Character encoding

From ArchWiki

Character encoding is the process of interpreting bytes to readable characters. UTF-8 is the dominant encoding since 2009 and is promoted as a de-facto standard [1].

UTF-8

Terminal

The following lists some terminals that support UTF-8:

Gnome-terminal or rxvt-unicode

You need to launch these applications from a UTF-8 locale or they will drop UTF-8 support. Enable the en_US.UTF-8 locale (or your local UTF-8 alternative) per the instructions above and set it as the default locale, then reboot.

URL encoding

URIs accept US-ASCII characters only and use percent-encoding to encode non-ASCII characters. This can result in very long and human-unreadable URIs.

In Firefox, it is possible to copy decoded URLs by enabling the browser.urlbar.decodeURLsOnCopy flag in about:config, or alternatively inserting a space to the start of the URL, then selecting it (with the space) and copying it.

This article or section needs language, wiki syntax or style improvements. See Help:Style for reference.

Reason: Listing a bunch of tools without making the purpose clear is not helpful. (Discuss in Talk:Character encoding)

For command-line tools with check:

Troubleshooting

This article or section is a candidate for merging with Localization/Simplified Chinese#Garbled problem.

Notes: The mentioned encoding at #Incorrect encoding for extracted files is chinese-specific, having a dedicated section for a single line about mp3unicode does not seem adequate, the whole section should be merged there. (Discuss in Talk:Character encoding)

This article or section is a candidate for merging with Archiving_and_compression#Garbled_Japanese_Filenames.

Notes: This is a problem related to character encoding more than usage of archiving utilities. (Discuss in Talk:Character encoding)
  • Use mp3unicode for fixing encoding problems with mp3 files.

Incorrect encoding for files extracted from .zip archives

On older versions of Windows (XP, Vista, and 7) using certain locales (Chinese, Japanese, Russian, etc.), File Explorer will typically end up using non-Unicode character encodings to store filenames when it creates a .zip archive. To extract these files with the proper encoding and avoid mojibake filenames, use the -O flag to manually specify the right charset to unzip. For example:

$ unzip -O CP936 file.zip

If you are not sure which charset you need to specify, you can check the filenames without extracting by adding the -l flag:

$ unzip -lO SJIS file.zip