Character encoding

From ArchWiki

Character encoding is the process of interpreting bytes to readable characters. UTF-8 is the dominant encoding since 2009 and is promoted as a de-facto standard [1].

UTF-8

Terminal

The following lists some terminals that support UTF-8:

Gnome-terminal or rxvt-unicode

You need to launch these applications from a UTF-8 locale or they will drop UTF-8 support. Enable the en_US.UTF-8 locale (or your local UTF-8 alternative) per the instructions above and set it as the default locale, then reboot.

Troubleshooting

This article or section is a candidate for merging with Localization/Simplified Chinese#Garbled problem.

Notes: The mentioned encoding at #Incorrect encoding for extracted files is chinese-specific, having a dedicated section for a single line about mp3unicode does not seem adequate, the whole section should be merged there. (Discuss in Talk:Character encoding)

This article or section is a candidate for merging with Archiving_and_compression#Garbled_Japanese_Filenames.

Notes: This is a problem related to character encoding more than usage of archiving utilities. (Discuss in Talk:Character encoding)


  • Use mp3unicode for fixing encoding problems with mp3 files.

Incorrect encoding for files extracted from .zip archives

On older versions of Windows (XP, Vista, and 7) using certain locales (Chinese, Japanese, Russian, etc.), File Explorer will typically end up using non-Unicode character encodings to store filenames when it creates a .zip archive. To extract these files with the proper encoding and avoid mojibake filenames, use the -O flag to manually specify the right charset to unzip. For example:

$ unzip -O CP936 file.zip

If you're not sure which charset you need to specify, you can check the filenames without extracting by adding the -l flag:

$ unzip -lO SJIS file.zip