Character encoding
Character encoding is the process of interpreting bytes to readable characters. UTF-8 is the dominant encoding since 2009 and is promoted as a de-facto standard [1].
UTF-8
Terminal
The following lists some terminals that support UTF-8:
- gnustep-terminal
- konsole
- mlterm
- rxvt-unicode
- st
- VTE-based terminals
- xterm - Run with the argument
-u8
or configure resourcexterm*utf8: 2
.
Gnome-terminal or rxvt-unicode
You need to launch these applications from a UTF-8 locale or they will drop UTF-8 support. Enable the en_US.UTF-8
locale (or your local UTF-8 alternative) per the instructions above and set it as the default locale, then reboot.
Troubleshooting
- Use mp3unicode for fixing encoding problems with mp3 files.
Incorrect encoding for files extracted from .zip archives
On older versions of Windows (XP, Vista, and 7) using certain locales (Chinese, Japanese, Russian, etc.), File Explorer will typically end up using non-Unicode character encodings to store filenames when it creates a .zip archive. To extract these files with the proper encoding and avoid mojibake filenames, use the -O
flag to manually specify the right charset to unzip. For example:
$ unzip -O CP936 file.zip
If you're not sure which charset you need to specify, you can check the filenames without extracting by adding the -l
flag:
$ unzip -lO SJIS file.zip