UTF-8 encoding is a transformation format within the Unicode standard. The international standard ISO 10646 defines Unicode in large parts under the name “Universal Coded Character Set.” The Unicode developers limit certain parameters for practical use, which is intended to ensure the globally uniform, compatible coding of characters and text elements. When it was introduced in 1991, Unicode defined 24 modern writing systems and currency symbols for data processing. In June 2017, there were 139.
There are various Unicode transformation formats (UTF for short), which reproduce the 1.114.112 possible code points. Three formats have been established: UTF-8, UTF-16, and UTF-32. Other encodings, like UTF-7 or SCSU, also have their advantages, but have not been able to establish themselves.
Unicode is divided into 17 levels, each of which contains 65,536 characters. Each level consists of 16 columns and 16 rows. The first level, the “Basic Multilingual Plane” (level 0) comprises a large part of the writing systems currently in use around the world, as well as punctuation marks, control characters, and symbols. Five more levels are currently also in use:
- “Supplementary Multilingual Plane” (level 1): historical writing systems, rarely used characters
- “Supplementary Ideographic Plane” (level 2): rare CJK characters (Chinese, Japanese, Korean)
- “Supplementary Special Purpose Plane” (level 14): individual check characters
- “Supplementary Private Use Area – A” (level 15): private use
- “Supplementary Private Use Area – B” (level 16): private use
The UTF encodings allow access to all Unicode characters. The respective properties are recommended for certain areas of application.