“UTF-8” stands for “8-Bit UCS Trans­for­ma­tion Format” and rep­re­sents the most wide­spread character encoding on the World Wide Web. The in­ter­na­tion­al Unicode standard captures all language char­ac­ters and text elements (virtually) of all world languages for data pro­cess­ing. UTF-8 plays a crucial role in the Unicode character set.

Website Builder
From idea to website in record time with AI
  • Intuitive website builder with AI as­sis­tance
  • Create cap­ti­vat­ing images and texts in seconds
  • Domain, SSL and email included

The de­vel­op­ment of UTF-8 coding

UTF-8 is a character encoding. It assigns every existing Unicode character a specific bit sequence, which can also be read as a binary number. UTF-8 works by assigning a unique binary number to every character—letters, numbers, and symbols—from an ever-expanding range of languages. In­ter­na­tion­al or­ga­ni­za­tions focused on setting internet standards, such as the W3C and the Internet En­gi­neer­ing Task Force (IETF), are actively promoting UTF-8 as the universal standard for character encoding. In fact, as early as 2009, the majority of websites had already adopted UTF-8. According to a W3Techs report from April 2025, 98.6% of all websites now use this encoding format.

Problems faced before UTF-8 was in­tro­duced

Different regions with related languages and writing systems developed their own coding standards because they had different needs. In English-speaking countries, for instance, the ASCII encoding was suf­fi­cient, as it allowed 128 char­ac­ters to be rep­re­sent­ed as computer-readable strings.

Languages that use Asian scripts or the Cyrillic alphabet, however, require a much larger set of unique char­ac­ters. Even German umlauts—such as the letter ä—are not included in the ASCII character set. On top of that, different encoding systems could assign the same binary values to entirely different char­ac­ters. As a result, a Russian document opened on an American computer might appear not in Cyrillic, but in Latin letters mapped by the local encoding—producing un­read­able text. This kind of mismatch seriously disrupted in­ter­na­tion­al com­mu­ni­ca­tion.

Creation of UTF-8

To solve this problem, Joseph D. Becker developed the universal character set Unicode for Xerox between 1988 and 1991. From 1992, the IT industry con­sor­tium X/Open was also in search of a system to replace ASCII and expand the character reper­toire. The coding was still meant to remain com­pat­i­ble with ASCII.

This re­quire­ment was not met by the first coding named UCS-2, as it simply trans­ferred character numbers into 16-bit values. UTF-1 also failed because Unicode as­sign­ments partially collided with existing ASCII character as­sign­ments. A server set to ASCII thus sometimes output incorrect char­ac­ters. This was a sig­nif­i­cant issue since most English-speaking computers operated this way at the time.

The next attempt was the File System Safe UCS Trans­for­ma­tion Format (FSS-UTF) by Dave Prosser, which elim­i­nat­ed overlap with ASCII char­ac­ters. In August of the same year, the draft cir­cu­lat­ed among experts. At Bell Labs, known for numerous Nobel laureates, Unix co-founders Ken Thompson and Rob Pike were working on the Plan 9 operating system. They adopted Prosser’s idea, developed a self-syn­chro­niz­ing coding (each character indicates how many bits it needs), and es­tab­lished rules for the as­sign­ment of letters that could be rep­re­sent­ed dif­fer­ent­ly in the code (example: “ä” as its own character or “a+¨”). They suc­cess­ful­ly used the coding for their operating system and presented it to the au­thor­i­ties. Thus, FSS-UTF, now known as “UTF-8,” was es­sen­tial­ly completed.

UTF-8 in the Unicode character set is a standard for all languages

The UTF-8 coding is a trans­for­ma­tion format within the Unicode standard. The in­ter­na­tion­al stan­dard­iza­tion ISO 10646 largely defines Unicode, there known as the “Universal Coded Character Set.” The Unicode de­vel­op­ers set certain pa­ra­me­ters for practical ap­pli­ca­tion. The standard aims to ensure the in­ter­na­tion­al­ly uniform and com­pat­i­ble coding of char­ac­ters and text elements.

When Unicode was in­tro­duced in 1991, it defined 24 modern writing systems and currency symbols for data pro­cess­ing. In the Unicode standard published in 2024, there were 168. There are various Unicode Trans­for­ma­tion Formats, or “UTF,” which reproduce the 1,114,112 possible code­points. Three formats have prevailed: UTF-8, UTF-16, and UTF-32. Other encodings like UTF-7 or SCSU also have their ad­van­tages but have not been es­tab­lished. Unicode is divided into 17 levels, each con­tain­ing 65,536 char­ac­ters. Each level consists of 16 columns and rows. The zeroth level, the “Basic Mul­ti­lin­gual Plane,” covers most of the writing systems currently used worldwide, along with punc­tu­a­tion, control char­ac­ters, and symbols. Six ad­di­tion­al levels are currently in use:

  • Sup­ple­men­tary Mul­ti­lin­gual Plane (Level 1): his­tor­i­cal writing systems, rarely used char­ac­ters
  • Sup­ple­men­tary Ideo­graph­ic Plane (Level 2): rare CJK char­ac­ters (“Chinese, Japanese, Korean”)
  • Tertiary Ideo­graph­ic Plane (Level 3): more CJK char­ac­ters have been encoded here since Unicode Version 15.1.
  • Sup­ple­men­tary Special-Purpose Plane (Level 14): in­di­vid­ual control char­ac­ters
  • Sup­ple­men­tary Private Use Area – A (Level 15): private use
  • Sup­ple­men­tary Private Use Area – B (Level 16): private use

The UTF encodings provide access to all Unicode char­ac­ters. The specific prop­er­ties are rec­om­mend­ed for certain areas of ap­pli­ca­tion.

UTF-32 and UTF-16 as al­ter­na­tives

UTF-32 always operates with 32 bits, or 4 bytes. The simple structure increases the read­abil­i­ty of the format. In languages that primarily use the Latin alphabet and thus only the first 128 char­ac­ters, the encoding takes up much more storage space than necessary (4 instead of 1 byte).

UTF-16 es­tab­lished itself as a display format in operating systems like Apple macOS and Microsoft Windows. It is also used in many software de­vel­op­ment frame­works. It is one of the oldest UTFs still in use. Its structure is par­tic­u­lar­ly suitable for memory-efficient encoding of non-Latin char­ac­ters. Most char­ac­ters can be rep­re­sent­ed in 2 bytes (16 bits), with the length doubling to 4 bytes only for rare char­ac­ters.

UTF-8 if efficient and scalable

UTF-8 uses up to four sequences of 8 bits (one byte each), while its pre­de­ces­sor, ASCII, relies on a single 7-bit sequence. Both encodings represent the first 128 char­ac­ters in exactly the same way, covering letters and symbols commonly used in English. As a result, char­ac­ters from the English-speaking world can be stored using just one byte, making UTF-8 par­tic­u­lar­ly efficient for texts in Latin-based languages. This efficient use of storage is one reason why operating systems like Unix and Linux use UTF-8 in­ter­nal­ly. However, UTF-8 plays its most important role in internet ap­pli­ca­tions, es­pe­cial­ly when dis­play­ing text on websites or in emails.

Thanks to the self-syn­chro­niz­ing structure, read­abil­i­ty is main­tained despite the variable length per character. Without Unicode lim­i­ta­tion, UTF-8 could the­o­ret­i­cal­ly allow 4,398,046,511,104 character mappings. Due to the 4-byte re­stric­tion in Unicode, it’s ef­fec­tive­ly 221, which is more than suf­fi­cient. Even the Unicode range still has empty planes for many more writing systems. The precise mapping prevents codepoint overlaps, which in the past limited com­mu­ni­ca­tion.

While UTF-16 and UTF-32 also allow for precise mapping, UTF-8 utilizes storage space par­tic­u­lar­ly ef­fi­cient­ly for the Latin writing system and is designed so that different writing systems can exist alongside each other seam­less­ly and be covered. This enables their con­cur­rent, mean­ing­ful display within a text field without com­pat­i­bil­i­ty issues.

The basics of UTF-8 coding and com­po­si­tion

The UTF-8 coding stands out not only for its backward com­pat­i­bil­i­ty with ASCII but also for a self-syn­chro­niz­ing structure, making it easier for de­vel­op­ers to identify sources of error af­ter­wards. For all ASCII char­ac­ters, UTF uses only 1 byte. The total number of bit sequences can be rec­og­nized by the first digits of the binary number. Since ASCII code en­com­pass­es only 7 bits, the leading digit is the iden­ti­fi­er 0. The 0 fills the storage to a full byte and signals the start of a chain without follow-up chains. The name “UTF-8” would be rep­re­sent­ed as a binary number with UTF-8 coding, for instance, as follows:

Character U T F - 8
UTF-8, binary 01010101 01010100 01010100 00101101 00111000
Unicode Point, hexa­dec­i­mal U+0055 U+0054 U+0046 U+002D U+0038

ASCII char­ac­ters, like those used in the table, are assigned a single bit sequence by UTF-8 coding. All sub­se­quent char­ac­ters and symbols within Unicode have two to four 8-bit sequences. The first sequence is called the start byte, with ad­di­tion­al sequences being con­tin­u­a­tion bytes. Start bytes with con­tin­u­a­tion bytes always begin with 11, while con­tin­u­a­tion bytes begin with 10. If you manually search for a specific point in the code, you can recognize the start of an encoded character by the markers 0 and 11. The first printable multi-byte character is the inverted ex­cla­ma­tion mark:

Character ¡
UTF-8, binary 11000010 10100001
Unicode Point, hexa­dec­i­mal U+00A1

Prefix coding prevents another character from being encoded within a byte sequence. If a byte stream starts in the middle of a document, the computer still displays readable char­ac­ters correctly, as it doesn’t render in­com­plete ones. When searching for the beginning of a character, the 4-byte lim­i­ta­tion means you only need to go back at most three byte sequences at any given point to find the start byte.

Another struc­tur­ing element: The number of ones at the beginning of the start byte indicates the length of the byte sequence:

  • 110xxxxx rep­re­sents 2 bytes
  • 1110xxxx rep­re­sents 3 bytes
  • 11110xxx rep­re­sents 4 bytes

In Unicode, each byte value cor­re­sponds directly to a character number, which enables a logical, lexical order. However, this sequence includes some gaps. The range U+007F to U+009F is reserved for non-visible control char­ac­ters rather than printable ones. In this section, the UTF-8 standard doesn’t assign any readable symbols—only command functions or control codes.

As mentioned, UTF-8 coding can the­o­ret­i­cal­ly link up to eight byte sequences. However, Unicode pre­scribes a maximum length of 4 bytes. This results in byte sequences of 5 bytes or more being invalid by default. Moreover, this re­stric­tion reflects the aim to create code that is as compact—using minimal storage space—as possible, and as struc­tured as possible. A fun­da­men­tal rule when using UTF-8 is to always use the shortest possible encoding.

However, for some char­ac­ters, there are multiple equiv­a­lent encodings. For example, the letter ä is encoded using 2 bytes: 11000011 10100100. The­o­ret­i­cal­ly, it’s possible to combine the code points for the letter “a” (01100001) and the diaeresis mark “ ” (11001100 10001000) to represent “ä”: 01100001 11001100 10001000. This uses the so-called Unicode Nor­mal­iza­tion Form NFD, where char­ac­ters are canon­i­cal­ly de­com­posed. Both encodings shown lead to the exact same result (namely “ä”) and are therefore canon­i­cal­ly equiv­a­lent*.

Note

Nor­mal­iza­tions are used to unify different Unicode rep­re­sen­ta­tions of the same character. Canonical equiv­a­lence is important because it means that two sequences of char­ac­ters can be encoded dif­fer­ent­ly but have the same meaning and ap­pear­ance. Com­pat­i­ble equiv­a­lence, on the other hand, also allows sequences that differ in format or style but are sub­stan­tive­ly the same. Unicode nor­mal­iza­tion forms (e.g., NFC, NFD, NFKC, NFKD) use these concepts to stan­dard­ize texts. This ensures that com­par­isons, sorting, and searches work con­sis­tent­ly and reliably.

Some Unicode value ranges were not defined for UTF-8 because they are reserved for UTF-16 sur­ro­gates. The overview shows which bytes in UTF-8 under Unicode are con­sid­ered valid according to the Internet En­gi­neer­ing Task Force (IETF) (green marked areas are valid bytes, orange marked are invalid).

Image: Table: UTF-8 value ranges
The table provides an overview of the valid UTF-8 value ranges.

Con­ver­sion from Unicode Hexa­dec­i­mal to UTF-8 binary

Computers read only binary numbers, while humans use a decimal system. An interface between these forms is the hexa­dec­i­mal system. It helps to compactly represent long chains of bits. It uses the digits 0 through 9 and the letters A through F and operates on the base of the number 16. As the fourth power of 2, the hexa­dec­i­mal system is better suited than the decimal system for rep­re­sent­ing eight-digit byte ranges.

A hexa­dec­i­mal digit rep­re­sents a quartet (“nibble”) within the octet. A byte with eight binary digits can therefore be rep­re­sent­ed with just two hexa­dec­i­mal digits. Unicode uses the hexa­dec­i­mal system to describe the position of a character within its own system. From this, the binary number and finally the UTF-8 codepoint can be cal­cu­lat­ed.

First, the binary number must be converted from the hexa­dec­i­mal number. Then you fit the code­points into the structure of the UTF-8 coding. To simplify the struc­tur­ing, use the following overview, which shows how many code­points fit into a byte chain and what structure can be expected in which Unicode value range.

Size in Bytes Free Bits for De­ter­mi­na­tion First Unicode Codepoint Last Unicode Codepoint Start Byte / Byte 1 Follow Byte 2 Follow Byte 3 Follow Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Within a given code range, you can predict the number of bytes used because the lexical order is con­sis­tent in both the Unicode code­points and the cor­re­spond­ing UTF-8 binary values. For the range U+0800 to U+FFFF, UTF-8 uses 3 bytes. This range provides 16 bits to represent the codepoint of each symbol. The binary number is in­te­grat­ed into the UTF-8 encoding from right to left, with any unused bits on the left filled with zeros.

Cal­cu­la­tion example:

The char­ac­terᅢ (Hangul Junseong, Ä) is located at position U+1162 in Unicode. To calculate the binary number, first convert the hexa­dec­i­mal number into a decimal number. Each digit in the number cor­re­sponds to the cor­re­lat­ing power of 16. The rightmost digit has the lowest value with 160 = 1. Starting from the right, multiply the digit’s numeric value by the power’s value. Then, add up the results.

Image: Example calculation: Convert Hexadecimal Number to Decimal
Convert the hexa­dec­i­mal number to a decimal number in the first step.

4450 is the cal­cu­lat­ed decimal number. Now, convert this into a binary number. To do this, re­peat­ed­ly divide the number by 2 until the result is 0. The remainder, written from right to left, is the binary number.

Image: Example calculation: Convert decimal number to Binary
Convert the decimal number to a binary number in the next step.

The UTF-8 code pre­scribes 3 bytes for the codepoint U+1162 because the codepoint is between U+0800 and U+FFFF. Therefore, the start byte begins with 1110. The two sub­se­quent bytes each start with 10. Fill in the binary number in the free bits, which do not dictate the structure, from right to left. Complete remaining bit positions in the start byte with 0 until the octet is full. The UTF-8 coding then looks like this:

11100001 10000101 10100010 (the inserted codepoint is bold)

Character Unicode Codepoint, hexa­dec­i­mal Decimal number Binary number UTF-8
U+1162 Decimal number 4450 1000101100010 111000011000010110100010

UTF-8 in the Editor

UTF-8 is the most wide­spread standard on the internet, but simple text editors do not nec­es­sar­i­ly save texts in this format by default. Microsoft Notepad, for instance, uses a default encoding referred to as “ANSI” (which is actually the ASCII-based encoding Windows-1252). If you want to convert a text file from Microsoft Word to UTF-8 (for example, to represent various writing systems), proceed as follows: Go to “Save As” and select “Plain Text” in the File Type option.

Image: Screenshot: Saving document in Word
You also have the option to save documents as plain text in Microsoft Word.

The pop-up window “File Con­ver­sion” will open. Under “Text Encoding”, select “Other encoding” and from the list, choose “Unicode (UTF-8)”. In the drop-down menu “End lines with”, choose “Carriage Return/Line Feed” or “CR/LF”. This is how you easily convert a file to the Unicode character set with UTF-8.

Image: Screenshot: File conversion in Word
In addition to UTF-8, the “File Con­ver­sion” window also offers options such as Unicode (UTF-16) with and without Big-Endian, as well as ASCII and many other encodings.

Opening an unmarked text file, where you don’t know be­fore­hand which encoding was applied, can lead to issues during editing. In Unicode, the Byte Order Mark (BOM) is used for such sit­u­a­tions. This invisible character indicates whether the document is in Big-Endian or Little-Endian format. If a program decodes a UTF-16 Little-Endian file using UTF-16 Big-Endian, the text will be output in­cor­rect­ly.

Documents based on the UTF-8 character set do not have this problem, as the byte order is always read as a Big-Endian byte sequence. In this case, the BOM merely serves as an in­di­ca­tion that the document is UTF-8 encoded.

Note

Char­ac­ters rep­re­sent­ed with more than one byte can have the most sig­nif­i­cant byte at the front (left) or the back (right) in some encodings (UTF-16 and UTF-32). If the most sig­nif­i­cant byte (MSB) is at the front, the encoding is labeled “Big-Endian.” If the MSB is at the back, “Little-Endian” is added.

You place the BOM before a data stream or at the start of a file. This marker takes prece­dence over all other di­rec­tives, even over the HTTP Header. The BOM acts as a sort of signature for Unicode encodings and has the code point U+FEFF. Depending on the encoding used, the BOM appears dif­fer­ent­ly in its encoded form.

Encoding Format BOM, Code point: U+FEFF (hex.)
UTF-8 EF BB BF
UTF-16 Big-Endian FE FF
UTF-16 Little-Endian FF FE
UTF-32 Big-Endian 00 00 FE FF
UTF-32 Little-Endian FF FE 00 00

Do not use the Byte Order Mark if the protocol ex­plic­it­ly prohibits it or if your data is already assigned a specific type. Some programs, according to the protocol, expect ASCII char­ac­ters. Since UTF-8 is backward com­pat­i­ble with ASCII coding and its byte order is fixed, you don’t need a BOM. In fact, Unicode rec­om­mends not using the BOM with UTF-8. However, since they can appear in older code and cause problems, it’s important to identify any existing BOM as such.

Create a website with your domain
Build your own website or online store, fast
  • Pro­fes­sion­al templates
  • Intuitive cus­tomiz­able design
  • Free domain, SSL, and email address
Go to Main Menu