Unicode is an in­ter­na­tion­al standard for encoding, dis­play­ing, and pro­cess­ing text char­ac­ters from nearly all the world’s writing systems. Each character is assigned a unique code point, which can be stored in various character encodings like UTF-8 or UTF-16. This allows Unicode to provide con­sis­tent rep­re­sen­ta­tion and pro­cess­ing of texts across different platforms and languages.

Domain Name Reg­is­tra­tion
Build your brand on a great domain
  • Free Wildcard SSL for safer data transfers 
  • Free private reg­is­tra­tion for more privacy
  • Free Domain Connect for easy DNS setup

What is Unicode?

Unicode stands for “Universal Character Encoding” and is a global standard for rep­re­sent­ing text char­ac­ters in binary form. It enables con­sis­tent storage, exchange, and pro­cess­ing of text across different digital systems and platforms.

Unicode is in­no­v­a­tive in that it is not tied to the formats and encodings of a single alphabet of a par­tic­u­lar human language. Rather, Unicode was created with the aim of serving as a unified standard for rep­re­sent­ing all writing systems and char­ac­ters developed by humans.

Since the release of Unicode 1.0 at the end of 1991, the standard has fulfilled its purpose. Unicode is in­ter­nal­ly used by browsers and operating systems as a unified format. With the release of version 16.0 by the Unicode Con­sor­tium in 2024, the Unicode Standard now en­com­pass­es a reper­toire of 154,998 char­ac­ters. The character set covered by the Unicode Standard is com­plete­ly identical to the “Universal Coded Character Set” (UCS), which is in­ter­na­tion­al­ly stan­dard­ized as ISO/IEC 10646.

Technical basis for character encoding

First, it’s important to un­der­stand that all in­for­ma­tion present in a digital system consists of endless chains of zeros and ones on a deeper level. This is also referred to as “binary rep­re­sen­ta­tion.” The binary code is somewhat like an alphabet in itself. However, in binary code, there are only two “letters”: zeros and ones. Each position within a sequence of zeros and ones is called a “bit.”

The basic trick of digital in­for­ma­tion tech­nol­o­gy is to represent char­ac­ters from different alphabets as sequences of zeros and ones. This allows for encoding numbers and letters, as well as any other dis­tin­guish­able states. Usually, these are called “symbols.” The longer the sequence of zeros and ones for rep­re­sent­ing a single symbol, the more symbols can be depicted. With each added bit, the number of possible symbols doubles.

A concrete example: Imagine we have binary “words” that are two bits long. This would allow us to encode four numbers:

2-bit word Number
00 0
01 1
10 2
11 3

If we add another bit to the beginning of the sequence, the number of possible bit-words doubles. These consist of the already known bit sequences, each preceded by a zero or one. Thus, we can encode eight numbers:

3-bit word Number
000 0
001 1
010 2
011 3
100 4
101 5
110 6
111 7
Fact

An 8-bit word is referred to as an octet or byte.

For sim­plic­i­ty, we’ve shown the encoding of numbers as an example here. However, the same principle applies to digital systems for encoding letters or any other char­ac­ters and states. Here is a highly sim­pli­fied example of binary encoding of letters:

3-bit word Letter
000 A
001 B
010 C

The graphic rep­re­sen­ta­tion of a character is called a glyph. Depending on the font used, there are different glyphs for the same character, and even within a single font, there can be multiple vari­a­tions for a glyph. Think, for instance, of different weights, ligatures, italics, etc. Here is an expanded rep­re­sen­ta­tion that includes the mapping from the character to the glyph:

Binary rep­re­sen­ta­tion Decimal number Encoded character Glyph
1000001 65 uppercase “A” of the Latin alphabet A
1100001 97 lowercase “a” of the Latin alphabet a
0110000 48 Arabic numeral “0” 0
0111001 57 Arabic numeral “9” 9
11000100 196 uppercase “Ä” Ä
11000001 193 uppercase “Á” Á

Ter­mi­nol­o­gy of character encoding

Digital character encoding involves a range of specific terms and concepts. In everyday usage, some of these may be used in­ter­change­ably, but in technical contexts — es­pe­cial­ly when working with Unicode — it’s important to dis­tin­guish them clearly. Below are key terms along with their de­f­i­n­i­tions:

Term Meaning
Character set A col­lec­tion of possible char­ac­ters, such as digits “0–9” or letters “a–z”
Code point A numerical value assigned to a specific character within a coding system
Coded character set A system that maps each character to exactly one code point
Character encoding The process of con­vert­ing char­ac­ters into a digital format (e.g., binary)

Overview of common character encodings

Before the advent of Unicode, there was a wide variety of specific encodings. The norm was to use a distinct encoding for each language or language family. This often led to display errors and data in­con­sis­ten­cies. To counter this, character encodings were fre­quent­ly modeled as backward-com­pat­i­ble supersets of an existing standard. The modern Unicode standard builds on the earlier ISO Latin-1 encoding, which in turn is based on the ASCII character code.

Character encoding Bits per character Possible char­ac­ters Character set
ASCII 7 bits 128 Letters, numbers, and special char­ac­ters of the American keyboard, as well as control char­ac­ters for teletypes
ISO Latin-1 (ISO 8859-1) 8 bits 256 First 128 char­ac­ters like ASCII, next 128 char­ac­ters for special char­ac­ters of European languages
Universal Coded Character Set 2 (UCS-2) 16 bits 65,536 Char­ac­ters of the “Basic Mul­ti­lin­gual Plane” (BMP); first 256 char­ac­ters like in ISO Latin-1
Universal Coded Character Set 4 (UCS-4) 32 bits 1,114,111 Char­ac­ters of the BMP and ad­di­tion­al beyond; total of 143,859 char­ac­ters in Unicode Version 13.0; first 256 char­ac­ters like ISO Latin-1
UCS Trans­for­ma­tion Format 8 Bit (UTF-8) 8/16/24/32 bits 1,114,111 Any char­ac­ters from UCS-2 and UCS-4; first 256 char­ac­ters like ISO Latin-1

Structure of the Unicode Standard

The Unicode Standard defines char­ac­ters and cor­re­spond­ing code points for letters, syl­labaries, ideograms, punc­tu­a­tion marks, special char­ac­ters, and numerals. It supports the Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets. Ad­di­tion­al­ly, it includes Japanese (Katakana, Hiragana), Chinese, and Korean scripts (Hangul). There are also math­e­mat­i­cal, com­mer­cial, and technical special char­ac­ters, as well as his­tor­i­cal control char­ac­ters for teletypes.

The char­ac­ters are compiled in a series of character tables. We provide an overview of the most common character tables here.

Writing systems of the Unicode Standard

Character table Includes these alphabets, among others
European Writing Systems Armenian, Georgian, Greek, Latin
African Writing Systems Ethiopian, Egyptian Hi­ero­glyphs, Coptic
Middle Eastern Writing Systems Arabic, Hebrew, Syriac
Central Asian Writing Systems Mongolian, Tibetan, Old Turkic
South Asian Writing Systems Brahmi, Tamil, Vedic
Southeast Asian Writing Systems Khmer, Rohingya, Thai
Writing Systems of Indonesia and Oceania Balinese, Buginese, Javanese
East Asian Writing Systems CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese)
American Writing Systems Cherokee, Canadian Syllabics, Osage

Symbols and punc­tu­a­tion of the Unicode Standard

Character table Includes these char­ac­ters, among others
Notation systems Braille Patterns, Musical Notation, Duployan Shorthand
Punc­tu­a­tion Punc­tu­a­tion of the English Language, Punc­tu­a­tion of European Languages, CJK Punc­tu­a­tion
Al­phanu­mer­ic symbols Math­e­mat­i­cal Unicode Letters, Circled Unicode Letters
Technical symbols Symbols of the APL Pro­gram­ming Language, Symbols for Optical Character Recog­ni­tion
Numbers & numerals Maya Numerals, Ottoman Siyaq Numerals, Numerals of Sumerian Cuneiform
Math­e­mat­i­cal symbols Arrows, Math­e­mat­i­cal Operators, Geometric Shapes
Emoji & pic­tograms Emoticons, Dingbats, Other Pic­tograms
Other symbols Al­chem­i­cal Symbols, Currency Char­ac­ters, Chess, Domino, and Mahjong Char­ac­ters

What is Unicode used for?

The Unicode Standard primarily serves as a universal foun­da­tion for pro­cess­ing, storing, and ex­chang­ing text in any language. Most modern software com­po­nents, such as libraries, protocols, databases, etc., that operate on text are based on Unicode. We il­lus­trate the range of possible uses with the following examples.

Operating systems

Unicode is the internal standard for text rep­re­sen­ta­tion in most modern operating systems. Some operating systems, like Apple’s macOS, allow the use of Unicode char­ac­ters in file names.

Websites

The Unicode variant UTF-8 has become the standard for encoding HTML documents. As early as 2016, more than 80 percent of the world’s most visited websites used UTF-8 for storing and dis­play­ing their HTML documents. The Punycode standard has become es­tab­lished for using non-ASCII letters in domain names.

Website Builder
From idea to website in record time with AI
  • Intuitive website builder with AI as­sis­tance
  • Create cap­ti­vat­ing images and texts in seconds
  • Domain, SSL and email included

Pro­gram­ming languages

Many modern pro­gram­ming languages use Unicode as the basis for text pro­cess­ing. A recent de­vel­op­ment is the ability to use Unicode char­ac­ters for naming variables and functions. This is possible in EC­MAScript/JavaScript, as il­lus­trat­ed in the following code:

let ︎👍 = true;
let 👎 = false;
if (bool_var === ︎👎) {
 // …
}
javascript

Databases

The popular and widely used database MySQL supports the complete Unicode character set with the character encoding “utf8mb4”. In contrast, using the “utf8” encoding results in the loss of char­ac­ters whose code points encompass more than three bytes.

Fonts

Fonts contain the glyphs used for the graphic rep­re­sen­ta­tion of text. Due to the large number of char­ac­ters included in the Unicode Standard, there is no font that contains all char­ac­ters. Even the subset of the Basic Mul­ti­lin­gual Plane is covered com­plete­ly by only a few fonts. Here are a few examples:

Unicode font Glyphs License
Noto approx. 77,000 Open Font License
Sun-ExtA/B approx. 50,000 Freeware
Unifont approx. 63,000 GNU GPL
Code2000 approx. 63,000 Shareware
HiDrive Cloud Storage
Store and share your data on the go
  • Store, share, and edit data easily
  • Backed up and highly secure
  • Sync with all devices

How is Unicode used?

In many cases, users employ Unicode without ever being aware of it. Digital text is presented in most documents and ap­pli­ca­tions as Unicode and can be freely copied, pasted, and edited by users. Sometimes, the end user may need to insert a specific Unicode character into text. There are various methods for doing this, which we will present below.

Special software keyboards

The use of special software keyboards is probably the most common method to insert Unicode char­ac­ters into text. Ubiq­ui­tous on mobile devices, software keyboards allow for switching between languages and their re­spec­tive alphabets. The key layout changes, with all char­ac­ters orig­i­nat­ing from the Unicode reper­toire. These char­ac­ters can be mixed and combined freely in texts.

A good example of this is emojis: Emojis are regular Unicode char­ac­ters like letters, numbers, and special symbols. As with digital char­ac­ters, the rep­re­sen­ta­tion of emojis is in­de­pen­dent of their internal modeling. Each operating system displays the same emoji slightly dif­fer­ent­ly.

The useful software keyboards are not only found on mobile devices. They’re also available on desktops. They can be easily accessed in Windows, macOS, and many Linux dis­tri­b­u­tions, dis­play­ing a different set of char­ac­ters depending on the selected language. Since the number of keys is limited, not all Unicode char­ac­ters are shown. Instead, there’s a language-specific selection of the most commonly used char­ac­ters.

Unicode character tables

Besides software keyboards, Unicode character tables are probably the most useful way to access Unicode char­ac­ters. Remember, a character set (“Coded character set”) is the col­lec­tion of all char­ac­ters along with their cor­re­spond­ing unique code points. Such a structure lends itself to a table format, and indeed the Unicode Standard includes exactly such tables called Unicode Code Charts. From these tables, users can copy specific char­ac­ters to use elsewhere. Al­ter­na­tive­ly, end users can read the cor­re­spond­ing code point, for example, to use it as a numeric character reference—more on this in the next section.

Many desktop operating systems also include a Unicode character table. This provides an overview of all available Unicode char­ac­ters along with their code point, de­scrip­tion, and glyph. A character can be inserted or copied with a click. A character table can also be created with just a few lines of code. Later in this article, we’ll show an example using the Python pro­gram­ming language.

Numeric character reference

The core of the Unicode Standard is the mapping of char­ac­ters to code points. Knowing a character’s code point allows it to be used to embed the cor­re­spond­ing character in various contexts. On Windows, entering Unicode symbols is done using the standard hardware keyboard with a special key com­bi­na­tion. Note that the code point number typically needs to be entered in hexa­dec­i­mal format.

Pro­gram­mers most often need numeric character ref­er­ences. The hexa­dec­i­mal rep­re­sen­ta­tion of code points allows for the mapping of a Unicode character into char­ac­ters of the ASCII character set. We demon­strate this approach in HTML; fun­da­men­tal­ly, it works the same in Python, C++, etc.

The general scheme for embedding a character using a numeric reference includes the reference itself, as well as an opening and closing term: In HTML documents, the numeric reference starts with &#x and ends with ;. In between, without any spaces, the two- to four-digit hexa­dec­i­mal code point is entered, resulting in the pattern &#xNNNN;.

To insert the copyright symbol “©” into an HTML document by example, we proceed with the following scheme:

  1. Search for the character in a Unicode table.

  2. Read the code point as­so­ci­at­ed with the character. In our example, the code point is listed as “U+00A9,” which is the hexa­dec­i­mal rep­re­sen­ta­tion.

  3. Compose the character reference and enter it into HTML source code or a Markdown document. In our case, we input ©; this renders the character “©”.

A less common approach allows for the use of code points in decimal rather than hexa­dec­i­mal rep­re­sen­ta­tion. In this case, the numeric reference begins with &# (without the “x”) and ends as usual with ;. In between, the code point is written in decimal form. In our example, the numeric reference © results in the copyright symbol.

Tip

Use the Unicode Character Inspector to quickly find the different codes for a character.

Named character entities

Since the notation of Unicode char­ac­ters as numeric ref­er­ences is not intuitive for humans, there is another method: named character entities. These are defined for commonly used char­ac­ters and assign a short, memorable name to the character. A named character entity starts with the ampersand & and ends with a semicolon ;. The defined name is placed in between without spaces. To insert the copyright symbol “©” in HTML, simply write ©.

Tip

The complete list of defined character entities is doc­u­ment­ed in the HTML Standard.

Pro­gram­ming languages

Most pro­gram­ming languages include basic functions to convert char­ac­ters and code points. The cor­re­spond­ing functions are often called ord(character) and chr(code point). The following applies:

chr(ord(character)) == character

Note that it is always possible to determine the code point cor­re­spond­ing to a character. Con­verse­ly, the as­sign­ment only works for numbers that are actually defined as code points of the character code. We demon­strate the basic scheme here with a short Python example:

# Determine the decimal code point of a character
ord('A') # `65`
# Determine the hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Determine the character corresponding to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, because code point > `0x110000`
python

With the help of these functions, it’s easy to create a character table for code points of the Unicode character set. For this, you iterate the code points and output the cor­re­spond­ing char­ac­ters. With Python, this can be done in just a few lines of code:

# Start `range` at `32` to avoid control characters being printed
# Print ASCII character set
for code_point in range(32, 128):
	print(code_point, hex(code_point), chr(code_point))
# Print ISO Latin-1
for code_point in range(32, 256):
	print(code_point, hex(code_point), chr(code_point))
python

Program Library ICU

The In­ter­na­tion­al Com­po­nents for Unicode (ICU) are con­sol­i­dat­ed in a program library provided by the Unicode Con­sor­tium. The library is released under an open-source license and can be used on many operating systems. The software serves the purpose of pro­gram­mat­ic In­ter­na­tion­al­iza­tion (often ab­bre­vi­at­ed as “i18n”). Its ap­pli­ca­tions include:

  • Pro­cess­ing of Unicode texts
  • Support for regular ex­pres­sions in Unicode
  • Parsing and for­mat­ting of calendar dates, times, numbers, cur­ren­cies, and messages

The ICU library is available in two versions:

  • “icu4c” is written in C/C++ and provides an API for these languages.
  • “icu4j” is written in Java and provides an API for this language.

The use of the com­po­nents provides con­sis­tent results re­gard­less of the un­der­ly­ing platform.

Charset meta tag in the head of HTML documents

Most HTML documents today use the UTF-8 character encoding. To ensure that visitors see the document without erroneous char­ac­ters, a “Charset” meta tag should be placed in the head of the HTML document. This instructs the browser to interpret the retrieved document as UTF-8 and is il­lus­trat­ed below:

<head>
<meta charset="utf-8">
<!-- additional head elements -->
</head>
html

Instagram fonts

The popular social network Instagram does not allow text for­mat­ting for biography in­for­ma­tion, posts, or stories. This limits users’ creative options. However, clever de­vel­op­ers have found a workaround: Instagram uses Unicode, making it possible to compose text that appears formatted using special char­ac­ters. This often involves char­ac­ters that resemble Latin letters. The easiest way to create such text is with an Insta Fonts Generator. Ad­di­tion­al­ly, using Instagram fonts also works in other social networks.

Go to Main Menu