Unicode is an inĀ­terĀ­naĀ­tionĀ­al standard for encoding, disĀ­playĀ­ing, and proĀ­cessĀ­ing text charĀ­acĀ­ters from nearly all the world’s writing systems. Each character is assigned a unique code point, which can be stored in various character encodings like UTF-8 or UTF-16. This allows Unicode to provide conĀ­sisĀ­tent repĀ­reĀ­senĀ­taĀ­tion and proĀ­cessĀ­ing of texts across different platforms and languages.

Domain Name RegĀ­isĀ­traĀ­tion
Build your brand on a great domain
  • Free Wildcard SSL for safer data transfers 
  • Free private regĀ­isĀ­traĀ­tion for more privacy
  • Free Domain Connect for easy DNS setup

What is Unicode?

Unicode stands for ā€œUniversal Character Encodingā€ and is a global standard for repĀ­reĀ­sentĀ­ing text charĀ­acĀ­ters in binary form. It enables conĀ­sisĀ­tent storage, exchange, and proĀ­cessĀ­ing of text across different digital systems and platforms.

Unicode is inĀ­noĀ­vĀ­aĀ­tive in that it is not tied to the formats and encodings of a single alphabet of a parĀ­ticĀ­uĀ­lar human language. Rather, Unicode was created with the aim of serving as a unified standard for repĀ­reĀ­sentĀ­ing all writing systems and charĀ­acĀ­ters developed by humans.

Since the release of Unicode 1.0 at the end of 1991, the standard has fulfilled its purpose. Unicode is inĀ­terĀ­nalĀ­ly used by browsers and operating systems as a unified format. With the release of version 16.0 by the Unicode ConĀ­sorĀ­tium in 2024, the Unicode Standard now enĀ­comĀ­passĀ­es a reperĀ­toire of 154,998 charĀ­acĀ­ters. The character set covered by the Unicode Standard is comĀ­pleteĀ­ly identical to the ā€œUniversal Coded Character Setā€ (UCS), which is inĀ­terĀ­naĀ­tionĀ­alĀ­ly stanĀ­dardĀ­ized as ISO/IEC 10646.

Technical basis for character encoding

First, it’s important to unĀ­derĀ­stand that all inĀ­forĀ­maĀ­tion present in a digital system consists of endless chains of zeros and ones on a deeper level. This is also referred to as ā€œbinary repĀ­reĀ­senĀ­taĀ­tion.ā€ The binary code is somewhat like an alphabet in itself. However, in binary code, there are only two ā€œlettersā€: zeros and ones. Each position within a sequence of zeros and ones is called a ā€œbit.ā€

The basic trick of digital inĀ­forĀ­maĀ­tion techĀ­nolĀ­oĀ­gy is to represent charĀ­acĀ­ters from different alphabets as sequences of zeros and ones. This allows for encoding numbers and letters, as well as any other disĀ­tinĀ­guishĀ­able states. Usually, these are called ā€œsymbols.ā€ The longer the sequence of zeros and ones for repĀ­reĀ­sentĀ­ing a single symbol, the more symbols can be depicted. With each added bit, the number of possible symbols doubles.

A concrete example: Imagine we have binary ā€œwordsā€ that are two bits long. This would allow us to encode four numbers:

2-bit word Number
00 0
01 1
10 2
11 3

If we add another bit to the beginning of the sequence, the number of possible bit-words doubles. These consist of the already known bit sequences, each preceded by a zero or one. Thus, we can encode eight numbers:

3-bit word Number
000 0
001 1
010 2
011 3
100 4
101 5
110 6
111 7
Fact

An 8-bit word is referred to as an octet or byte.

For simĀ­plicĀ­iĀ­ty, we’ve shown the encoding of numbers as an example here. However, the same principle applies to digital systems for encoding letters or any other charĀ­acĀ­ters and states. Here is a highly simĀ­pliĀ­fied example of binary encoding of letters:

3-bit word Letter
000 A
001 B
010 C

The graphic repĀ­reĀ­senĀ­taĀ­tion of a character is called a glyph. Depending on the font used, there are different glyphs for the same character, and even within a single font, there can be multiple variĀ­aĀ­tions for a glyph. Think, for instance, of different weights, ligatures, italics, etc. Here is an expanded repĀ­reĀ­senĀ­taĀ­tion that includes the mapping from the character to the glyph:

Binary repĀ­reĀ­senĀ­taĀ­tion Decimal number Encoded character Glyph
1000001 65 uppercase ā€œAā€ of the Latin alphabet A
1100001 97 lowercase ā€œaā€ of the Latin alphabet a
0110000 48 Arabic numeral ā€œ0ā€ 0
0111001 57 Arabic numeral ā€œ9ā€ 9
11000100 196 uppercase ā€œĆ„ā€ Ƅ
11000001 193 uppercase ā€œĆā€ Ɓ

TerĀ­miĀ­nolĀ­oĀ­gy of character encoding

Digital character encoding involves a range of specific terms and concepts. In everyday usage, some of these may be used inĀ­terĀ­changeĀ­ably, but in technical contexts — esĀ­peĀ­cialĀ­ly when working with Unicode — it’s important to disĀ­tinĀ­guish them clearly. Below are key terms along with their deĀ­fĀ­iĀ­nĀ­iĀ­tions:

Term Meaning
Character set A colĀ­lecĀ­tion of possible charĀ­acĀ­ters, such as digits ā€œ0–9ā€ or letters ā€œa–zā€
Code point A numerical value assigned to a specific character within a coding system
Coded character set A system that maps each character to exactly one code point
Character encoding The process of conĀ­vertĀ­ing charĀ­acĀ­ters into a digital format (e.g., binary)

Overview of common character encodings

Before the advent of Unicode, there was a wide variety of specific encodings. The norm was to use a distinct encoding for each language or language family. This often led to display errors and data inĀ­conĀ­sisĀ­tenĀ­cies. To counter this, character encodings were freĀ­quentĀ­ly modeled as backward-comĀ­patĀ­iĀ­ble supersets of an existing standard. The modern Unicode standard builds on the earlier ISO Latin-1 encoding, which in turn is based on the ASCII character code.

Character encoding Bits per character Possible charĀ­acĀ­ters Character set
ASCII 7 bits 128 Letters, numbers, and special charĀ­acĀ­ters of the American keyboard, as well as control charĀ­acĀ­ters for teletypes
ISO Latin-1 (ISO 8859-1) 8 bits 256 First 128 charĀ­acĀ­ters like ASCII, next 128 charĀ­acĀ­ters for special charĀ­acĀ­ters of European languages
Universal Coded Character Set 2 (UCS-2) 16 bits 65,536 CharĀ­acĀ­ters of the ā€œBasic MulĀ­tiĀ­linĀ­gual Planeā€ (BMP); first 256 charĀ­acĀ­ters like in ISO Latin-1
Universal Coded Character Set 4 (UCS-4) 32 bits 1,114,111 CharĀ­acĀ­ters of the BMP and adĀ­diĀ­tionĀ­al beyond; total of 143,859 charĀ­acĀ­ters in Unicode Version 13.0; first 256 charĀ­acĀ­ters like ISO Latin-1
UCS TransĀ­forĀ­maĀ­tion Format 8 Bit (UTF-8) 8/16/24/32 bits 1,114,111 Any charĀ­acĀ­ters from UCS-2 and UCS-4; first 256 charĀ­acĀ­ters like ISO Latin-1

Structure of the Unicode Standard

The Unicode Standard defines charĀ­acĀ­ters and corĀ­reĀ­spondĀ­ing code points for letters, sylĀ­labaries, ideograms, puncĀ­tuĀ­aĀ­tion marks, special charĀ­acĀ­ters, and numerals. It supports the Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets. AdĀ­diĀ­tionĀ­alĀ­ly, it includes Japanese (Katakana, Hiragana), Chinese, and Korean scripts (Hangul). There are also mathĀ­eĀ­matĀ­iĀ­cal, comĀ­merĀ­cial, and technical special charĀ­acĀ­ters, as well as hisĀ­torĀ­iĀ­cal control charĀ­acĀ­ters for teletypes.

The charĀ­acĀ­ters are compiled in a series of character tables. We provide an overview of the most common character tables here.

Writing systems of the Unicode Standard

Character table Includes these alphabets, among others
European Writing Systems Armenian, Georgian, Greek, Latin
African Writing Systems Ethiopian, Egyptian HiĀ­eroĀ­glyphs, Coptic
Middle Eastern Writing Systems Arabic, Hebrew, Syriac
Central Asian Writing Systems Mongolian, Tibetan, Old Turkic
South Asian Writing Systems Brahmi, Tamil, Vedic
Southeast Asian Writing Systems Khmer, Rohingya, Thai
Writing Systems of Indonesia and Oceania Balinese, Buginese, Javanese
East Asian Writing Systems CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese)
American Writing Systems Cherokee, Canadian Syllabics, Osage

Symbols and puncĀ­tuĀ­aĀ­tion of the Unicode Standard

Character table Includes these charĀ­acĀ­ters, among others
Notation systems Braille Patterns, Musical Notation, Duployan Shorthand
PuncĀ­tuĀ­aĀ­tion PuncĀ­tuĀ­aĀ­tion of the English Language, PuncĀ­tuĀ­aĀ­tion of European Languages, CJK PuncĀ­tuĀ­aĀ­tion
AlĀ­phanuĀ­merĀ­ic symbols MathĀ­eĀ­matĀ­iĀ­cal Unicode Letters, Circled Unicode Letters
Technical symbols Symbols of the APL ProĀ­gramĀ­ming Language, Symbols for Optical Character RecogĀ­niĀ­tion
Numbers & numerals Maya Numerals, Ottoman Siyaq Numerals, Numerals of Sumerian Cuneiform
MathĀ­eĀ­matĀ­iĀ­cal symbols Arrows, MathĀ­eĀ­matĀ­iĀ­cal Operators, Geometric Shapes
Emoji & picĀ­tograms Emoticons, Dingbats, Other PicĀ­tograms
Other symbols AlĀ­chemĀ­iĀ­cal Symbols, Currency CharĀ­acĀ­ters, Chess, Domino, and Mahjong CharĀ­acĀ­ters

What is Unicode used for?

The Unicode Standard primarily serves as a universal founĀ­daĀ­tion for proĀ­cessĀ­ing, storing, and exĀ­changĀ­ing text in any language. Most modern software comĀ­poĀ­nents, such as libraries, protocols, databases, etc., that operate on text are based on Unicode. We ilĀ­lusĀ­trate the range of possible uses with the following examples.

Operating systems

Unicode is the internal standard for text repĀ­reĀ­senĀ­taĀ­tion in most modern operating systems. Some operating systems, like Apple’s macOS, allow the use of Unicode charĀ­acĀ­ters in file names.

Websites

The Unicode variant UTF-8 has become the standard for encoding HTML documents. As early as 2016, more than 80 percent of the world’s most visited websites used UTF-8 for storing and disĀ­playĀ­ing their HTML documents. The Punycode standard has become esĀ­tabĀ­lished for using non-ASCII letters in domain names.

Website Builder
From idea to website in record time with AI
  • Intuitive website builder with AI asĀ­sisĀ­tance
  • Create capĀ­tiĀ­vatĀ­ing images and texts in seconds
  • Domain, SSL and email included

ProĀ­gramĀ­ming languages

Many modern proĀ­gramĀ­ming languages use Unicode as the basis for text proĀ­cessĀ­ing. A recent deĀ­velĀ­opĀ­ment is the ability to use Unicode charĀ­acĀ­ters for naming variables and functions. This is possible in ECĀ­MAScript/JavaScript, as ilĀ­lusĀ­tratĀ­ed in the following code:

let ļøŽšŸ‘ = true;
let šŸ‘Ž = false;
if (bool_var === ļøŽšŸ‘Ž) {
 // …
}
javascript

Databases

The popular and widely used database MySQL supports the complete Unicode character set with the character encoding ā€œutf8mb4ā€. In contrast, using the ā€œutf8ā€ encoding results in the loss of charĀ­acĀ­ters whose code points encompass more than three bytes.

Fonts

Fonts contain the glyphs used for the graphic repĀ­reĀ­senĀ­taĀ­tion of text. Due to the large number of charĀ­acĀ­ters included in the Unicode Standard, there is no font that contains all charĀ­acĀ­ters. Even the subset of the Basic MulĀ­tiĀ­linĀ­gual Plane is covered comĀ­pleteĀ­ly by only a few fonts. Here are a few examples:

Unicode font Glyphs License
Noto approx. 77,000 Open Font License
Sun-ExtA/B approx. 50,000 Freeware
Unifont approx. 63,000 GNU GPL
Code2000 approx. 63,000 Shareware
HiDrive Cloud Storage
Store and share your data on the go
  • Store, share, and edit data easily
  • Backed up and highly secure
  • Sync with all devices

How is Unicode used?

In many cases, users employ Unicode without ever being aware of it. Digital text is presented in most documents and apĀ­pliĀ­caĀ­tions as Unicode and can be freely copied, pasted, and edited by users. Sometimes, the end user may need to insert a specific Unicode character into text. There are various methods for doing this, which we will present below.

Special software keyboards

The use of special software keyboards is probably the most common method to insert Unicode charĀ­acĀ­ters into text. UbiqĀ­uiĀ­tous on mobile devices, software keyboards allow for switching between languages and their reĀ­specĀ­tive alphabets. The key layout changes, with all charĀ­acĀ­ters origĀ­iĀ­natĀ­ing from the Unicode reperĀ­toire. These charĀ­acĀ­ters can be mixed and combined freely in texts.

A good example of this is emojis: Emojis are regular Unicode charĀ­acĀ­ters like letters, numbers, and special symbols. As with digital charĀ­acĀ­ters, the repĀ­reĀ­senĀ­taĀ­tion of emojis is inĀ­deĀ­penĀ­dent of their internal modeling. Each operating system displays the same emoji slightly difĀ­ferĀ­entĀ­ly.

The useful software keyboards are not only found on mobile devices. They’re also available on desktops. They can be easily accessed in Windows, macOS, and many Linux disĀ­triĀ­bĀ­uĀ­tions, disĀ­playĀ­ing a different set of charĀ­acĀ­ters depending on the selected language. Since the number of keys is limited, not all Unicode charĀ­acĀ­ters are shown. Instead, there’s a language-specific selection of the most commonly used charĀ­acĀ­ters.

Unicode character tables

Besides software keyboards, Unicode character tables are probably the most useful way to access Unicode charĀ­acĀ­ters. Remember, a character set (ā€œCoded character setā€) is the colĀ­lecĀ­tion of all charĀ­acĀ­ters along with their corĀ­reĀ­spondĀ­ing unique code points. Such a structure lends itself to a table format, and indeed the Unicode Standard includes exactly such tables called Unicode Code Charts. From these tables, users can copy specific charĀ­acĀ­ters to use elsewhere. AlĀ­terĀ­naĀ­tiveĀ­ly, end users can read the corĀ­reĀ­spondĀ­ing code point, for example, to use it as a numeric character reference—more on this in the next section.

Many desktop operating systems also include a Unicode character table. This provides an overview of all available Unicode charĀ­acĀ­ters along with their code point, deĀ­scripĀ­tion, and glyph. A character can be inserted or copied with a click. A character table can also be created with just a few lines of code. Later in this article, we’ll show an example using the Python proĀ­gramĀ­ming language.

Numeric character reference

The core of the Unicode Standard is the mapping of charĀ­acĀ­ters to code points. Knowing a character’s code point allows it to be used to embed the corĀ­reĀ­spondĀ­ing character in various contexts. On Windows, entering Unicode symbols is done using the standard hardware keyboard with a special key comĀ­biĀ­naĀ­tion. Note that the code point number typically needs to be entered in hexaĀ­decĀ­iĀ­mal format.

ProĀ­gramĀ­mers most often need numeric character refĀ­erĀ­ences. The hexaĀ­decĀ­iĀ­mal repĀ­reĀ­senĀ­taĀ­tion of code points allows for the mapping of a Unicode character into charĀ­acĀ­ters of the ASCII character set. We demonĀ­strate this approach in HTML; funĀ­daĀ­menĀ­talĀ­ly, it works the same in Python, C++, etc.

The general scheme for embedding a character using a numeric reference includes the reference itself, as well as an opening and closing term: In HTML documents, the numeric reference starts with &#x and ends with ;. In between, without any spaces, the two- to four-digit hexaĀ­decĀ­iĀ­mal code point is entered, resulting in the pattern &#xNNNN;.

To insert the copyright symbol ā€œĀ©ā€ into an HTML document by example, we proceed with the following scheme:

  1. Search for the character in a Unicode table.

  2. Read the code point asĀ­soĀ­ciĀ­atĀ­ed with the character. In our example, the code point is listed as ā€œU+00A9,ā€ which is the hexaĀ­decĀ­iĀ­mal repĀ­reĀ­senĀ­taĀ­tion.

  3. Compose the character reference and enter it into HTML source code or a Markdown document. In our case, we input ©; this renders the character ā€œĀ©ā€.

A less common approach allows for the use of code points in decimal rather than hexaĀ­decĀ­iĀ­mal repĀ­reĀ­senĀ­taĀ­tion. In this case, the numeric reference begins with &# (without the ā€œxā€) and ends as usual with ;. In between, the code point is written in decimal form. In our example, the numeric reference © results in the copyright symbol.

Tip

Use the Unicode Character Inspector to quickly find the different codes for a character.

Named character entities

Since the notation of Unicode charĀ­acĀ­ters as numeric refĀ­erĀ­ences is not intuitive for humans, there is another method: named character entities. These are defined for commonly used charĀ­acĀ­ters and assign a short, memorable name to the character. A named character entity starts with the ampersand & and ends with a semicolon ;. The defined name is placed in between without spaces. To insert the copyright symbol ā€œĀ©ā€ in HTML, simply write ©.

Tip

The complete list of defined character entities is docĀ­uĀ­mentĀ­ed in the HTML Standard.

ProĀ­gramĀ­ming languages

Most proĀ­gramĀ­ming languages include basic functions to convert charĀ­acĀ­ters and code points. The corĀ­reĀ­spondĀ­ing functions are often called ord(character) and chr(code point). The following applies:

chr(ord(character)) == character

Note that it is always possible to determine the code point corĀ­reĀ­spondĀ­ing to a character. ConĀ­verseĀ­ly, the asĀ­signĀ­ment only works for numbers that are actually defined as code points of the character code. We demonĀ­strate the basic scheme here with a short Python example:

# Determine the decimal code point of a character
ord('A') # `65`
# Determine the hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Determine the character corresponding to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, because code point > `0x110000`
python

With the help of these functions, it’s easy to create a character table for code points of the Unicode character set. For this, you iterate the code points and output the corĀ­reĀ­spondĀ­ing charĀ­acĀ­ters. With Python, this can be done in just a few lines of code:

# Start `range` at `32` to avoid control characters being printed
# Print ASCII character set
for code_point in range(32, 128):
	print(code_point, hex(code_point), chr(code_point))
# Print ISO Latin-1
for code_point in range(32, 256):
	print(code_point, hex(code_point), chr(code_point))
python

Program Library ICU

The InĀ­terĀ­naĀ­tionĀ­al ComĀ­poĀ­nents for Unicode (ICU) are conĀ­solĀ­iĀ­datĀ­ed in a program library provided by the Unicode ConĀ­sorĀ­tium. The library is released under an open-source license and can be used on many operating systems. The software serves the purpose of proĀ­gramĀ­matĀ­ic InĀ­terĀ­naĀ­tionĀ­alĀ­izaĀ­tion (often abĀ­breĀ­viĀ­atĀ­ed as ā€œi18nā€). Its apĀ­pliĀ­caĀ­tions include:

  • ProĀ­cessĀ­ing of Unicode texts
  • Support for regular exĀ­presĀ­sions in Unicode
  • Parsing and forĀ­matĀ­ting of calendar dates, times, numbers, curĀ­renĀ­cies, and messages

The ICU library is available in two versions:

  • ā€œicu4cā€ is written in C/C++ and provides an API for these languages.
  • ā€œicu4jā€ is written in Java and provides an API for this language.

The use of the comĀ­poĀ­nents provides conĀ­sisĀ­tent results reĀ­gardĀ­less of the unĀ­derĀ­lyĀ­ing platform.

Charset meta tag in the head of HTML documents

Most HTML documents today use the UTF-8 character encoding. To ensure that visitors see the document without erroneous charĀ­acĀ­ters, a ā€œCharsetā€ meta tag should be placed in the head of the HTML document. This instructs the browser to interpret the retrieved document as UTF-8 and is ilĀ­lusĀ­tratĀ­ed below:

<head>
<meta charset="utf-8">
<!-- additional head elements -->
</head>
html

Instagram fonts

The popular social network Instagram does not allow text forĀ­matĀ­ting for biography inĀ­forĀ­maĀ­tion, posts, or stories. This limits users’ creative options. However, clever deĀ­velĀ­opĀ­ers have found a workaround: Instagram uses Unicode, making it possible to compose text that appears formatted using special charĀ­acĀ­ters. This often involves charĀ­acĀ­ters that resemble Latin letters. The easiest way to create such text is with an Insta Fonts Generator. AdĀ­diĀ­tionĀ­alĀ­ly, using Instagram fonts also works in other social networks.

Go to Main Menu