What is Unicode? Definition and explanation

Contents

Unicode is an international standard for encoding, displaying, and processing text characters from nearly all the world’s writing systems. Each character is assigned a unique code point, which can be stored in various character encodings like UTF-8 or UTF-16. This allows Unicode to provide consistent representation and processing of texts across different platforms and languages.

Domain Name Registration

Build your brand on a great domain

Free Wildcard SSL for safer data transfers
Free private registration for more privacy
Free Domain Connect for easy DNS setup

What is Unicode?

Unicode stands for “Universal Character Encoding” and is a global standard for representing text characters in binary form. It enables consistent storage, exchange, and processing of text across different digital systems and platforms.

Unicode is innovative in that it is not tied to the formats and encodings of a single alphabet of a particular human language. Rather, Unicode was created with the aim of serving as a unified standard for representing all writing systems and characters developed by humans.

Since the release of Unicode 1.0 at the end of 1991, the standard has fulfilled its purpose. Unicode is internally used by browsers and operating systems as a unified format. With the release of version 16.0 by the Unicode Consortium in 2024, the Unicode Standard now encompasses a repertoire of 154,998 characters. The character set covered by the Unicode Standard is completely identical to the “Universal Coded Character Set” (UCS), which is internationally standardized as ISO/IEC 10646.

Technical basis for character encoding

First, it’s important to understand that all information present in a digital system consists of endless chains of zeros and ones on a deeper level. This is also referred to as “binary representation.” The binary code is somewhat like an alphabet in itself. However, in binary code, there are only two “letters”: zeros and ones. Each position within a sequence of zeros and ones is called a “bit.”

The basic trick of digital information technology is to represent characters from different alphabets as sequences of zeros and ones. This allows for encoding numbers and letters, as well as any other distinguishable states. Usually, these are called “symbols.” The longer the sequence of zeros and ones for representing a single symbol, the more symbols can be depicted. With each added bit, the number of possible symbols doubles.

A concrete example: Imagine we have binary “words” that are two bits long. This would allow us to encode four numbers:

2-bit word	Number
00	0
01	1
10	2
11	3

If we add another bit to the beginning of the sequence, the number of possible bit-words doubles. These consist of the already known bit sequences, each preceded by a zero or one. Thus, we can encode eight numbers:

3-bit word	Number
000	0
001	1
010	2
011	3
100	4
101	5
110	6
111	7

Fact

An 8-bit word is referred to as an octet or byte.

For simplicity, we’ve shown the encoding of numbers as an example here. However, the same principle applies to digital systems for encoding letters or any other characters and states. Here is a highly simplified example of binary encoding of letters:

3-bit word	Letter
000	A
001	B
010	C

The graphic representation of a character is called a glyph. Depending on the font used, there are different glyphs for the same character, and even within a single font, there can be multiple variations for a glyph. Think, for instance, of different weights, ligatures, italics, etc. Here is an expanded representation that includes the mapping from the character to the glyph:

Binary representation	Decimal number	Encoded character	Glyph
1000001	65	uppercase “A” of the Latin alphabet	A
1100001	97	lowercase “a” of the Latin alphabet	a
0110000	48	Arabic numeral “0”	0
0111001	57	Arabic numeral “9”	9
11000100	196	uppercase “Ä”	Ä
11000001	193	uppercase “Á”	Á

Terminology of character encoding

Digital character encoding involves a range of specific terms and concepts. In everyday usage, some of these may be used interchangeably, but in technical contexts — especially when working with Unicode — it’s important to distinguish them clearly. Below are key terms along with their definitions:

Term	Meaning
Character set	A collection of possible characters, such as digits “0–9” or letters “a–z”
Code point	A numerical value assigned to a specific character within a coding system
Coded character set	A system that maps each character to exactly one code point
Character encoding	The process of converting characters into a digital format (e.g., binary)

Overview of common character encodings

Before the advent of Unicode, there was a wide variety of specific encodings. The norm was to use a distinct encoding for each language or language family. This often led to display errors and data inconsistencies. To counter this, character encodings were frequently modeled as backward-compatible supersets of an existing standard. The modern Unicode standard builds on the earlier ISO Latin-1 encoding, which in turn is based on the ASCII character code.

Character encoding	Bits per character	Possible characters	Character set
ASCII	7 bits	128	Letters, numbers, and special characters of the American keyboard, as well as control characters for teletypes
ISO Latin-1 (ISO 8859-1)	8 bits	256	First 128 characters like ASCII, next 128 characters for special characters of European languages
Universal Coded Character Set 2 (UCS-2)	16 bits	65,536	Characters of the “Basic Multilingual Plane” (BMP); first 256 characters like in ISO Latin-1
Universal Coded Character Set 4 (UCS-4)	32 bits	1,114,111	Characters of the BMP and additional beyond; total of 143,859 characters in Unicode Version 13.0; first 256 characters like ISO Latin-1
UCS Transformation Format 8 Bit (UTF-8)	8/16/24/32 bits	1,114,111	Any characters from UCS-2 and UCS-4; first 256 characters like ISO Latin-1

Structure of the Unicode Standard

The Unicode Standard defines characters and corresponding code points for letters, syllabaries, ideograms, punctuation marks, special characters, and numerals. It supports the Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets. Additionally, it includes Japanese (Katakana, Hiragana), Chinese, and Korean scripts (Hangul). There are also mathematical, commercial, and technical special characters, as well as historical control characters for teletypes.

The characters are compiled in a series of character tables. We provide an overview of the most common character tables here.

Writing systems of the Unicode Standard

Character table	Includes these alphabets, among others
European Writing Systems	Armenian, Georgian, Greek, Latin
African Writing Systems	Ethiopian, Egyptian Hieroglyphs, Coptic
Middle Eastern Writing Systems	Arabic, Hebrew, Syriac
Central Asian Writing Systems	Mongolian, Tibetan, Old Turkic
South Asian Writing Systems	Brahmi, Tamil, Vedic
Southeast Asian Writing Systems	Khmer, Rohingya, Thai
Writing Systems of Indonesia and Oceania	Balinese, Buginese, Javanese
East Asian Writing Systems	CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese)
American Writing Systems	Cherokee, Canadian Syllabics, Osage

Symbols and punctuation of the Unicode Standard

Character table	Includes these characters, among others
Notation systems	Braille Patterns, Musical Notation, Duployan Shorthand
Punctuation	Punctuation of the English Language, Punctuation of European Languages, CJK Punctuation
Alphanumeric symbols	Mathematical Unicode Letters, Circled Unicode Letters
Technical symbols	Symbols of the APL Programming Language, Symbols for Optical Character Recognition
Numbers & numerals	Maya Numerals, Ottoman Siyaq Numerals, Numerals of Sumerian Cuneiform
Mathematical symbols	Arrows, Mathematical Operators, Geometric Shapes
Emoji & pictograms	Emoticons, Dingbats, Other Pictograms
Other symbols	Alchemical Symbols, Currency Characters, Chess, Domino, and Mahjong Characters

What is Unicode used for?

The Unicode Standard primarily serves as a universal foundation for processing, storing, and exchanging text in any language. Most modern software components, such as libraries, protocols, databases, etc., that operate on text are based on Unicode. We illustrate the range of possible uses with the following examples.

Operating systems

Unicode is the internal standard for text representation in most modern operating systems. Some operating systems, like Apple’s macOS, allow the use of Unicode characters in file names.

Websites

The Unicode variant UTF-8 has become the standard for encoding HTML documents. As early as 2016, more than 80 percent of the world’s most visited websites used UTF-8 for storing and displaying their HTML documents. The Punycode standard has become established for using non-ASCII letters in domain names.

Website Builder

From idea to website in record time with AI

Intuitive website builder with AI assistance
Create captivating images and texts in seconds
Domain, SSL and email included

Programming languages

Many modern programming languages use Unicode as the basis for text processing. A recent development is the ability to use Unicode characters for naming variables and functions. This is possible in ECMAScript/JavaScript, as illustrated in the following code:

let ︎👍 = true;
let 👎 = false;
if (bool_var === ︎👎) {
 // …
}

javascript

Databases

The popular and widely used database MySQL supports the complete Unicode character set with the character encoding “utf8mb4”. In contrast, using the “utf8” encoding results in the loss of characters whose code points encompass more than three bytes.

Fonts

Fonts contain the glyphs used for the graphic representation of text. Due to the large number of characters included in the Unicode Standard, there is no font that contains all characters. Even the subset of the Basic Multilingual Plane is covered completely by only a few fonts. Here are a few examples:

Unicode font	Glyphs	License
Noto	approx. 77,000	Open Font License
Sun-ExtA/B	approx. 50,000	Freeware
Unifont	approx. 63,000	GNU GPL
Code2000	approx. 63,000	Shareware

HiDrive Cloud Storage

Store and share your data on the go

Store, share, and edit data easily
Backed up and highly secure
Sync with all devices

How is Unicode used?

In many cases, users employ Unicode without ever being aware of it. Digital text is presented in most documents and applications as Unicode and can be freely copied, pasted, and edited by users. Sometimes, the end user may need to insert a specific Unicode character into text. There are various methods for doing this, which we will present below.

Special software keyboards

The use of special software keyboards is probably the most common method to insert Unicode characters into text. Ubiquitous on mobile devices, software keyboards allow for switching between languages and their respective alphabets. The key layout changes, with all characters originating from the Unicode repertoire. These characters can be mixed and combined freely in texts.

A good example of this is emojis: Emojis are regular Unicode characters like letters, numbers, and special symbols. As with digital characters, the representation of emojis is independent of their internal modeling. Each operating system displays the same emoji slightly differently.

The useful software keyboards are not only found on mobile devices. They’re also available on desktops. They can be easily accessed in Windows, macOS, and many Linux distributions, displaying a different set of characters depending on the selected language. Since the number of keys is limited, not all Unicode characters are shown. Instead, there’s a language-specific selection of the most commonly used characters.

Unicode character tables

Besides software keyboards, Unicode character tables are probably the most useful way to access Unicode characters. Remember, a character set (“Coded character set”) is the collection of all characters along with their corresponding unique code points. Such a structure lends itself to a table format, and indeed the Unicode Standard includes exactly such tables called Unicode Code Charts. From these tables, users can copy specific characters to use elsewhere. Alternatively, end users can read the corresponding code point, for example, to use it as a numeric character reference—more on this in the next section.

Many desktop operating systems also include a Unicode character table. This provides an overview of all available Unicode characters along with their code point, description, and glyph. A character can be inserted or copied with a click. A character table can also be created with just a few lines of code. Later in this article, we’ll show an example using the Python programming language.

Numeric character reference

The core of the Unicode Standard is the mapping of characters to code points. Knowing a character’s code point allows it to be used to embed the corresponding character in various contexts. On Windows, entering Unicode symbols is done using the standard hardware keyboard with a special key combination. Note that the code point number typically needs to be entered in hexadecimal format.

Programmers most often need numeric character references. The hexadecimal representation of code points allows for the mapping of a Unicode character into characters of the ASCII character set. We demonstrate this approach in HTML; fundamentally, it works the same in Python, C++, etc.

The general scheme for embedding a character using a numeric reference includes the reference itself, as well as an opening and closing term: In HTML documents, the numeric reference starts with &#x and ends with ;. In between, without any spaces, the two- to four-digit hexadecimal code point is entered, resulting in the pattern &#xNNNN;.

To insert the copyright symbol “©” into an HTML document by example, we proceed with the following scheme:

Search for the character in a Unicode table.
Read the code point associated with the character. In our example, the code point is listed as “U+00A9,” which is the hexadecimal representation.
Compose the character reference and enter it into HTML source code or a Markdown document. In our case, we input ©; this renders the character “©”.

A less common approach allows for the use of code points in decimal rather than hexadecimal representation. In this case, the numeric reference begins with &# (without the “x”) and ends as usual with ;. In between, the code point is written in decimal form. In our example, the numeric reference © results in the copyright symbol.

Tip

Use the Unicode Character Inspector to quickly find the different codes for a character.

Named character entities

Since the notation of Unicode characters as numeric references is not intuitive for humans, there is another method: named character entities. These are defined for commonly used characters and assign a short, memorable name to the character. A named character entity starts with the ampersand & and ends with a semicolon ;. The defined name is placed in between without spaces. To insert the copyright symbol “©” in HTML, simply write ©.

Tip

The complete list of defined character entities is documented in the HTML Standard.

Programming languages

Most programming languages include basic functions to convert characters and code points. The corresponding functions are often called ord(character) and chr(code point). The following applies:

chr(ord(character)) == character

Note that it is always possible to determine the code point corresponding to a character. Conversely, the assignment only works for numbers that are actually defined as code points of the character code. We demonstrate the basic scheme here with a short Python example:

# Determine the decimal code point of a character
ord('A') # `65`
# Determine the hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Determine the character corresponding to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, because code point > `0x110000`

python

With the help of these functions, it’s easy to create a character table for code points of the Unicode character set. For this, you iterate the code points and output the corresponding characters. With Python, this can be done in just a few lines of code:

# Start `range` at `32` to avoid control characters being printed
# Print ASCII character set
for code_point in range(32, 128):
	print(code_point, hex(code_point), chr(code_point))
# Print ISO Latin-1
for code_point in range(32, 256):
	print(code_point, hex(code_point), chr(code_point))

python

Program Library ICU

The International Components for Unicode (ICU) are consolidated in a program library provided by the Unicode Consortium. The library is released under an open-source license and can be used on many operating systems. The software serves the purpose of programmatic Internationalization (often abbreviated as “i18n”). Its applications include:

Processing of Unicode texts
Support for regular expressions in Unicode
Parsing and formatting of calendar dates, times, numbers, currencies, and messages

The ICU library is available in two versions:

“icu4c” is written in C/C++ and provides an API for these languages.
“icu4j” is written in Java and provides an API for this language.

The use of the components provides consistent results regardless of the underlying platform.

Charset meta tag in the head of HTML documents

Most HTML documents today use the UTF-8 character encoding. To ensure that visitors see the document without erroneous characters, a “Charset” meta tag should be placed in the head of the HTML document. This instructs the browser to interpret the retrieved document as UTF-8 and is illustrated below:

<head>
<meta charset="utf-8">
<!-- additional head elements -->
</head>

html

Instagram fonts

The popular social network Instagram does not allow text formatting for biography information, posts, or stories. This limits users’ creative options. However, clever developers have found a workaround: Instagram uses Unicode, making it possible to compose text that appears formatted using special characters. This often involves characters that resemble Latin letters. The easiest way to create such text is with an Insta Fonts Generator. Additionally, using Instagram fonts also works in other social networks.

Binary code

Computers use binary code which is made up of all “ones and zeroes”. But why? Why don’t PCs and smartphones work in the decimal system that we are all used to? The answer can be found in the technology as well as in the sheer elegance of the binary system. It is a lot simpler…

Encyclopedia

FlashMovieShutterstock

What is a Byte Order Mark (BOM)?

How information actually read? For some people the answer seems obvious – from left to right. For people from many cultures, however, the opposite direction is considered normal. These are all conventions – something that computers do not understand. So, in which order should…

ASAG StudioShutterstock

What is utf-8 and why is it important for global digital communication?

UTF-8 is a character encoding under Unicode that aims to encompass all modern languages for data processing. What exactly does “UTF-8” mean? And what is special about UTF-8 in the Unicode character set? Here you’ll learn about the structure of the coding and which bytes are…

Encyclopedia
Digitalization
Encryption

What is Unicode? De­f­i­n­i­tion and ex­pla­na­tion

What is Unicode?

Technical basis for character encoding

Ter­mi­nol­o­gy of character encoding

Overview of common character encodings

Structure of the Unicode Standard

Writing systems of the Unicode Standard

Symbols and punc­tu­a­tion of the Unicode Standard

What is Unicode used for?

Operating systems

Websites

Pro­gram­ming languages

Databases

Fonts

How is Unicode used?

Special software keyboards

Unicode character tables

Numeric character reference

Named character entities

Pro­gram­ming languages

Program Library ICU

Charset meta tag in the head of HTML documents

Instagram fonts

What is Unicode? Definition and explanation

Terminology of character encoding

Symbols and punctuation of the Unicode Standard

Programming languages

Programming languages