What is Unicode? Definition and explanation
Unicode is an international standard for encoding, displaying, and processing text characters from nearly all the world’s writing systems. Each character is assigned a unique code point, which can be stored in various character encodings like UTF-8 or UTF-16. This allows Unicode to provide consistent representation and processing of texts across different platforms and languages.
- Free Wildcard SSL for safer data transfers
- Free private registration for more privacy
- Free Domain Connect for easy DNS setup
What is Unicode?
Unicode stands for “Universal Character Encoding” and is a global standard for representing text characters in binary form. It enables consistent storage, exchange, and processing of text across different digital systems and platforms.
Unicode is innovative in that it is not tied to the formats and encodings of a single alphabet of a particular human language. Rather, Unicode was created with the aim of serving as a unified standard for representing all writing systems and characters developed by humans.
Since the release of Unicode 1.0 at the end of 1991, the standard has fulfilled its purpose. Unicode is internally used by browsers and operating systems as a unified format. With the release of version 16.0 by the Unicode Consortium in 2024, the Unicode Standard now encompasses a repertoire of 154,998 characters. The character set covered by the Unicode Standard is completely identical to the “Universal Coded Character Set” (UCS), which is internationally standardized as ISO/IEC 10646.
Technical basis for character encoding
First, it’s important to understand that all information present in a digital system consists of endless chains of zeros and ones on a deeper level. This is also referred to as “binary representation.” The binary code is somewhat like an alphabet in itself. However, in binary code, there are only two “letters”: zeros and ones. Each position within a sequence of zeros and ones is called a “bit.”
The basic trick of digital information technology is to represent characters from different alphabets as sequences of zeros and ones. This allows for encoding numbers and letters, as well as any other distinguishable states. Usually, these are called “symbols.” The longer the sequence of zeros and ones for representing a single symbol, the more symbols can be depicted. With each added bit, the number of possible symbols doubles.
A concrete example: Imagine we have binary “words” that are two bits long. This would allow us to encode four numbers:
| 2-bit word | Number |
|---|---|
| 00 | 0 |
| 01 | 1 |
| 10 | 2 |
| 11 | 3 |
If we add another bit to the beginning of the sequence, the number of possible bit-words doubles. These consist of the already known bit sequences, each preceded by a zero or one. Thus, we can encode eight numbers:
| 3-bit word | Number |
|---|---|
| 000 | 0 |
| 001 | 1 |
| 010 | 2 |
| 011 | 3 |
| 100 | 4 |
| 101 | 5 |
| 110 | 6 |
| 111 | 7 |
An 8-bit word is referred to as an octet or byte.
For simplicity, we’ve shown the encoding of numbers as an example here. However, the same principle applies to digital systems for encoding letters or any other characters and states. Here is a highly simplified example of binary encoding of letters:
| 3-bit word | Letter |
|---|---|
| 000 | A |
| 001 | B |
| 010 | C |
The graphic representation of a character is called a glyph. Depending on the font used, there are different glyphs for the same character, and even within a single font, there can be multiple variations for a glyph. Think, for instance, of different weights, ligatures, italics, etc. Here is an expanded representation that includes the mapping from the character to the glyph:
| Binary representation | Decimal number | Encoded character | Glyph |
|---|---|---|---|
| 1000001 | 65 | uppercase “A” of the Latin alphabet | A |
| 1100001 | 97 | lowercase “a” of the Latin alphabet | a |
| 0110000 | 48 | Arabic numeral “0” | 0 |
| 0111001 | 57 | Arabic numeral “9” | 9 |
| 11000100 | 196 | uppercase “Ä” | Ä |
| 11000001 | 193 | uppercase “Á” | Á |
Terminology of character encoding
Digital character encoding involves a range of specific terms and concepts. In everyday usage, some of these may be used interchangeably, but in technical contexts — especially when working with Unicode — it’s important to distinguish them clearly. Below are key terms along with their definitions:
| Term | Meaning |
|---|---|
| Character set | A collection of possible characters, such as digits “0–9” or letters “a–z” |
| Code point | A numerical value assigned to a specific character within a coding system |
| Coded character set | A system that maps each character to exactly one code point |
| Character encoding | The process of converting characters into a digital format (e.g., binary) |
Overview of common character encodings
Before the advent of Unicode, there was a wide variety of specific encodings. The norm was to use a distinct encoding for each language or language family. This often led to display errors and data inconsistencies. To counter this, character encodings were frequently modeled as backward-compatible supersets of an existing standard. The modern Unicode standard builds on the earlier ISO Latin-1 encoding, which in turn is based on the ASCII character code.
| Character encoding | Bits per character | Possible characters | Character set |
|---|---|---|---|
| ASCII | 7 bits | 128 | Letters, numbers, and special characters of the American keyboard, as well as control characters for teletypes |
| ISO Latin-1 (ISO 8859-1) | 8 bits | 256 | First 128 characters like ASCII, next 128 characters for special characters of European languages |
| Universal Coded Character Set 2 (UCS-2) | 16 bits | 65,536 | Characters of the “Basic Multilingual Plane” (BMP); first 256 characters like in ISO Latin-1 |
| Universal Coded Character Set 4 (UCS-4) | 32 bits | 1,114,111 | Characters of the BMP and additional beyond; total of 143,859 characters in Unicode Version 13.0; first 256 characters like ISO Latin-1 |
| UCS Transformation Format 8 Bit (UTF-8) | 8/16/24/32 bits | 1,114,111 | Any characters from UCS-2 and UCS-4; first 256 characters like ISO Latin-1 |
Structure of the Unicode Standard
The Unicode Standard defines characters and corresponding code points for letters, syllabaries, ideograms, punctuation marks, special characters, and numerals. It supports the Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets. Additionally, it includes Japanese (Katakana, Hiragana), Chinese, and Korean scripts (Hangul). There are also mathematical, commercial, and technical special characters, as well as historical control characters for teletypes.
The characters are compiled in a series of character tables. We provide an overview of the most common character tables here.
Writing systems of the Unicode Standard
| Character table | Includes these alphabets, among others |
|---|---|
| European Writing Systems | Armenian, Georgian, Greek, Latin |
| African Writing Systems | Ethiopian, Egyptian Hieroglyphs, Coptic |
| Middle Eastern Writing Systems | Arabic, Hebrew, Syriac |
| Central Asian Writing Systems | Mongolian, Tibetan, Old Turkic |
| South Asian Writing Systems | Brahmi, Tamil, Vedic |
| Southeast Asian Writing Systems | Khmer, Rohingya, Thai |
| Writing Systems of Indonesia and Oceania | Balinese, Buginese, Javanese |
| East Asian Writing Systems | CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese) |
| American Writing Systems | Cherokee, Canadian Syllabics, Osage |
Symbols and punctuation of the Unicode Standard
| Character table | Includes these characters, among others |
|---|---|
| Notation systems | Braille Patterns, Musical Notation, Duployan Shorthand |
| Punctuation | Punctuation of the English Language, Punctuation of European Languages, CJK Punctuation |
| Alphanumeric symbols | Mathematical Unicode Letters, Circled Unicode Letters |
| Technical symbols | Symbols of the APL Programming Language, Symbols for Optical Character Recognition |
| Numbers & numerals | Maya Numerals, Ottoman Siyaq Numerals, Numerals of Sumerian Cuneiform |
| Mathematical symbols | Arrows, Mathematical Operators, Geometric Shapes |
| Emoji & pictograms | Emoticons, Dingbats, Other Pictograms |
| Other symbols | Alchemical Symbols, Currency Characters, Chess, Domino, and Mahjong Characters |
What is Unicode used for?
The Unicode Standard primarily serves as a universal foundation for processing, storing, and exchanging text in any language. Most modern software components, such as libraries, protocols, databases, etc., that operate on text are based on Unicode. We illustrate the range of possible uses with the following examples.
Operating systems
Unicode is the internal standard for text representation in most modern operating systems. Some operating systems, like Apple’s macOS, allow the use of Unicode characters in file names.
Websites
The Unicode variant UTF-8 has become the standard for encoding HTML documents. As early as 2016, more than 80 percent of the world’s most visited websites used UTF-8 for storing and displaying their HTML documents. The Punycode standard has become established for using non-ASCII letters in domain names.
- Intuitive website builder with AI assistance
- Create captivating images and texts in seconds
- Domain, SSL and email included
Programming languages
Many modern programming languages use Unicode as the basis for text processing. A recent development is the ability to use Unicode characters for naming variables and functions. This is possible in ECMAScript/JavaScript, as illustrated in the following code:
let ︎👍 = true;
let 👎 = false;
if (bool_var === ︎👎) {
// …
}javascriptDatabases
The popular and widely used database MySQL supports the complete Unicode character set with the character encoding “utf8mb4”. In contrast, using the “utf8” encoding results in the loss of characters whose code points encompass more than three bytes.
Fonts
Fonts contain the glyphs used for the graphic representation of text. Due to the large number of characters included in the Unicode Standard, there is no font that contains all characters. Even the subset of the Basic Multilingual Plane is covered completely by only a few fonts. Here are a few examples:
| Unicode font | Glyphs | License |
|---|---|---|
| Noto | approx. 77,000 | Open Font License |
| Sun-ExtA/B | approx. 50,000 | Freeware |
| Unifont | approx. 63,000 | GNU GPL |
| Code2000 | approx. 63,000 | Shareware |
- Store, share, and edit data easily
- Backed up and highly secure
- Sync with all devices
How is Unicode used?
In many cases, users employ Unicode without ever being aware of it. Digital text is presented in most documents and applications as Unicode and can be freely copied, pasted, and edited by users. Sometimes, the end user may need to insert a specific Unicode character into text. There are various methods for doing this, which we will present below.
Special software keyboards
The use of special software keyboards is probably the most common method to insert Unicode characters into text. Ubiquitous on mobile devices, software keyboards allow for switching between languages and their respective alphabets. The key layout changes, with all characters originating from the Unicode repertoire. These characters can be mixed and combined freely in texts.
A good example of this is emojis: Emojis are regular Unicode characters like letters, numbers, and special symbols. As with digital characters, the representation of emojis is independent of their internal modeling. Each operating system displays the same emoji slightly differently.
The useful software keyboards are not only found on mobile devices. They’re also available on desktops. They can be easily accessed in Windows, macOS, and many Linux distributions, displaying a different set of characters depending on the selected language. Since the number of keys is limited, not all Unicode characters are shown. Instead, there’s a language-specific selection of the most commonly used characters.
Unicode character tables
Besides software keyboards, Unicode character tables are probably the most useful way to access Unicode characters. Remember, a character set (“Coded character set”) is the collection of all characters along with their corresponding unique code points. Such a structure lends itself to a table format, and indeed the Unicode Standard includes exactly such tables called Unicode Code Charts. From these tables, users can copy specific characters to use elsewhere. Alternatively, end users can read the corresponding code point, for example, to use it as a numeric character reference—more on this in the next section.
Many desktop operating systems also include a Unicode character table. This provides an overview of all available Unicode characters along with their code point, description, and glyph. A character can be inserted or copied with a click. A character table can also be created with just a few lines of code. Later in this article, we’ll show an example using the Python programming language.
Numeric character reference
The core of the Unicode Standard is the mapping of characters to code points. Knowing a character’s code point allows it to be used to embed the corresponding character in various contexts. On Windows, entering Unicode symbols is done using the standard hardware keyboard with a special key combination. Note that the code point number typically needs to be entered in hexadecimal format.
Programmers most often need numeric character references. The hexadecimal representation of code points allows for the mapping of a Unicode character into characters of the ASCII character set. We demonstrate this approach in HTML; fundamentally, it works the same in Python, C++, etc.
The general scheme for embedding a character using a numeric reference includes the reference itself, as well as an opening and closing term: In HTML documents, the numeric reference starts with &#x and ends with ;. In between, without any spaces, the two- to four-digit hexadecimal code point is entered, resulting in the pattern &#xNNNN;.
To insert the copyright symbol “©” into an HTML document by example, we proceed with the following scheme:
-
Search for the character in a Unicode table.
-
Read the code point associated with the character. In our example, the code point is listed as “U+00A9,” which is the hexadecimal representation.
-
Compose the character reference and enter it into HTML source code or a Markdown document. In our case, we input
©; this renders the character “©”.
A less common approach allows for the use of code points in decimal rather than hexadecimal representation. In this case, the numeric reference begins with &# (without the “x”) and ends as usual with ;. In between, the code point is written in decimal form. In our example, the numeric reference © results in the copyright symbol.
Use the Unicode Character Inspector to quickly find the different codes for a character.
Named character entities
Since the notation of Unicode characters as numeric references is not intuitive for humans, there is another method: named character entities. These are defined for commonly used characters and assign a short, memorable name to the character. A named character entity starts with the ampersand & and ends with a semicolon ;. The defined name is placed in between without spaces. To insert the copyright symbol “©” in HTML, simply write ©.
The complete list of defined character entities is documented in the HTML Standard.
Programming languages
Most programming languages include basic functions to convert characters and code points. The corresponding functions are often called ord(character) and chr(code point). The following applies:
chr(ord(character)) == character
Note that it is always possible to determine the code point corresponding to a character. Conversely, the assignment only works for numbers that are actually defined as code points of the character code. We demonstrate the basic scheme here with a short Python example:
# Determine the decimal code point of a character
ord('A') # `65`
# Determine the hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Determine the character corresponding to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, because code point > `0x110000`pythonWith the help of these functions, it’s easy to create a character table for code points of the Unicode character set. For this, you iterate the code points and output the corresponding characters. With Python, this can be done in just a few lines of code:
# Start `range` at `32` to avoid control characters being printed
# Print ASCII character set
for code_point in range(32, 128):
print(code_point, hex(code_point), chr(code_point))
# Print ISO Latin-1
for code_point in range(32, 256):
print(code_point, hex(code_point), chr(code_point))pythonProgram Library ICU
The International Components for Unicode (ICU) are consolidated in a program library provided by the Unicode Consortium. The library is released under an open-source license and can be used on many operating systems. The software serves the purpose of programmatic Internationalization (often abbreviated as “i18n”). Its applications include:
- Processing of Unicode texts
- Support for regular expressions in Unicode
- Parsing and formatting of calendar dates, times, numbers, currencies, and messages
The ICU library is available in two versions:
- “icu4c” is written in C/C++ and provides an API for these languages.
- “icu4j” is written in Java and provides an API for this language.
The use of the components provides consistent results regardless of the underlying platform.
Charset meta tag in the head of HTML documents
Most HTML documents today use the UTF-8 character encoding. To ensure that visitors see the document without erroneous characters, a “Charset” meta tag should be placed in the head of the HTML document. This instructs the browser to interpret the retrieved document as UTF-8 and is illustrated below:
<head>
<meta charset="utf-8">
<!-- additional head elements -->
</head>htmlInstagram fonts
The popular social network Instagram does not allow text formatting for biography information, posts, or stories. This limits users’ creative options. However, clever developers have found a workaround: Instagram uses Unicode, making it possible to compose text that appears formatted using special characters. This often involves characters that resemble Latin letters. The easiest way to create such text is with an Insta Fonts Generator. Additionally, using Instagram fonts also works in other social networks.

