What is Unicode? DeĀfĀiĀnĀiĀtion and exĀplaĀnaĀtion
Unicode is an inĀterĀnaĀtionĀal standard for encoding, disĀplayĀing, and proĀcessĀing text charĀacĀters from nearly all the worldās writing systems. Each character is assigned a unique code point, which can be stored in various character encodings like UTF-8 or UTF-16. This allows Unicode to provide conĀsisĀtent repĀreĀsenĀtaĀtion and proĀcessĀing of texts across different platforms and languages.
- Free Wildcard SSL for safer data transfers
- Free private regĀisĀtraĀtion for more privacy
- Free Domain Connect for easy DNS setup
What is Unicode?
Unicode stands for āUniversal Character Encodingā and is a global standard for repĀreĀsentĀing text charĀacĀters in binary form. It enables conĀsisĀtent storage, exchange, and proĀcessĀing of text across different digital systems and platforms.
Unicode is inĀnoĀvĀaĀtive in that it is not tied to the formats and encodings of a single alphabet of a parĀticĀuĀlar human language. Rather, Unicode was created with the aim of serving as a unified standard for repĀreĀsentĀing all writing systems and charĀacĀters developed by humans.
Since the release of Unicode 1.0 at the end of 1991, the standard has fulfilled its purpose. Unicode is inĀterĀnalĀly used by browsers and operating systems as a unified format. With the release of version 16.0 by the Unicode ConĀsorĀtium in 2024, the Unicode Standard now enĀcomĀpassĀes a reperĀtoire of 154,998 charĀacĀters. The character set covered by the Unicode Standard is comĀpleteĀly identical to the āUniversal Coded Character Setā (UCS), which is inĀterĀnaĀtionĀalĀly stanĀdardĀized as ISO/IEC 10646.
Technical basis for character encoding
First, itās important to unĀderĀstand that all inĀforĀmaĀtion present in a digital system consists of endless chains of zeros and ones on a deeper level. This is also referred to as ābinary repĀreĀsenĀtaĀtion.ā The binary code is somewhat like an alphabet in itself. However, in binary code, there are only two ālettersā: zeros and ones. Each position within a sequence of zeros and ones is called a ābit.ā
The basic trick of digital inĀforĀmaĀtion techĀnolĀoĀgy is to represent charĀacĀters from different alphabets as sequences of zeros and ones. This allows for encoding numbers and letters, as well as any other disĀtinĀguishĀable states. Usually, these are called āsymbols.ā The longer the sequence of zeros and ones for repĀreĀsentĀing a single symbol, the more symbols can be depicted. With each added bit, the number of possible symbols doubles.
A concrete example: Imagine we have binary āwordsā that are two bits long. This would allow us to encode four numbers:
| 2-bit word | Number |
|---|---|
| 00 | 0 |
| 01 | 1 |
| 10 | 2 |
| 11 | 3 |
If we add another bit to the beginning of the sequence, the number of possible bit-words doubles. These consist of the already known bit sequences, each preceded by a zero or one. Thus, we can encode eight numbers:
| 3-bit word | Number |
|---|---|
| 000 | 0 |
| 001 | 1 |
| 010 | 2 |
| 011 | 3 |
| 100 | 4 |
| 101 | 5 |
| 110 | 6 |
| 111 | 7 |
An 8-bit word is referred to as an octet or byte.
For simĀplicĀiĀty, weāve shown the encoding of numbers as an example here. However, the same principle applies to digital systems for encoding letters or any other charĀacĀters and states. Here is a highly simĀpliĀfied example of binary encoding of letters:
| 3-bit word | Letter |
|---|---|
| 000 | A |
| 001 | B |
| 010 | C |
The graphic repĀreĀsenĀtaĀtion of a character is called a glyph. Depending on the font used, there are different glyphs for the same character, and even within a single font, there can be multiple variĀaĀtions for a glyph. Think, for instance, of different weights, ligatures, italics, etc. Here is an expanded repĀreĀsenĀtaĀtion that includes the mapping from the character to the glyph:
| Binary repĀreĀsenĀtaĀtion | Decimal number | Encoded character | Glyph |
|---|---|---|---|
| 1000001 | 65 | uppercase āAā of the Latin alphabet | A |
| 1100001 | 97 | lowercase āaā of the Latin alphabet | a |
| 0110000 | 48 | Arabic numeral ā0ā | 0 |
| 0111001 | 57 | Arabic numeral ā9ā | 9 |
| 11000100 | 196 | uppercase āĆā | Ć |
| 11000001 | 193 | uppercase āĆā | Ć |
TerĀmiĀnolĀoĀgy of character encoding
Digital character encoding involves a range of specific terms and concepts. In everyday usage, some of these may be used inĀterĀchangeĀably, but in technical contexts ā esĀpeĀcialĀly when working with Unicode ā itās important to disĀtinĀguish them clearly. Below are key terms along with their deĀfĀiĀnĀiĀtions:
| Term | Meaning |
|---|---|
| Character set | A colĀlecĀtion of possible charĀacĀters, such as digits ā0ā9ā or letters āaāzā |
| Code point | A numerical value assigned to a specific character within a coding system |
| Coded character set | A system that maps each character to exactly one code point |
| Character encoding | The process of conĀvertĀing charĀacĀters into a digital format (e.g., binary) |
Overview of common character encodings
Before the advent of Unicode, there was a wide variety of specific encodings. The norm was to use a distinct encoding for each language or language family. This often led to display errors and data inĀconĀsisĀtenĀcies. To counter this, character encodings were freĀquentĀly modeled as backward-comĀpatĀiĀble supersets of an existing standard. The modern Unicode standard builds on the earlier ISO Latin-1 encoding, which in turn is based on the ASCII character code.
| Character encoding | Bits per character | Possible charĀacĀters | Character set |
|---|---|---|---|
| ASCII | 7 bits | 128 | Letters, numbers, and special charĀacĀters of the American keyboard, as well as control charĀacĀters for teletypes |
| ISO Latin-1 (ISO 8859-1) | 8 bits | 256 | First 128 charĀacĀters like ASCII, next 128 charĀacĀters for special charĀacĀters of European languages |
| Universal Coded Character Set 2 (UCS-2) | 16 bits | 65,536 | CharĀacĀters of the āBasic MulĀtiĀlinĀgual Planeā (BMP); first 256 charĀacĀters like in ISO Latin-1 |
| Universal Coded Character Set 4 (UCS-4) | 32 bits | 1,114,111 | CharĀacĀters of the BMP and adĀdiĀtionĀal beyond; total of 143,859 charĀacĀters in Unicode Version 13.0; first 256 charĀacĀters like ISO Latin-1 |
| UCS TransĀforĀmaĀtion Format 8 Bit (UTF-8) | 8/16/24/32 bits | 1,114,111 | Any charĀacĀters from UCS-2 and UCS-4; first 256 charĀacĀters like ISO Latin-1 |
Structure of the Unicode Standard
The Unicode Standard defines charĀacĀters and corĀreĀspondĀing code points for letters, sylĀlabaries, ideograms, puncĀtuĀaĀtion marks, special charĀacĀters, and numerals. It supports the Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets. AdĀdiĀtionĀalĀly, it includes Japanese (Katakana, Hiragana), Chinese, and Korean scripts (Hangul). There are also mathĀeĀmatĀiĀcal, comĀmerĀcial, and technical special charĀacĀters, as well as hisĀtorĀiĀcal control charĀacĀters for teletypes.
The charĀacĀters are compiled in a series of character tables. We provide an overview of the most common character tables here.
Writing systems of the Unicode Standard
| Character table | Includes these alphabets, among others |
|---|---|
| European Writing Systems | Armenian, Georgian, Greek, Latin |
| African Writing Systems | Ethiopian, Egyptian HiĀeroĀglyphs, Coptic |
| Middle Eastern Writing Systems | Arabic, Hebrew, Syriac |
| Central Asian Writing Systems | Mongolian, Tibetan, Old Turkic |
| South Asian Writing Systems | Brahmi, Tamil, Vedic |
| Southeast Asian Writing Systems | Khmer, Rohingya, Thai |
| Writing Systems of Indonesia and Oceania | Balinese, Buginese, Javanese |
| East Asian Writing Systems | CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese) |
| American Writing Systems | Cherokee, Canadian Syllabics, Osage |
Symbols and puncĀtuĀaĀtion of the Unicode Standard
| Character table | Includes these charĀacĀters, among others |
|---|---|
| Notation systems | Braille Patterns, Musical Notation, Duployan Shorthand |
| PuncĀtuĀaĀtion | PuncĀtuĀaĀtion of the English Language, PuncĀtuĀaĀtion of European Languages, CJK PuncĀtuĀaĀtion |
| AlĀphanuĀmerĀic symbols | MathĀeĀmatĀiĀcal Unicode Letters, Circled Unicode Letters |
| Technical symbols | Symbols of the APL ProĀgramĀming Language, Symbols for Optical Character RecogĀniĀtion |
| Numbers & numerals | Maya Numerals, Ottoman Siyaq Numerals, Numerals of Sumerian Cuneiform |
| MathĀeĀmatĀiĀcal symbols | Arrows, MathĀeĀmatĀiĀcal Operators, Geometric Shapes |
| Emoji & picĀtograms | Emoticons, Dingbats, Other PicĀtograms |
| Other symbols | AlĀchemĀiĀcal Symbols, Currency CharĀacĀters, Chess, Domino, and Mahjong CharĀacĀters |
What is Unicode used for?
The Unicode Standard primarily serves as a universal founĀdaĀtion for proĀcessĀing, storing, and exĀchangĀing text in any language. Most modern software comĀpoĀnents, such as libraries, protocols, databases, etc., that operate on text are based on Unicode. We ilĀlusĀtrate the range of possible uses with the following examples.
Operating systems
Unicode is the internal standard for text repĀreĀsenĀtaĀtion in most modern operating systems. Some operating systems, like Appleās macOS, allow the use of Unicode charĀacĀters in file names.
Websites
The Unicode variant UTF-8 has become the standard for encoding HTML documents. As early as 2016, more than 80 percent of the worldās most visited websites used UTF-8 for storing and disĀplayĀing their HTML documents. The Punycode standard has become esĀtabĀlished for using non-ASCII letters in domain names.
- Intuitive website builder with AI asĀsisĀtance
- Create capĀtiĀvatĀing images and texts in seconds
- Domain, SSL and email included
ProĀgramĀming languages
Many modern proĀgramĀming languages use Unicode as the basis for text proĀcessĀing. A recent deĀvelĀopĀment is the ability to use Unicode charĀacĀters for naming variables and functions. This is possible in ECĀMAScript/JavaScript, as ilĀlusĀtratĀed in the following code:
let ļøš = true;
let š = false;
if (bool_var === ļøš) {
// ā¦
}javascriptDatabases
The popular and widely used database MySQL supports the complete Unicode character set with the character encoding āutf8mb4ā. In contrast, using the āutf8ā encoding results in the loss of charĀacĀters whose code points encompass more than three bytes.
Fonts
Fonts contain the glyphs used for the graphic repĀreĀsenĀtaĀtion of text. Due to the large number of charĀacĀters included in the Unicode Standard, there is no font that contains all charĀacĀters. Even the subset of the Basic MulĀtiĀlinĀgual Plane is covered comĀpleteĀly by only a few fonts. Here are a few examples:
| Unicode font | Glyphs | License |
|---|---|---|
| Noto | approx. 77,000 | Open Font License |
| Sun-ExtA/B | approx. 50,000 | Freeware |
| Unifont | approx. 63,000 | GNU GPL |
| Code2000 | approx. 63,000 | Shareware |
- Store, share, and edit data easily
- Backed up and highly secure
- Sync with all devices
How is Unicode used?
In many cases, users employ Unicode without ever being aware of it. Digital text is presented in most documents and apĀpliĀcaĀtions as Unicode and can be freely copied, pasted, and edited by users. Sometimes, the end user may need to insert a specific Unicode character into text. There are various methods for doing this, which we will present below.
Special software keyboards
The use of special software keyboards is probably the most common method to insert Unicode charĀacĀters into text. UbiqĀuiĀtous on mobile devices, software keyboards allow for switching between languages and their reĀspecĀtive alphabets. The key layout changes, with all charĀacĀters origĀiĀnatĀing from the Unicode reperĀtoire. These charĀacĀters can be mixed and combined freely in texts.
A good example of this is emojis: Emojis are regular Unicode charĀacĀters like letters, numbers, and special symbols. As with digital charĀacĀters, the repĀreĀsenĀtaĀtion of emojis is inĀdeĀpenĀdent of their internal modeling. Each operating system displays the same emoji slightly difĀferĀentĀly.
The useful software keyboards are not only found on mobile devices. Theyāre also available on desktops. They can be easily accessed in Windows, macOS, and many Linux disĀtriĀbĀuĀtions, disĀplayĀing a different set of charĀacĀters depending on the selected language. Since the number of keys is limited, not all Unicode charĀacĀters are shown. Instead, thereās a language-specific selection of the most commonly used charĀacĀters.
Unicode character tables
Besides software keyboards, Unicode character tables are probably the most useful way to access Unicode charĀacĀters. Remember, a character set (āCoded character setā) is the colĀlecĀtion of all charĀacĀters along with their corĀreĀspondĀing unique code points. Such a structure lends itself to a table format, and indeed the Unicode Standard includes exactly such tables called Unicode Code Charts. From these tables, users can copy specific charĀacĀters to use elsewhere. AlĀterĀnaĀtiveĀly, end users can read the corĀreĀspondĀing code point, for example, to use it as a numeric character referenceāmore on this in the next section.
Many desktop operating systems also include a Unicode character table. This provides an overview of all available Unicode charĀacĀters along with their code point, deĀscripĀtion, and glyph. A character can be inserted or copied with a click. A character table can also be created with just a few lines of code. Later in this article, weāll show an example using the Python proĀgramĀming language.
Numeric character reference
The core of the Unicode Standard is the mapping of charĀacĀters to code points. Knowing a characterās code point allows it to be used to embed the corĀreĀspondĀing character in various contexts. On Windows, entering Unicode symbols is done using the standard hardware keyboard with a special key comĀbiĀnaĀtion. Note that the code point number typically needs to be entered in hexaĀdecĀiĀmal format.
ProĀgramĀmers most often need numeric character refĀerĀences. The hexaĀdecĀiĀmal repĀreĀsenĀtaĀtion of code points allows for the mapping of a Unicode character into charĀacĀters of the ASCII character set. We demonĀstrate this approach in HTML; funĀdaĀmenĀtalĀly, it works the same in Python, C++, etc.
The general scheme for embedding a character using a numeric reference includes the reference itself, as well as an opening and closing term: In HTML documents, the numeric reference starts with &#x and ends with ;. In between, without any spaces, the two- to four-digit hexaĀdecĀiĀmal code point is entered, resulting in the pattern &#xNNNN;.
To insert the copyright symbol āĀ©ā into an HTML document by example, we proceed with the following scheme:
-
Search for the character in a Unicode table.
-
Read the code point asĀsoĀciĀatĀed with the character. In our example, the code point is listed as āU+00A9,ā which is the hexaĀdecĀiĀmal repĀreĀsenĀtaĀtion.
-
Compose the character reference and enter it into HTML source code or a Markdown document. In our case, we input
©; this renders the character āĀ©ā.
A less common approach allows for the use of code points in decimal rather than hexaĀdecĀiĀmal repĀreĀsenĀtaĀtion. In this case, the numeric reference begins with &# (without the āxā) and ends as usual with ;. In between, the code point is written in decimal form. In our example, the numeric reference © results in the copyright symbol.
Use the Unicode Character Inspector to quickly find the different codes for a character.
Named character entities
Since the notation of Unicode charĀacĀters as numeric refĀerĀences is not intuitive for humans, there is another method: named character entities. These are defined for commonly used charĀacĀters and assign a short, memorable name to the character. A named character entity starts with the ampersand & and ends with a semicolon ;. The defined name is placed in between without spaces. To insert the copyright symbol āĀ©ā in HTML, simply write ©.
The complete list of defined character entities is docĀuĀmentĀed in the HTML Standard.
ProĀgramĀming languages
Most proĀgramĀming languages include basic functions to convert charĀacĀters and code points. The corĀreĀspondĀing functions are often called ord(character) and chr(code point). The following applies:
chr(ord(character)) == character
Note that it is always possible to determine the code point corĀreĀspondĀing to a character. ConĀverseĀly, the asĀsignĀment only works for numbers that are actually defined as code points of the character code. We demonĀstrate the basic scheme here with a short Python example:
# Determine the decimal code point of a character
ord('A') # `65`
# Determine the hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Determine the character corresponding to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, because code point > `0x110000`pythonWith the help of these functions, itās easy to create a character table for code points of the Unicode character set. For this, you iterate the code points and output the corĀreĀspondĀing charĀacĀters. With Python, this can be done in just a few lines of code:
# Start `range` at `32` to avoid control characters being printed
# Print ASCII character set
for code_point in range(32, 128):
print(code_point, hex(code_point), chr(code_point))
# Print ISO Latin-1
for code_point in range(32, 256):
print(code_point, hex(code_point), chr(code_point))pythonProgram Library ICU
The InĀterĀnaĀtionĀal ComĀpoĀnents for Unicode (ICU) are conĀsolĀiĀdatĀed in a program library provided by the Unicode ConĀsorĀtium. The library is released under an open-source license and can be used on many operating systems. The software serves the purpose of proĀgramĀmatĀic InĀterĀnaĀtionĀalĀizaĀtion (often abĀbreĀviĀatĀed as āi18nā). Its apĀpliĀcaĀtions include:
- ProĀcessĀing of Unicode texts
- Support for regular exĀpresĀsions in Unicode
- Parsing and forĀmatĀting of calendar dates, times, numbers, curĀrenĀcies, and messages
The ICU library is available in two versions:
- āicu4cā is written in C/C++ and provides an API for these languages.
- āicu4jā is written in Java and provides an API for this language.
The use of the comĀpoĀnents provides conĀsisĀtent results reĀgardĀless of the unĀderĀlyĀing platform.
Charset meta tag in the head of HTML documents
Most HTML documents today use the UTF-8 character encoding. To ensure that visitors see the document without erroneous charĀacĀters, a āCharsetā meta tag should be placed in the head of the HTML document. This instructs the browser to interpret the retrieved document as UTF-8 and is ilĀlusĀtratĀed below:
<head>
<meta charset="utf-8">
<!-- additional head elements -->
</head>htmlInstagram fonts
The popular social network Instagram does not allow text forĀmatĀting for biography inĀforĀmaĀtion, posts, or stories. This limits usersā creative options. However, clever deĀvelĀopĀers have found a workaround: Instagram uses Unicode, making it possible to compose text that appears formatted using special charĀacĀters. This often involves charĀacĀters that resemble Latin letters. The easiest way to create such text is with an Insta Fonts Generator. AdĀdiĀtionĀalĀly, using Instagram fonts also works in other social networks.

