BOM: What is a Byte Order Mark?
Information sent over the internet needs to be in a certain order. The data recipient (for example, a HTML page) needs to know how to read the information. To ensure this, different markers are put in the code. One such marker is the byte order mark (BOM). But what is the marker intended for?
Why Do You Need the BOM?
Characters can be coded in various ways. While today, UTF-8 is used a lot, UTF-16 encoding was previously popular – and is still often used today. UTF-32 is also used sometimes. Unlike with UTF-8, however, encoding with a larger number of bits per character requires the order of bytes to be known.
With UTF-8 encoding, each character can be presented within one byte (i.e. 8 bits). With UTF-16 on the other hand, you need two bytes (so 16 bits) to encode a character. In order for the character to be interpreted correctly, it must be clear whether the bytes are read from left to right or from right to left. Depending on this, a completely different value is created.
- From left to right: 01101010 00110101 is 6a35 in hexadecimal notation
- From right to left: 01101010 00110101 is 356a in hexadecimal notation
When looking at this number sequence in the context of a Unicode table, two completely different characters would be displayed. The first form of interpretation is known as Big Endian (BE), and the second as Little Endian (LE). The reason for this is that with Big Endian, the higher value is indicated first, and with Little Endian, the lower value is indicated first.
In everyday life, the Big Endian notation is more frequently used. But this is just a convention. Computers can handle both methods of storage, so it makes sense to mark them.
In order to indicate the order in which the bytes are to be read, you need a BOM. This is a character that is not visible and therefore also known as a zero-width no-break space. It’s a space that has a width of zero and does not trigger a line break. In UTF-16, this character (hexadecimal) is either feff (BE) or fffe (LE). This value is then prefixed to the actual character encoding.
UTF-8 doesn’t actually need the BOM – and yet it is also found in texts encoded with it. This is either a remnant that arose in the conversion from UTF-16/UTF-32 to UTF-8, or it has been automatically inserted by an editor. This is because, even if the byte order mark is not necessary for UTF-8, it usually does not get in the way since it is not displayed.
Issues with Byte Order Mark
Problems arise when the receiving system does not know how to handle the BOM. Some PHP versions or various Unix-like environments do not expect the character, which can lead to an incorrect presentation of a website, for example.
Problems can also arise between HTTP and HTML: One HTTP header already contains information about character encoding. This comes from the server settings. If the HTML document has been created with a BOM, but the HTTP header makes a different indication to the browser, this can also lead to display errors. This should no longer occur since the change in the HTML5 specification took place: There, it was required that the BOM overwrites the information of the HTTP header at the beginning. However, it’s possible that older browser versions have not yet implemented this new rule.
How to remove BOM
If you want to remove the byte order mark from a source code, you need a text editor that offers the option of saving the mark. You read the file with the BOM into the software, then save it again without the BOM and thereby convert the coding. The mark should then no longer appear. In the popular text editor Notepad++, for example, you can change the encoding and then save the file without the BOM.
In older versions of Notepad++, you can still find the menu entry UTF-8 without a BOM. In newer versions, this corresponds to UTF-8. With the marker, the entry would correspond to UTF-8 BOM.