The term metadata has been on everyone’s lips for a few years now. Today, billions of people around the world use digital media. Large amounts of metadata are con­stant­ly being generated in the process. The term “trans­par­ent citizen” is sometimes used to describe the resulting data pro­tec­tion risk.

The eval­u­a­tion of metadata by ar­ti­fi­cial in­tel­li­gence provides pre­dic­tions about people’s behavior. In per­spec­tive, this poses a serious threat to the privacy of citizens and to democracy in practice. Yet metadata is not a bad thing. In this article, we explain what metadata actually is.

What’s the dif­fer­ence between metadata and data?

De­f­i­n­i­tion

Metadata: The term refers to in­for­ma­tion that sup­ple­ments actual data. Often, metadata provides more details about the context of the content or gives in­struc­tions on how to handle data. In this way, metadata plays a major role in both computing and tra­di­tion­al data pro­cess­ing (including things like library catalogs or the postal system).

To become more familiar with the term metadata, imagine a simple example: You send a letter through the mail. Now the document contained in the envelope cor­re­sponds to the actual, primary data. This data is private and protected by law against access by third parties – the secrecy of cor­re­spon­dence applies.

The envelope contains the metadata of the letter. This is ad­di­tion­al data that ac­com­pa­nies the primary data:

  • Address and sender
  • Stamp and post mark
  • Where required, ad­di­tion­al iden­ti­fiers like bar codes

As you can see, all in all it is data that makes the sending of the letter possible in the first place. The metadata of the letter is visible to anyone. This means that it is not specially protected by the secrecy of cor­re­spon­dence, although postal secrecy does apply.

So, what is the danger posed by metadata? It’s not a problem if in­di­vid­ual metadata can be read. If, for example, a third party gained knowledge of the existence of an in­di­vid­ual envelope, it’s usually no cause for concern. However, this changes when more data is at stake, as is the case with massive data storage and its eval­u­a­tion. On a larger scale, patterns emerge that reveal a lot about a person’s behavior: Who com­mu­ni­cat­ed with whom and when? Networks and chains of com­mu­ni­ca­tion can be iden­ti­fied.

The dis­tinc­tion between data and metadata is fluid. The clas­si­fi­ca­tion depends on the context and on per­spec­tive. Here’s another example. A book contains primary data, such as the title of the book and its content. Fur­ther­more, a set of metadata is available for the pub­li­ca­tion of a book:

  • Author
  • Publisher
  • Time and place the book was published
  • Edition
  • ISBN

Let’s imagine that the metadata of many pub­li­ca­tions is collected in a database. Regarding this kind of a database, the pub­li­ca­tion in­for­ma­tion would be primary data. In addition, there would be a new set of metadata for each pub­li­ca­tion. For example, for each pub­li­ca­tion, the database could store when an entry was added and by which user.

What types of metadata exist and how are these used?

Metadata is found in all areas of data storage and pro­cess­ing. The use of metadata cannot be described con­clu­sive­ly. Here are three major areas of use:

1. To provide context for in­for­ma­tion.

Metadata often describes the process that led to the creation of in­for­ma­tion. Think, for example, of the ge­o­graph­ic co­or­di­nates with which digital photos are tagged. The context – once lost – may not be re­con­struct­ed and is therefore stored.

2. To provide in­for­ma­tion that would otherwise be difficult to find.

Here, consider the length of a video. This length is embedded as a timer in the video file. Without saving the duration of a video, it would have to be cal­cu­lat­ed. A possible approach would be to count the number of frames and divide this by the frame rate – a rel­a­tive­ly high effort.

3. Linking in­for­ma­tion, making it easily re­triev­able and search­able.

The main goal here is to support human-readable in­for­ma­tion with machine-readable data. The aim is to use automated processes to establish re­la­tion­ships between pieces of in­for­ma­tion. In par­tic­u­lar, struc­tured data, which, when connected, creates a so-called “semantic web”.

Metadata that describes images

Images taken with digital cameras and smart­phones contain a large amount of metadata. On the one hand, this is technical data, such as image di­men­sions, the camera used, focal length, etc. These factors are defined in the EXIF standard and are created au­to­mat­i­cal­ly by the camera. Fur­ther­more, the IPTC standard defines metadata that describes the content of the photo and is entered by the user.

Standard Image metadata Creation
EXIF Image in­for­ma­tion like di­men­sions, color space, color channels, etc.; pho­to­graph­ic in­for­ma­tion, such as exposure time, aperture, ISO, etc. Automatic when recording
IPTC Keywords, copy­rights, location and time in­for­ma­tion, content de­scrip­tions, etc. Manually done by user

When sharing digital images, you should be careful: the image metadata can contain private in­for­ma­tion on the author. Many apps and social networks au­to­mat­i­cal­ly clear images when they are uploaded. But it’s best to not rely on this. In certain instances, it’s better to use a special tool to delete image in­for­ma­tion.

Metadata that is embedded in digital videos

A video file typically consists of a container that holds various data. The primary data of a video includes the encoded video and audio content. Ad­di­tion­al metadata that is embedded includes:

  • Length of the video
  • Data rate and image di­men­sions
  • Details of the audio and video codec used
  • Subtitles, if ap­plic­a­ble in different languages

Metadata that is assigned to files

A file in a digital system includes two primary pieces of data: the contents of the file and its name. In addition, each file has a set of metadata as­so­ci­at­ed with it. The file metadata is managed by the operating system and is also known as “file attribute”. Here is an overview of common file metadata:

File metadata De­scrip­tion
Time stamp For the creation, mod­i­fi­ca­tion, and last time the file was opened
Saved location File path in the data system
Ownership Owner and group
File per­mis­sions Read, right, execute: for users, groups, and other

In addition to file at­trib­ut­es, some file types include specific metadata. These are managed by the re­spec­tive ap­pli­ca­tion. Even with this metadata, there is a risk of dis­clos­ing con­fi­den­tial in­for­ma­tion when sharing it.

Metadata that is created when an email is sent

An email includes – analogous to the classic postal letter – two key parts:

The body contains the actual message, which cor­re­sponds to the document in the envelope. Like the envelope, the header contains the addresses of the sender and recipient. As with the envelope, some in­for­ma­tion in the header can be easily forged. For the recipient, it then appears as if an email came from a different sender. This is a trick that is often used in spoofing attacks.

The email header usually contains a lot of other metadata, such as:

  • Various time­stamps
  • In­for­ma­tion on the for­mat­ting and coding of the message
  • Stages the email has passed through during trans­mis­sion
  • Eval­u­a­tion of the email by spam filters
  • Note on whether the email was checked by a virus scanner

The metadata of the email header is written and read by server software and ap­pli­ca­tion programs. The in­for­ma­tion generated in the process reveals a lot about an email and the path it has taken through the Internet. Among other things, state­ments can be made about the au­then­tic­i­ty and con­fi­den­tial­i­ty of an email. Fur­ther­more, the header can contain the host name of the user’s own device and reveal the location from which an email was sent.

Metadata that is generated when you visit a website

From a technical point of view, visiting a website is re­triev­ing an HTML document. The user’s browser retrieves the document from a server at the specified address. The HTTP or HTTPS protocol is used for this.

In addition to the actual HTML document that is displayed in the browser, metadata known as HTTP headers is trans­mit­ted. The HTTP headers are com­pa­ra­ble to the fields of the email header. They contain in­for­ma­tion about the encoding, trans­mis­sion, en­cryp­tion, and com­pres­sion of the HTTP con­nec­tion.

Fur­ther­more, metadata is generated during the transfer, which ac­cu­mu­lates on the server. These include log files in which accesses to the server are logged, and which are needed for logfile analyses. For each access, another line is written to the log file. In addition, the browser usually sends further queries to the DNS server. Metadata is also generated and possibly stored and evaluated by the server operator.

Con­fus­ing­ly, in addition to the HTTP header already mentioned, there is also the HTML header. While the former refers to the con­nec­tion, the latter contains metadata de­scrib­ing the contents of the document. Below is an overview of a HTTP server response. The in­tro­duc­to­ry lines are the HTTP header. This is followed by the HTML source code with HTML head and HTML body elements:

HTTP/1.1 200 OK
Date: Mon, 01 Feb 2021 12:13:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 148
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Accept-Ranges: bytes
Connection: close
<html>
    <head>
        <title>An Example Page</title>
    </head>
    <body>
        <p> The human readable text is in the body of the document</p>
    </body>
</html>

What metadata means for online marketing and search engine op­ti­miza­tion

In this section, we focus on metadata that is embedded in a HTML document. We’ll leave out the HTTP metadata already mentioned, as well as server-side metadata such as log files. Usually, HTML metadata is embedded in the head of the HTML document.

Many of the elements used in the HTML header are directly used for search engine op­ti­miza­tion. Search engine bots crawl the content of an HTML document. The human-readable part present in the HTML body is extracted and indexed. In addition, there is special metadata that is intended ex­clu­sive­ly for bots. Here, we dis­tin­guish between “classic” and “modern” variants.

Website metadata il­lus­trat­ed with classic HTML head elements

The classic HTML head elements include the title and a handful of critical meta tags. The title is also visible to the user in various forms. For example, it is displayed in bookmarks or in the browser tab header. The other classic “<meta>” tags are used ex­clu­sive­ly for search engine op­ti­miza­tion. Here is an overview of the most important classic HTML head elements:

Tag De­scrip­tion Im­por­tance
<title> Title of the document, displayed in results of a search Critical
<meta name="de­scrip­tion"> De­scrip­tion of the document, displayed in the search results Critical
<meta name="keywords"> Keywords of the document, not displayed in search results Minimal
<meta name="robots"> Di­rec­tions for search engine bots for pro­cess­ing the document Critical

Website metadata displayed with modern HTML head elements

In addition to the classic HTML head elements, a variety of other elements are used today to include metadata on a website. Search engine operators and large tech­nol­o­gy groups are con­stant­ly defining new metadata. The elements “<meta>” and “<link>” are ideal for this, as they can be expanded. Here is an overview of fre­quent­ly used modern website metadata:

Tag De­scrip­tion Im­por­tance
<link rel="canonical"> Canonical tag to avoid duplicate content Critical, if duplicate content exists
<link rel="alternate" hreflang="en"> Provide al­ter­na­tive language versions of the same document per hreflang Optional
<meta property="og:…"> Open Graph for pub­li­ca­tion on social media Optional

For the “<meta>” element, the “name” attribute is used to specify the specific type of metadata. For the “<link>” element, the “rel” attribute is used in a similar way. Depending on the metadata standard used, two al­ter­na­tive notations can be found for the “<meta>” element. We summarize them here:

How it’s written Metadata standard
<meta name=""> HTML5
<meta property=""> RDFa
<meta itemprop=""> HTML Microdata

Website metadata defined with the Open Graph

Open Graph is a protocol developed by Google to enrich a web document with metadata. The Open Graph data provides in­for­ma­tion that is displayed as an overview when the document is shared on social networks. In this way, optimized images, titles, and de­scrip­tive texts can be specified. This makes sense, since depending on the platform, specific re­stric­tions apply in terms of length of texts, di­men­sions of images, and the like. The protocol is used ex­ten­sive­ly by Facebook and Twitter. Here is an overview of the essential Open Graph metadata:

Open Graph metadata Ex­pla­na­tion
<meta property="og:title"> Title of the object
<meta property="og:type"> The type of objects e.g., image, web document, video, etc.
<meta property="og:image"> An image that rep­re­sents an object
<meta property="og:url"> The canonical URL of the object
Tip

If you find errors in your web content when sharing content on Facebook, the problem is often as­so­ci­at­ed with faulty Open Graph entries. In this case, a simple trick can fix the error: log in to your Facebook account and try the Sharing Debugger. This tells Facebook to read the Open Graph in­for­ma­tion again.

Website metadata defined with Rich Cards

Besides Open Graph, Rich Cards is a further metadata standard developed by Google. Rich Cards enrich a web document with struc­tured metadata. For example, the website of a restau­rant can be sup­ple­ment­ed with in­for­ma­tion on ge­o­graph­i­cal location, prices, opening hours, etc. The Rich Card in­for­ma­tion can be placed in the HTML head or in the HTML body.

Tech­ni­cal­ly, Rich Cards are derived from the metadata standard Schema.org. Various formats are used to mark up the metadata. Besides the older standards which include RDFa and microdata, JSON-LD is also available today. The use of JSON-LD even comes of­fi­cial­ly rec­om­mend­ed by Google.

Go to Main Menu