Data struc­tures are the bones of every website and an integral part of HTML coding since tags are used to assign various settings and features to text segments. Among other things, such ad­just­ments allow web de­vel­op­ers to define para­graphs, titles, lists, hy­per­links, graphics, tables, videos as well as put fonts in bold lettering or italics. Programs that read out the code receive detailed in­for­ma­tion on the structure of HTML documents as well as their de­pic­tions as defined by the tagged elements. The content supplied by these tags isn’t captured when the code is au­to­mat­i­cal­ly read out. As seen in the example below from a news article, the left depiction shows which in­for­ma­tion is reg­is­tered by a program, while the right one displays how a human reader would interpret the text:

While human internet users can infer that the headline is to be un­der­stood as a title, and the sub­head­line is the author’s name, etc., programs can only interpret in­for­ma­tion that has been labeled (or tagged) in HTML code:  headline (<h1>), sub­head­line <h1>, italics <i>. Such issues are relevant when search engine web crawlers are at play; these are re­spon­si­ble for de­ter­min­ing a website’s relevance based on search queries. This is why many website owners enrich their HTML documents with machine-readable semantic in­for­ma­tion, which defines the meaning of in­di­vid­ual content. This is known as struc­tured data.

Why is struc­tured data needed?

The idea of struc­tur­ing website data so that programs can process in­for­ma­tion shaped by human language comes from the concept of the semantic web. When properly used, struc­tured data enables website content to be machine readable. This is par­tic­u­lar­ly relevant for text-based search engines like Google, Bing, or Yahoo! When provided with cor­re­spond­ing tags, these Big Data giants are able to read and evaluate semantic in­for­ma­tion and process it into various display forms, such as the Knowledge Graph or Rich Snippets in the SERP (search engine result page). The latter aspect is es­pe­cial­ly important for website owners.

Rich Snippets are excerpts from web content that display basic in­for­ma­tion (URL, title, and de­scrip­tion) in the SERPS. For this in­for­ma­tion to be displayed, all relevant content needs to be tagged in the HTML code and assigned a certain in­for­ma­tion type by the website owner. Currently, the market leader, Google, processes struc­tured data in order to display Rich Snippets for the following data types:

  • Product in­for­ma­tion: price, avail­abil­i­ty, reviews and user ex­pe­ri­ences
  • Recipes: pictures, prepa­ra­tion time, calories, and reviews
  • User ex­pe­ri­ences: restau­rants, movies, stores and busi­ness­es
  • Events: musicals, concerts, ex­hi­bi­tions, or festivals, including duration
  • Software: reviews, price, user ex­pe­ri­ences
  • Videos: de­scrip­tion and image preview
  • News articles: title, pub­li­ca­tion date, author details, and picture

For website owners, Rich Snippets have the advantage of taking up sig­nif­i­cant­ly more space in the SERPS and sticking out more, which leads to a higher click rate. Search result displays can be expanded using bread­crumbs (a graphical control element) and the sitelinks search box.

Google displays the sitelinks search box for nav­i­ga­tion­al search requests. This happens when the desired website can be derived from the user’s search query, but its subpage can’t; this usually occurs when users search for brands. This process enables internet users to browse through websites directly in the SERPs, sparing the need of accessing in­di­vid­ual sites. For site owners, sitelinks and search boxes again have the advantage of gaining more attention through the pro­por­tion­ate­ly large amount of space this feature occupies in the SERPS.

Bread­crumbs display the position of a search hit within the structure of a website and help search engine users orientate them­selves.

Exactly which search results are expanded with this feature depend on the different criteria search engines use to determine their relevance. This is why it’s important to tag your website ac­cord­ing­ly; search engines need struc­tured data in order generate Rich Snippets, bread­crumbs, or a sitelinks search box.

Struc­tur­ing data on your own website

There are several standard formats that site owners follow in order to ensure that content with struc­tured data is machine readable. These include mi­cro­for­mats, RDFa, and microdata. All three formats for data struc­tur­ing are based on semantic tagging, which is entered directly into the HTML code. Depending on the format, either tra­di­tion­al HTML at­trib­ut­es or new labeling elements can be used. The data format JSON-LD has become in­creas­ing­ly popular over the past few years; this option makes it possible to annotate a web page within a script.

Mi­cro­for­mats

The labeling format mi­cro­for­mats is used for se­man­ti­cal­ly tagging HTML and XHTML documents. Well-known HTML at­trib­ut­es, like class, rel, and rev are extracted from the website code, enabling programs like web crawlers to read out semantic in­for­ma­tion. A typical use case would be to label contact in­for­ma­tion with the mi­cro­for­mat hCard, which is in­te­grat­ed in the HTML code as class=’vcard’:

An example of common labeling for contact in­for­ma­tion in HTML:

01<div>
02<div>first name last name</div>
03<div>company</div>
04<div>phone number</div>
05<a href="http://website.com/">http://website.com/</a>
06</div>

Tagging contact in­for­ma­tion with the mi­cro­for­mat hCard

01<div class="vcard">
02<div class="fn">first name last name</div>
03<div class="org">company</div>
04<div class="tel">phone number</div>
05<a class="url" href="http://website.com/">http://website.com/</a>
06</div>

While the contact in­for­ma­tion in pure HTML markup is tagged as a div element, in­te­grat­ing the mi­cro­for­mats hCard via the HTML attribute class=‘vcard’ enables distinct semantic an­no­ta­tions for specific bits of in­for­ma­tion—like names, or­ga­ni­za­tions, or telephone numbers—to be in­cor­po­rat­ed. The advantage of this type of labeling is the easy ap­pli­ca­tion of known HTML at­trib­ut­es. Doing this limits the options of semantic an­no­ta­tions with mi­cro­for­mats to a few pre­de­fined elements. Using class at­trib­ut­es can also lead to conflicts with CSS. An API for ex­tract­ing data is also not supported by mi­cro­for­mats.

RDFa

RDFa stands for ‘resource de­scrip­tion framework in at­trib­ut­es’. The W3C rec­om­mends this format for embedding RDF state­ments in HTML, XHTML, and other XML dialects. Instead of having to rely on common HTML at­trib­ut­es, RDFa in­tro­duces new at­trib­ut­es that enable complex semantic an­no­ta­tion. The following example shows contact in­for­ma­tion as struc­tured data in RDFa format:

Ausze­ich­nung von Kon­tak­t­in­for­ma­tio­nen mit RDFa

01<div xmlns:v="http://rdf.data-vo­cab­u­lary.org/#" typeof="v:Person">
02<div property="v:name">first name last name</div>
03<div property="v:af­fil­i­a­tion">company</div>.
04<div property="v:tel">phone number</div>
05<a href="http://website.com" rel="v:url">www.website.com</a>.
06</div>

Before tagging data with the RDFa format, the cor­re­spond­ing XML namespace has to be defined. The attribute typeof specifies which data type the subject of an RDF statement is as­so­ci­at­ed with. The attribute property de­ter­mines the predicate of a statement and also specifies char­ac­ter­is­tics for an element’s content. The ad­van­tages of data struc­tur­ing with RDFa include its high flex­i­bil­i­ty and pos­si­bil­i­ty to define custom vo­cab­u­lary. Prefixes also help keep the code compact. RDFa supports a DOM API (document object model ap­pli­ca­tion pro­gram­ming interface) that extracts a website’s struc­tured data and can also be used for in­ter­ac­tive ap­pli­ca­tions. A dis­ad­van­tage is the focus on XML and XHTML, even though RDFa can also be embedded into HTML5. A detailed guide on schema.org can be found in our tutorial on the topic. For stan­dard­ized vo­cab­u­lary of RDFa an­no­ta­tions, consult the official website.

Microdata

Microdata is a sep­a­rate­ly defined HTML5 module that can add at­trib­ut­es to existing markup language; these at­trib­ut­es are used for carrying out semantic an­no­ta­tions. As is the case with mi­cro­for­mats and RDFa, this format also uses simple at­trib­ut­es in HTML tags for assigning item features. The microdata syntax is based on a vo­cab­u­lary that allows items to be described as name/value pairs. This gives the markup format a com­pro­mise between moderate com­plex­i­ty, flex­i­bil­i­ty, and ex­pand­abil­i­ty. Microdata supports a native JSON export for trans­fer­ring data and saving struc­tured data as well as Microdata DOM API. Microdata is com­pat­i­ble with schema.org vo­cab­u­lary.

JSON-LD

JSON-LD is the newest standard for se­man­ti­cal­ly labelling website data. The acronym stands for ‘JavaScript object notation for linked data’ (in other words, the JSON-based se­ri­al­iza­tion of linked data). Google considers JSON-LD to be the simplest markup format, but doesn’t yet support all data types. Unlike mi­cro­for­mats, RDFa and Microdata, JSON-LD isn’t based on at­trib­ut­es in HTML tags. Instead, a block with JSON data is in­cor­po­rat­ed in a script of HTML code at a location of your choosing.

The project Schema.org

Initiated by market leaders Google, Bing, Yahoo!, and Yandex, the col­lab­o­ra­tive community Schema.org sets out to stan­dard­ize the semantic an­no­ta­tion of website content. Browsing through the website, users will find a uniform set of schemes for struc­tured data. Schema.org supports the data formats RDFa, Microdata, and JSON-LD.

Tip: testing struc­tured data with Google

Labeling HTML documents through semantic an­no­ta­tion requires a high level of tact. Avoiding mistakes is best done by extending a page’s source code step by step and val­i­dat­ing tags slowly as you go along. For this, Google provides a free struc­tured data testing tool. Here, site owners are able to check in­di­vid­ual code excerpts or enter the URL of a web page to check the source code for errors. The search engine giant also offers a tool, Data High­lighter, which lets users tag data directly on a web page in the browser. Relevant areas are marked with the mouse and then provided with a keyword. This method of semantic an­no­ta­tion doesn’t allow any direct labeling in the source code. The tagged areas can only be read Google and can be used for ad­di­tion­al display forms. Other search engines like Bing or Yahoo! don’t offer users the option of gathering such content.

Go to Main Menu