A Deep Dive into Web Metadata

A Deep Dive into Web Metadata

The web was born as a medium to share documents and other media resources online. To this day it is founded on HTML, a markup language that allows users to easily create content. The ease of writing HTML was (and is) one factor that made the web platform so popular.

Making content easy to create led to a proliferation of web sites. But it was difficult to find information on the web. In the early days, the problem was solved by creating catalogs of links on specific topics. This approach was manual, and it was soon replaced by search engines that used robots to automatically crawl and index content from the web to create a searchable database.

Search engines quickly became the dominant force on the web. They were also increasingly commercially significant, driving traffic to companies selling their goods and services. This in turn was seen as an opportunity by others and so the Search Engine Optimization industry was born. SEO is the practice of trying to affect the engines so that your (or your clients') content ranks high on search result pages (SERPs) using content editing and/ or technical methods.

Search Engines Make Sense of a Giant Mess

The process of finding and listing relevant content from the web not a trivial task. You need to have vast amounts of data and advanced algorithms to find meaningful patterns in it. The web is also an incredibly messy way to store data. Unlike in most computer programming where the syntax needs to be tiptop or the program won't run, HTML works even if it's a little bit broken.

The robust nature of the web allowed for easy content creation, but made processing it difficult. One practical example is that Google completely ignores the defined language for a HTML document. This is because this definition is often wrong, as a result of copy-pasting templates or wrong CMS configuration. Instead of relying on the given definition, the algorithm makes educated guesses: If it reads like German it likely is, even if the definition says it's English.

There is nothing stopping engineers and content creators to create something more structured. HTML has always had structured elements like headings that you can use to create semantically relevant document structures. You can also add metadata to give meaning to the content and its relationships with other entities on the web. The tools do exist, and they're used to some extent.

Ultimately, the Open Web is stuck with an imperfect data model implementation compared to walled gardens controlled by companies like Facebook or Twitter. Luckily, this is increasingly not that much of an issue. Advancements in technology (like the Google language example above) can fill in the gaps where semantic data is not there or if the input is inherently false.

Metadata is Data About Data

While we might never get a fully semantic web where everything is related to everything and machines can process the data and its meaning with ease, there is still room for improvement. It is possible to annotate our messy web with some structured data that makes it easier to make sense of it all. Investing in metadata can also give a competitive advantage in the SEO arena.

Most people who have worked with the web or information management in general are familiar with the term metadata. It is information that can be used to give additional meaning to content. HTML meta tags like title and description are the most rudimentary form of it on the web, but more sophisticated formats like Open Graph and Twitter cards have seen increased adoption.

The above-mentioned formats are still all quite limited in their vocabulary. They can express some basic attributes like content type, description and maybe a thumbnail image. They are useful for basic content indexing as well as social media sharing, but lack depth. More advanced metadata formats have been around for even longer using RDFa and microformats.

RDFa and microformats are attributes embedded into HTML elements that annotate them with metadata on things such as people, organizations, calendar events and relationships between these entities. Microformats are inlined into markup and as such are cumbersome to implement.

Microformats have been largely replaced by JSON-LD as the de-facto rich metadata standard for the web. It is technical term is constructed from two acronyms describing the how and what: 

  • JSON: JavaScript Object Notation
  • LD: Linked Data

Instead of embedding metadata with in the HTML markup, JSON-LD is written in a single blob within the head section. This is a JSON fragment that conforms to a specific format of data from Schema.org. Schema.org is an actively developed hierarchical vocabulary of different types of entities and activities. See full list of Schema.org classes: Full Hierarchy on Schema.org

Conclusion

Metadata remains an important topic on the web. It has evolved from being abused for keyword stuffing as a black hat SEO technique to being a helpful addition to help users discover your content and services. It is also a continuously evolving area, so you can't define best practices for which metadata you should include on your site and expect them to stay relevant for very long.

Some proprietary metadata tags can become irrelevant over time as the popularity of a platform wanes, such as when Google Plus was shut down or as Myspace lost its appeal. So you should never rest on your laurels when it comes to metadata, but do a periodical review of them and do necessary adjustment. Every six months or so sounds like a good cadence for doing this.

That said, the combination of JSON-LD and Schema.org specifications seems like a future proof platform that has legs for a marathon. At the time of writing in May 2020 Google supports the following entity types from the spec: articles, book reviews, datasets, events, job postings, local businesses, movies, products, Q&As, restaurants, software applications and TV episodes. A long list, but does not cover the full the breadth of the vocabulary available today on Schema.org.

What likely lasts longer than the markup of your metadata is where it is stored. Having a robust and capable tool that can handle management of metadata is the key for long term success. Learn more about the content engine and other capabilities that help you build rich semantic relationships between entities by downloading our free eBook on Digital Experience Platforms.

eZ Platform is now Ibexa DXP

Ibexa DXP was announced in October 2020. It replaces the eZ Platform brand name, but behind the scenes it is an evolution of the technology. Read the Ibexa DXP v3.2 announcement blog post to learn all about our new product family: Ibexa Content, Ibexa Experience and Ibexa Commerce

Introducing Ibexa DXP 3.2

Insights and News