Basic Vocabulary for schema.org and Structured Data

by Aaron Bradley on November 5, 2013

in Semantic Web, SEO

Basic Vocabulary for schema.org and Structured Data Markup

schema.org was launched in June 2011, and for many search marketers this launch introduced them not only to the initiative itself, but to the world of structured data and its accompanying terminology.

The vocabulary surrounding schema.org is old hat for semantic web developers, but for the rest of us the terminology associated with the world of structured data is often confusing, frustratingly nuanced or downright opaque.

That SEOs and others have mangled words when discussing schema.org and related technologies is entirely understandable. I've had a consuming interest in "semantic SEO" for the past five years, and I still encounter plenty of difficulty in expressing myself correctly when discussing matters related to schema.org.

But having made the effort to get a better grasp of the terminology used in this realm, I can assure you that making this effort yourself will result in a better understanding of the technologies you're referencing.

In that spirit I've laid out here as best I can some of the core vocabulary related to schema.org: I hope you find it helpful, and I welcome any corrections or suggestions on how the definitions and examples provided can be improved.

Contents

General vocabulary for structured data markup

schema.org is derived from and/or associated with a number of technologies that are described here.

schema.org

The shortest possible definition: schema.org is a vocabulary.

A slightly more extended and probably more useful definition is that schema.org is a vocabulary that supports the markup of structured data in HTML documents.

The schema.org site describes it this way:

Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!

I think the above definition from the "getting started" page is a bit clearer than the home page description (or at least provides context to better understand that definition) which calls schema.org "a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers."

All of this isn't particularly helpful, of course, without knowing what is meant by "a vocabulary." While it's not incorrect to think of schema.org being a vocabulary in the colloquial sense – "the body of words used in a particular language" (if one thinks of schema.org as being "a language," and of its types, properties and enumerations as being its "body of words"), the term "vocabulary" has a more specific meaning in the semantic web world. Under the heading "what is a vocabulary?" W3C has this to say:

On the Semantic Web, vocabularies define the concepts and relationships (also referred to as "terms") used to describe and represent an area of concern. Vocabularies are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. In practice, vocabularies can be very complex (with several thousands of terms) or very simple (describing one or two concepts only).

For those of you with a working knowledge of ontologies, the W3C entry goes onto say that there's "no clear division between what is referred to as 'vocabularies' and 'ontologies'. The trend is to use the word 'ontology' for more complex, and possibly quite formal collection of terms, whereas 'vocabulary' is used when such strict formalism is not necessarily used or only in a very loose sense." This explanation may help clear up any confusion when you find "schema.org" and "ontology" used in the same sentence (for an in-depth discussion of the differences between taxonomies, vocabularies, ontologies, thesauri and more, I recommend Heather Hedden's excellent book "The Accidental Taxonomist").

Finally, in case you're wondering how the word "schema" fits into schema.org, one can lean on the definition of an XML schema (which will be more accessible to most than RDF Schema), which Wikipedia describes as "a way to define the structure, content, and to some extent, the semantics of XML documents." In general, think of a schema as being "a set of rules and definitions."

Armed with this nomenclature you're hopefully better armed to understand what schema.org means when it declares itself to be "a collection of schemas." On this topic schema.org's Dan Brickley provides some excellent insight on schema.org, and the difference between XML and RDF-based schemas:

Sometimes we talk like schema.org is one big schema; sometimes as if it were several. This is because it has an associative, network structure. You can see similar ambiguity about how other networks are discussed.

The word 'vocabulary' emphasises description and communication. The word 'schema' emphasises data structures, databases. Unlike XML schemas, RDF-based schemas are closer to dictionaries than to grammar rules. They document the meaning and inter-relationship of descriptive terms rather than police strongly how you must use them.

A couple of things that schema.org is not, since these descriptions keep popping up. It is not a microformat (or a "micro format"). It is not microdata (or "micro data"). It is not "a markup" but it is, in schema.org's own words "a markup vocabulary" (a Wikidata note correctly describes it as – emphasis mine – "a project to improve general Web page markup through the use of structured data").

data-vocabulary.org

The shortest possible definition: data-vocabulary.org is a vocabulary (which will surprise exactly no-one).

If you were to visit www.data-vocabulary.org today, you would discover most of its real estate is given over to links to schema.org, which in itself tells you a lot about it's place in the structured data world.

In many ways data-vocabulary.org is the predecessor to schema.org (and schema.org has certainly superseded it), and prior to schema.org was the primary vocabulary used for marking up HTML documents with microdata, though it can be employed with other markup syntaxes (Mark Pilgrim uses it for all his namespace examples in his chapter on microdata in his classic work "Dive into HTML5," and it was the sole vocabulary referenced by Google when they announced microdata support for rich snippets in March 2010).

For those that care, data-vocabulary.org was Google's predecessor to schema.org, whereas the latter vocabulary is a collaborative initiative backed by Google, Bing, Yahoo and Yandex (you can read a contemporary take-down of data-vocabulary.org from Ian Davis written mere days after the domain popped into existence).

I'll leave data-vocabulary.org at that, as the working semantologist (yes, I just made that word up) will have little reason to use it the schema.org era. However, I've included on this list both because it is still widely deployed, and because (at time of writing) many Google Webmaster Tool help pages on rich snippets still use it in their examples.

Microformats

The shortest possible definition: a microformat is a vocabulary and markup syntax.

A slightly more extended, probably more useful and certainly more accurate definition is that a microformat is a method of adding semantic information to an HTML document using a prescribed markup structure that relies on existing HTML attributes.

The microformats.org site describes it this way:

Microformats are simple ways to add information to a web page using mostly the class attribute (although sometimes the id, title, rel or rev attributes too). The class names are semantically rich and describe the data they encapsulate.

The microformats wiki on microformats.org provides an alternate definition that focuses more on the functional usefulness of microformats:

microformats are HTML for marking up people, organizations, events, locations, blog posts, products, reviews, resumes, recipes etc. Sites use microformats to publish a standard API that is consumed and used by search engines, browsers, and other sites.

And according to Wikipedia:

A microformat (sometimes abbreviated μF) is a web-based approach to semantic markup which seeks to re-use existing HTML/XHTML tags to convey metadata and other attributes in web pages and other contexts that support (X)HTML, such as RSS.

Microformats share a lot in common with vocabularies like schema.org and data-vocabulary.org: for example the microformat schema hReview, like the schema.org schema Review, has properties associated with the schema in question, and some of these property values are expected to be of a certain type (both the hReview property dtreviewed and the schema.org/Review property dateCreated, for example, expect a date in ISO 8601 format).

Microformats differ from vocabularies like schema.org in numerous important ways, however.

Any single microformat has one or more required properties that must be encoded with a value; schema.org data consumers (like Google) may require one or more properties for a given item type, but schema.org itself has no such prescribed properties.

Microformats also lack the hierarchical structure of schema.org. A schema.org item may be an instance of a more specific type, and inherits properties from the parent type (schema.org/Review, for example, is more specific type of schema.org/CreativeWork), whereas each individual microformat is (in this respect) a standalone schema.

Probably the biggest difference a microformat schema and a schema.org type is that microformats rely on the use of prescribed HTML (chiefly class attribute values), whereas schema.org information can be expressed using a number of compatible encoding mechanisms, including microdata, RDFa and JSON-LD (at least for microformats as opposed to microformats 2). In this respect microformats cannot be compared to similar vocabularies (microformats are more "a collection of vocabularies"), but to other structured data syntaxes. You'll find just such a comparison (and an exhaustive one) laid out in Manu Sporny's excellent blog post "An Uber-comparison of RDFa, Microdata and Microformats".

For the record, yes, there are syntactical aspects to vocabularies like schema.org, and elements in RDFa and microdata may in themselves provide semantically meaningful information, but "meanings" and "rules" are intrinsically bound in microformats.

Microformats are also more limited in scope than broader vocabularies like schema.org. Whereas there's a dozen or so core microformats, schema.org's hierarchical structure supports a huge number of individual types. And while one microformat may sometimes be nested in another, the relatively small topic breadth of microformats similarly limits the breadth of relationships that may be expressed structurally between objects marked up with microformats.

In aggregate, microformats are less extensible than schema.org (there is no formal mechanism for creating ad hoc extensions for microformats equivalent to the extension mechanism for schema.org), and it's unlikely we'll see much in the way of new microformat schemas being developed (check out what's involved in creating a new microformat).

Microdata

The shortest possible definition: microdata is an HTML5 markup syntax for structured data. (Though only fully understandable in the context of the statement, schema.org actually trumps this definition for brevity when it describes microdata as a "set of tags.")

A slightly more extended and probably more useful definition is that microdata is an HTML5 specification that supports the markup of structured data in HTML documents.

Google describes microdata as "a way to label content to describe a specific type of information," and along those same lines Wikipedia calls microdata "a WHATWG HTML specification used to nest metadata within existing content on web pages."

For the more technically-inclined, W3C has this to say about microdata:

Microdata allows nested groups of name-value pairs to be added to documents, in parallel with the existing content. […] At a high level, microdata consists of a group of name-value pairs. The groups are called items, and each name-value pair is a property. Items and properties are represented by regular elements.

This is in line with Mark Pilgrim's concise but extremely accurate definition: "Microdata annotates the DOM with scoped name/value pairs from custom vocabularies." (Follow the link for Pilgrim's minute dissection of this sentence.)

If you're familiar with RDF (more on that below) you might find Kingsley Idehen's definition the most insightful (slightly reformatted as the original was provided in Turtle!):

An HTML5 based Notation for constructing structured data islands within HTML5 documents. These structured data islands are Entity -> Attribute -> Value based and compatible with basic the RDF model's abstract Subject -> Predicate -> Object syntax. Basically, you can easily produce RDF and RDF based Linked Data from this form of structured data.

From a functional, schema.org perspective, one can simply (and accurately) say that microdata is one of the two primary methods used to add inline schema.org information to web pages (the other being RDFa).

RDF

RDF is an acronym for the Resource Description Framework, and is – in the words of W3C – "a standard model for data interchange on the Web." W3C goes onto say:

RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a "triple"). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications.

In the RDF Primer W3C provides an even more basic definition, calling it "a language for representing information about resources in the World Wide Web." Elsewhere in this document you'll find an excellent overview of the RDF model, including a diagram of a simple triple (subject -> predicate -> object).

And that's really all the semantic SEO practitioner really needs to know about RDF, and in fact you can add excellent schema.org information to HTML documents without knowing anything about RDF. The reason I bring it up at all is to provide context to the following discussion of RDFa.

RDFa and RDFa Lite

The shortest possible definition of RDFa: RDFa is an RDF markup syntax for structured data.

A slightly more extended and probably more useful definition is that RDFa is a method of adding semantic information to an HTML document concretely based on the Resource Description Framework (the acronym "RDFa" itself stands for "the Resource Description Framework in attributes")

Google calls RDFa "a way to label content to describe a specific type of information, such as a restaurant review, an event, a person, or a product listing." Along these same lines W3C's RDFa 1.1. Primer (2nd ed.) says of RDFa:

Using a few simple HTML attributes, authors can mark up human-readable data with machine-readable indicators for browsers and other programs to interpret. A web page can include markup for items as simple as the title of an article, or as complex as a user's complete social network.

If RDFa is beginning to sound a lot like microdata, it's with good reason: RDFa is another, similar means of marking up HTML documents with structured data, including schema.org information. The primary conceptual difference is that RDFa can be thought of as an RDF syntax, whereas microdata is not, but can be used to extract RDF (thanks Greg Kellogg).

RDFa Lite, as the name suggests, is "a minimal subset of RDFa." The W3C RDFa Lite 1.1 W3C Recommendation goes onto say:

The full RDFa syntax … provides a number of basic and advanced features that enable authors to express fairly complex structured data, such as relationships among people, places, and events in an HTML or XML document. Some of these advanced features may make it difficult for authors, who may not be experts in structured data, to use RDFa. This lighter version of RDFa is a gentler introduction to the world of structured data, intended for authors that want to express fairly simple data in their web pages. The goal is to provide a minimal subset that is easy to learn and will work for 80% of authors doing simple data markup.

While this may not always seem to be the case, schema.org was indeed (at least in part) created for those "who may not be experts in structured data," and accordingly RDFa Lite will suffice for the vast majority of webmasters who want to mark up schema.org information in HTML documents.

In a section titled "Why microdata? Why not RDFa or microformats?" schema.org says of RDFa: "RDFa is extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption." The development of RDFa Lite since the release of schema.org, and RDFa Lite's subsequent addition as a W3C Recommendation, has blunted the complexity argument. And whereas the data model discussion on schema.org initially referenced only RDFa, the site now says "Our use of Microdata maps easily into RDFa Lite" rather than "into RDFa 1.1," and the accompanying markup example now uses RDFa Lite rather than RDFa 1.1 syntax.

JSON-LD

The shortest possible definition: JSON-LD is "a JSON-based format to serialize Linked Data.".

Short and sweet, but only meaningful if you have an appreciation for what JSON is all about. Wikipedia provides this explanation:

JSON, or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of key:value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML.

In other words (putting the two parts together) JSON-LD is JavaScript Object Notation – Linked Data. In the schema.org context, it is a way of exchanging data "in pure JSON or as JSON within HTML" as an alternative to using markup attributes in HTML.

I won't dwell too much on JSON-LD, both because schema.org's support of JSON-LD as a recommended format for the vocabulary is relatively new, and because most webmasters using schema.org will have more interest in the attribute-based markup methods described above (microdata and RDFa).

But I'll close out the JSON-LD section with another concise technical definition from Kingsley Idehen:

A JSON based Notation for constructing RDF model and abstract syntax compatible Structured Data and/or Linked Data aimed at Javascript developers. You can also embedd JSON-LD based structured data islands in HTML documents using the <script/> tag.

Vocabulary relevant to schema.org

In this section I discuss terms closely associated with schema.org (although none of the terms defined are exclusive to schema.org).

Item

The item is the thing being described using schema.org, which may be an entire HTML document, a section of a page or even an individual element.

"Item" isn't intrinsic to schema.org except insofar as one is describing a "thing" using schema.org, and that "thing" needs to be demarcated. Both conceptually and semantically "item" is derived from the microdata element itemscope, which is used to declare the scope of the vocabulary (schema.org) used.

In the code below, the microdata itemscope attribute indicates that everything falling between <div> and </div> is within the scope of the item being described:

<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">Albert Einstein</span>
</div>

"The itemscope property is redundant in RDFa Lite and is thus unnecessary," and is not applicable to JSON-LD.

Type

The type is the kind of item being described, and specifies the schema.org URI that corresponds to that type. A movie, a book, a product, an offer to sell something, a literary event and a library are all examples of schema.org types. Because (as per the section above) each item is associated with a type, a "type" is often referred to in schema.org as an "item type."

In microdata, the type of the item being described is declared using the itemtype attribute:

<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">Albert Einstein</span>
</div>

In RDFa Lite, the vocab attribute is used to specify that the schema.org vocabulary is being used, and the schema.org type being described is declared using the typeof attribute:

<div vocab="http://schema.org" typeof="Person">
  <span property="name">Albert Einstein</span>
</div>

In (Google-friendly) JSON-LD, @context is used to specify that the schema.org vocabulary is being used, and the schema.org type being described is declared using @type:

<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Albert Einstein"
}
</script>

Property

A schema.org property provides a specific piece of information about a predefined aspect of the item being described. For items marked up in HTML, the content of the enclosing element provides the property value. There is an expected type for each property value, such as text, a number or another type (expected types are discussed in greater detail below).

In the microdata code following, the item type is Article. The Article property being described is name. The value of the name property is All About Schema.org.

<div itemscope itemtype="http://schema.org/Article">
  <span itemprop="name">All About Schema.org</span>
</div>

If you were to run this code through Google's Structured Data Testing Tool, you would see the marked up item, the item type, the property of the type, and the value of that property represented visually:

A schema.org item, type, property and property value as displayed by the Google Structured Data Testing Tool

Each type has a fixed number of properties associated with it. Because schema.org's structure is hierarchical, each sub-type inherits the properties of its parent type.

Because Recipe is a more specific type of CreativeWork, which is in turn a more specific type of Thing, in marking up the item type Recipe one may use properties available for CreativeWork (such as aggregateRating) and Thing (such as name), along with properties specific to Recipe (such as cookTime).

Property inheritance in schema.org

In microdata, each property is declared with the itemprop attribute, and the content of the element it encloses is that property's value:

<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">Albert Einstein</span>
</div>

In RDFa Lite, each property is declared with the property attribute, and the content of the element it encloses is that property's value:

<div vocab="http://schema.org" typeof="Person">
  <span property="name">Albert Einstein</span>
</div>

In (Google-friendly) JSON-LD, each colon-separated line following the type declaration (@type) is a property/value pair:

<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Albert Einstein"
}
</script>

Expected types and embedded items

Each property is expected to be of a certain type, such as a URL (e.g. "http://www.seoskeptic.com/"), a number (e.g. "17"), or a specific data type such as duration (e.g. "PT1H30M" – "1 1/2 hrs" in ISO 8601 duration format). Failure to provide the expected type for a property value may result in code validation errors, and may prevent the search engines from generating a rich snippet for the item displayed.

Sometimes an expected type for an item may be another schema.org type with its own set of properties. That is, a property may have another item with its own set of properties "embedded" or "nested" under that property.

For example, the expected type for the Article property copyrightHolder (inherited from CreativeWork) is Person or Organization. In the microdata code below, the value of copyrightHolder is an item of the type Organization; this Organization has the property name, and the value of name is "Acme Publishers":

<div itemscope itemtype="http://schema.org/Article">
  <p>Copyright <span itemprop="copyrightHolder" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">Acme Publishers</span><span></p>
</div>

The Google Structured Data Testing Tool clearly shows how Organization and its property/value pair (name/Acme Publishers) is nested as an item under the Article property copyrightHolder:

An item embedded as the property of another item in schema.org

If you refashion this code with RDFa Lite …

<div vocab="http://schema.org/" typeof="Article">
  <p>Copyright <span property="copyrightHolder" typeof="Organization"><span property="name">Acme Publishers</span><span></p>
</div>

… and run it through RDFa / Play, the resulting visualization shows the nested structure even more clearly.

An item embedded as the property of another item in schema.org using RDFa Lite

Elsewhere on this site you can find other microdata code examples showing embedded items, as well as item embedding with RDFa Lite and JSON-LD.

Common pitfalls

Hopefully the definitions and descriptions I've provided above have provided you with an improved understanding of the terminology surrounding schema.org.

While this post might not have made you an expert in these technologies, you're nevertheless better armed to speak (and write) knowledgeably about schema.org.

If you do aspire to be an expert in these matters (or at least appear to be an expert in these matters), then you're hopefully now less likely to make some common gaffes – detailed below – that occur with somewhat depressing frequency when SEOs talk about schema.org. Just say no!

"RDFa vs. schema.org"

I don't know why this keeps cropping up, but pitting RDFa against schema.org is very much a case of comparing apples and oranges.

As discussed above, RDFa is a markup syntax and schema.org is a vocabulary: the former is one of the mechanisms that can be used to encode the latter.

When someone makes this comparison, they almost certainly mean to compare RDFa and microdata, as microdata was the markup syntax originally recommended for schema.org, and currently all examples on the site use microdata (except the recently-added Action type hierarchy, where the examples are in JSON-LD).

It's valid to compare the relative merits of RDFa and microdata (although many of those familiar with this particular debate have had enough of it), but attempting to compare RDFa and schema.org is a fool's errand.

"Micro data" and "micro formats"

"Micro data" is (I guess) a tiny bit of data. A "micro format" is (I guess) a very small format. These two things are unrelated to microdata or microformats.

Think I'm being pedantic? Well, like it or not certain things have established names with a single acceptable way of spelling them, and microdata and microformats are among these things.

But if you're comfortable with the sentence "I took grand mother to the super market so she could pick up a new tooth brush" then by all means continue to speak of micro data and micro formats.

"Schema"

I get it: "schema" (or more often "Schema") is a shorthand way of referring to schema.org, and it's hardly a cardinal sin to omit the .org when it's clear from the context that you're speaking of schema.org.

However, keep in mind that schema.org is itself a collection of schemas, that there are schema elements used in microformats, that there is both an XML and RDF schema, that schema has a very specific meaning in the field of psychology and that the transcendental schema plays a role in Immanuel Kant's architectonic system (I know, I know – everyone's well-acquainted with Kant's architectonic system).

All of this to say that you'll never appear less informed if you favor "schema.org" over "schema" when you're referring to … schema.org.

Acknowledgments

Thanks to all in the Semantic Search Marketing Community on Google+ for their help with this post, and in particular to Gregg Kellogg, Dan Brickley and Kingsley Idehen for their patient and extremely helpful responses to a number of questions I posed specifically for this article. And I'd largely be tongue-tied trying to talk about RDFa without the help of Manu Sporny.

Update (5 June 2014)

Phil Barker and Lorna M. Campbell of the Learning Resource Metadata Initiative (LRMI), Centre for Educational Technology, Interoperability and Standards (cetis), have published an excellent overview of schema.org – "What is Schema.org?" – which is an excellent resource for those who want to dive a bit deeper into the vocabulary.

The figure below, "Some of the relationships around a Creative Work that may be described using schema.org", is excerpted from the document.

Some of the relationships around a Creative Work that may be described using schema.org - From the LRMI document What is Schema.org

{ 5 comments… read them below or add one }

1 David Deering November 5, 2013 at 2:35 pm

Another awesome post, Aaron. A lot of people, myself included, are still trying to get a firm grasp on all of the different terminologies related to structured data. So thanks for putting this detailed explanation together. I’m sure it will be a valuable resource to many for a long time to come.

Reply

2 Aaron Bradley November 5, 2013 at 2:45 pm

Thanks David!

Reply

3 Huang-Wei Chang March 15, 2014 at 2:39 pm

Wonderful introduction of this topic. Maybe the best I found on the web. Thanks a lot, Aaron!

Reply

4 Ismail Yusof October 1, 2014 at 2:29 am

Hi Aaron……just want to let you know that I have been researching on structured data markup and schema.org for a couple of weeks now and frankly, I’m getting more confused than ever before I started it. All the best SEO sites on the internet couldn’t explained it better than you did, even schema.org or Google themselves!
Thanks and have a great day!

Reply

5 Aaron Bradley October 1, 2014 at 12:38 pm

Thanks Ismail … really glad you found the post helpful!

Reply

Leave a Comment

Previous post:

Next post: