The rallying cry from the search engines to encourage the use of semantic markup has been, so far, centered on rich snippets. Markup you site with structured data, they say, and your search result snippet might be enhanced with information and graphics that will increase the number of searchers that click through from the results to your website.
But the benefits of employing structured data and associated semantic web technologies extend past rich snippet generation, and that's what I'll be exploring here.This post is an annotated version of slides I presented at SearchFest 2013 in Portland, Oregon. It was a fabulous conference, and I was honored by the invitation from Matthew Brown to appear there, and by the opportunity to present alongside Jeff Preston of Disney Interactive – thanks guys!
So what was that about rich snippets? Let's start with a little history.
Building the foundations: 1999-2005
The Dublin Core Metadata Initiative (DCMI, or simply "Dublin Core") was among the first efforts at developing metadata standards for describing both physical and online resources (1995). Microformats (2004) enabled semantic markup of particular types of information by leveraging the attributes of markup tags.
Specific protocols like XML sitemaps (2005) and product feeds like Google Base (2005) were among the earliest efforts by the search engines to better understand and index with structured data.
Underpinning all of these developments was the Resource Description Framework, or RDF (1999), which provided the general mechanism for describing resources and modeling information on the web. FOAF ("Friend of a Friend," 2000) built a simple but powerful ontology for describing persons on this framework. And RSS (originally "RDF Site Summary," 1999)" – a feed format for describing frequently updated resources – rapidly became one of the most widely-employed XML applications on the web.
Rise of the vocabularies: 2005-2011
By mid-decade the search engines (and in particular Google) were producing rich snippets and displaying topical verticals in search results by parsing and structuring the content of specific sites and, to a certain extend, making sense of data provided by microformats.
But it was the introduction of RDFa (Resource Description Framework in attributes, 2007) and microdata in HTML5 (2005-2008) that really allowed the search engines to leverage structured data. These protocols provided a mechanism for marking up the visible content of web pages with structured data (without relying on class attributes), allowing rich metadata from external vocabularies to be embedded in machine-readable code.
GoodRelations (2008) and rNews (2011) leveraged the power of RDFa to build specialist vocabularies for ecommerce and news content, respectively. The introduction of schema.org (2011), though, provided webmasters with the first general-purpose set of schemas that were officially sanctioned by the search engines.
Freebase (2007) developed as a large collaborative knowledge base of metadata, and DBpedia (2007) evolved to extract structured content from Wikipedia. The Open Graph protocol (2010) grew out of the Facebook Platform, and enabled developers to integrate web pages into the social graph (a graph of the relationships between internet users). All of these initiatives were to play a role in the semantic evolution of search engines in the years ahead.
Consolidation and growth: 2011-present
Since 2011 the schema.org vocabulary has continued to grow, both by the addition of extensions built for schema.org, and by incorporating types and properties from rNews and GoodRelations. This has meant that a increasingly richer, search engine-sanctioned vocabulary is available to webmasters, even if not all schema.org types and properties currently result in the production of rich snippets in the search results, or other clearly demonstrable benefits for search visibility.
Google introduced its Knowledge Graph in 2012, which draws on a number of structured data sources to populate this knowledge base that currently resides alongside Google's main search results. Google's 2010 acquisition of Metaweb (developers of Freebase) turned out to be fundamental to the Knowledge Graph. Bing's own version of the Knowledge Graph, Bing Snapshots, references many of the same sources as the Knowledge Graph.
Just as Google leaned on structured data sources to build its knowledge graph, Facebook Platform technologies facilitated the release of Facebook Graph Search in 2013.
So the search engines are using structured data to improve their search results and to build complementary products, while the power of the social graph is allowing sites like Facebook to challenge the search engines with their own, socially-informed search products.
Specialty search and structured data
Along with reviews, one of the earliest types of rich snippets to appear in search results based on structured data markup were recipes. Building on this, and starting with recipes, Google has constructed a number of specialty search engines (or, depending on how you look at, subsets of its main results) that allow searches restricted to be restricted to specific topical domains.
Structured data markup is the price of admission to obtain a presence in these specialty engines, as the underlying markup for pages that appear in recipe search readily attest.
The fundamental piece of information that recipe markup provides to the search engines is the formal declaration "this resource is a recipe."
Does this information help the search engines return relevant resources when they judge that the intent for a particular query might be a recipe, even if a keyword trigger like "recipe" does not appear in the query? Judging from the similarity of the web search results for "chicken cacciatore" this might well be the case.
Even the image vertical for this query is dominated by results which might be informed by the use of structured data, the exception being a highly-optimized page containing only a highly-optimized picture of chicken cacciatore.
Application search and recipe lessons learned
As with recipe search, Google's application search relies on structured data markup to return results about, and only about, software applications. Obviously sites with one or more pages dedicated to software have the ability to improve their search visibility by using application markup.
[SIDEBAR] Google and the iTunes store
If the price of admission to a Google specialty search engine is structured data, what is a page from the iTunes store – which does not contain schema.org/MobileApplication markup – doing in the applications search results?
Google knows that any iTunes store app is a mobile application. And as the iTunes store is built on standard templates, it's straightforward for Google to parse and structure the data it finds there.
The lesson here is that when Google considers a source of information to be important enough to provide value to its search results, it will go to exceptional lengths to parse and structure unstructured or semi-structured data. It has previously done this with Yelp and pages with BazaarVoice reviews to generate review rich snippets in the SERPs, and has returned product and review rich snippets for Amazon, which is devoid of any Google-sanctioned structured data. The exceptional lengths to which Google will go to structure data provides a pretty strong clue about the value it places in it.
Web search and recipe lessons learned
Google application search is newer than recipe search, and the application schema is much newer than hRecipe, data-vocabulary.org/Recipe or schema.org/Recipe. Accordingly, adoption of application semantic markup has lagged behind that of recipe markup.
This relatively poor adoption means there's a competitive advantage available for those that choose to markup software pages. Aside from the fact that this markup might generate rich snippets in the main SERPs, and that it is required to appear in application search, there is the possibility of the sort of knock-on effect from specialty search observed in the web recipe results. And if Google determines that a particular query's intent is software-related, there's no clearer signal that can be provided to Google here than saying unambiguously in the code that "this page is about a software application."
Web search and the Knowledge Graph
Google's Knowledge Graph has typically contained information derived from major trustworthy structured (or structure-able) sources like Wikipedia or IMDb. Recently, though, results for events have begun to appear for city queries.
Clicking through from the Knowledge Graph
While there is no specialty events engine where semantic markup is the price of admission, the results displayed when clicking through on Knowledge Graph event results suggests that Google favors structured data sources for these results, and might be relying on them in some form to populate the Knowledge Graph events vertical in the first place.
[SIDEBAR] The Google Data Highlighter
Around the time that the Knowledge Graph events vertical started to appear, Google introduced the Data Highlighter, which allows webmasters to visually match visible website information with properties supported by the Highlighter. In other words, it provides a mechanism to provide Google with structured information about a resource without marking it up in the code.
The first (and at time of writing, only) content type supported by the Highlighter? Events.
If the Knowledge Graph events vertical is in some way predicated on the presence structured events data, this suggests that Google created the Highlighter to help grow the amount of structured data available to it about events. It stands to reason that Google doesn't want to throw out the baby with the bathwater by restricting Knowledge Graph event listings to sites that have marked up their events, and has introduced the Highlighter to ensure that unstructured pages about events still have a presence in the events vertical and their linked results.
It also stands to reason that webmasters concerned with search visibility will want to pay attention to any new types supported by the Highlighter, even if they don't actually employ the Highlighter; if, say, software applications are supported by the Highlighter, webmasters with the wherewithal to do so would be well advised to markup their software pages.
[SIDEBAR] An event thought experiment
The conference at which these slides were presented, SearchFest, did not appear in a Knowledge Graph search vertical for a "portland" query. The SearchFest site did not feature event markup. While there are doubtlessly a number of factors that go into Google's determination of what should appear in the Knowledge Graph events vertical (such as, perhaps, the number of times an event is cited, or the authority of the sites on which a given event appears), might SearchFest have appeared in the "portland" events vertical if the SearchFest site had carried events markup?
I subsequently learned that a SearchFest organizer had used the Data Highlighter to provide event data about the conference, but – aside from not appearing in the events vertical – the effort did not result in a rich snippet. With so many unknowns and variables its hard to say whether this is due to the lack of events markup or any number of other factors.
I can report, however, in regard to the slide above used to demonstrate the Google Data Highlighter, that some two weeks after using the Highlighter that this events rich snippet has started to appear for the query "paste 2013":
No rich snippets and no specialty search?
There are myriad schema.org types that currently do not produce rich snippets, have no specialty search engine associated with them, and have no visibility in the Knowledge Graph. What's the point of adding schema.org RDFa or microdata markup for these types?
The history of recipes, applications, events, products and reviews suggests both that pages with semantic markup can suddenly make an enhanced appearance in the search results, and that the types most likely to appear there are related to the types of information that people frequently search or browse for on large, well-used, topically-specific sites.
So marking up pages with structured data today may prove to be a competitive advantage tomorrow, and that return on investment may come sooner rather than later for certain types of information. Absent a crystal ball, one can still make educated guesses about what some of the more important types might be.
Already in use at the US National Resource Directory, bringing job postings directly to the SERPs would obviously be useful to searchers (however much a full-fledged Google or Bing jobs specialty engine might present perhaps too much of a politically-charged challenge to the likes of Monster or LinkedIn).
- schema.org/Organization, Person, Place, etc.
Named entities are the life's blood of the Knowledge Graph and Bing Snapshots. And even if marking up named entities don't end up producing fireworks in the SERPs, I'd argue that any structured information provided to the search engines about entities helps them better understand those entities.
- schema.org/Product, schema.org/Offer
At this point its remarkable to me how little the search engines have done with ecommerce information in the web results. Even if this is a result of a deliberate effort to protect product search (either to retain its monetary potential, or to promote the submission of account-linked and verified product feeds), sooner or later somebody is going to make hay of structured product and offer information.
[SIDEBAR] No feeds required
Not that long ago, the only way to provide reliable, detailed product and offer information to search engines (and other data consumers) was through XML product feeds.
Ecommerce-related structured data has changed this. Using schema.org, not only can ecommerce sites expose the same information in markup that is supplied in feeds, but – with GoodRelations integration – websites can now markup more detailed information about their products, offers and services than are supported by product feeds.
So there might be a future where product feeds might not be required for sites to appear in product search (data cleanliness and veracity being the chief impediment to that right now). Might Bing – already trying to gain favor with consumers by publicly attacking Google for moving to paid product listings – circumvent the restrictiveness of product feeds altogether? Or might some other player do an end run around them both by creating their own product search engine?
[SIDEBAR] Structured markup and Google CSE
Sites employing Google Site Search (the paid version of Google Custom Search) can leverage almost any sort of structured data in their search results. Certain data can be used for filtering results; other types of data can be used in the sorting or biasing (ranking) of Site Search results.
No schema.org types that are relevant to you?
schema.org is a large general-purpose vocabulary, but is not comprehensive. There are many types of information found on websites that cannot describe be described using schema.org.
If a website owner judges that there's potential value in doing so, they can create an extension to schema.org for their content. For example, there's no type for video games in the current schema, and so a site dedicated to video games might build an extension in order to be able to provide detailed semantic information about video games listed in their (publicly-consumable) code.
One could build such an extension, employ it one's site and make internal use of the types and properties developed. One might further hope that other video game sites use the extension, and so extend its utility. For a search marketer, however, ultimately the wisdom decision to spend time and money on developing an extension has to hinge on improved search visibility.
Different approaches to schema.org extensions
There are countless well-developed formal vocabularies and ontologies available on the web, covering a large number of specialist domains. Rather than extend schema.org by adding more types and properties to it, semantic web developers are more inclined to link schema.org markup to controlled vocabularies, standards and datasets where they exist. This makes a lot of sense, as it keeps the maintenance and development of these vocabularies singularly in the hands of the specialists in the domains to which they are related.
For search marketers, though, types and properties formally integrated into schema.org have a much better chance of eventually surfacing in search engine results. And while there still might be search benefits for a site employing external lists – insofar as the search engines might derive a better understanding of that site's content – the nebulous and unproven nature of this value proposition is unlikely to result in resources being committed to vocabulary building (at least where the impetuous is SEO).
Accordingly, search marketers with a terrific use case for an extension might find the best route is to lobby for its inclusion in schema.org directly. This is more-or-less how MedicalEntity and its sub-types ended up in schema.org.
Proposed extensions to schema.org will have the greatest chance of success if the following steps are followed:
- Conduct research on vocabularies and datasets that may have already been developed in your topical domain, allowing you to build on them and to avoid unnecessary duplication of effort
- World collaboratively with other leaders in your industry to develop and promote your extension
- Make your proposed extension publicly available and solicit feedback from interested parties, refining and improving your extension based on this feedback
- Formally propose adoption of your extension, citing need, use cases and the benefits to site owners, data consumers (both human and machine varieties) and search engines
The proposal from Jindrich Mynarz for a JobMarket extension to schema.org, featured below, is an excellent example of a well thought-out and executed extension strategy.
Structured data for structure’s sake
As I've already alluded to more than once, adding relevant, consistent, valid structured data to your site helps the search engines better understand the content and structure of your site, regardless of any special visibility you may or may not receive in the SERPs, such as rich snippets. Let me reiterate this with some bold text for extra emphasis: structured data helps the search engines better understand your site.
A breadcrumb rich snippet in the SERPs is hardly going to cause a doubling of your click-through rate from search. But what if the underlying code helps the search engines more clearly understand the hierarchy of your site and the relationship between pages? In the absence of the breadcrumbs I marked up with structured data, might Google nonetheless generate the ideal breadcrumbs, sitelinks and mini-sitelinks in the SERPs featured above? Perhaps. But given the results I don't feel in any way that this was wasted effort.
In delivering the SearchFest closing keynote, Bing's Duane Forrester urged webmasters to employ structured data, and said that doing so would not immediately boost your rankings, but would pay off in the long run. Why? Because structured data helps the search engines better understand your site.
Structured data for greater data fidelity
Well-executed structured data can help provide a more consistent and positive experience for website users, whether their exposure to your content is through search results, social media or third-party applications.
Employing structured data can help developers and optimization specialists see what underlies a resource through a data lens – that is, as related data points and data values, rather than disconnected widgets and disparate pieces of information. For example, a page's "like" button seen through this lens is not generically a widget for Facebook, but a mechanism to provide Facebook users and Open Graph consumers with precisely crafted information.
Data fidelity as a trust-building measure
Consistency of data is important for search engines. In particular, as this principle applies to structured data, search engines will trust your content more if it can see that your visible content aligns with the data provided in your markup.
This is, in fact, almost certainly one of the reasons that the search engines put their weight behind schema.org, which was literally designed for attribute-based markup, rather than opting for Open Graph-like invisible metadata (remember <meta> keywords?). And this is also the reason Google makes a point of advising against marking up non-visible content except in situations where a very precise data type is required but is not available on a page, such as a numeric representation of review star ratings, or event durations in ISO 8601 date format.
An additional point of data reference for the search engines are XML product feeds. As in the case of the toaster oven pictured above, the search engines will accord greater trust to ecommerce pages if the product feed, markup and visible content are in sync.
If it looks like a duck, quacks like a duck and walks like a duck Google might feel reasonably confident that what they're looking at is a duck; if it looks like a duck, honks like a goose and does the Harlem Shuffle … not so much.
Paid inclusion is now a fact of life in Google Shopping, and Google made a simultaneous mandate requiring clean, relevant, rich data inputs to product feeds, data that is verifiable by cross referencing other sources. That means the following data must all be the same and in sync:
- Data on the webpage visible to humans
- On-page semantic markup
- Data in the product feed
Google’s mandate is strongly reminiscent of Semantic Web philosophy for dealing with data quality and provenance. As is the fact that Google uses some forms of Rich Snippets to expand its Knowledge Graph.
Provenance – knowing precisely from where a particular piece of data originated – is an important semantic web concept, and the W3C has been developing the PROV model to support the "inter-operable interchange of provenance information" on the web and elsewhere.
Where an SEO, in trying to assess whether a particular document is trustworthy, might look at the absence or presence of spam signals, a semantic web type is more liable to first ask the question, "where did it come from?"
Initial assessments of Google+ have tended to focus on its traction to date as a social network: how many users does it have, how engaged are its users, how much information is shared over the platform, and so on.
This glosses over what I think is the true strength and ultimately the great potential of Google+, namely, as a interlinked network of verified, canonical, named entities. The role that Google+ could potentially play in determining provenance is obvious. The combination of a Google+ Profile or Page and structured data means Google can connect the dots between a web resource and the entity that produced that resource.
[SIDEBAR] Provenance and a/Authorship
Much has been said about the role of authorship since the introduction of author rich snippets in Google search results, which – of course – are predicated on the existence of a Google+ Profile.
What I find interesting about these linkages is less the role that authorship ("Author Rank") might play in the display and ranking of search results (which might, indeed, turn out to be quite important), but the potential that the combination of structured data and verified Google+ Profiles and Pages might have in supporting complex queries based on linked data. I could, for example, state on a page I authored (and which Google can verify I authored) that I work for InfoMine, even though that information might not appear on my Google profile. From that Google could, with a high degree of confidence, include me in the results for the query "people that work for InfoMine."
Using semantic tools to improve site structure
As much as adding structured data may help improve your site's visibility in search, semantic web technologies can be leveraged to help improve your site in general ways, even if these improvements don't result in the creation or manipulation of structured data on your site.
Extracting and disambiguating entities using APIs is one such use of semantic web technologies. Categorizing topically identical content in two or more places because of entity variants may result in duplicate content, keyword cannibalization, or both. On a news site, for examples, news stories pertaining to IBM might be listed either on a page about "IBM," a page about "International Business Machines," or both. By using an API to identify stories about IBM, all these stories are associated with a single resource page, regardless of whether "IBM," "International Business Machines" or some other variant on the name appears in the content itself – as in this example of the IBM page on Wikipedia put through Calais.
Rich applications can be built on these technologies to help produce well-organized, meaningfully-linked, content-rich pages – all attributes that curry favor with the search engines regardless of the presence or absence of structured data. Zemanta, for example, has used semantic web technologies to provide bloggers with tags, multimedia resources, and relevant inline links based on a post's content.
[SIDEBAR] Linked (Open) Data and search
While the principles of linked data may not be directly applicable to SEO, there's certainly analogies that can be made between best practices in the two realms. While I won't dwell on this, most SEOs would benefit from getting better acquainted with linked data, and understanding its importance in the semantic web world. No time? Purchase the linked data coffee cup and keep it at your desk as a reminder.
Structured data and semantic architecture
"Once we get the site built we'll SEO it." Any seasoned search marketer has probably heard a variation on this declaration at least once in their career, and knows the folly the statement engenders. While it's perfectly possible to markup a site with structured data after it's been built, a site that's constructed with an eye to semantic structure will fare better in the long run than one that's not.
The BBC World Cup 2010 website did not have structured data applied to it, but was actually assembled with the aid of semantic web technologies. The result is a rock-solid resource that at once reliably meets human visitors' needs, and at the same time provides search engines with explicit, utterly unambiguous data (the BBC has taken a lead role this sort of semantic architecture).
Fine for the megalith that is the BBC, but impractical for your site? Perhaps. But in the future expect more platforms and tools to emerge that can make this sort of architecture more accessible. From a data perspective, after all, what is a standard ecommerce site if not a collection of entities with associated properties, all of which may be fully described using existing syntaxes and vocabularies?