Deciphering Google's "Semantic Search" Intentions

by Aaron Bradley on March 19, 2012

in Search Engines, Semantic Web, SEO

Deciphering Google's Semantic Search Intentions

In a Wall St. Journal article published recently, Google's Amit Singhal suggested that changes were afoot at the world's dominant search engine.

While the Journal piece might overstate the magnitude of possible changes ahead (it is article author Amir Efrati who calls this "a makeover" to Google's formula, not Singhal himself) and while much of the substance of what Singhal is reported as having said is not new, there are indications that there is at least a substantial retooling of Google's search technology underway, and that the nature of these changes are related to Google's embrace of semantic web technologies.  As Efrati puts it, "the company is aiming to provide more relevant results by incorporating technology called 'semantic search,' which refers to the process of understanding the actual meaning of words."

What shape can we expect the nature of that retooling to take?  What changes might Google make to both better utilize semantic technologies and encourage their use by webmasters?  And what are the implications for search marketers and the SEO industry?

Provenance, Please!

In determining the relevance of thousands of resources, one of the most important things Google does is weeding out maliciously irrelevant resources – spam detection and filtering.  By "spam" I'm talking the whole gambit, from pages that deliberately misrepresent themselves (e.g. a page built to match the query "golf clubs" that redirects the user to an online poker site) to pages that try to exaggerate their actual relevance (e.g. an on-topic but keyword-stuffed page on golf clubs).  Google needs pick the gems out of the goo, and this procedure is still a requirement when it comes to assessing the value of structured data.

You won't find much discussion about spam in the semantic web world.  It's not that the semantic web framework does not account for the necessity to validate the veracity of documents, but that it takes a different tack.  At the top of the classic semantic web "layer cake" lays the "trust and proof" layer.  The chief mechanisms being worked on in this layer surround issues of provenance:  the origin and chronology of a document.

The W3C Provenance Working Group recently published the third working draft of its provenance data model; the introductory paragraph of this document provides an excellent overview of provenance issues:

For the purpose of this specification, provenance is defined as a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world. In particular, the provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable: provenance can help those users to make trust judgments.

(See an earlier post on "trust and proof" if you're dying to know more about approaches to this topic in the semantic web world.)

So how is this relevant to Google and its "semantic search" intentions?

As stated above, Google is very much in the trust and proof business, and if provenance proves to be a useful method to help Google determine to what extent resources can be trusted, and to verify the source of those resources, they'll add provenance mechanisms to their toolkit.

Google has previously stuck its toe in the provenance water in the form of the Google News meta tags original-source and syndication-source.  Interestingly, Google News seems to have (without fanfare) dropped support for original-source (it is detailed in the original announcement, but no long appears on the publisher metadata tags help page, having been "replaced" by the more limited standout tag).  That original-source was a bust makes sense, as it was a unilateral publisher declaration where publishers could readily unilaterally lie (and continued support for syndication-source makes sense in this context, as there's limited benefit to lying about not being the original source of a news story).

Not so with Google's big entrance into the world of provenance:  rel="author" and rel="publisher tags, combined with Google+ profile pages.

What does Google have to say about your author profile (emphasis mine)?

A rich profile is not only a great way to share information with users, but it also gives Google information we need to better identify you as the author of web content.

Your publisher profile (emphasis mine)?

Linking your Google+ page and your site like this not only helps you build relationships with friends and followers, but also gives Google information we can use to determine the relevancy of your site to a user query in Google Web Search.

The combination of a meta tag and a profile page provides Google with something that unilateral publisher meta tags cannot:  an identity verification mechanism.  Both author and publisher markup use Google+ profiles to validate the identity of a publisher, enabling them to establish provenance for resources linked to verified sources.  (In all the hoopla about Google's failure or success so far to establish Google+ as a social network, it's easy to overlook how the introduction of Google+ profile pages is an important structural improvement to previous Google profile pages.  There's now value for authors and publishers to have a correctly-linked Google+ page, even if Google+ profile owners never create or consume a Google+ post).

If Google's "semantic search" plans include facilitating the propagation of more structured data, whether by encouraging structured markup or by other mechanisms, it makes sense that they're going to continue to lean on provenance measures to help make sense of it.  Being able to determine the provenance of resources makes it easier to produce something akin to a "author graph" where queries return pages based not just on the authority of the page or site, but the authority of the author as well.

Not just what you say, but who you are, will almost certainly play an increasingly important role in the search results.

Vocabularies Big and Small

In the biggest move to date by the search engines to facilitate a more semantic web, Google, Bing and Yahoo did two things when they unveiled schema.org:  they stipulated a preferred markup specification for structured data (microdata) and provided a vocabulary to use with this markup (schema.org).

The bulk of analysis surrounding the introduction of schema.org has focused on the markup standard endorsed by Google, and in particular on the relative merits of microdata compared to microformats and RDFa.  From a practical perspective I think the schema.org vocabulary itself is of more importance to publishers, and probably to Google too.  Publishers will obviously produce more uniform – and so more readily-digestible – structured markup if they're provided with specific properties to apply to specific types of things.  And the more extensive that vocabulary, the more it allows for a greater volume and greater topical breadth of structured data.

Right now, the schema.org vocabulary is extremely useful marking up three types of data:

  1. Named entities (people, places, organizations, etc.)
  2. Media (information about types of web pages, images, videos, etc.)
  3. Things that are bought and sold (especially on the Internet)

For this last type, schema.org is good about providing information about products and offers in the abstract, without defining domain-specific properties of the things being bought and sold.  For example, it allows publishers to say very precise things about the price of a specific television, and what consumers think of that television, but doesn't provide a mechanism to classify that television by size, display resolution, or any other property specifically relevant to televisions.

Which is at least in part to say that schema.org is not useful for marking up information specifically relevant to:

  1. Any vegetable, mineral or animal that is not a named entity
  2. Concepts

To a certain degree this is undoubtedly by design, and is certainly in keeping with the thrust of semantic web technologies, which is to provide machine-digestible data about real-life objects in the world (objects that can be represented by URIs, if you want to get technical about it).  Certainly the bulk of what Singhal apparently conveyed to the Journal surrounded improved recognition and inclusion of entity-based information (and, further to my previous parenthetical comment, of "identifying information about specific entities referenced" on web pages – which works hand-in-hand with linked URIs).

However, the consequences of a limited vocabulary are, well, limited information.  The best that a site about televisions that doesn't actually sell them can offer to Google now using schema.org is information about the web pages that house television information, but – again – no domain-specific information about the properties of any of those televisions.  As I've often contended, structured markup will not be truly useful to content product producers until it allows them to accurately describe a cat video.  Using schema.org the properties of the video itself may indeed by accurately described, but – aside from being able to declare the subject of the video as the entity "Mittens" – cannot provide any structured information about the adorable feline.

Among those actively building and finessing schema.org you'll find lively discussions about which new types should be added to the vocabulary, and the limits of extending the vocabulary (see the W3C vocabularies mailing list).  Semantic web types mostly caution against efforts at building all-inclusive vocabularies, and the foolhardiness of pursuing an "ontology of everything" – pointing out, sensibly, that domain-specific vocabularies should be reused by publishers, linking vocabularies when necessary.

This has a great deal of merit technologically and is sensible conceptually, but leaves the everyday webmaster (even a very technically adept webmaster) at a loss if he or she wants to express, in a structured way, some specific property of a topic not covered by schema.org.  However inelegant and monstrous it might be, building an "ontology of everything" – or at least a vastly expanded (and more readily extensible) schema.org vocabulary – might be in Google's best interest to facilitate and promote the production of structured data to improve its search capability and results.

The point is, from a taxonomic perspective, if Google hopes to exploit the benefits of classified data it will be in its best interest to support that classification by building or extending vocabularies.  They've been doing this on an ad hoc basis since the introduction of schema.org (June 2011), extending it to include sports (Aug. 2011) and then software applications (Sept. 2011).

Entities Rule!

What I've suggested above about Google extending vocabulary support to cover a broader range of non-entity types is highly speculative.  That entities will play a pivotal role in Google "semantic search" is much less so.

In an interview with Mashable, Singhal stressed – just as he subsequently did in the Journal interview – that entities will play a critical role in the road ahead for Google:

Google is "building a huge, in-house understanding of what an entity is and a repository of what entities are in the world and what should you know about those entities," said Singhal.

In 2010, Google purchased Freebase, a community-built knowledge base packed with some 12 million canonical entities. Twelve million is a good start, but Google has, according to Singhal, invested dramatically to "build a huge knowledge graph of interconnected entities and their attributes."

Apparently the work on entities has been continuing at a fevered pace.  According to the Journal:

Mr. Singhal said Google and the Metaweb team, which then numbered around 50 software engineers, have since expanded the size of the index to more than 200 million entities, partly by developing "extraction algorithms," or mathematical formulas that can organize data scattered across the Web.

Bill Slawski has argued – I think convincingly – that what some observers have classified as "brand bias" is in fact Google endeavoring to make sense of queries as they might be related to named entities, and returning information relevant to an identified brand whenever it can (Slawski recently named Google's Entity Detection patent as one of "the 10 most important SEO patents – see the end of the article on the patent for more resources about Google and entities).

Between Google's filings of entity-related patents, it's 2010 purchase of Metaweb, the introduction of shema.org (supporting the structured markup of named entities) and Singhal's own unambiguous statements, it is clear that entities will play an important role in Google's "semantic search" initiative.

Does Retooling Mean More Tools?

Whether Google continues to build out from schema.org or introduces an entirely different knowledge organization scheme, the best vocabulary in the world isn't going to help if nobody uses it.  While it's not rocket science for developers, even relatively code-savvy webmasters and SEOs (these days, often the ones actually marking up HTML with microdata) find it difficult to get microdata right.  How might Google make it easier for webmasters to use structured markup?

One of the reasons often cited for the slow adoption of semantic web technologies is a paucity of useful, relatively easy-to-use tools.  Certainly there's very little in the way of microdata authoring tools, and the most widely-used content management systems (Drupal notwithstanding) lack native microdata support.  Were Google to introduce authoring tools that made structured markup easy that would probably go a long way toward improving adoption, with the added benefit that the markup would be uniform and syntactically sound.

Are "Direct Answers" (Or Much Else Here) New?

Many commentators in the search marketing industry have remarked that a "semantic" Google might answer more user queries directly in the search results, obviating the need for searchers to visit a site for that information, and so depriving publishers of (search-derived) traffic.  The Journal recounts:

Over the next few months, Google's search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.

A tweet from Craig Fifield summarizes the reaction of many search marketers:

semantic search from Google to fight SEO? come on, it's just another way to use our content on their site and keeping the traffic - Craig Fifield

Should Google force a user to click through to a site to consume a piece of information that it thinks it can accurately return in the search results?  As long as the source or sources of that information is cited and linked (a SERP-level provenance measure) it seems to me a user's best interests are served without requiring them to jump through that particular hoop.

While I recognize that there's lots of nuance involved in debates about information ownership and Google's use of information published by others, I think that SEOs worry far too much about "direct answers" poaching traffic to their sites.  The nature of such information is, first, limited to the sort of facts that can be encapsulated in relatively brief form (e.g. "when was William Shakespeare born") and, second, as likely or not be found on a non-commercial authority site that's going to outrank you anyway (e.g. "when was William Shakespeare born").  Sites with rich content are still likely to fare well, and do any of us bemoan the loss of content farm pages in the search results that were written specifically to answer queries like "when is Memorial Day 2012?".

In any case, as ReadWriteWeb's Jon Mitchell correctly notes, such "direct answers" are not new.  Perhaps we'll see more of these, or perhaps even the development of user interfaces that permit searches to delve deeper into the response to their question without leaving a Google page.  Were this ever to present an egregious challenge to a publisher's traffic, that publisher could always prevent Google from indexing their content – though of course, that would also prevent users from discovering that content from search, and almost certainly end up being a case of cutting off one's nose to spite one's face.

As I never tire of telling people (but as they almost certainly tire of hearing), Google has been employing semantic web technologies for a long time (this earlier post on SEO and the semantic web provides examples).  It's possible that what's on the Google horizon when it comes to "direct answers," the use of named entities and other semantic web technologies may end up be being big changes under the hood, but rather less evident in the way actual search results are displayed.

The Big Takeaways for Search Marketers

Sites with structured markup will appear in the results of more queries, rank better than non-structured web pages for those queries, and have a greater visibility both in linked and "direct answer" verticals.

Sites that can successfully identify and interlink entities in a fashion that Google can readily understand, whether by the use of structured data or otherwise, will find themselves particularly favored, both in linked search results and as sources of information extracted by Google and presented directly in responses to queries.

Up until now Google has been insistent upon the fact that structured markup does not impact the ranking of a site in its search results.  To channel my inner Jan Brady, it's all rich snippets, rich snippets, rich snippets!  But here comes a point where it becomes disingenuous to leverage structured data solely to manipulate the way in which it is represented.  More to the point as it pertains to Google's business model, there comes a point where willfully ignoring the information structured data provides results an inferior user experience.

If a site – by virtue of declaring and linking named entities, employing structured markup and providing verifiable provenance information – helps Google understand that a given resource is a relevant match for a given query, Google will inevitably favor relevant results over a fear that, in doing so, it may be exhibiting bias against more loosely structured sites.  Google will absolutely continue to try to understand sites with unstructured data, but at the end of the day the better Google understands a resource, the more use it can – and will – make of it.

{ 7 comments… read them below or add one }

1 John March 19, 2012 at 7:53 am

I read the original article on WSJ – and found nothing new. The term “SEAMNTIC SEARCH” OR web has been doing the rounds for last 3/4 years. What big thing happened during this time? Just the PANDA update!
Google claims to have collected millions of data over last year or so. But with the ever changing nature of web and in particular websites — are those data still relevant? I think the whole issue is over stated and Google will move step-by-step rather than overhauling the whole thing that is present today!

Reply

2 Brian March 19, 2012 at 12:33 pm

Great article, very well written and excellent resources thank you! Encore!

Reply

3 Brian March 20, 2012 at 12:06 pm

I agree with Brian!

Very well done. While edge cases are nothing new, I agree that we will see more of this going forward, gradually.

Reply

4 Runner2009 April 16, 2012 at 6:11 pm

Very good review of the WSJ article and expansion on it. . Refreshingly un-biased and as one of the Brian’s stated great supporting sources.

Reply

5 itpings.com June 17, 2013 at 4:12 am

I hardly drop remarks, but i did a few searching and wound up here Deciphering Google’s “Semantic Search” Intentions. And I actually do have some questions for you if it’s
allright. Is it just me or does it give the impression like a few
of these comments appear like they are coming from brain dead individuals?
:-P And, if you are posting on additional social sites,
I’d like to keep up with anything new you have to post. Would you make a list of the complete urls of your community pages like your linkedin profile, Facebook page or twitter feed?

Reply

6 Aaron Bradley June 19, 2013 at 11:54 am

Thanks for your comment (and I’ll reserve judgment on whether some of your fellow commenters are brain dead or not, especially as I’m a zombie fan :) ).

I do actually post a lot on Google+:
https://plus.google.com/106943062990152739506

I tweet as well. A lot:
https://twitter.com/aaranged

LinkedIn? You bet, but not very active there:
http://ca.linkedin.com/in/aaranged

Facebook is the social network I reserve for friends, and rarely talk business. :)

Reply

7 Dr Mahesh C. Jain June 25, 2013 at 6:48 am

I have only recently come to know about schema.org and semantic web technologies but I was able to appreciate significance of schema.org after I visited the site 2nd time.

After appreciating its significance I told my son who has developed http://curatio.in that schema.org is a powerful tool for the purpose of enhancing search-ability. His immediate reaction was on the following lines:

Are “Direct Answers” (Or Much Else Here) New?

My immediate reply was that we can restrict use of schema.org’s vocabulary only to the data we want to be searchable. Moreover meta-description being restricted to 160 to 170 characters is insufficient to meet all the information needs of the surfer and for detailed information, in any case he shall have to visit the relevant web page.

Regarding ‘trust and proof’ or provenance it suffices to say that if content relates well with context, it is unlikely to be spam.

Lastly future of the web is Semantic Web Technology.

Reply

Leave a Comment

Previous post:

Next post: