SEO, the Semantic Web and Information Discovery

by Aaron Bradley on January 6, 2011

in Semantic Web, SEO

The following is a transcript of a talk presented at a dual meetup of Vancouver Search Engine Marketing Group and the The Vancouver Semantic Web Meetup Group on 6 January 2011.

What is the Semantic Web?

I'm not going to spend the next 45 minutes trying to define the semantic web – also called Web 3.0 – because it's in some ways as philosophical as technical, and is a matter of debate even among semantic web practitioners. Instead I want to focus on three core concepts of the semantic web, keeping things a simple as possible.

The father of the World Wide Web, Tim Berners-Lee defines the Semantic Web as "a web of data that can be processed directly and indirectly by machines." This really encapsulates the first core concept of the semantic web – namely a separation between the presentation of data and the data itself. Search marketers should note that I'm going to, for reasons that will become clear, speak little of websites and a lot about data in this presentation. That's because the "web of data" in Berners-Lee model includes all sorts of information that is currently unavailable in the traditional notion of the web: documents in any number of formats, information that exists only in databases, and so on. These data can include web sites and web pages, of course, but the web is not the exclusive domain of this web of data.

The second concept – and perhaps the critically defining concept of the semantic web, is a reliance on structured data. That is to say, models and technologies that permit the efficient digestion of data by machines, and the ability to create meaningful linkages between resources.

RDF Triples

RDF Triples

Bear with me here as I get a little technical – it will be the only time in this presentation I do so – as I think the semantic web structured data deserves at least a cursory explanation.

The core of structured data is RDF – the resource definition framework, an XML-based standard. The takeaway here for SEOs probably unfamiliar with RDF (and associated technologies like OWL, RDFS and a forest of other impenetrable acronyms) is that RDF allows concepts, terms, and relationships to be formally defined.

This rests on the idea of a triple. A triple consists of a subject, a predicate and an object. For example, I can create a triple saying "Bill likes baseball" or "Bill is married to Jill" or "New York has the postal abbreviation NY."

So what, right? Well triples are a formal but amazingly flexible way of describing and classifying data. There is no relationship between subjects and objects that can't be described. As we'll see, webmasters are already providing Google with structured or semi-structured data that really does the same thing, such as review content that makes de facto statements like "the average customer rating of this DVD player is 4 stars". What the structured data models of the semantic web really extend is the ability to tie all sorts of data together in meaningful ways.

A Rice Ontology

A Rice Ontology (via mikeaxelrod.com)

Which brings me to the third key concept of the semantic web: linked data. This is fairly straightforward in broad strokes. In a linked data model, things are uniquely identified, capable of being looked up, provide useful information when looked, and themselves link to other uniquely identified things. Because this interconnected data is structured, it allows computers – for the benefit of humans – to make complex connections between the web of things.

Before moving onto SEO, let me pause to draw the first analogies between SEO and the semantic web. The ability to look up a specific resource – let's say a web page – by using a unique identifier, a procedure called "dereferencing, " is also Google's hit list. In fact, Google basically invented <rel="canonical"> just because it wants to be able to dereference web pages. Structured data provides a much more formal way of doing this, and conceivably could be employed to solve a lot of duplicate content problems.

Disambiguation of Cricket

Disambiguation - Cricket the Game and Cricket the Insect

Similarly, semantic web technologies allow web owners to tell Google much more precisely which of two identically-named but different resources are about, in a procedure called "disambiguation." Presented with two resources about "cricket" available only in HTML, Google has to do some data gymnastics to figure out – if it can – what sort of cricket a given resource is talking about. Semantic web models allow you not only to say very explicitly "hey Google, this is about cricket the game, not cricket the insect" but also to say, "cricket is a game" and "baseball is a game," making connections between related resources. Can then structured data help with web rankings? Well, Google really, really likes being told directly what something is about, and the more help you give it in classifying pages, the better the chance that a page will receive visibility in its search results.  More on this later.

What is SEO?

For the longest time in introductory SEO presentations, I've defined SEO as "techniques and strategies applied to websites with the aim of making pages rank highly in search engines for queries." I would revise that now to say, "techniques and strategies applied to web-accessible resources with the aim of providing greater visibility for those resources in the search engines."

Why the revision? In part because Google has been dealing with structured data for a long time now, and this has permitted it to move past the list of ten blue links to display information directly in search results – and increasingly to tailor those results to the needs of the person in front of the keyboard. The clearest example of this is the inclusion of "rich snippets" in search results, where other information besides a site's <title> and description are displayed in results.

The SEOs in this room all know the basics of these techniques, and I think reviewing them in any detail would bore the semantic web people to death. Suffice it to say that there are two main ways of achieving that visibility: by ensuring you're delivering clear data to the search engines in a way they can understand (on-page and on-site SEO), and by providing this data with authority by garnering links (traditionally gaining this authority in the form of links on web pages, but increasingly through other mechanisms, such as social media citations). Web 3.0 has significantly changed both of these activities, and will continue to do so.

Does Structured Data Help Improve Search Engine Visibility?

I think a case can be made that employing structured data can indeed improve search engine visibility, in two basic ways.

First, Google doesn't have to work as hard at understanding structured data, as compared to unstructured or semi-structured data. That this is the case is readily supported by the increasingly large number of data formats that Google not only supports, but encourages web producers to use. In 2009 Google announced support for RDFa and Microformats, and has extended supported data types since then, most recently to include the GoodRelations ontology for ecommerce.

Hilltop Algorithm

A Snippet from the Hilltop Algorithm

But Google has been dealing with structured data for a long time. In 2008 Google started listing ratings in rich snippets for Yelp, Citysearch, CNET, TripAdvisor and Download.com. With this and so many other types of data Google has been parsing web-delivered resources to structure data into a form it can manipulate. This sort of data structuring has certainly been around (at the very least, in theory) since the Hilltop algorithm, which Google acquired in 2003.  Hilltop describes a mechanism for extracting key phrases based on HTML, and defining their relationship with links on the page (think "key phrase -> qualifies -> URL").

In fact – for all those semantic web types that bemoan the slow adoption of Web 3.0 technologies – structured data is everywhere, even if it's not RDF; and for SEOs, you've almost certainly been providing the search engines with structured XML data. Submit a XML sitemap to the engines? That's obviously well-formed XML (malformed sitemaps are rejected). Got a blog? You're feeding Google RSS. Got Bazaarvoice reviews on your site? Google parses that to return star ratings in rich snippets. Do you appear in Google product listings? You're feeding the engine a Google Merchant Center feed. In fact, in announcing their support for GoodRelations, Google said that one of two things was required to produce rich snippets for products in web search results. Either structured data (in the form of Google's own product markup, the hProduct microformat or GoodRelations) or a Google Merchant Feed and product pages that employ the <link rel="canonical"> element. (Remember dereferencing?)

Google Merchant Center Feed

Google Merchant Center Feed

So certainly providing the search engines with structured data will almost certainly provide you with a leg up in search engine visibility. To forestall the question "which format should I use" I don't think it really matters: whatever structured data you can most easily produce for which Google has announced support.

But the second way I think structured data can improve your site's search engine visibility is by imposing structure on website architecture that's independent of design considerations. I'm still shocked and appalled daily by how poorly websites classify and interlink information – and this includes enterprise sites with thousands of pages. Dereferencing? Canonicalization? Hardly, if your Samsung Q3 MP Player is accessible under six different URLs depending on navigational paths. For a lot of mid-size sites, even card sorting exercises would be a vast improvement on the haphazard classification of data on these sites – again, one of the disadvantages of combining data and presentation layers.

If you were to put together a site using RDF, with pages produced by running SPARQL queries against an OWL ontology that's going to be a really clear and well-structured site for both search engines and users.  Think, for example, of the SEOs quandary concerning keyword cannibalization – two or more pages competing for the same keyword attention. I'd lay big money on the fact that there'd be a lot less problems with cannibalization on a site based on a decent ontology or taxonomy – issues that web architects can ignore in open design are brought to the fore with structured data. Structure data begets logical and internally consistent websites, which the search engines have long been known to favour.

Spam Filtering and Ranking of Structured Data

In its digestion of structured data – both structured data fed to it, and data it structures for itself – Google is not merely a presentation layer. Google does two important and disruptive things before dishing out a chunk of data in response to a query: it vets it, and it ranks it.

It vets the data by assessing if it is spam – tossing out or devaluing (in the second part of the equation) data it thinks is crap. Taxonomists, ontologists, data architects and semantic web modellers think a lot more, I find, about misclassified data than misleading data. I think this in part because so much of the semantic web heavy lifting these days takes place around controlled data sets in universities, research organizations and businesses. And big collaborative structured data sets like Freebase and DBpedia use humans to identify and worm out bad data – which is not the same as algorithmic spam filtering.

I bring up spam for two reasons. First, if feeding the search engines structured data ultimately results in higher search engine visibility, there's incentive for the production of spam structured data. If producing an elegant but self-serving ontology of online pharmaceuticals in OWL ultimately results in page one results for drug- and disease related queries, it will be worth someone's while to do it. A good cautionary tale from the world of SEO is the birth, life and death of the meta keywords tag – such an abused piece meta data that the search engines eventually had no choice but to ignore it. This, perhaps, is just to say that Tim Berners-Lee's beautiful web of things includes Viagra, casino games and hot college dorm cams.

Second, and on the flip side of the coin, spam filtering becomes less and less important for pull data. That is, that data a user pulls into a stream or willingly revisits is by nature below the spam threshold for that user. The cautionary note here for search marketers is that you can make unattractive data rank well, but that attractive data – information that's comprehensive, up to date, logically presented – is required for a user to pull that data at some level. Even now that's happening with personalization of results in Google: the more you consume a particular type of data on a particular site, the more likely Google is to show you data from that site.

Aside from filtering structured data for spam, Google also – of course – ranks that data. And however much Google may appreciate being fed structured data, it will rank a fondue recipe in poorly-written HTML above a whole site of fondue recipes in hRecipe in a heartbeat if its algorithm assesses this to be a better page for the mass of Google's users.

Trust and Proof in the Semantic Web

The Place of Trust and Proof in the Standard Semantic Web Layer Cake

What Google does not do is rank data based solely on the semantic relevance of a resource to query keywords, unlike, say, the results for a SPARQL query of DBpedia.  Relevance, to Google, means relevance to the user. So structured data will never trump useful data, however "relevant" a resource may be in the semantic. Discussions of the "proof" and "trust" layer in the standard semantic web diagram tend to revolve around data provenance and digital signatures; in the SEO world proof and trust are chiefly related to citations, to user "votes" for a resource. In a real sense, what Google does that the semantic web does not, at least directly, is determine the importance of a resource as much as its relevance, and I think this is sometimes overlooked in the utopian vision of an interconnected semantic web of things.

Implications of Linked Data for Search Engine Optimization

All of this leads me the concept of information discovery, the last point I'll be covering. I'm indebted to Don Turnbull for the phrase "information discovery" which he introduced me to in a recent talk.

Pull Data in the SERPs

Pull Data in the SERPs

Traditionally, search engines have been used to request information from websites for specific queries. Increasingly, however, different types of data may be presented to users from different sources, enabling the discovery of new information by means other than search. This can be from direct pull – opt-in technologies such as RSS content, a Facebook news feed, or tweets from those you follow.  These sorts of data are increasingly being pulled into the SERPs.

And there is, in Google and elsewhere, what I'll describe as indirect pull, which is data returned to users on the basis of machine-observed behavior, and the clarity of connections possible between linked data. If you have a profile in FOAF (friend of a friend) at a University Entomology department, follow insect enthusiasts on Twitter, and subscribe to Bug Girl's Blog, one day Google might never sully your search results with pictures of guys in white throwing wooden balls around when you type "cricket" in the search box.

I think a key task for search marketers will be the exposure and structuring of data. That is making more types of information available to the search engines, and making it easier to understand.

Finally, I think Web 3.0 also entails a shift from the hoarding of content to the sharing of content in the web of linked data, and I represents the most radical shift facilitated by the semantic web. The traditional goal of SEO has been to drive users to a website. Including information for your users, however useful, from non-affiliated websites has always been derisively regarded as "scraping." But as it becomes easier and easier for computers to make sense of the relationships between pieces of linked data, users will be less and less willing to travel to multiple destinations when information can be aggregated for them in fewer places.

I knew an SEO that was basically livid when Google began providing answers to questions directly in the search results, rather than click through to a website where that user could be sent down some sort of conversion pathway. Like it or not, that's the evolution of the web landscape as traditional search morphs into information discovery. I think SEO efforts that will meet with the best success in the future will be those that produce and interlink data that's attractive for users, and puts the focus on building what might be called not just brand, but data loyalty.

{ 7 comments… read them below or add one }

Leave a Comment

Previous post:

Next post: