In many recent missives, users of Google have complained about the declining quality of their search results. Numerous causes are cited for this decline – from increasingly successful underhanded marketing tactics, to Google deliberately propagating poor results for its financial benefit – but the unifying theme of the clamor is clear: these days, there's more spam in my SERPs.
What distinguishes the current chorus of dissatisfaction over past complaints is that critics are, largely without self-reflection, redefining "spam" on the fly. "Spam" has become an overdetermined word, in that it is a conceptually necessary signifier – the "right" word – but is at the same time infused with multiple and even contradictory meanings. In phenomenological parlance it would now be proper to place "spam" sous rature ("under erasure"), effacing it – but maintaining its legibility – even as the force of necessity demands its use. I won't employ the Derridian convention outside of this sentence, but one might think of the subject under discussion as spam.
In its response to the outcry Google used the phrase "pure webspam" to distinguish it from the "shallow or low-quality content" that has largely been the focus of recent criticism. This, as much as anything, acknowledges that what users define as "spam" has evolved over time, and that we're witnessing something of a turning point in what users consider acceptable search results. With that, we're also witnessing a forced re-evaluation of the notion of "quality" search results, both by the search engines and by search marketers. If, as I'll argue, spam was originally thought of as trickery in the SERPs, and subsequently morphed to reference notions of relevancy, then spam as defined as honest, relevant but suboptimal results marks the third phase in the evolution of search spam: spam 3.0.
What is Spam?
Spam is fundamentally "something lacking merit" that has achieved high search engine visibility as a result of deliberate manipulative effort (in the context of this discussion – just to be clear – I am not referring to content delivered by mechanisms outside of search, such as email). Google has traditionally referred to this as "webspam" (Matt Cutts is "the head of Google’s Webspam team"), but I think it is increasingly limiting to think of spam exclusively as links in search results that point to websites (for example, Google may now well return content in a rich snippet that is delivered to them from a data feed, and doesn't reside on a website per se).
What might be, more formally, some of the distinguishing characteristics of search spam?
- Intentional. Search engine spam must, by definition, be directed at improving a resource's visibility in the search engines (in times past, one might have said "of improving the rankings of a website"). However appalling an included result, if it was not designed to perform well in search, it got there by chance.
- Monetarily-focused. For a spammer, the purpose of achieving high visibility in the search engines is to make money. Exceptions where search engine spam has been propagated reasons other than cash exist, but are rare.
- Cost-effective. Search engine spam is at least designed to bring in more money than it costs to produce, though search engine spam has traditionally had an eye on an astronomical, rather than simply acceptable, ROI.
- Suboptimal. A top spam result is never the same as the top result that would be returned by a user conducting a through, independent survey of available resources.
I think these characteristics, in aggregate, provide some rough constants by which "spam" can be referenced throughout its history – with an emphasis on "rough." The characteristic, in particular, that spam is always "suboptimal" will certainly require further examination, as the quality of a resource is subject both to opinion and context. I'll also leave aside too, for the moment, the degree to which quality might be used to distinguish between "spamming" and "search engine optimization," as both these activities are, broadly, intentional and cost-effective manipulation of search results for financial gain.
The Evolution of Spam
In the table below is a summary of what I think what has typified search spam through different phases, as well as the different ways search engines have responded to spammers' efforts. There's a lot of overlap between categories, and certainly spam or spam counter-measures are not mutually exclusive to each phase (for example, spammers are still keyword stuffing, and Google is still working at neutralizing at the impact of keyword stuffing).
|Spam 1.0||Spam 2.0||Spam 3.0|
|Spammer Intention||Fool the bot||Humor the bot||Feed the bot|
|Spam Content||Not what it claims to be||Not as relevant as it claims to be||Not as good as it claims to be|
|Spammer Tactics||Keyword stuffing, sneaky redirects, link farms, scraping, hidden text, hidden links, parasite hosting, comment spam, cloaking||Raw content aggregation, paid links, directories, subdomaining||Query-focused content generation, thin UGC, sophisticated content aggreation|
|Search Engine Ranking Emphasis||Content (the site itself)||Links (what other sites say about the site)||Linkers (what trusted entities say about the resource)|
|Algorithim Authority Focus||Statistically relevant keyword content, unvetted backlinks||Contextually relevant keyword content, expert backlinks||Authoritatively relevant keyword content, trusted linkages|
|Search Engine Responses to Spam||Google Florida update, devaluation of meta keywords, nofollow, Google sandbox||End of Googlebombing, paid link reporting, canonicalization, expanded crawling, Google Caffeine||Google May Day update, factoring of social signals, sentiment analysis, reputation profiling|
Spam 1.0: Fool the Bot
The early days of search engine spam was typified by base misrepresentation of content. The spammer's goal was to deceive the search engines – and by extension, the searcher – into believing a particular website or page was about something it was not (or provided answers it didn't, or sold products it didn't, and so on).
The tactics employed by spammers were in part successful because the search engines placed a high degree of value on the keyword environment of websites, without the same regard to either the authority of supporting external links or deeper scrutiny of the content that was to come later. In this way the search engines were far more apt to believe a website's own representation of its content than they are now. Words in meta keywords and other on-page elements carried substantial weight, as did scraped content – however nonsensical it might be. It is, indeed, in this era that search marketers began measuring and adjusting "keyword density" in an effort to find the magic formula for winning search results based solely on the keywords supplied to the search engine robots.
Links, even at this stage, played an important role in search engine rankings, but not a sophisticated one. Just as success with keyword density was predicated on mathematical calculations, so were efforts at acquiring large numbers of links, regardless of their quality. Link farms and rather simple link exchange schemes met with some success, and – playing on the search engines' propensity to rank scraped or auto-generated content – whole networks of garbage websites were created to generate links on a massive scale.
The search engines' response was to take much less of a website's supposed content on face value, and to pay much closer attention to a website's supporting link environment. This culminated in the now-infamous Google "Florida" or "November" update of November 2003. Whatever the exact mechanics behind this algorithm change, it had every appearance of incorporating key elements of Hilltop (building on the hub-authority, or HITS, model that played a major role in the development of PageRank), and much less heavily on the statistical occurrence of keywords in site content or in its inbound links. With this change, not just links but topically authoritative links (those, in Hilltop's language, links originating from "expert documents") were required to boost rankings. Spam had to become a lot more relevant to a user's query, or at least create the illusion of relevancy.
Spam 2.0: Humor the Bot
Spam in the post-Florida era was above all about trying to trick the search engines into believing that a given resource was more relevant than it actually was for a target query. The most successful mechanism for pulling this off was by proving relevancy indirectly through relevant links, backed up by target content that was just relevant enough to provide a decent keyword match.
Article sites and selective web directories played (and continue to play) a role in keyword relevancy matching through linking. That is, over the course of time the intention behind using these sites has been less and less to have the content housed there to rank in its own right, but to generate links for targets that the search engines would favorably correlate with the link source. Well perhaps not rocket science, this was a marked change from spam linking practices in the pre-Florida days, where link farms and comment spam were still effective.
I consider paid links as a feature of spam 2.0, even though I've listed comment spam as a 1.0 technique. This, again, is related to relevancy. Spam bots dropping links in open comment and forum forms end up generating a link profile with often has no relevance to the target (such as a link to online poker from a blog on wine-making). In purchasing links spammers can match the relevancy and authority of a link source to keywords they're targeting, and to the content on the site to which it links. Put another way, an expert document system provided Google with a much better idea of what constituted a quality link source, but also exposed itself to being gamed by links from these same sources.
However much the infamous Google "sandbox" may be a figment of SEOs imagination, there were more and more documented cases of websites failing to appear meaningfully in the SERPs for long periods of time. This effectively neutralized the previously-successful technique of creating new websites specifically to build links. Again, web resources were forced more and more to prove their relevancy for a particular keyword before they would end up ranking well for that query.
Some of the search engine counter-measures listed in the table may not seem on the surface to be spam-fighting responses, but at some level they all help weed out less relevant content from SERPs. Improved discovery measures (sitemaps, crawling of Flash, employing OCR in crawling) improves relevance by broadening the resources and linkages available to the search engines for use in relevance calculations. Improved crawl frequency (Caffeine) improves a search engine's ability to match the relevance of a resource as measured against the freshness of a query. Improved canonicalization and provenance measures (rel="canonical", Google original- and syndication-source meta data) help the search engines judge the relevance of content by being better able to distinguish between inadvertent and deliberate content duplication.
In many ways the first eras of search spam were about misrepresentation, and what I've labelled spam 2.0 may simply be a more sophisticated version of spam 1.0. Spammers, by and large, moved on from simply lying to the search engines to a providing a more nuanced, but still deceitful, representation of a web resource's topicality and authority. And while these more sophisticated tactics are still alive and well, their effectiveness is being increasingly blunted by what may be the most ingenious spam strategy of all if, indeed, it is even spam: give the search engines exactly what they want.
Spam 3.0: Feed the Bot
What is now being referred to as "spam" in search results is markedly different than the efforts to game the search engines previously described. These resources do not represent themselves as something they are not, and they are absolutely relevant to the query for which they appear: what they lack is "quality."
So-called "content farms" like About.com, Demand Media's eHow and Yahoo's Associated Content have become the infants terribles of the SERPs, with the quality of their content widely derided by those demanding "better" search results. They are not, however, the only type of sites to generating highly relevant content with an eye to achieving search engine visibility. Question and answer sites and services like Yahoo! Answers and (one-time-search-engine) Ask.com also target long-term queries, along with "expert" sites more directly monetizing successful search queries by charging users a fee for a professional's answer or ebook. Even a file sharing service ranked highly for "download beatles mp3s for free" can be said to be honestly providing content relevant to the user's query (as technically the MP3 is free, even though the file sharing service isn't).
The challenge for search engines, of course, is that quality is highly subjective, and two human beings might reach entirely different conclusions about the quality of a resource based on their own criteria and biases. Personalization of search results notwithstanding, the search engines do not know what constitutes quality for an individual searcher, and must return the results that are most satisfactory to most users, most of the time.
This is not to say that the search engines cannot and do not use the wisdom of crowds in assessing the quality of resources in their index. At a very basic level links themselves are considered by the search engines to be "votes" for a website, even if the ballots of individual electorates are given different weight. But in returning results for long tail queries like "what vitamins are good for dogs," the resource-level link environment may not rich enough to make a reasonable quality assessment, while relying too heavily on the domain-level link environment might inadvertently end up promoting a "bad" article on a "good" site.
The search engines have only just begun to grapple with the issue of quality, and with mixed results. Google's most notable shot across the bow has come in the form of the May Day update, of which little is known besides the fact that it targeted long-tail queries. Both Bing and Google have also recently admitted to incorporating social signals such as tweets into their algorithms, and it is not difficult to envision how search engines might use conversations conducted in social networking environments to help assess the quality of referenced resources.
Another method that the search engines are using to assess the quality of results, particularly as it relates to content backed by links from user-generated content, is sentiment analysis. In terms of links to a site, this means not just how relevant the anchor text, linking source or linking target is to a query, but what the linking source says about the resource. While Google denied employing sentiment analysis directly in slapping down Decor My Eyes, in the same breath it extolled the virtues of its "world-class sentiment analysis system," and has since gone on to acquire the sentiment analysis engine fflick.
Ultimately, HITS, PageRank and Hilltop – all, in some manifestation or another, key components of Google's algorithm – are about the authority of web resources, rather than about a subjective human measure of their quality. At the same time this binds the algorithm to URIs: two very topically similar resources, created by the same authoritative person or entity, but living on two different domains, may have entirely different visibility in the search results. On the flip side, two very topically different resources, created by decidedly non-authoritative persons or entities, may both perform well in search right out of the gate simply by virtue of where they live.
In an effort to return content of higher quality, the search engines may start to spend more energy determining who created a resource, rather than where it is parked. This effort can generally be described as "reputation profiling," and there are many signs that the search engines are starting to go down this path. Google has been granted a patent on a system of reputation management for reviewers and raters, Microsoft has submitted an application for a reputation mashup, and Quora is working on an algorithm to determine and rank user quality.
Reputation profiling may be important not just for the search engines to judge – individually or collectively – the quality of reviews, ratings and answers that appear on websites, but also to the world of the social and semantic web. In the social realm, this could help search engines rank real-time data better based on the identity of the content producer: if it can be determined that John Doe, with 100 followers and a PR2 Twitter profile, has an excellent global reputation pertaining to tequila, might one of his tweets be a better match for a generic "tequila" query over a celebrity with 1,000,000 followers and a PR7 profile tweeting "got wasted on tequila last night"? In the semantic realm, reputation profiling might aid in the ranking of content that exists only as a linked dataset and not as a website, and is thus incapable of accumulating inbound links.
In all this, one might reasonably ask if spam 3.0 is really, well, spam? When a user queries a search engine and the top-ranked result is a page from a content farm that exactly and concisely provides the information is seeking, is it spam? Is it even a "poor quality" result? Are there better resources available that should be displayed instead? Might not Google, specifically, be deliberately driving users to spam 3.0 for their own benefit?
Does Google Support Spam 3.0?
Some commentators have gone so far as to suggest that Google is knowingly taking a laissez-faire attititude toward low-quality content in its results in order to encourage clicks on Google ads in the SERPs that are seemingly more relevant than the organic results, or to encourage clicks on Google ads accompanying the low-quality content itself, or both.
This seems unlikely to me for chiefly one reason: Google understands where its bread is buttered. By consistently providing better results than any competing search engine they will retain their enormous customer base, in turn enabling a steady flow of cash from their search-associated products like AdWords. It is for reasons of corporate self-interest, not corporate altruism, that Google has seemed to largely adhere to the first two tenets of its stated corporate philosophy: "Focus on the user and all else will follow" and, especially, "It's best to do one thing really, really well."
Despite its dominance of the market in all but a few locales, Google has not given into the temptation, as has Microsoft, to alienate users by compromising comprehensiveness or diversity in their search results. I am not speaking here of bias (favoring results in your network over other available results), but of selective promotion (limiting users to products and services that exist chiefly, or only, in your network). Live Cashback, for example, was not a method to supply comprehensive product listings to Live users, but a mechanism by which vendors could advertise their products on the Live Search Network. Google, by contrast, has long allowed merchants to list their products for inclusion in Google Product Search for free. Yes, Google has monetized product search by the advertising that accompanies it, and by offering (the largely unsucessful) Google Checkout, but with product search they support transactions from which they don't make a dime.
This is not to bash poor Microsoft for its serial failures in product search, or that Silverlight was needed to use Bing Visual Search (now turfed in favor of HTML5), or that Bing Rewards requires a Bing Bar that runs exclusively on IE – or Yahoo for its "Search Submit" paid inclusion program – but to demonstrate how Google, by contrast, has consistently resisted such short-sighted self-promotional measures. That Google would compromise the quality of its search results for short-term gain simply doesn't match their MO.
At the end of the day, I believe that Google really does want to provide the best search results it can to its users. What may now be thwarting them from doing so is the success of their search engine.
Spam 3.0 and the Future of SEO
Spam 3.0 has been made possible by the return on investment of search engine optimization. Once upon a time, not that long ago, big business was skeptical of the value of SEO – or, at least, in the value of spending money to achieve a high degree of visibility in the search engines. In realm after realm that ROI has now become statistically demonstrable, and increasingly companies are starting to get an idea of just how much they can invest in pleasing the search engines and still turn a profit. At a time where search query volume is at least remaining high, and is combined with continued growth in the total value of online transactions, the attraction of a robust search engine presence does nothing but grow.
In that not-so-long-ago age, the bulk of marketing dollars might have been spent on building brand equity through traditional mass media advertising, traditional public relations exercises and brick-and-mortar promotions. But not only has an effective presence in search itself proven to be a brand-booster, but – as brand-building has focused more and more on social media – those same web activities which build brand also help with search engine visibility. The decision is no longer between spending 100K on TV ads and 100K on SEO, but between spending 100K on TV ads and 100K on things like blogs, review mechanisms, Facebook pages and Twitter accounts – expenditures that promote brand awareness and improve that brand's presence in search.
An increasing willingness to invest in search has lead to an evolution in tactics and strategies of search engine optmization. There has been a relentless reworking of SEO "best practices" in enterprise environments, where large investments in content and technical infrastructure have produced stellar results, spurring yet further investments. The noble efforts of individual SEOs crafting perfect title tags or acquiring that gold mine of a link from Sally's Crochet Blog are being made increasingly less effective by big players with deep pockets who are willing to spend, and spend big.
An excellent example of this is the rise of big multi-product retailers in the SERPs, particularly for product categories (as opposed to individual product listings, though the two are not mutually exclusive). At one time it was manufacturers or resellers that specialized in a particular product area that had the greatest visibility in the SERPs. Now, virtually regardless of the query, the same huge players turn up again and again in the search results. Amazon. Overstock. Target. Sears. Google hasn't suddenly discovered that these are good places for its users to buy stuff: these companies have realized that their websites are good places to Google's users to buy stuff. They've got the thousands – or tens of thousands, or hundreds of thousands – of dollars that it takes to consistently achieve this sort of ascendancy in search. Extendible search-friendly site architecture that requires sophisticated custom code and endless maintenance. Complex review and rating mechanisms requiring extensive moderation, and even the support of sentiment analysis algorithms. Product information delivered in a dizzying array of data types, from simple XML sitemaps to specialized structured data based on complex ontologies. Marketing teams producing content, engaging on social networks, monitoring discussions, creating campaigns. This all takes cash, and lots of it.
The evolution of SEO for multi-topic content sites is analagous to the evolution of SEO for multi-product retailers. Demand Media didn't simply have some good ideas about how to attract long-tail queries, they invested in them. Demand has raised some $355 million in funding, and while its content network is not the totality of their business, this capital has allowed them to develop search-focused algorithms, editorial processes and websites that have resulted in a staggering amount of traffic from organic search. Critics of Demand frequently make reference to their "cheap content" but, from a corporate perspective, that content has been anything but cheap to produce.
Ultimately the search success of big multi-topic content sites might prove more fragile than that of big multi-product ecommerce sites. Good ecommerce SEO basically entails giving consumers what they want: comprehensive and accurate information about products. Good content SEO does not necessarily entail providing comprehensive and accurate information about topics. Google might eventually be able to use signals (like an author's reputation profile) to better assess the quality of an article, but that store A sells item Y for amount Z is fulfilling a user demand in a very non-subjective way. So where large online retailers can get away with being product generalists (especially if they enlist their customers as product specialists), multi-topic content peddlers may not always get away with being content generalists. The exception to this may be collaborative multi-topic content sites supported by topic specialists, namely Wikipedia. In this way Wikipedia is analagous to Amazon, with volunteer contributors taking the place of volunteer product reviewers.
This does not mean, however, that "low quality" content will necessarily be supplanted in the SERPs by better quality content from a broader variety of sources. As long as it remains profitable to do, enterprising businesses will continue to feed the beast what it most wants to devour. While this content might be "better," it will still be intentionally designed for maximum visibility in search, and to achieve the maximum return on the substantial investment required for that search success. Some might even call it spam.
Addendum – Search Engine Chronology
A chronology of major search engine algorithm changes and innovations referenced in this post. Links reference the source of the dates listed.
|Sept. 2003||Supplemental results appear in Google|
|Nov. 2003||Google "Florida update"|
|Jan. 2005||Nofollow introduced|
|Nov. 2006||Sitemap protocol introduced|
|Jan. 2007||End of Googlebombing|
|Apr. 2007||Google institutes paid link reporting system|
|Dec. 2007||Google devalues subdomains|
|Jun. 2008||Google improves Flash indexing|
|Oct. 2008||Google OCRs scanned documents|
|Feb. 2009||Rel=canonical introduced|
|May 2009||Google introduces rich snippets|
|Apr. 2010||Google "May Day" update|
|Jun. 2010||Google Caffeine complete|
|Nov. 2010||Google News introduces "original-source"|
|Dec. 2010||Google responds to Décor My Eyes|
|Dec. 2010||Google and Bing admit to using social signals|