Google's Knowledge-Based Trust (KBT) Proposal

by Aaron Bradley on March 2, 2015

in Search Engines, Semantic Web

Google's Knowledge-Based Trust Approach to Quality

An article by Hal Hodson published in New Scientist has something of a seemingly hyperbolic title, "Google wants to rank websites based on facts not links."

But the very first paragraph of the Google Research paper cited by Hodson shows the headline to be fairly, well, factual. This is a research proposal to replace links with factual accuracy as a means of assessing a web page or web site's trustworthiness.

The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy.

I've taken an initial look at the paper, called "Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources", and there's much about it that's both compelling and contentious, as it as well sometimes offers inadvertent insights into other Google projects and priorities.

I threw my initial thoughts on Google+, and here expand on those thoughts and augment them.

Extraction errors and the Knowledge Vault

Perhaps unsurprisingly, it turns out that automating the fact extraction process has made it error-prone. They note that "extraction errors are far more prevalent than source errors. Ignoring this distinction can cause us to incorrectly distrust a website."

Web resources without triples

Again unsurprisingly it turns out that a paucity of a data for a resource makes it difficult to assess that resrouce (suggesting that I might not have been entirely out on a limb in speaking of SEO with data).

What is somewhat surprising is, in turns out, just how many resources have so few extractable triples (and, interestingly, triples seems to be the measure of whether or not a given resource "has data" in the eyes of the Knowledge Vault).

This [assessment mechanism for automatically extracted facts] can cause problems when data are sparse. For example, for more than one billion webpages, KV is only able to extract a single triple (other extraction systems have similar limitations). This makes it difficult to reliably estimate the trustworthiness of such sources.

But apparently they've made progress in cracking both these nuts.

Our main contribution is a more sophisticated probabilistic model, which can distinguish between two main sources of error: incorrect facts on a page, and incorrect extractions made by an extraction system.

The main topic of a website, and other challenges

One of the identified area for improvements is in an approved ability to identify the main entity of a page.

To avoid evaluating KBT on topic irrelevant triples, we need to identify the main topics of a website, and filter triples whose entity or predicate is not relevant to these topics.

(On a side note, a mechanism identifying the main entity using has long been discussed, and recently proposed.)

On a related note, they say that in order to "avoid evaluating KBT on trivial extracted triples, we need to decide whether the information in a triple is trivial."

All in all its evident that "not enough data" (not enough triples) and "too much data" (too many triples) are persistent problems at opposite ends of the data volume spectrum. Even if the also-identified goal of improving extraction capabilities results in "more triples, they may introduce more noise."

Whither the Knowledge Graph after Freebase?

I'll just mention, as I did on Google+, that the Knowledge Graph doesn't just rely on Freebase classes, but replicates them exactly.

We used the Google Knowledge Graph (KG) (whose schema, and hence set of classes is identical to that of Freebase) to map cell values to entities, and then to the classes in the KG to which they belong.

So I'll note, tangentially, that it'll be interesting to see how it all works out for Google once Freebase is shuttered and Wikidata becomes the Knowledge Graph's new BFF. Will classes simply then be derived from Wikidata, as they seemingly were from Freebase?

What's a fact, Jack?

The opening two sentences of the abstract are remarkable in terms of what follow, insofar as the first speaks of "factual information", and the second of "facts", without any subsequent discussion of what constitutes a "fact".

We do however, has this on the assessing the correctness of facts. Emphasis mine.

We extract a plurality of facts from many pages using information extraction techniques. We then jointly estimate the correctness of these facts and the accuracy of the sources using inference in a probabilistic model. Inference is an iterative process, since we believe a source is accurate if its facts are correct, and we believe the facts are correct if they are extracted from an accurate source.

Later on, when the paper describes advances in probabilistic modeling the researches have made, again the emphasis is on being better able to assess the source's trustworthiness.

This provides a much more accurate estimate of the source reliability.

I hesitate to say much about the Open World Assumption, since I always seem to put my foot in it when I do, but it does seem to me to be worth mentioning in relation to the thrust of Knowledge-Based Trust.

The Open World Assumption, says Juan Sequeda, "is the assumption that what is not known to be true is simply unknown" whereas the Closed World Assumption "that what is not known to be true must be false."

He goes onto say:

Recall that OWA is applied in a system that has incomplete information. Guess what the Web is? The Web is a system with incomplete information. Absence of information on the web means that the information has not been made explicit. That is why the Semantic Web uses the OWA. The essence of the Semantic Web is the possibility to infer new information.

Knowledge-Based Trust is certainly avails itself of the power of inference. In fact, the essence of what the researchers propose is "a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model."

But does the model's reliance on "accurate sources" in order to determine whether or not something is a fact itself assume the a priori existence of factually correct sources, rather in contradiction to the Open World Assumption? Again, I've probably put my foot in it, but if the Open World Assumption isn't the elephant in the room here, the principle seems worth thinking about in regard to Knowledge-Based trust, whether in its application, disregard, or both.

Certainly the authors of the paper took the bull by the horn in more ways than one in using the citizenship of Barrack Obama as an example, since by preponderance of links and other signals one might incorrectly assess he was a Kenyan citizen.

And what about irony, or parodies, or inadvertent ironic self-parody?

Not intended to be a factual statement

But in whatever detail the paper describes a method of assessing the trustworthiness of a resource by using facts, it fails to squarely address the issue of just what a "fact" is to begin with.

Bernard Vatant has commented that Google conveys "(maybe unwillingly) the (very naive) notion that the Knowledge Graph stands at a neat projection in data of 'real-world' well-defined things-entities-objects and proven (true) facts about those." (See also ensuing discussions.)

He says this in regard to the Knowledge-Based Trust paper.

Seems to me this paper, regardless of the intrinsic scientific quality and interest of the method and experiments – which I must confess I have not enough understanding to assess thoroughly – presents the same fundamental confusion I have previously pointed at in Google Knowledge Graph's presentation prose. There again the meaning of "facts" is not clearly defined, and taken for granted.

If such a terminological vagueness is already borderline in general communication and marketing of the Knowledge Graph concept, it's far more difficult to admit in the context of a scientific publication. The term "fact" seems clearly used to denote "statement" typically expressed as (subject, predicate, object) triple (as in RDF). But expressions such as "the correct value for a fact (such as Barack Obama’s nationality)" or "facts extracted by automatic methods such as KV may be wrong" show indeed a very loosy use of the notion of "correctness" or "truth" applied to "facts" which should be used in a scientific publication context with much more caution.

I don't feel qualified to comment on the appropriateness of the paper's use of the word "fact" in the context of a scientific paper, but it seems pretty glaring even to this relative layman that talking about an approach that relies on "the correctness of factual information" without addressing just what constitutes "factual information" is an pretty glaring epistemological error of omission.

Knowledge-Based Trust: not the whole shooting match

The other thing to note about the first paragraph of the papaer, cited above, is that it spoke of using links to evaluate the "quality of web sources" (emphasis mine).

So I don't think the authors have suggested that links can't be used for assessing web resources, only that KBT should replace them as mechanism for web quality. They might still, for example, be used for assessing relevancy, or freshness, or how viral-ness.

And it certainly doesn't say that other factors shouldn't be taken into consideration in the ranking of web resources in search results. Indeed the authors stress early that "source trustworthiness provides an additional signal for evaluating the quality of a website" and discuss "new research opportunities for improving it and using it in conjunction with existing signals such as PageRank."

So even if fully embraced by Google Knowledge-Based Trust would never be the whole shooting match.

As the name "Knowledge-Based Trust" – described in the paper as a "trustworthiness score" – suggests, what the mechanism assesses is how much trust can be put in a resource, and there's many more types of assessments that are made than trustworthiness when a search engine provides a query response.

Not to say that trustworthiness might not in itself be a, or the, determining factor in what Google responds with for a query.

Certainly conventional wisdom SEO would have it that links are a, or the, determining factor in what Google responds with for a query, so if link equity were to be supplanted by factual accuracy – PageRank by KBT – then we'd have to regard Knowledge-Based Trust as a very influential signal indeed.

Another imperfect measure of trust, or a link-killer?

I can well imagine the response of many to this proposal: we can all see that Google sometimes gets its facts mixed up in response to a query, so it's crazy to rank web resources based on factual accuracy when Google is factually inaccurate. Or that Google relies too much on flawed Wikipedia for fact extraction and verification, so skews in favor of the biases or shortcomings evident there.

These are valid points about the methodology, and I've of course just above I've raised objections to the glib assumption that what constitutes a fact is self-evident.

But the fact (ha) is that hyperlink-based resource evaluation doesn't itself reflect some objective reality. It is error-prone, subject to gaming and is its own special brand of subjective.

Would relying on the seeming veracity rather than its seeming popularity of a web page or web site result in better web results?

Obviously we wouldn't know until we were able to compare those results. But for all the talk of how Google one day might no longer rely on links, this is a rare serious look at how Google might actually arrive at that future.

{ 11 comments… read them below or add one }

1 bill bean March 3, 2015 at 5:45 am

I have to go away and ponder “inadvertent ironic self-parody.”


2 Jeremy Niedt March 3, 2015 at 8:45 am

“So I’ll note, tangentially, that it’ll be interesting to see how it all works out for Google once Freebase is shuttered and Wikidata becomes the Knowledge Graph’s new BFF. Will classes simply then be derived from Wikidata, as they seemingly were from Freebase?”

Every time I read an article on the subject, and I’ve seen a lot of information of the migration of information, but no explicit statement on whether Wikidata will fulfill the same role. I wonder if Google is just going to rely on a now(with FB locking down) closed internal system that informed by the mechanisms described in the papers. Meaning that though the data is moving to Wikidata, I’m not sure it will be factored in as largely as Freebase was.


3 Aaron Bradley March 4, 2015 at 6:15 pm

Thanks for your thoughtful comment Jeremy. I wonder precisely those same things. The original announcement contains some clues, and there’s now a Wikidata WikiProject “to coordinate the migration and use of data at Freebase” – but of course there’s no Google source that explicitly lays out how Google is going to populate their classes (and data about those class attributes) moving forward. Stay tuned, I guess.


4 Patrick Coombe March 8, 2015 at 12:07 am

I think Jeremy has a good point. I wonder the same things. The Knowledge Graph relies on Freebase and semantic data throughout the web, all of which Google would never be able to replicate without people. I’ve thought about the whole “closed internal system” and wondered why they haven’t. Liability is one of the biggest reasons I think. Innacurate data? Not our fault! Its the users that input it.


5 Alan Morrison March 4, 2015 at 2:05 pm


After reading your post, I looked at the paper describing the KBT effort, and it does say up front that the method is a probabilistic one. But there are aspects of the method that seem definite or at least based more on the closed rather than the open world assumption.

Ideally, the method should include a probabilistic re-definition of both facts and false assertions, one reason the authors of the paper wouldn’t have to be all that specific with a definition of a “fact”. A “fact becomes a range on a continuum between likely factual and likely false.

PageRank ideally just becomes an objective, probabilistic assessment of the substance of the content, including evidence of facts or false assertions, utility, popularity vis-a-vis other content–a variety of factors. Fact checking against a trusted knowledge base would be part of this assessment.

With a probabilistic approach that presumably evolves, the open-world assumption wouldn’t be directly contradicted. Assertions would be more or less correct, and the rankings that result could change with the tides.

The KBT method as far as I understand it seems noble in its intent. Users shouldn’t have to wade through sites full of apparently false assertions in their search for reliable sources. Old-style PageRank didn’t address that problem.

Maybe the authors just need to be more nuanced in their thinking and their language than they have been to date, beginning explicitly with considerations of OWA and its role.


6 Aaron Bradley March 4, 2015 at 6:44 pm

Thanks for your insight-laden comment Alan, and in particular for confirming my own thoughts about the (ambiguous) application of the Open World Assumption in this approach.

The KBT method as far as I understand it seems noble in its intent. Users shouldn’t have to wade through sites full of apparently false assertions in their search for reliable sources. Old-style PageRank didn’t address that problem.

I think this excellent observation contrasting KBT and PageRank, in particular, deserves highlighting. If the veracity of facts (“truthiness”) is a perilous signal to use in ranking web resources, surely that’s true ten-fold for links.


7 Jesse Wojdylo March 5, 2015 at 4:48 pm

I just do not see how Google will be able to do this. I come from a very small area in the mountains of North Carolina. Much of the “news” and “facts” are hearsay that eventual gets published in the local newspaper. The town is so small and remote you really do not know what is factual and what is not. I have to call my parents on a weekly basis to confirm something that they saw with their own eyes.

If I built a “news” related website out for the small town of North Carolina I could immediately outrank the Graham Star (that is behind a paywall) and Google would assume I am providing facts? I live over five hours away and only make my way back to the town once every few months.

Watching some of these “southern” type TV shows on the Discovery Channel and History Channel I am left to wonder what some people deem as facts. While a search algorithm without links seems like a good idea I do not see it being a possibility.

How would The Onion, Buzzfeed, Huffington Post and Elite Daily stack up? Did Buzzfeed prove, with facts, the 22 best ways to get a college girl to like you?


8 Aaron Bradley March 5, 2015 at 5:41 pm

Thanks for your comment Jesse. I can’t stress enough that what Google considers to be a “fact” in the context of the Knowledge-Based Trust score is not a subjective judgment such as you or I might make based on our assessment of, say, the objects contained in an article of a local newspaper, but a algorithmically-assigned score based on machine-extracted data statements (a “triple” in the form of a subject, predicate and object). It is certainly unconcerned with (to touch on your Buzzfeed example) the availability or absence of arguments to support an assertion on a webpage, but is rather concerned with the trustworthiness of a source as assessed by “the correctness of factual information” provided by that source.

I commend to you section 5.4 of the paper, and in particular the sub-section titled “High PageRank but low KBT”. Spoiler alert – TMZ does not fair well by the measure of KBT. 🙂


9 Jesse Wojdylo April 30, 2015 at 10:15 pm

Very, VERY interesting stuff in the section 5.4. Some of these numbers go well beyond my scope of knowledge, but I do appreciate the research and resource.


10 Alan Morrison May 1, 2015 at 5:38 pm


I think your example of small-town news is telling, but I see what’s Google’s doing with KBT as just one stage in an evolution. If you think about the what will happen a few years from now, there will be various graphs, not just the knowledge graph, and community interactions would be graphed just like anything else.

Long term, the connected graphs would be the primary means of disambiguating all the people, places and things named in articles. In other words, machines would be present, “conscious” and always monitoring and regraphing community interactions at web scale. They will be able to “recognize” a lot more than facts via the graphs. The graphs (essentially semantic metadata boils down to how this is related to that anyway, the essence of edges or graph connections) will be the means of contextual, machine-based understanding.

Your point about facts being insufficient in and of themselves is well taken, but the knowledge graph is what Google can rely on right now. The long-term ability to disambiguate would hinge on how much information is available online, and more and more info is coming online to inform more and more graphs.

That’s not to mention that news itself will be more and more interconnected as time goes on, and machines will be able to validate the connectedness of the authors to the communities they serve, take the temperature of the sentiment of their articles and comment threads, analyze the language of the articles, ferret out the dialect in quotations and other text, and help people filter on other characteristics unique to geographies.

I’m hoping services such as Google will immediately flag blatant plagiarism in a manner comparable to the way image checking is done: (This was a link Aaron shared on Twitter recently, I think.) If you’re building a base of understanding with timestamped triples or quads in the graphs, that suggests various ways to cross-check using machines.

Anyway, lots of ways to validate content will emerge. Facts are only the beginning, and facts are so foundational anyway.

We’ll be posting more during the week of May 4th on graph databases and their integration potential at


11 Ruben April 29, 2015 at 7:31 am

Interesting information.


Leave a Comment

Previous post:

Next post: