I have been giving more and more thought lately about the top two layers of the semantic web "layer cake" – "proof," and above it, "trust." The lower layers receive a lot of attention – RDF (and other data structures), OWL (and ontologies in general), SPARQL, URIs, and so forth – but trust and proof haven't been, I think, explored with the same depth as other aspects of the semantic web.
I came to wondering, what if those underlying lower layers came together? What if there was widespread adoption of these technologies, and heaps of well-structured, properly-linked data became available? In this sea of linked data, wouldn't there be some resources that deliberately misrepresented their origin, the data they provided, or both? And if so, what mechanisms might be employed to mitigate the negative impact of these "bad" linked data sources?
Based on these ruminations, I put together some questions on proof and trust and reached out to members of the semantic web community far more knowledgeable and experienced in such matters than myself. Three were kind enough to reply: my questions and their answers follow below. But first a little more context.
My journey to better understand trust and proof in some ways began when I encountered Crawling and Querying Linked Data [PDF], a presentation by one of the respondents, Andreas Harth, and his colleagues. In the linked data crawler architecture described in the presentation, new links are extracted and placed in a queue. The (at least to me) exciting prospect for information discovery through linked data crawling in turn had me pondering at what point proof and trust entered into the equation. That is, what proof and trust tests might be run against newly-discovered data, and at what juncture or junctures in the crawling, indexing, querying and ranking process?
While there might be a paucity of discussion surrounding trust and proof in the semantic web, it is by no means non-existent. My own – by no means exhaustive – search turned up several interesting items. N. Henze's The Proof and Trust Layers of the Semantic Web [PDF] looks at logic, data sources and providers for "proof that an answer found in the Semantic Web is correct," as well as interoperability and scalability in trust and policy management policies. Matthijs Melissen et al. take up the logic issue, especially, in a discussion of Logic, Proof and Trust [PDF]. Aaron Swartz provides a succinct and very understandable overview in his The Semantic Web In Breadth.
Addressing the issue of potential semantic web "spam" more directly, Ian Davis (CTO of Talis) enumerated seven different possible linked data exploits in Linked Data Spam Vectors. Following on this we have Marie-Claire Jenkins' Semantic web spam: SemSpam, which seems to have coined the phrase "semspam." These are both seminal posts in the extremely small corpus of work (at least outside of academic circles) that talk about possible exploitation of linked data as an extension of "traditional" web spam techniques.
In all of these discussions the emphasis is on provenance and digital signatures, relating mostly to (as I understand it) linked data that's requested and delivered (that is, in an on-demand model where one is querying trusted, or at least known, data sets), rather than data that is discovered through the crawling of linked data. Which leads me to the two questions I posed, the first focused largely on discovery itself, the second largely on the ranking of discovered resources.
- One of the foundations of linked data is the discovery of new URIs through links. In the process of discovering new resources from links, what if a crawler encounters well-structured data that is suspect (misrepresents its topicality, provenance or resource type) or is purposely misleading (spam)? Is there any reason for a crawler not to visit a discovered URI? And once indexed, what trust tests should indexed URIs or indexed data collected from them undergo?
- Some work has been done on proof, but is confirming provenance (and metrics derived by confirming provenance) enough to produce “trustworthy” results? How would the inclusion of suspect or misleading URIs impact the ranking methodology of results returned by, say, a SPARQL query? Would a very large scale realization of open linked data require quality filters, or even complex layers of spam filtering like those employed by Google et al?
Andreas Harth (@aharth) works at the Institute AIFB at the Karlsruhe Institute of Technology. As his work centers around user searching and exploration of "collaboratively-edited web datasets," mostly "in the context of the SWSE (Semantic Web Search Engine) project," I couldn't hope for a better authority to comment on these trust and proof issues. Excuse the reiteration of my questions (in italics) for context.
Aaron: In a nutshell I’m taking a look near the top of the typical semantic web stack at the “trust” and “proof” layers, as they pertain to the crawling and indexing of resources discovered through the mechanisms of open linked data.
Andreas: Most work on Linked Data fits to the lower layers (URIs, HTTP, RDF and maybe SPARQL) of the Semantic Web stack. I think focusing on these basic layers and ignoring (at least for now) the higher layers helped to get the Linking Open Data movement off the ground. The whole Semantic Web stack can be intimidating to newcomers.
Aaron: One of the foundations of linked data is the discovery of new URIs through links – indeed your crawling presentation slides cite W3 about including "links to other URIs allowing agents (machines and humans) to discover more things". In the process of discovering new resources from links, what if a crawler encounters well-structured data that is suspect (misrepresents its topicality, provenance or resource type) or is purposely misleading (spam)? Is there any reason for a crawler not to visit a discovered URI? And once indexed, what trust tests should indexed URIs or indexed data collected from them undergo?
Andreas: My co-authors and me advocate the use of ranking techniques to assuage the effect of resources that are "not relevant" (see our IdRank paper at ISWC 2009). Our method is somewhat resilient to spam, at least to simple attacks.
I guess the fundamental thing is to include context (or provenance), in it's simplest form the URI of the data source of a triple to give users a chance to check where a given triple came from.
Aaron: Some work has been done on proof, but is confirming provenance (and metrics derived by confirming provenance) enough to produce “trustworthy” results? How would the inclusion of suspect or misleading URIs impact the ranking methodology of results returned by, say, a SPARQL query? Would a very large scale realization of open linked data require quality filters, or even complex layers of spam filtering like those employed by Google et al?
There is even the issue of data publishers accidentally providing wrong data. Our Pedantic Web group tries to assuage the issues coming from erroneous data. If you employ reasoning in a system you need to provide measure to counter-act the effect of spam (which does not yet exist to a large degree) and data source which publish just plainly wrong data and axioms. We came up with the notion of "authoritative source" which is trusted to provide data and axioms about a URI.
I'd hope that the SEO community can provide interesting data to the LOD cloud and thus improves the overall utility of Linked Open Data, but ultimately there will be spam. We'll deal with that once we encounter enough spam to mess up our systems.
Dr. Michael Hausenblas (@mhausenblas) is the co-ordinator of the Linked Data Research Centre (LiDRC) at DERI, where he is a postdoctoral researcher: a semantic web heavy hitter that I was very happy to hear from! Michael had this to say in reply to my questions:
- Don't blindly trust everything you encounter. Provenance (be it via graphs or other means) is essential and should always be taken into consideration.
- As my colleagues have argued for reasoning, the same applies for crawling IMO. Given that the provenance is known, one should apply heuristics to determine what to use (and/or involve human in the loop to provide disambiguation as we do in Sig.ma).
- Dataset-level technologies, such as voiD can guide systems to assess sources both efficiently and effectively.
Juan F. Sequeda
Juan F. Sequeda (@juansequeda) is a Ph.D. student in the Department of Computer Sciences at the University of Texas at Austin, and co-founder of Semantic Web Austin. As his current research "involves integrating databases with the Semantic Web" I was very curious about his opinions on trust and proof; here's what Juan had to say in response to my questions:
I've struggled for a long time trying to explain these concepts. So after a long time of trying, I think I'm starting to pull it off. So what a better way than to share what I've done
I'm no expert in this topic, but I do have my opinion, and it's simple. All we need is a Google like PageRank algorithm. That's it. The questions that you are posing right now are not so different from issues with spam and trust on the "web of documents." Are you going to trust www.foobarnews.com or www.cnn.com? The issue here is that you, as a human, make the decision. Furthermore, I can be an ***hole and write sh** about you and put it on my blog. Who is stopping me? I'm still being indexed by Google and may even show high in the search results.
The main difference now is that instead of me clicking on a search result, on a semantic web, I will have a "semantic agent" who will do the clicking for me. So how is that semantic agent going to know what to "click" on? One way is for it to be social and learning. I'll "click" (or follow) URIs that my friends have followed before. Or I can have a white/black list of domains to follow i.e follow dbpedia, nytimes, etc. domains only.
The issue sometimes is that people think that the semantic web is a magical thing. It will solve all our problems. And if you read the Berners-Lee et al. Scientific American article of 2001, this is the idea you got. I believe it is completely futuristic and unreal.
So in conclusion, we need a PageRank algorithm for linked data, this will help crawlers and agents to know where to go and fight spam, we will then have semantic agents that will do the querying for us and when it comes to a decision of a URI to choose, it will look at its social graph to see what friends have done, or white list of domain names to follow.
I'm sure you can see the business opportunity here, right?
My Two Cents
In the paper on SOAR that both Andreas and Michael cite, this passage in the introduction stands out for me:
While there exists a large body of work in the area of reasoning algorithms and systems that work and scale well in confined environments, the distributed and loosely coordinated creation of a world-wide knowledge base creates new challenges.
Indeed, what I've encountered regarding provenance and digitally-signed resources is mostly in relation to "confined environments," and it makes me somewhat hesitant to say that primarily provenance-based measures will scale well (notwithstanding things like the authoritative reasoning described by SOAR to combat ontology hijacking). I agree with Davis who, in the conclusion to his article, says that "attack vectors can be countered through a whitelist provenance system, but they are not easy to scale." Indeed, if you look at enterprise search engines' consumption of HTML (and other semi-structured data) as an analogy for how one might approach trust issues in the large scale consumption of structured data, whitelisting certainly hasn't been used broadly. Google may blacklist sites that try to game its algorithms, but otherwise provisionally trusts new resources it discovers.
Some interesting papers on provenance in relatively "confined environments" include Using Semantic Web Technologies for Representing E-science Provenance [PDF] and, especially, Using Web Data Provenance for Quality Assessment [PDF]. Again, outside of these confines, I think using provenance becomes more difficult. A recent EOS paper, Data Citation and Peer Review [PDF] shows, as Tim Finin said in a post referencing it, "how far we still need to go w.r.t. formally capturing the provenance of data and information derived from it."
Like Andreas, I think that ranking techniques will ultimately be the best way of dealing with data relevance – including, as Juan suggests – the creation of spam-detection and spam-filtration algorithms as they become necessary. This seems to me preferable to throttling the discovery of linked data resources by imposing rules on crawling.
Thanks again to Andreas, Michael and Juan for their contributions!
UPDATE – 3 March 2011
As a result of a tweet from Ivan Herman I discovered a great resource with obvious relevance to this topic:
This is a page from the tRDF project (hosted on SourceForge), where the "tRDF framework provides tools to deal with the trustworthiness of RDF data." The project's subtitle is, indeed, "Tools for Trust in the Web of Data": I'm kind of embarrassed I missed this in my initial review.
The page cites 61 facotrs "which can be used to measure the criteria for a qualitative assessment of a data source." The broad evaluation categories used are content, representation, usage and system. While provenance plays a role, comprising three criteria in the verifiability sub-category under content, this framework in its totality looks at a far broader range of signals in assessing the quality of RDF data. The document actually users the word "consumer" (as in, I think, "of data") and the author notes, interestingly, that "[a] way of linking relevancy and quality could be the display of quality values, as measured by the criteria above, in the results of a search engine."