New York Times Prototypes a Linked Data Search Engine

by Aaron Bradley on March 4, 2013

in News Media, Semantic Web

New York Times Prototypes a Linked Data Search Engine

On Beet.TV, Andy Plesser recently featured a short but fascinating video of Michael Zimbalist, Vice President of Research and Development Operations at the New York Times, talking with Joanna O'Connell of Forrester about a prototype linked data search engine being developed by the Times.

Zimbalist begins by talking about the great asset that is the New York Times Index, and the relationship between the Index's metadata and linked data.

For almost the entire life of the newspaper … we've been annotating that.  We've had the Times Index – you've probably seen it at the library, these big fat red books so you could go and look up the Index and find what issue of the Times and what section there was a story about that.  So that really provides the basis for some exceptionally rich metadata that's been consistent over the history of the organization.

So we're able to look at the text of articles as data, as very unstructured data, and begin to put some structure around it.  And we do it in a lot of really different ways.  So we have our metadata and our index now rationalized.  So all the terms in the index match up to all the terms in our metadata, and we've connected that to this linked data cloud.  There's this linked data movement that's trying to treat the entire web as this database of information.

Zimbalist goes on to talk about entities and how, for example, the New York Times may have one way of referencing the named entity "Barack Obama" and Amazon another.

And through this linked data movement we're able to say that their Barack Obama is the same as our Barack Obama and create these new editorial products that fuse these different bits of information.

So, for example, one the things we have built as a tool, just as a prototye example of the kind of services we might be able to deliver in the future, is a search engine where you can type in a college or university and we will go look through the articles that we've written and surface any articles that mention alumni or alumnae from that university even though we might not have mentioned in the article that that person attended that school. Because we're able to link up the person's name with a larger database of college graduates.

This is a great example of a leveraging linked data to create a product that is obviously useful in tying together disparate pieces of information that aren't available from a single source.

In response to a question from O'Connell, Zimbalist agreed that "normalization … does matter," but said that the bigger challenge was broad-based indexation of content.

I feel like the bigger problem is getting all this stuff indexed. Rationalizing among the indices seems like a more solvable problem than getting everybody who's publishing on the Internet to adequately index their content, but I don't know.

Some definite food for thought here for data producers. That is, how useful is that data outside your organization if it is not exposed? And while Zimbalist does not explain the difference between "adequate" and "inadequate" indexation, any student of the semantic web could reasonably infer that "adequately" indexed data is ultimately some form of structured data.

Members of the "linked data movement" to which Zimbalist refers would likely also argue that maximum benefit is derived from data that is not only structured but open. In discussing the prototype engine, O'Connell makes an oblique reference to this in terms of the possible data ownership issues.

Well, I was just thinking about monetization challenges that it might cause, or rights challenges … how does that work when you have everything linked together like that doesn't it create a who owns what and who gets paid for what question?

While Zimbalist concurs that this is "an interesting question," one doesn't get the impression that, as O'Connell suggests, that this is a question that the Times thinks about "every day." To me, in fact, the prototype engine is something of a poster child for how useful applications may be built when structured data is made freely accessible to all data consumers. The degree to which an application harnessing these data may be monetized is dependent on that data existing in a consumable format in the first place; that is, linked open data is itself a condition for creating products that have a monetization potentional.

Watch the entire interview:

Previous post:

Next post: