You are here:  Home  >  Education Resources For Use & Management of Data  >  Data Daily | Data News  >  Current Article

Nearly 800 Million Documents Served With Freebase Concept Annotations

By   /  July 17, 2013  /  No Comments

ClueWeb 09 and ClueWeb 12. They may sound like secret project names, but in fact they’re two datasets, both created by The Lemur Project to support research on information retrieval and related human-language technologies (ClueWeb12 is the successor to ClueWeb 09). Today, news comes from Research at Google that it’s undertaken Freebase Annotations of English-language web pages of the ClueWeb 09 and ClueWeb 12 Corpora.

That adds up, it says, to nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

There are 340,451,982 documents in ClueWeb09 and 456,498,584 documents in ClueWeb12 with at least one entity annotated, with ClueWeb 09 documents boasting about 15 entity mentions annotated and ClueWeb 12 docs sporting about 13.  Google researchers Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya and note they optimized for precision over recall, estimating the former at about 80 to 85 percent, and the latter in the range of 70 to 85 percent

It’s another step in advancing a world of things, not strings, with the project automatically labeling these billions of phrases with appropriate Freebase MIDs. The ClueWeb data is used in various TREC (Text Retrieval Conference) tracks. This spring, Research at Google also released Freebase annotations of the TREC Million Query Track and Web Track queries. That annotated data (available here) could be useful in conjunction with the similarly annotated ClueWeb corpora.

Here’s an example of the work with ClueWeb 12:

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.

You might also like...

Property Graphs: The Swiss Army Knife of Data Modeling

Read More →