ClueWeb 09 and ClueWeb 12. They may sound like secret project names, but in fact they’re two datasets, both created by The Lemur Project to support research on information retrieval and related human-language technologies (ClueWeb12 is the successor to ClueWeb 09). Today, news comes from Research at Google that it’s undertaken Freebase Annotations of English-language web pages of the ClueWeb 09 and ClueWeb 12 Corpora.
That adds up, it says, to nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.
There are 340,451,982 documents in ClueWeb09 and 456,498,584 documents in ClueWeb12 with at least one entity annotated, with ClueWeb 09 documents boasting about 15 entity mentions annotated and ClueWeb 12 docs sporting about 13. Google researchers Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya and note they optimized for precision over recall, estimating the former at about 80 to 85 percent, and the latter in the range of 70 to 85 percent
It’s another step in advancing a world of things, not strings, with the project automatically labeling these billions of phrases with appropriate Freebase MIDs. The ClueWeb data is used in various TREC (Text Retrieval Conference) tracks. This spring, Research at Google also released Freebase annotations of the TREC Million Query Track and Web Track queries. That annotated data (available here) could be useful in conjunction with the similarly annotated ClueWeb corpora.
Here’s an example of the work with ClueWeb 12: