Following this teaser last week, Dr. Christian Bizer has reported, "We are happy to announce WebDataCommons.org, a joint project of Freie Universität Berlin and the Karlsruhe Institute of Technology to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public. WebDataCommons.org provides the extracted data for download in the form of RDF-quads. In addition, we produce basic statistics about the extracted data."
The website states, "Web Data Commons enables you to use structured data originating from hundreds of millions of web pages within your applications without needing to crawl the Web yourself. Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web. We have extracted and published structured data from both the 2012 and the 2009/2010 Common Crawl corpus. For the future, we plan to update the extracted datasets on a regular basis as new Common Crawl corpora are becoming available."
Image: Courtesy Flickr/ sjcockell