Christian Bizer and Robert Meusel of the Web Data Commons project today announced the release of a new WebDataCommons dataset: "The dataset has been extracted from the latest version of the Common Crawl. This August 2012 version of the Common Crawl contains over 3 billion HTML pages which originate from over 40 million websites (pay-level-domains). Altogether we discovered structured data within 369 million HTML pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million websites (5.65%). Approximately 519 thousand of these websites use RDFa, while 140 thousand websites use Microdata. Microformats are used on 1.7 million websites."
Bizer & Meusel noted, "Basic statistics about the extracted dataset as well as the vocabularies that are used together with each encoding format are found at: http://www.webdatacommons.org/2012-08/stats/stats.html. Additional statistics that analyze top-level domain distribution and the popularity of the websites covered by the Common Crawl, as well as the topical domains of the embedded data are found at: http://www.webdatacommons.org/2012-08/stats/additional_stats.html. The overall size of the August 2012 WebDataCommons dataset is 7.3 billion quads. The dataset is split into 1,416 files each having a size of around 100 MB. In order to make it easier to find data from a specific website or top-level-domain, we provide indexes about the location of specific data within the files."
Image: Courtesy Web Data Commons