You are here:  Home  >  Data Education  >  Current Article

Growing Resource: WebDataCommons.org

By   /  March 23, 2012  /  No Comments

Dr. Christian Bizer recently reported, “We are happy to announce WebDataCommons.org.”

Following this teaser last week, Dr. Christian Bizer has reported, “We are happy to announce WebDataCommons.org, a joint project of Freie Universität Berlin and the Karlsruhe Institute of Technology to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public. WebDataCommons.org provides the extracted data for download in the form of RDF-quads. In addition, we produce basic statistics about the extracted data.”

The website states, “Web Data Commons enables you to use structured data originating from hundreds of millions of web pages within your applications without needing to crawl the Web yourself. Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web. We have extracted and published structured data from both the 2012 and the 2009/2010 Common Crawl corpus. For the future, we plan to update the extracted datasets on a regular basis as new Common Crawl corpora are becoming available.”

Learn more here.

Image: Courtesy Flickr/ sjcockell

photo by: opensourceway

You might also like...

James Kobielus

Data Science to Change The World or Scratch an Itch

Read More →