Common Crawl To Add New Data In Amazon Web Services Bucket

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week., a project of the Freie Universität Berlin in cooperation with the Karlsruhe Institute of Technology that involves extracting all microformat, microdata and RDFa data that is contained in the Common Crawl corpus and providing the extracted data for free download, is moving ahead too.  Dr. Christian Bizer, who heads up the Web-based Systems Group, reports that they have “extracted all structured data from the 2010 crawl and are currently busy packaging the data for download and calculating basic statistics,” and will publish it as early as this week.

The project also is getting early access to the new Common Crawl data in a more raw form; plans are to run the extraction and, Bizer says, “if all goes well, the new data should be on the website in the week of the 19th.”

Other agenda items for the team at Common Crawl, whose origins are in the MapReduce/Hadoop world, include work it’s doing to accommodate data access for the SQL set. Green says she can’t set a date yet for when this will happen, but she says the organization is excited to give access to more people.

And developers should keep their eyes out in the next couple of weeks for news of an upcoming Code Contest.  Academic projects such as WebDataCommons are great, but Green is eager to present many more compelling examples of what can be done with the corpus. “When you talk about Common Crawl people think it’s a great resource, but then they have a little trouble thinking exactly what to do with it,” she says. Common Crawl soon will be posting a video announcing the contest, along with categories for which the organization would like to see some compelling use cases and code.

There will be a panel of judges but also a people’s choice award, so that everyone can see all of the entries on the web site and vote for their favorites. “This way we put everything up and even if some are not very good, there will be many more possible sources of inspiration for people,” she says.

Green’s got a lot more plans on her plate – for instance, producing an index of the corpus and some regular targeted crawl refreshes – but she needs to staff up the engineering team. So, if you’ve got Hadoop and distributed computing skills, you may want to get in touch with Common Crawl.