You are here:  Home  >  Data Education  >  Big Data News, Articles, & Education  >  Current Article

Common Crawl To Add New Data In Amazon Web Services Bucket

By   /  March 13, 2012  /  No Comments

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week.

WebDataCommons.org, a project of the Freie Universität Berlin in cooperation with the Karlsruhe Institute of Technology that involves extracting all microformat, microdata and RDFa data that is contained in the Common Crawl corpus and providing the extracted data for free download, is moving ahead too.  Dr. Christian Bizer, who heads up the Web-based Systems Group, reports that they have “extracted all structured data from the 2010 crawl and are currently busy packaging the data for download and calculating basic statistics,” and will publish it as early as this week.

The project also is getting early access to the new Common Crawl data in a more raw form; plans are to run the extraction and, Bizer says, “if all goes well, the new data should be on the website in the week of the 19th.”

Other agenda items for the team at Common Crawl, whose origins are in the MapReduce/Hadoop world, include work it’s doing to accommodate data access for the SQL set. Green says she can’t set a date yet for when this will happen, but she says the organization is excited to give access to more people.

And developers should keep their eyes out in the next couple of weeks for news of an upcoming Code Contest.  Academic projects such as WebDataCommons are great, but Green is eager to present many more compelling examples of what can be done with the corpus. “When you talk about Common Crawl people think it’s a great resource, but then they have a little trouble thinking exactly what to do with it,” she says. Common Crawl soon will be posting a video announcing the contest, along with categories for which the organization would like to see some compelling use cases and code.

There will be a panel of judges but also a people’s choice award, so that everyone can see all of the entries on the web site and vote for their favorites. “This way we put everything up and even if some are not very good, there will be many more possible sources of inspiration for people,” she says.

Green’s got a lot more plans on her plate – for instance, producing an index of the corpus and some regular targeted crawl refreshes – but she needs to staff up the engineering team. So, if you’ve got Hadoop and distributed computing skills, you may want to get in touch with Common Crawl.

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.

You might also like...

Michael Blaha

Ten Reasons Why Developers Ignore Data Models

Read More →