The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services. The non-profit Common Crawl is the vision of Gil Elbaz, who founded Applied Semantics and the AdSense technology for which Google acquired it , as well as the Factual open data aggregation platform, and it counts Nova Spivack -- who’s been behind semantic services from Twine to Bottlenose – among its board of directors.
Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says.
“You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”
Common Crawl’s data already is stored on Amazon’s S3 service, but now Amazon will be providing the storage space for free through the Public Data Set program. Not only does that remove from Common Crawl the storage burden and costs for hosting its crawl of 5 billion web pages – some 50 or 60 terabytes large – but it should make it easier for users to access the data, and remove the bandwidth-related costs they might incur for downloads. Users won’t have to deal with setting up accounts, being responsible for bandwidth bills incurred, and more complex authentication processes.
Elbaz notes that while the 5-billion page crawl met its goal, because it’s larger than any other corpus that’s made available, he also would like to see it grow larger. “Another factor in choosing 5 billion was that number fit into our budget,” he says. Now, with the Amazon arrangement, “there’s one less reason not to shoot for the stars and keep crawling and get much more data.”
More Visibility, More Value
This enhanced relationship also should make the crawl’s existence more visible to more of the world, which means increasing the chances that the researcher with the next incredible algorithm or the entrepreneur with the inspired startup idea will find it and use it to help move society or the economy forward. “We’re just at the tip of the iceberg, as Google has said itself, in terms of extracting all of the knowledge that is there in patterns or that needs to be deciphered or learned in this amazing body of knowledge,” Elbaz says.
What might the next layer of value those researchers and entrepreneurs add to the Internet look like? One example can be found at TinEye, a reverse image search engine from Idée Inc., which uses image identification technology rather than keywords, metadata or watermarks to find out where a submitted image came from, how it is being used, if modified versions of the image exist, or to find higher resolution versions. Using the Common Crawl web crawl, it launched this new kind of search service before Google did something similar, Elbaz notes.
Not every innovation will be immediate. Elbaz expects that Common Crawl and efforts that build upon its data may help lay the foundation of things that will be valuable to applications over time, as well. He posits, for example, an end user trying to create a restaurant search app and the access she now can have to more resources to develop it, using Common Crawl directly for free, or looking up clean, structured Global Place data from Factual (for which the pre-existing open crawl is one of multiple resources) for a small fee, or leveraging a project underway by researchers at the Web-based Systems Group at the Freie Universität Berlin in cooperation with the Karlsruhe Institute of Technology.
This project involves extracting all microformat, microdata and RDFa data that is contained in the Common Crawl corpus and providing the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types -- location, product and so on. The extracted data will be published on WebDataCommons.org, according to Dr. Christian Bizer, who heads up the Web-based Systems Group.
“Each additional resource creates that much more productivity and manifests itself in better and better consumer applications,” Elbaz says.
Bring On the Structure, And Big Data Too
A resurgence of interest in structured data and semantics is being driven by the success of Apple’s Siri technology, and the desire of others to see how they might leverage voice input for search, too. “Siri is just the first offering in a long chain of better and better technologies that will more closely resemble the AI we’ve seen in the movies,” Elbaz says.
When it comes to a resurgence in search, whether it's via new ways of input beyond Google-style queries, verticalized, or more contextually aware, “the opportunities are endless. But," he says, "the crawl data will definitely be one key resource that people can take advantage of because it will let them test out ideas on the first day in their garage, rather than crawl for six months and then on the 181st day test out this new technology. It will completely change the efficiency of a startup’s ability to create a proof of concept with $25,000 and then try to get the next level of funding.”
Elbaz has a number of reasons for his decision to make Common Crawl a non-profit, and one of them is to provide a key educational tool. “What better lab in college can there be than to be given access to the world body of knowledge and asked to build some theory around that,” he says.
But equally if not more profound is educators using the Common Crawl corpus as a resource to teach the next generation about Big Data. “It’s easier to teach it if you have a huge amount of Big Data that can be made accessible,” he says. Today, most students aren’t gaining these skills in college, and more of them need to given the expertise shortage in this area. “It’s really hard to find experts,” says Elbaz, who also notes how lucky Common Crawl was to persuade one of them, Google engineer Ahad Rana, to take on the job of single-handedly building a crawler that works at fast rates at scale.
At the same time, adding the Crawl to Amazon’s Public Data Set also can help get more people familiar with AWS, which itself increasingly is becoming a required proficiency. It’s another interesting set of data for them to play with and gain expertise.
“As companies get used to the notion of just starting these jobs using different libraries that automatically spin up on a hundred Amazon machines, developers are getting used to the fact that if you do it that way the answer comes back a hundred times faster once you’re using Hadoop infrastructure,” he says. “If the data is already on an Amazon cluster you don’t have to wait for the time it takes for the data to move from one data center to another. It’s getting more and more embedded into the way developers work.”
Making crawl data available to the masses is a mission for Elbaz, and a pleasure, too. "I've always been a data head," he says, "so this also is something that's incredibly fun to work on."