Common Crawl now is providing its 2012 corpus of web crawl data not just as .ARC files, but also is releasing the metadata files (JSON-based metadata with all the links from every page crawled, metatags, headers and so on) as well as text output.
With the metadata files, users don’t have to extract the link graph from the raw crawl, which, says Common Crawl Chief Architect Ahad Rana, is “pretty significant for the community. They don’t have to expend all this CPU power to extract the links. And metadata files are a much smaller set of data than the raw corpus.” Similarly, the full text output that users now can run analysis over is significantly smaller than the .ARC file raw content.
Earlier this year Common Crawl (which we first covered here) did an initial run of 2012 data to produce the .ARC, text output and metadata files, but this marks the official release. And providing the three sets of files will be the approach Common Crawl continues going forward. Common Crawl also is embedding the location of the raw content of the .ARC files in the metadata to help additional analysis in the most cost-effective and speedy way. This way, users can narrow down retrievals to the seekable .ARC content they want to do additional analysis on. Similarly, with the text output, rather than having to go through the full corpus of raw files, users who want to analyze text only – say, to get an idea of languages used on the web – can do just that. In that instance, they simply can run a language detector to see what language the web page originated in.
It’s also easier to use text only rather than raw data to gain insight into a variety of other things, such as generating an n-gram of the most popular words on the web, or what words are most associated with other words. Recently, for instance, one Common Crawl user found that 22 percent of web pages mentioned the word Facebook.
“All this is an attempt to make data more approachable and efficient to extract subsets from,” says Rana. “As we continue to crawl all types of data on the web – the good, the bad, you name it – whether people want to run a bulk job or a more targeted job, they have the full options.”
Also with the 2012 corpus Common Crawl crawls every home page every time to see if there have been changes. “As we discover new domains, at minimum we crawl the home page, so hopefully the corpus will have a representation of at least some sampling of whatever the domain is serving up,” says Rana. “The second thing is there is a more concerted effort to crawl blogs more aggressively,” since that is where a lot of content is changing in comparison, for example, to business home pages, where things are pretty static.
An aggressive goal for Common Crawl is to push data every month or, at the least, every quarter. Taking advantage of Amazon’s public data set bucket has helped it relieve some of the previous constraints, “so there’s hope to grow the corpus significantly this year,” he says. One note is that the 2012 crawl was restarted from scratch, so it might initially be less than the 5-billion page crawl of earlier data.
Common Crawl also is taking on some other tasks, such as talking to more educators about integrating it in their classrooms, says director Lisa Green. “That’s exciting. Everyone talks about the shortage of data scientists and the need for real-world data in the classroom, and they need these techniques.” Dividing it up into metadata, text output and raw data is going to make it much more convenient and cost-effective for educators and others to approach using it, Green and Rana believe.
Spending $100 for compute time vs. $500 (the cost of the aforementioned Facebook study done using ARC files) is a big difference, and, as “the raw corpus grows, even though compute time is cheap on Amazon, with over 10 billion sets of documents, that still costs some money to do basic-level analysis,” Rana says. “We want start-ups to be able to do stuff on top of data, and even individuals [to do the same].”
There’s also talk underway about putting copies of the corpus in other places – one of them being the Open Cloud Consortium, a nonprofit organization managing and operating cloud computing infrastructure that supports scientific, environmental, medical and health care research.
Finally, this week expect to hear about Common Crawl sponsoring a code contest that will take place starting on the 18th in distributed and asynchronous fashion over six weeks. Common Crawl is asking people to join to use the web crawl data to explore social impact analysis (for instance, what are the most common words on the web to follow Barack Obama or Mitt Romney?) and job trends (like figuring out by keywords in online resumes and semi-structured information about job postings what types of skills and jobs tend to live in what particular areas), using either or both the 2012 and older web crawl data.