Common Crawl, the non-profit organization creating a repository of openly and freely accessible web crawl data, is getting a present from search engine provider blekko. It’s donating its metadata on search engine ranking for 140 million websites and 22 billion webpages to Common Crawl.
“The blekko data donation is a huge benefit to Common Crawl,” Common Crawl director Lisa Green told The Semantic Web Blog. “Knowing what the blekko team is crawling and how they rate those pages allows us to improve our crawler and enrich our corpus for high-value webpages.”
Blekko and Common Crawl share a vision of a more transparent web – not necessarily a common feature among search engine providers. As Common Crawl founder Gil Elbaz told The Semantic Web Blog earlier this year (see story here), “you can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers.”
That mesh of philosophies makes for a good match. “Everyone in the Common Crawl community is excited to be working with blekko,” Green says. “The data that blekko donated will be tremendously helpful to Common Crawl’s efforts, and also it is inspiring to collaborate with a company that lives by its ideals of openness and transparency.”
Over 80 terabytes of spam-filtered ranking metadata gathered from crawling web sites between February 2012 to November 2012 come into play in the donation. Blekko’s blog notes that Common Crawl will be able to use blekko’s metadata to help improve its crawl while avoiding spam, porn and the influence of excessive SEO. The insight Common Crawl gains from the data, Green says, means it doesn’t have to apply substantial engineering resources to the task itself. “Engineering resources…can now be applied to improving other aspects of Common Crawl,” she says. “If more companies worked in this open and collaborative manner there would be a surge of innovation throughout the research and tech industries.”
Speaking of innovation, in case you missed it Common Crawl in conjunction with SARA, an independent organization that supports scientific research, last month announced the Norvig Web Data Science Award, named in honor of Peter Norvig, whose known for his work in Internet search, artificial intelligence, NLP and machine learning. Students and researchers at public universities in the Netherlands have the opportunity to show what cool things they can create using Common Crawl data.
This follows on the heels of Common Crawl’s Code Contest, the winners of which were announced earlier in the fall.