Loading...
You are here:  Home  >  'Web crawl data'
Latest

Common Crawl Corpus Update Makes Web Crawl Data More Efficient, Approachable For Users To Explore

By   /  July 16, 2012  /  Big Data News, Articles, & Education, Data Blogs | Information From Enterprise Leaders, Data Education, Smart Data News, Articles, & Education  /  No Comments

Common Crawl now is providing its 2012 corpus of web crawl data not just as .ARC files, but also is releasing the metadata files (JSON-based metadata with all the links from every page crawled, metatags, headers and so on) as well as text output. Semantic web projects that use its corpus include the work of […]

Read More →
Latest

Common Crawl To Add New Data In Amazon Web Services Bucket

By   /  March 13, 2012  /  Big Data News, Articles, & Education, Data Blogs | Information From Enterprise Leaders  /  No Comments

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in […]

Read More →