You are here:  Home  >  Data Education  >  Current Article

Wikilinks Corpus: What Will You Do With 40 Million Disambiguated Entity Mentions Across 10 Million-Plus Web Pages?

By   /  March 11, 2013  /  No Comments

Last Friday saw the release of the Wikilinks Corpus from Research at Google, 40 million entities in context strong.

As explained in a blog post by Dave Orr, Amar Subramanya, and Fernando Pereira at Google Research, the Big Data set “involves 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus.” The mentions, the post relates, are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If each page on Wikipedia is throught of as an entity, then the anchor text can be thought of as a mention of the corresponding entity, it says.

The data can be found in Google’s Wikilinks Corpus; tools and data with extra context can be found at UMass Wiki-linksUMass Amherst‘s Sameer Singh and Andrew McCallum are collaborators in the project. The team does note that users will need to do a little footwork to understand the corpus, as they can’t distribute actual annotated web pages because of copyright issues – just an index of URLs, and the tools to create the dataset, or whatever piece of it you want.

Freebase shared a post on Google+ about the announcement, to the effect that the huge dataset has “a lot of potential for folks building entity-aware apps. Essentially it lets your app tell the difference between ambiguous entity names like Apple (the fruit) and Apple (the company). Because they give Wikipedia URLs for each entity it’s easy to connect this to all of the Freebase APIs.” And, asks Freebase, “what would you do if you could scan any page on the web and know which Freebase entities it mentioned?”

The Google Research team blog gives some ideas of how it sees the data potentially being used:

  • Look into coreference — when different mentions mention the same entity — or entity resolution — matching a mention to the underlying entity
  • Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity (see a paper by Google Research on Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models here)
  • Learn things about entities by aggregating information across all the documents they’re mentioned in
  • Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
  • Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.

You might also like...


The Law of Diminishing Returns: How Much Data is Too Much?

Read More →