You are here:  Home  >  Data Education  >  Current Article

Google Releases Linguistic Data based on NY Times Annotated Corpus

By   /  August 27, 2014  /  No Comments

Photo of New York Times Building in New York City

Dan Gillick and Dave Orr recently wrote, “Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas. Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.”

The blog continues with, “We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article. One way to approach the problem is to look for words that appear more often than their ordinary rates.”

Read more here.

Photo credit : Eric Franzon

About the author

Jelani Harper has written for a number of publications, both online and in print. He was a staff writer at both the Oakland Tribune and the San Mateo Times. He has written extensively about various aspects of IT and finance including business intelligence, cloud computing and cloud-based data, GPS, architecture, data management, and ERP.

You might also like...

James Kobielus

Data Science to Change The World or Scratch an Itch

Read More →