by Angela Guess
Kay Ewbank reports in I-Programmer, “Thirteen Terabytes of anonymized user-news item interaction data has been made available for developers to use in machine learning applications. This is the largest ever set of data to be made available for general use. It began life as user-news interaction data, collected by recording the user-news item interactions of about 20 million Yahoo users from February 2015 to May 2015. The dataset contains around 100 billion events. The Yahoo news Feed dataset was drawn from the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.”
Ewbank goes on, “Writing about the dataset, Suju Rajan of Yahoo Labs said: ‘Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use’.”
Ewbank adds, “In addition to the interaction data, Yahoo is providing a range of categorized demographic information for a subset of the anonymized users. The demographic information includes age range, gender, and generalized geographic data. On the item side, the dataset contains the title, summary, and key-phrases of the news article. The interaction data is timestamped with the relevant local time and also contains partial information about the device used to access the news feed.”
photo credit: Yahoo!