Yahoo Releases Machine Learning Dataset for Academic Researchers
- By Dian Schaffhauser
- 01/20/16
Academic researchers now have free access to a sizable new dataset for the purposes of expanding the scientific world's understanding of Web sciences. Yahoo Labs released the "Yahoo News Recommendation" dataset, which consists of data on 110 billion events, taking up 13.5 terabytes in its uncompressed format.
Already the data has been used for research on the "effects of bid-pulsing on keyword performance in search engines" and the evaluation of "automatic image annotation using human descriptions at different levels of granularity."
The information inside the dataset is completely anonymized and maintained in the Yahoo Labs Webscope data-sharing program, the company reported. It's made up of user content interactions for about 20 million users during the period from February 2015 to May 2015. The data includes title, summary and key phrases for news articles, as well as a timestamp for the local time and partial information about the device on which the user accessed the news feeds. The dataset also has categorized demographic information such as age range, gender and generalized geographic data for a subset of users. All of it derives from the news feeds of several of Yahoo's content sites: the homepage, news, sports, finance, movies and real estate.
To gain access to the Webscope data, the requester must be a faculty member, researcher employee or student from an accredited institution and the request must come from an .edu or university domain email address
"Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research," wrote Suju Rajan, a director of research for "personalization science" at Yahoo Labs, in a blog article about the data.
Rajan's group has used a full-scale version of the dataset to research behavior modeling, recommender systems and large-scale and distributed machine learning, among other topics.
"We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, 'real-world' dataset," she added.
About the Author
Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.