Yahoo Releases Machine Learning Dataset for Academic Researchers -- Campus Technology

Data and Analytics

Yahoo Releases Machine Learning Dataset for Academic Researchers

By Dian Schaffhauser
01/20/16

Academic researchers now have free access to a sizable new dataset for the purposes of expanding the scientific world's understanding of Web sciences. Yahoo Labs released the "Yahoo News Recommendation" dataset, which consists of data on 110 billion events, taking up 13.5 terabytes in its uncompressed format.

Already the data has been used for research on the "effects of bid-pulsing on keyword performance in search engines" and the evaluation of "automatic image annotation using human descriptions at different levels of granularity."

The information inside the dataset is completely anonymized and maintained in the Yahoo Labs Webscope data-sharing program, the company reported. It's made up of user content interactions for about 20 million users during the period from February 2015 to May 2015. The data includes title, summary and key phrases for news articles, as well as a timestamp for the local time and partial information about the device on which the user accessed the news feeds. The dataset also has categorized demographic information such as age range, gender and generalized geographic data for a subset of users. All of it derives from the news feeds of several of Yahoo's content sites: the homepage, news, sports, finance, movies and real estate.

To gain access to the Webscope data, the requester must be a faculty member, researcher employee or student from an accredited institution and the request must come from an .edu or university domain email address

"Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research," wrote Suju Rajan, a director of research for "personalization science" at Yahoo Labs, in a blog article about the data.

Rajan's group has used a full-scale version of the dataset to research behavior modeling, recommender systems and large-scale and distributed machine learning, among other topics.

"We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, 'real-world' dataset," she added.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

E-Mail this page

Printable Format

Featured

Report: Global AI Use Rises as Adoption Gap Continues to Widen

AI usage has reached 17.8% among the world's working-age population, while adoption remains far higher in developed economies than in the Global South.
Anthropic Expands Enterprise Deployment Options for Claude Desktop with New Controls and Cloud Integrations

Anthropic is adding new enterprise deployment options for Claude Desktop, saying organizations that use the app through Amazon Web Services, Google Cloud, and Microsoft Foundry can now access the full desktop experience across chat, Claude Cowork, and Claude Code.
Call for Speakers Now Open for Tech Tactics in Education Fall 2026

The virtual conference from the producers of Campus Technology and THE Journal will return on Sept. 23, 2026, with a focus on emerging trends in with a focus on emerging trends in AI, cybersecurity, and more.
Meta Steps Up Enterprise AI Ambitions with Muse Spark Launch

Meta has announced the launch of Muse Spark 1.1, a multimodal reasoning model designed for agentic AI, alongside a new Meta Model API that gives developers access to the model for the first time.