Yahoo Releases Machine Learning Dataset for Academic Researchers

Academic researchers now have free access to a sizable new dataset for the purposes of expanding the scientific world's understanding of Web sciences. Yahoo Labs released the "Yahoo News Recommendation" dataset, which consists of data on 110 billion events, taking up 13.5 terabytes in its uncompressed format.

Already the data has been used for research on the "effects of bid-pulsing on keyword performance in search engines" and the evaluation of "automatic image annotation using human descriptions at different levels of granularity."

The information inside the dataset is completely anonymized and maintained in the Yahoo Labs Webscope data-sharing program, the company reported. It's made up of user content interactions for about 20 million users during the period from February 2015 to May 2015. The data includes title, summary and key phrases for news articles, as well as a timestamp for the local time and partial information about the device on which the user accessed the news feeds. The dataset also has categorized demographic information such as age range, gender and generalized geographic data for a subset of users. All of it derives from the news feeds of several of Yahoo's content sites: the homepage, news, sports, finance, movies and real estate.

To gain access to the Webscope data, the requester must be a faculty member, researcher employee or student from an accredited institution and the request must come from an .edu or university domain email address

"Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research," wrote Suju Rajan, a director of research for "personalization science" at Yahoo Labs, in a blog article about the data.

Rajan's group has used a full-scale version of the dataset to research behavior modeling, recommender systems and large-scale and distributed machine learning, among other topics.

"We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, 'real-world' dataset," she added.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • Two autonomous AI figures performing tasks in a tech environment; one interacts with floating holographic screens, while the other manipulates digital components

    Agentic AI Named Top Tech Trend for 2025

    Agentic AI will be the top tech trend for 2025, according to research firm Gartner. The term describes autonomous machine "agents" that move beyond query-and-response generative chatbots to do enterprise-related tasks without human guidance.

  • sleek fishing hook with a translucent email icon hanging from it

    Report Identifies Rise in Phishing-as-a-Service Attacks

    Cybersecurity researchers at Trustwave are warning about a surge in malicious e-mail campaigns leveraging Rockstar 2FA, a phishing-as-a-service (PhaaS) toolkit designed to steal Microsoft 365 credentials.

  • person signing a bill at a desk with a faint glow around the document. A tablet and laptop are subtly visible in the background, with soft colors and minimal digital elements

    California Governor Signs AI Content Safeguards into Law

    California Governor Gavin Newsom has officially signed off on a series of landmark artificial intelligence bills, signaling the state’s latest efforts to regulate the burgeoning technology, particularly in response to the misuse of sexually explicit deepfakes. The legislation is aimed at mitigating the risks posed by AI-generated content, as concerns grow over the technology's potential to manipulate images, videos, and voices in ways that could cause significant harm.

  • abstract technology icons connected by lines and dots

    Digital Layers and Human Ties: Navigating the CIO's Dilemma in Higher Education

    As technology permeates every aspect of life on campus, efficiency and convenience may come at the cost of human connection and professional identity.