Yahoo Releases Machine Learning Dataset for Academic Researchers

Academic researchers now have free access to a sizable new dataset for the purposes of expanding the scientific world's understanding of Web sciences. Yahoo Labs released the "Yahoo News Recommendation" dataset, which consists of data on 110 billion events, taking up 13.5 terabytes in its uncompressed format.

Already the data has been used for research on the "effects of bid-pulsing on keyword performance in search engines" and the evaluation of "automatic image annotation using human descriptions at different levels of granularity."

The information inside the dataset is completely anonymized and maintained in the Yahoo Labs Webscope data-sharing program, the company reported. It's made up of user content interactions for about 20 million users during the period from February 2015 to May 2015. The data includes title, summary and key phrases for news articles, as well as a timestamp for the local time and partial information about the device on which the user accessed the news feeds. The dataset also has categorized demographic information such as age range, gender and generalized geographic data for a subset of users. All of it derives from the news feeds of several of Yahoo's content sites: the homepage, news, sports, finance, movies and real estate.

To gain access to the Webscope data, the requester must be a faculty member, researcher employee or student from an accredited institution and the request must come from an .edu or university domain email address

"Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research," wrote Suju Rajan, a director of research for "personalization science" at Yahoo Labs, in a blog article about the data.

Rajan's group has used a full-scale version of the dataset to research behavior modeling, recommender systems and large-scale and distributed machine learning, among other topics.

"We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, 'real-world' dataset," she added.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • Three cubes of noticeably increasing sizes are arranged in a straight row on a subtle abstract background

    A Sense of Scale

    Gardner Campbell explores the notion of scale in education and shares some of his own experience "playing with scale" — scaling up and/or scaling down — in an English course at VCU.

  • illustration of a futuristic building labeled "AI & Innovation," featuring circuit board patterns and an AI brain motif, surrounded by geometric trees and a simplified sky

    Cal Poly Pomona Launches AI and Innovation Center

    In an effort to advance AI innovation, foster community engagement, and prepare students for careers in STEM fields and business, California State Polytechnic University, Pomona has teamed up with AI, cloud, and advisory services provider Avanade to launch a new Avanade AI & Innovation Center.

  • Abstract widescreen image with geometric shapes, flowing lines, and digital elements like graphs and data points in soft blue and white gradients.

    5 Trends to Watch in Higher Education for 2025

    In 2025, the trends shaping higher education reflect a continuous transformation of the higher education landscape to meet the changing needs of students and staff, while maintaining sustainable and cost-effective institutional practices.

  • collection of glowing digital documents and seals

    1EdTech: 6 Key Steps for a Successful Credentialing Program

    A new report from 1EdTech Consortium outlines recommendations for creating microcredential programs in schools, colleges, and universities.