Yahoo Releases Machine Learning Dataset for Academic Researchers

Academic researchers now have free access to a sizable new dataset for the purposes of expanding the scientific world's understanding of Web sciences. Yahoo Labs released the "Yahoo News Recommendation" dataset, which consists of data on 110 billion events, taking up 13.5 terabytes in its uncompressed format.

Already the data has been used for research on the "effects of bid-pulsing on keyword performance in search engines" and the evaluation of "automatic image annotation using human descriptions at different levels of granularity."

The information inside the dataset is completely anonymized and maintained in the Yahoo Labs Webscope data-sharing program, the company reported. It's made up of user content interactions for about 20 million users during the period from February 2015 to May 2015. The data includes title, summary and key phrases for news articles, as well as a timestamp for the local time and partial information about the device on which the user accessed the news feeds. The dataset also has categorized demographic information such as age range, gender and generalized geographic data for a subset of users. All of it derives from the news feeds of several of Yahoo's content sites: the homepage, news, sports, finance, movies and real estate.

To gain access to the Webscope data, the requester must be a faculty member, researcher employee or student from an accredited institution and the request must come from an .edu or university domain email address

"Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research," wrote Suju Rajan, a director of research for "personalization science" at Yahoo Labs, in a blog article about the data.

Rajan's group has used a full-scale version of the dataset to research behavior modeling, recommender systems and large-scale and distributed machine learning, among other topics.

"We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, 'real-world' dataset," she added.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • geometric grid of colorful faculty silhouettes using laptops

    Top 3 Faculty Uses of Gen AI

    A new report from Anthropic provides insights into how higher education faculty are using generative AI, both in and out of the classroom.

  • conceptual graph of rising AI adoption

    Report: AI Adoption Rising, but Trust Gap Limits Impact

    A recent global study found that while the adoption of artificial intelligence continues to expand rapidly across industries, a misalignment between perceived trust in AI systems and their actual trustworthiness is limiting business returns.

  • illustration of an open textbook, computer monitor with flowchart, gears, a wrench, and AI cloud symbol

    Wiley Introduces New AI Courseware Tools

    Wiley has created four new tools for its zyBooks courseware platform designed to improve instruction, learning outcomes, and academic integrity in college STEM courses.

  • Abstract tech background made of printed circuit board

    University of Kentucky Initiative to Advance AI Efforts Across the Campus and State

    The University of Kentucky has launched CATS AI (Commonwealth AI Transdisciplinary Strategy), a campuswide effort aimed at advancing AI across the institution's 17 colleges, libraries, research centers, and institutes; its academic and healthcare enterprises; and throughout the state.