Criteo Releases Machine Learning Dataset for Academic Use -- Campus Technology

Criteo Releases Machine Learning Dataset for Academic Use

By Leila Meyer
06/19/15

Criteo, a company that helps Web sites deliver personalized advertisements, has released a large machine learning dataset to the open source community for use in academic research and development of distributed machine learning algorithms.

Machine learning algorithms are computer programs that are capable of learning on their own when they are exposed to new data. For example, Criteo has its own proprietary distributed learning algorithm that is designed to predict when an individual is most likely to click on an online advertisement. According to Olivier Chapelle, principal research scientist at Criteo, publicly available datasets such as this one are necessary for the development of accurate and fast machine learning algorithms. With the release of this dataset, Criteo's goal is to help academic researchers test and refine other machine learning platforms.

The newly released dataset contains more than four billion lines and is more than one terabyte in size, making it "the largest public machine learning dataset ever issued to the open source community," according to a statement from the company. The dataset is anonymized from real-world applications and hosted on Microsoft Azure.

Researchers at Carnegie Mellon University have already used Criteo's newly released dataset as a benchmark and plans to use it for more academic and research projects.

"Criteo's one terabyte dataset has proven invaluable for benchmarking the scalability of the learning algorithms for high throughput click-through-rate estimation, which we are developing as part of our Marianas Labs project," said Alexander Smola, a professor at Carnegie Mellon University, in a prepared statement.

Information about how to access, download and use the dataset can be found on the Criteo Labs site.

About the Author

Leila Meyer is a technology writer based in British Columbia. She can be reached at [email protected].