Machine Learning
Criteo Releases Machine Learning Dataset for Academic Use
Criteo, a company that helps Web sites
deliver personalized advertisements, has released a large machine learning
dataset to the open source community for use in academic research and
development of distributed machine learning algorithms.
Machine learning algorithms are computer programs that are capable of
learning on their own when they are exposed to new data. For example, Criteo
has its own proprietary distributed learning algorithm that is designed to
predict when an individual is most likely to click on an online advertisement.
According to Olivier Chapelle, principal research scientist at Criteo, publicly
available datasets such as this one are necessary for the development of
accurate and fast machine learning algorithms. With the release of this
dataset, Criteo's goal is to help academic researchers test and refine other
machine learning platforms.
The newly released dataset contains more than four billion lines and is more than
one terabyte in size, making it "the largest public machine learning dataset
ever issued to the open source community," according to a statement from the
company. The dataset is anonymized from real-world applications and hosted on
Microsoft Azure.
Researchers at Carnegie Mellon
University have already used Criteo's newly released dataset as a benchmark
and plans to use it for more academic and research projects.
"Criteo's one terabyte dataset has proven invaluable for benchmarking the
scalability of the learning algorithms for high throughput click-through-rate
estimation, which we are developing as part of our Marianas Labs project," said
Alexander Smola, a professor at Carnegie Mellon University, in a prepared
statement.
Information about how to access, download and use the dataset can be found
on the Criteo
Labs site.
About the Author
Leila Meyer is a technology writer based in British Columbia. She can be reached at [email protected].