Research Project Figures Out How to Crowdsource Predictive Models
- By Dian Schaffhauser
- 11/09/17
An MIT research project has come up with a way to crowdsource development of features for use in machine learning. Groups of data scientists contribute their ideas for this "feature engineering" into a collaboration tool named "FeatureHub." The idea, according to lead researcher Micah Smith, is to enable contributors to spend a few hours reviewing a modeling problem and proposing various features. Then the software builds the models with those features and determines which ones are the most useful for a particular predictive task.
As described in "FeatureHub: Towards collaborative data science," the platform lets multiple users write scripts for feature extraction and then request an evaluation of their proposed features. The platform aggregates features from multiple users and automatically builds a machine learning model for the problem at hand. The name was inspired by GitHub, a repository for programming projects, some of which have drawn numerous contributors.
To test the platform, the researchers recruited 41 freelance analysts with data science experience, who spent five hours each with the system, familiarizing themselves with it and using it to propose candidate features for each of two data science problems. In one problem, the participants were given data about users of the home rental site Airbnb and their activity on the site. They were asked to predict, for a given user, the country in which the user would book his or her first rental. In the other problem the workers were given data provided by Sberbank, a Russian bank, on apartment sale transactions and economic conditions in Russia. The test subjects were given the job of predicting for a given transaction the apartment's final selling price. Of the 41 workers who logged into the platform, 32 successfully submitted at least one feature. In total, the project collected 1,952 features.
The predictive models produced with FeatureHub were then compared against the ones submitted to Kaggle, a data-science competition service that uses manual effort for its results. The Kaggle entries had been scored on a 100-point scale, and the FeatureHub models fell within three and five points of the winning entries for the two problems. Importantly, however, while the Kaggle entries took weeks or months of work, the FeatureHub entries were produced in days.
Smith is hopeful for the use of the platform. "I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention," he said, in an MIT article about the project. "I think that the concept of massive and open data science can be really leveraged for areas where there's a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses."
A paper on the project was recently presented at the IEEE International Conference on Data Science and Advanced Analytics in Tokyo. Co-authors included Smith's thesis advisor, Kalyan Veeramachaneni, a principal research scientist at MIT's Laboratory for Information and Decision Systems, and Roy Wedge, a former MIT undergraduate who is now a software engineer at Feature Labs, a data science company based on the group's work.
The project was partially funded through a National Science Foundation grant focused on creating a community software infrastructure, called LearnSphere, that supports sharing, analysis and collaboration across the wide variety of educational data.
About the Author
Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.