Research Project Figures Out How to Crowdsource Predictive Models

crowdsourcing

An MIT research project has come up with a way to crowdsource development of features for use in machine learning. Groups of data scientists contribute their ideas for this "feature engineering" into a collaboration tool named "FeatureHub." The idea, according to lead researcher Micah Smith, is to enable contributors to spend a few hours reviewing a modeling problem and proposing various features. Then the software builds the models with those features and determines which ones are the most useful for a particular predictive task.

As described in "FeatureHub: Towards collaborative data science," the platform lets multiple users write scripts for feature extraction and then request an evaluation of their proposed features. The platform aggregates features from multiple users and automatically builds a machine learning model for the problem at hand. The name was inspired by GitHub, a repository for programming projects, some of which have drawn numerous contributors.

To test the platform, the researchers recruited 41 freelance analysts with data science experience, who spent five hours each with the system, familiarizing themselves with it and using it to propose candidate features for each of two data science problems. In one problem, the participants were given data about users of the home rental site Airbnb and their activity on the site. They were asked to predict, for a given user, the country in which the user would book his or her first rental. In the other problem the workers were given data provided by Sberbank, a Russian bank, on apartment sale transactions and economic conditions in Russia. The test subjects were given the job of predicting for a given transaction the apartment's final selling price. Of the 41 workers who logged into the platform, 32 successfully submitted at least one feature. In total, the project collected 1,952 features.

The predictive models produced with FeatureHub were then compared against the ones submitted to Kaggle, a data-science competition service that uses manual effort for its results. The Kaggle entries had been scored on a 100-point scale, and the FeatureHub models fell within three and five points of the winning entries for the two problems. Importantly, however, while the Kaggle entries took weeks or months of work, the FeatureHub entries were produced in days.

Smith is hopeful for the use of the platform. "I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention," he said, in an MIT article about the project. "I think that the concept of massive and open data science can be really leveraged for areas where there's a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses."

A paper on the project was recently presented at the IEEE International Conference on Data Science and Advanced Analytics in Tokyo. Co-authors included Smith's thesis advisor, Kalyan Veeramachaneni, a principal research scientist at MIT's Laboratory for Information and Decision Systems, and Roy Wedge, a former MIT undergraduate who is now a software engineer at Feature Labs, a data science company based on the group's work.

The project was partially funded through a National Science Foundation grant focused on creating a community software infrastructure, called LearnSphere, that supports sharing, analysis and collaboration across the wide variety of educational data.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • multiple computer monitors connected by glowing blue lines in a network grid

    Gartner Forecasts Increased Spending on Desktop as a Service as Cost Optimization, Sustainability Drive Adoption

    Gartner's 2025 Magic Quadrant for Desktop as a Service reveals that while secure remote access remains a key driver of DaaS adoption, a growing number of deployments now focus on broader efficiency goals.

  • server racks, a human head with a microchip, data pipes, cloud storage, and analytical symbols

    OpenAI, Oracle Expand AI Infrastructure Partnership

    OpenAI and Oracle have announced they will develop an additional 4.5 gigawatts of data center capacity, expanding their artificial intelligence infrastructure partnership as part of the Stargate Project, a joint venture among OpenAI, Oracle, and Japan's SoftBank Group that aims to deploy 10 gigawatts of computing capacity over four years.

  • stylized figures, resumes, a graduation cap, and a laptop interconnected with geometric shapes

    OpenAI to Launch AI-Powered Jobs Platform

    OpenAI announced it will launch an AI-powered hiring platform by mid-2026, directly competing with LinkedIn and Indeed in the professional networking and recruitment space. The company announced the initiative alongside an expanded certification program designed to verify AI skills for job seekers.

  • interconnected blocks of data

    Rubrik Intros Immutable Backup for Okta Environments

    Rubrik has announced Okta Recovery, extending its identity resilience platform to Okta with immutable backups and in-place recovery, while separately detailing its integration with Okta Identity Threat Protection for automated remediation.