Research Project Figures Out How to Crowdsource Predictive Models

crowdsourcing

An MIT research project has come up with a way to crowdsource development of features for use in machine learning. Groups of data scientists contribute their ideas for this "feature engineering" into a collaboration tool named "FeatureHub." The idea, according to lead researcher Micah Smith, is to enable contributors to spend a few hours reviewing a modeling problem and proposing various features. Then the software builds the models with those features and determines which ones are the most useful for a particular predictive task.

As described in "FeatureHub: Towards collaborative data science," the platform lets multiple users write scripts for feature extraction and then request an evaluation of their proposed features. The platform aggregates features from multiple users and automatically builds a machine learning model for the problem at hand. The name was inspired by GitHub, a repository for programming projects, some of which have drawn numerous contributors.

To test the platform, the researchers recruited 41 freelance analysts with data science experience, who spent five hours each with the system, familiarizing themselves with it and using it to propose candidate features for each of two data science problems. In one problem, the participants were given data about users of the home rental site Airbnb and their activity on the site. They were asked to predict, for a given user, the country in which the user would book his or her first rental. In the other problem the workers were given data provided by Sberbank, a Russian bank, on apartment sale transactions and economic conditions in Russia. The test subjects were given the job of predicting for a given transaction the apartment's final selling price. Of the 41 workers who logged into the platform, 32 successfully submitted at least one feature. In total, the project collected 1,952 features.

The predictive models produced with FeatureHub were then compared against the ones submitted to Kaggle, a data-science competition service that uses manual effort for its results. The Kaggle entries had been scored on a 100-point scale, and the FeatureHub models fell within three and five points of the winning entries for the two problems. Importantly, however, while the Kaggle entries took weeks or months of work, the FeatureHub entries were produced in days.

Smith is hopeful for the use of the platform. "I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention," he said, in an MIT article about the project. "I think that the concept of massive and open data science can be really leveraged for areas where there's a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses."

A paper on the project was recently presented at the IEEE International Conference on Data Science and Advanced Analytics in Tokyo. Co-authors included Smith's thesis advisor, Kalyan Veeramachaneni, a principal research scientist at MIT's Laboratory for Information and Decision Systems, and Roy Wedge, a former MIT undergraduate who is now a software engineer at Feature Labs, a data science company based on the group's work.

The project was partially funded through a National Science Foundation grant focused on creating a community software infrastructure, called LearnSphere, that supports sharing, analysis and collaboration across the wide variety of educational data.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • Interface buttons of Generative AI tool

    Report: No Foolproof Method Exists for Detecting AI-Generated Media

    Microsoft has released a new research report warning that no single technology can reliably distinguish AI-generated content from authentic media, and that deepening reliance on any one method risks misleading the public.

  • illustration of people collaborating around large interlocking gears and data charts

    Why ERP and AI Initiatives Stall at the Execution Layer: A CIO Perspective

    Higher education institutions are investing heavily in ERP modernization, analytics, and AI-driven capabilities. Yet even with these investments, many are running into the same issue: turning insight into coordinated, timely action.

  • large group of college students sitting on an academic quad

    Student Readiness: Learning to Learn

    Melissa Loble, Instructure's chief academic officer, recommends a focus on 'readiness' as a broader concept as we try to understand how to build meaningful education experiences that can form a bridge from the university to the workplace. Here, we ask Loble what readiness is and how to offer students the ability to 'learn to learn'.

  • digital lock on a virtual background

    Encryptionless Extortion on the Rise as Ransomware Groups Shift Tactics

    Ransomware attacks continued to climb in 2025 as attackers increasingly timed operations around year-end staffing gaps and shifted away from traditional file encryption, according to new research from NordStellar.