MIT Machine Learning Model Learns from Audio Descriptions

abstract depiction of brain with audio signal

Computer scientists at the Massachusetts Institute of Technology have invented a new machine learning model for object recognition that incorporates audio descriptions (versus transcripts of audio) along with images.

"The model doesn't require manual transcriptions and annotations of the example [speech] it's trained on," the official announcement explained of the new method. "Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another."

Typically, most machine learning models that incorporate audio require transcriptions of that audio versus using the audio itself. While this current system only recognizes "several hundred words and object types," the researchers who developed it have high hopes for its future.

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to," commented David Harwath, a researcher in MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. 

"There's potential there for a Babel Fish-type of mechanism," he continued.

This specific experiment is built upon a 2016 project, but with more images and data added, and with a new approach to the training. Details on how the model was trained can be found in the official announcement of the new project here.

About the Author

Becky Nagel serves as vice president of AI for 1105 Media specializing in developing media, events and training for companies around AI and generative AI technology. She also regularly writes and reports on AI news for PureAI.com, a site she founded, among others. She's the author of "ChatGPT Prompt 101 Guide for Business Users" and other popular AI resources with a real-world business perspective. She regularly speaks, writes and develops content around AI, generative AI and other business tech. Find her on X/Twitter @beckynagel.

Featured

  • two large brackets facing each other with various arrows, circles, and rectangles flowing between them

    1EdTech Partners with DXtera to Support Ed Tech Interoperability

    1EdTech Consortium and DXtera Institute have announced a partnership aimed at improving access to learning data in postsecondary and higher education.

  • Abstract geometric shapes including hexagons, circles, and triangles in blue, silver, and white

    Google Launches Its Most Advanced AI Model Yet

    Google has introduced Gemini 2.5 Pro Experimental, a new artificial intelligence model designed to reason through problems before delivering answers, a shift that marks a major leap in AI capability, according to the company.

  •  laptop on a clean desk with digital padlock icon on the screen

    Study: Data Privacy a Top Concern as Orgs Scale Up AI Agents

    As organizations race to integrate AI agents into their cloud operations and business workflows, they face a crucial reality: while enthusiasm is high, major adoption barriers remain, according to a new Cloudera report. Chief among them is the challenge of safeguarding sensitive data.

  • stylized AI code and a neural network symbol, paired with glitching code and a red warning triangle

    New Anthropic AI Models Demonstrate Coding Prowess, Behavior Risks

    Anthropic has released Claude Opus 4 and Claude Sonnet 4, its most advanced artificial intelligence models to date, boasting a significant leap in autonomous coding capabilities while simultaneously revealing troubling tendencies toward self-preservation that include attempted blackmail.