MIT Machine Learning Model Learns from Audio Descriptions

abstract depiction of brain with audio signal

Computer scientists at the Massachusetts Institute of Technology have invented a new machine learning model for object recognition that incorporates audio descriptions (versus transcripts of audio) along with images.

"The model doesn't require manual transcriptions and annotations of the example [speech] it's trained on," the official announcement explained of the new method. "Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another."

Typically, most machine learning models that incorporate audio require transcriptions of that audio versus using the audio itself. While this current system only recognizes "several hundred words and object types," the researchers who developed it have high hopes for its future.

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to," commented David Harwath, a researcher in MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. 

"There's potential there for a Babel Fish-type of mechanism," he continued.

This specific experiment is built upon a 2016 project, but with more images and data added, and with a new approach to the training. Details on how the model was trained can be found in the official announcement of the new project here.

About the Author

Becky Nagel serves as vice president of AI for 1105 Media specializing in developing media, events and training for companies around AI and generative AI technology. She also regularly writes and reports on AI news for PureAI.com, a site she founded, among others. She's the author of "ChatGPT Prompt 101 Guide for Business Users" and other popular AI resources with a real-world business perspective. She regularly speaks, writes and develops content around AI, generative AI and other business tech. Find her on X/Twitter @beckynagel.

Featured

  • Hand holding a stylus over a tablet with futuristic risk management icons

    Why Universities Are Ransomware's Easy Target: Lessons from the 23% Surge

    Academic environments face heightened risk because their collaboration-driven environments are inherently open, making them more susceptible to attack, while the high-value research data they hold makes them an especially attractive target. The question is not if this data will be targeted, but whether universities can defend it swiftly enough against increasingly AI-powered threats.

  • hand typing on laptop with security and email icons

    Copilot Gets Expanded Role in Office, Outlook, and Security

    Microsoft has doubled down on its Copilot strategy, announcing new agents and capabilities that bring deeper intelligence and automation to everyday workflows in Microsoft 365.

  • Graduation cap resting on electronic circuit board

    Preparing Workplace-Ready Graduates in the Age of AI

    Artificial intelligence is transforming workplaces and emerging as an essential tool for employees across industries. The dilemma: Universities must ensure graduates are prepared to use AI in their daily lives without diluting the interpersonal, problem-solving, and decision-making skills that businesses rely on.

  • business man using smart phone in office

    Microsoft Copilot Adds Voice Commands, Teams Collaboration, Local Data Processing

    Microsoft has introduced new features within its Microsoft 365 Copilot offering, aimed at making further foothold in the enterprise, including voice-based interaction, group collaboration tools, and an expansion of in-country data processing.