Open Menu Close Menu

Generative AI

OpenAI Rolls Out Next Evolution of ChatGPT, Able to Accept or Output Any Combination of Text, Audio, or Image

OpenAI is introducing a new iteration of its flagship GPT-4 large multimodal language model.

Called "GPT 4o." (The "o" stands for "omni"), this new flagship model was designed, the company said, to "reason" across audio, vision, and text in real time.

OpenAI also announced the release of the desktop version of ChatGPT, and a refreshed UI designed to make it simpler to use and more natural.

The new iteration was designed to accept as input any combination of text, audio, and image, and to generate any combination of text, audio, and image outputs. In a blog post, the company said it can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, "which is similar to human response time in a conversation." This level of performance matches GPT-4 Turbo performance on text in English and code, the company says, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

The new iteration will be free for all users, said OpenAI CTO Mira Murati during the livestream announcement, and paid users will continue to have up to five times the capacity limits of free users. "The special thing about GPT 4o is that it brings GPT-4-level intelligence to everyone, including our free users," Murati said. "A very important part of our mission is to be able to make our advanced AI tools available to everyone for free. We think it's very, very important that people have an intuitive feel for what the technology can do."

The company plans to roll out the full capabilities of the new model iteratively over the next few weeks, Murati said.

"For the past couple of years, we've been very focused on improving the intelligence of these models," Murati said, "and they've gotten pretty good. But this is the first time that we are really making a huge step forward when it comes to the ease of use. And this is incredibly important, because we're looking at the future of interaction between ourselves and the machines. And we think that GPT 4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural and far, far easier."

Because GPT-4-class intelligence is now available to free users via GPT 4o, Murati said, builders posting to the ChatGPT Store have a larger audience. "University professors can create content for their students, or podcasters can create content for their listeners," she said, "and you can also use vision so now you can upload screenshots, photos, documents containing both texts and images. And you can start conversations with chargeability about all of this content. You can also use memory, which makes ChatGPT far more useful and helpful, because now it has a sense of continuity across all your conversations. And you can use browse where you can search for real time information in your conversation."

This iteration also improves the quality and speed in 50 different languages for ChatGPT, Murati said, which makes the experience available to many more people.

"This is something that we've been trying to do for many, many months. And we're very, very excited to finally bring GBT four o to all of our users," she said.

OpenAI CEO Sam Altman said in a post on X that GPT 4o is "our best model ever. it is smart, it is fast, it is natively multimodal." Developers will have access to the API, "which is half the price and twice as fast as GPT-4 Turbo," Altman added on X.

During the live stream, OpenAI team members demonstrated some of the new model’s audio capabilities. Responding to a greeting from OpenAI researcher Mark Chen's greet, it said, "Hey there, what’s up? How can I brighten your day today?" Chen said the model has the ability to "perceive your emotion" and demonstrated by asking the model for help calm him down ahead of a public speech, and then panting dramatically. A calming female voice responded with, "Woa, calm down," and started guiding Chen in some slow, calming breathing. OpenAI team member Barret Zoph asked it to analyze his facial expressions to show off its ability to perceive emotions accurately.

"As we bring these technologies into the world, it's quite challenging to figure out how to do so in a way that's both useful and also safe," Murati said. "And GPT 4o presents new challenges for us when it comes to safety, because we're dealing with real time audio, real time vision. And our team has been hard at work figuring out how to build in mitigations against misuse. We continue to work with different stakeholders out there from government, media entertainment, all industries red teamers civil society to figure out how to best bring these technologies into the world."

Read the full OpenAI blog post here.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

comments powered by Disqus