Join our everyday and weekly newsletter for the latest updates and exclusive content for top -class AI in the field. More information
The OpenIA Voice AI has previously received it on the mess with the actor Scarlett Johansson, but it prevents the company from continuing to develop their offers in this category.
Today Chatgpt Maker introduced three new Voice Models: GPT-4-TRANSCRIBE, GPT-4O-MINI-TRANCIBE and GPT-4O-MINI-TTS. These models will initially be available via the programming interface of the Chatgpt Applications (API) for third -party software developers to create their own applications. They will also be available on their own demo site, Openi.fm that individual users have access to limited and entertainment.
In addition, the voices of the GPT-4-MINI-TTS can be adapted from several preferences through a text call to change their accents, tone, tone and other voice properties that pass with the emotions they are asking for, which should go a long way to open a walnut intentionally imitating the voice (the company. seemingly imitating voice). Now it is up to the user to decide how he wants their voice to sound when he spoke back.
In a demo with Venturebeat delivered via a video call, Openai Technical Employees Jeff Harris showed how the user could get the same voice using text on the demo site to sound like a mad scientist or a woman, a calm yoga teacher.
Discovering and refining new abilities in the base of GPT-4O
The models are variants of the existing GPT-4O Openai launched in May 2024, which are currently powered by text and voice experience from Chatgpt for many users, but the company took this basic model and after training it other data to excel in transcripts and speech. The company did not specify when the models could come to Chatgpt.
“Chatgpt has somewhat different requirements in terms of cost and performance compromises, so while I expect to move to these models in time, this launch is focused on API users,” Harris said.
The purpose is to replace the two-year whisper Open-Source text to the OpenIi speech, which offers a lower level of errors in industry across industrial benchmarks and improved noise performance, with different accents and in different Speecs across 100+.
The company has published a chart on its website that shows how much lower the GPT-4-Transcribbe errors are to identify words across 33 languages compared to Whisper-S with impactly low 2.46% in English.

“These models include noise abolition and semantic voice detector that helps determine when the speaker has completed the idea and improved transcription accuracy,” Harris said.
Harris said Venturebeat that the new family of the GPT-4-Transcribe model is not designed to offer “diagization” or the ability to mark and distinguish between different speakers. Instead, it is designed primary to receive one (or perhaps more voice) as the only input channel and responds to all inputs in the only output voice in this interaction, how long it takes.
Society is It also organizes a contest for a general publication to find the most creative examples of using its demo voice OpenI.fm and share them online by marking @OPENAI CENTE NA X. The winner will receive his own adolescent engineering radio with The Openi logo, which Openai Product head, Plafform Olivier Godment said he is one of the only three in the world.
Golden
Emples are particularly suitable for applications such as Call Call Call, Meeting Meeting and Ai-Peed Assistants.
The newly launched SDK agents from last week impressively allow newly launched SDK agents also allow those developers who have already created applications, on top of their text models of large languages, such as common GPT-4O to add liquid voice interactions with only “nine lines of code”, according to Openai Youtube Livestream.
For example, an electronic trading application based on the GPT-4O peak could be based on questions like “tell me about my latest orders” with just seconds to enhance the code by adding these models.
“For the first time, we introduce a streaming speech to the text, which allows developers to continuously enter sound and receive real -time text flow, which makes the conversations feel more natural,” Harris said.
Yet, for the Devs looking for low latency, the AI voice experience in real time recommends using its speech models in real time speech.
Prices and availability
New models are available extremely through the API Openai API, with prices as follows:
• • GPT-4O-TRANSCRIBE: $ 6,00 per 1 million audio input tokens (~ $ 0.006 per minute)
• • GPT-4O-MINI-TRANSCIBE: $ 3.00 for 1 million audio input tokens (~ $ 0.003 per minute)
• • GPT-4O-MINI-TTS: $ 0.60 for 1 million text input tokens, $ 12.00 per 1m audio output tokens (~ $ 0.015 per minute)
But they come At the time of the Furcer competition before AI transcription and speech space, with specialized AI speech companies, such as elevenlabs offering its new SCRIB model that promotes diagization and boasts similar (but not so low) reduced errors in English. The price is $ 0.40 per hour of input sound (or $ 0.006 per minute, roughly equivalent).
Another startup, Hume AI offers a new model, Octave TTS, with adaptation of pronunciation at the level of the sentence and even at the level of words and emotional infection based on the user’s instructions, not on predetermined voices. Octave TTS prices are not directly comparable, but there is a free level offering 10 minutes of sound and cost increases from there
Meanwhile, the open source community also comes advanced sound and speech models, including the Orpheus 3B unit, which is available with the Permisive Apache 2.0 license, which means that the developers do not have any starting costs – provided they have the right hardware or cloud.
Accepting the industry and timely results
According to OpenRAI shared assessments, Venturebeat has already integrated new Openai sound models into their platforms and showed a significant improvement in AI performance.
Eliseai, which focused on the automation of real estate management, found that the Openai-to-To-River allowed a more natural and emotionally rich interaction with tenants.
Increased voices have caused leasing, maintenance and trips to AI-Powred to cause them to learn more, leading to higher satisfaction and improvement of rats of call resolution.
Decagon, which creates the voice experience of AI-Powred, recorded a 30% improvement in transcription accuracy using the Openai speech recognition model.
This increase in accuracy allowed AI Decagon agents to perform more reliably in the real world scenarios, even in the noisy wheel. The integration process was fast, with the decagon incorporated a new model into the system within one day.
Not all reactions to the latest edition of Openi were warm. APP Analytics Dawn Ai App Analytics (@benhylak) designer Apple Human Interfaces, published on X that while the models seem promising, Anneb “feels like retreat from reality”.
In addition, it was preceded by early X (Forus Twitter). TestingCatalog News (@Testingcatalog) has published details about the new models a few minutes before the official announcement and gives the names GPT-4-MINI-TTS, GPT-4-TRANSCRIBE and GPT-4O-MINI-TRANSCRIBE. The escape was credited to @stiventhed and the post quickly gained traction.
Looking forward, however, OpenAi plans to continue to improve its sound models and exploring their own voice capacities, while introducing the safety and responsible AI use. In addition to the sound, Openai also invests in multimodal AI, including video to allow a more dynamic and interactive experience based on agents.