What are the new audio-focused models released by OpenAI?

OpenAI released three new audio-focused models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts, designed to make voice a native part of AI interaction.

How does the new Text-to-Speech model work?

The gpt-4o-mini-tts model can be instructed to speak in different styles and tones, such as like a calm therapist or an enthusiastic demo host, adjusting dynamically without reprogramming.

How has the Agents SDK been updated with voice capabilities?

OpenAI updated its Agents SDK to seamlessly integrate audio, allowing agents to listen and speak in a continuous loop, making voice interfaces feel more native.

I Explored OpenAI’s New Audio Models — Here’s What Actually Feels Different

Ready to Go

Table of contents

I Explored OpenAI’s New Audio Models — Here’s What Actually Feels Different

OpenAI just made a big move in voice tech — and it’s not just another transcription update. In March 2025, they quietly rolled out three new audio-focused models:

gpt-4o-transcribe
gpt-4o-mini-transcribe
gpt-4o-mini-tts

Each does something specific, but they all push toward the same goal: make the voice feel like a native part of AI interaction — not a patch, not a side API, but something that belongs in the core product. I spent some time going through the official documentation, the SDK examples, and the audio samples. Here’s what’s really happening — and what’s still not quite there.

What’s New? A lot more than just better speech recognition.

1. Speech-to-Text: Not just faster — smarter

The two new transcription models (gpt-4o-transcribe and its lightweight sibling gpt-4o-mini-transcribe) are built to do more than just record words. They show real improvements in handling tough inputs:

Strong accents
Crosstalk
Noise (like public transport or coffee shop audio)
Fast speakers

And the benchmarks back it up — these models have a lower word error rate (WER) across multiple languages and acoustic conditions. This isn't just for your next personal assistant app — think legal, medical, support centers, or anything where transcription errors cost money and trust.

2. Text-to-Speech that actually gets you

Here’s the part that surprised me.

The new gpt-4o-mini-tts doesn’t just generate nice-sounding audio. It can be told how to speak — using natural instructions. Things like:

“Speak like a calm therapist”
“Sound enthusiastic like a product demo host”
“Speak quietly, like you’re whispering in a library”

And the model adjusts — dynamically, without reprogramming.

It’s not perfect (yet), but the expressiveness and instruction-following behavior are clearly the next frontier. The emotional quality of voices is now something you can program in seconds. You can access the model through their text-to-speech API or OpenAI.FM. Keep in mind that these voices are preset artificial samples, which they have reviewed to ensure they consistently meet synthetic standards.

3. Agents SDK got a voice

This part made me smile. OpenAI updated its Agents SDK to plug in audio effortlessly. That means:

Your agent can listen
Your agent can speak
And all of it runs in a continuous loop — input → processing → spoken output

The integration is clean. If you already have a text-based agent, you don’t need to rebuild it — just connect the voice in. This finally makes voice interfaces not feel hacked together. You don’t need a dozen tools anymore — it’s a native experience. For those focused on low-latency, speech-to-speech experiences, the speech-to-speech models in the Realtime API are the recommended choice.

What It’s Like to Use

Transcription? Crisp. I ran the public demos and listened to various samples. These models handle chaotic input way better than older Whisper-based ones. If your use case includes multitalker scenarios or messy real-world audio — these models are ready.
Speech synthesis? Surprisingly responsive.The voice output is clear, non-robotic, and has real nuance. You don’t get full actor-level performance yet — but it’s a huge step up from “text in, flat voice out.”

This launch isn’t loud – and maybe that’s the point. OpenAI didn’t try to blow up the internet with this one. Instead, they quietly stitched audio into the fabric of how agents work. They’re turning a voice into a powerful tool for automation. And if you’ve been waiting for the moment where you could stop typing and start talking to your tools — this might just be the signal you were listening for.

Automate Voice Workflows with Latenode

Want to turn audio into actions – and text into voice – without building an entire app from scratch?

Latenode lets you automate Speech-to-Text and Text-to-Speech workflows in minutes. No complex coding. Just connect your triggers and go. Integrate dozens of AI models. Connect to any service via no-code integration or API. While we’re working on connecting OpenAI’s newest audio models, here’s your voice-powered automation:

Try It Now: Transform your Raw Thoughts into Post (or anything else)

This workflow listens to Telegram voice messages, transcribes them, generates viral post text, creates an image, and sends everything back to Telegram.

4-Step Summary:

Receive voice message via Telegram bot
Transcribe audio using Whisper AI
Generate viral post + image prompt via ChatGPT
Create image with Recraft AI and send back to Telegram

👉 Start using your first voice automation on Latenode

Here’s what you can use it for after a slight customization:

Create a plan for the day, brainstorm ideas, come up with new ones without typing anything.
Transcribe voicemails and route them into support tickets.
Auto-summarize meeting recordings and post to Slack.
Combine audio input and output in a loop – with any logic in between.

It’s all about no-code, modular, and ready for real use cases.

‍

Try now