A low-code platform blending no-code simplicity with full-code power 🚀
Get started free
I Explored OpenAI’s New Audio Models — Here’s What Actually Feels Different
March 21, 2025
4
min read

I Explored OpenAI’s New Audio Models — Here’s What Actually Feels Different

George Miloradovich
Researcher, Copywriter & Usecase Interviewer
Table of contents

OpenAI just made a big move in voice tech — and it’s not just another transcription update. In March 2025, they quietly rolled out three new audio-focused models:

  • gpt-4o-transcribe
  • gpt-4o-mini-transcribe
  • gpt-4o-mini-tts

Each does something specific, but they all push toward the same goal: make the voice feel like a native part of AI interaction — not a patch, not a side API, but something that belongs in the core product. I spent some time going through the official documentation, the SDK examples, and the audio samples. Here’s what’s really happening — and what’s still not quite there.

What’s New? A lot more than just better speech recognition.

1. Speech-to-Text: Not just faster — smarter

The two new transcription models (gpt-4o-transcribe and its lightweight sibling gpt-4o-mini-transcribe) are built to do more than just record words. They show real improvements in handling tough inputs:

  • Strong accents
  • Crosstalk
  • Noise (like public transport or coffee shop audio)
  • Fast speakers

And the benchmarks back it up — these models have a lower word error rate (WER) across multiple languages and acoustic conditions. This isn't just for your next personal assistant app — think legal, medical, support centers, or anything where transcription errors cost money and trust. 

2. Text-to-Speech that actually gets you

Here’s the part that surprised me.

The new gpt-4o-mini-tts doesn’t just generate nice-sounding audio. It can be told how to speak — using natural instructions. Things like:

  • “Speak like a calm therapist”
  • “Sound enthusiastic like a product demo host”
  • “Speak quietly, like you’re whispering in a library”

And the model adjusts — dynamically, without reprogramming. 

It’s not perfect (yet), but the expressiveness and instruction-following behavior are clearly the next frontier. The emotional quality of voices is now something you can program in seconds. You can access the model through their text-to-speech API or OpenAI.FM. Keep in mind that these voices are preset artificial samples, which they have reviewed to ensure they consistently meet synthetic standards. 

3. Agents SDK got a voice

This part made me smile. OpenAI updated its Agents SDK to plug in audio effortlessly. That means:

  • Your agent can listen
  • Your agent can speak
  • And all of it runs in a continuous loop — input → processing → spoken output

The integration is clean. If you already have a text-based agent, you don’t need to rebuild it — just connect the voice in. This finally makes voice interfaces not feel hacked together. You don’t need a dozen tools anymore — it’s a native experience. For those focused on low-latency, speech-to-speech experiences, the speech-to-speech models in the Realtime API are the recommended choice.

What It’s Like to Use

  • Transcription? Crisp. I ran the public demos and listened to various samples. These models handle chaotic input way better than older Whisper-based ones. If your use case includes multitalker scenarios or messy real-world audio — these models are ready. 
  • Speech synthesis? Surprisingly responsive.The voice output is clear, non-robotic, and has real nuance. You don’t get full actor-level performance yet — but it’s a huge step up from “text in, flat voice out.”

This launch isn’t loud – and maybe that’s the point. OpenAI didn’t try to blow up the internet with this one. Instead, they quietly stitched audio into the fabric of how agents work. They’re turning a voice into a powerful tool for automation. And if you’ve been waiting for the moment where you could stop typing and start talking to your tools — this might just be the signal you were listening for.

Automate Voice Workflows with Latenode

Want to turn audio into actions – and text into voice – without building an entire app from scratch? 

Latenode lets you automate Speech-to-Text and Text-to-Speech workflows in minutes. No complex coding. Just connect your triggers and go. Integrate dozens of AI models. Connect to any service via no-code integration or API. While we’re working on connecting OpenAI’s newest audio models, here’s your voice-powered automation:

Try It Now: Transform your Raw Thoughts into Post (or anything else)

This workflow listens to Telegram voice messages, transcribes them, generates viral post text, creates an image, and sends everything back to Telegram.

4-Step Summary:

  1. Receive voice message via Telegram bot
  2. Transcribe audio using Whisper AI
  3. Generate viral post + image prompt via ChatGPT
  4. Create image with Recraft AI and send back to Telegram

👉  Start using your first voice automation on Latenode

Here’s what you can use it for after a slight customization:

  • Create a plan for the day, brainstorm ideas, come up with new ones without typing anything.
  • Transcribe voicemails and route them into support tickets.
  • Auto-summarize meeting recordings and post to Slack.
  • Combine audio input and output in a loop – with any logic in between.

It’s all about no-code, modular, and ready for real use cases. 

Related Blogs

Use case

Backed by