PRICING
PRODUCT
SOLUTIONS
by use cases
AI Lead ManagementInvoicingSocial MediaProject ManagementData Managementby Industry
learn more
BlogTemplatesVideosYoutubeRESOURCES
COMMUNITIES AND SOCIAL MEDIA
PARTNERS
OpenAI just made a big move in voice tech — and it’s not just another transcription update. In March 2025, they quietly rolled out three new audio-focused models:
Each does something specific, but they all push toward the same goal: make the voice feel like a native part of AI interaction — not a patch, not a side API, but something that belongs in the core product. I spent some time going through the official documentation, the SDK examples, and the audio samples. Here’s what’s really happening — and what’s still not quite there.
The two new transcription models (gpt-4o-transcribe and its lightweight sibling gpt-4o-mini-transcribe) are built to do more than just record words. They show real improvements in handling tough inputs:
And the benchmarks back it up — these models have a lower word error rate (WER) across multiple languages and acoustic conditions. This isn't just for your next personal assistant app — think legal, medical, support centers, or anything where transcription errors cost money and trust.
Here’s the part that surprised me.
The new gpt-4o-mini-tts doesn’t just generate nice-sounding audio. It can be told how to speak — using natural instructions. Things like:
And the model adjusts — dynamically, without reprogramming.
It’s not perfect (yet), but the expressiveness and instruction-following behavior are clearly the next frontier. The emotional quality of voices is now something you can program in seconds. You can access the model through their text-to-speech API or OpenAI.FM. Keep in mind that these voices are preset artificial samples, which they have reviewed to ensure they consistently meet synthetic standards.
This part made me smile. OpenAI updated its Agents SDK to plug in audio effortlessly. That means:
The integration is clean. If you already have a text-based agent, you don’t need to rebuild it — just connect the voice in. This finally makes voice interfaces not feel hacked together. You don’t need a dozen tools anymore — it’s a native experience. For those focused on low-latency, speech-to-speech experiences, the speech-to-speech models in the Realtime API are the recommended choice.
This launch isn’t loud – and maybe that’s the point. OpenAI didn’t try to blow up the internet with this one. Instead, they quietly stitched audio into the fabric of how agents work. They’re turning a voice into a powerful tool for automation. And if you’ve been waiting for the moment where you could stop typing and start talking to your tools — this might just be the signal you were listening for.
Want to turn audio into actions – and text into voice – without building an entire app from scratch?
Latenode lets you automate Speech-to-Text and Text-to-Speech workflows in minutes. No complex coding. Just connect your triggers and go. Integrate dozens of AI models. Connect to any service via no-code integration or API. While we’re working on connecting OpenAI’s newest audio models, here’s your voice-powered automation:
This workflow listens to Telegram voice messages, transcribes them, generates viral post text, creates an image, and sends everything back to Telegram.
4-Step Summary:
👉 Start using your first voice automation on Latenode
Here’s what you can use it for after a slight customization:
It’s all about no-code, modular, and ready for real use cases.