3 days ago
Voice-To-Voice Models And Beyond Meat: Still Not Ready For Mass Consumption
Arkadiy Telegin is the cofounder and CTO of Leaping AI, a conversational AI platform supporting customer experience departments worldwide.
I'm vegan. So when plant-based meat started going mainstream, I was elated. The tech was impressive, the marketing confident and, for a moment, it felt like we were on the cusp of a food revolution. Impossible Burgers hit Burger King. Beyond was everywhere. Investors poured in. The future, it seemed, had arrived.
Except it hadn't.
Today, plant-based meat is still a niche. Prices are high, availability is inconsistent and adoption is slower than expected. It's not that the products disappeared. They just haven't yet integrated into everyday life the way we imagined. This is a classic case of psychological distance: a cognitive bias where things that feel close because they're exciting or well-promoted turn out to be farther off than we think.
In voice AI, voice-to-voice model development is going through the same thing. Despite recent latency, reasoning and sound quality improvements, there's been a stubborn insistence on using older, more established technologies to build conversational AI platforms. Why is that?
After LLMs appeared, the first commercial voice AI applications all used a 'cascading' approach following a three-step sequence:
• Speech-To-Text (STT): Transcribe the user's speech to text.
• Large Language Model (LLM): Use an LLM to respond to the transcribed user's speech.
• Text-To-Speech (TTS): Synthesize speech from your response and play it back.
This is a standard, time-tested approach that's been in use even before LLMs came around, primarily for language translation.
Then, last fall, OpenAI launched its Realtime API, which promised a one-step speech-to-speech AI model capable of parsing audio directly to generate real-time responses, resulting in agents that sound much more human, can natively detect emotions and can be more 'tone aware.' OpenAI's entry into the space was the most commercially significant development yet, leading many to anticipate a new era for single-step voice-to-voice AI models that could feasibly be used in real-world applications.
Over six months later, while Realtime API's launch has created a lot of excitement around direct speech-to-speech AI models—the recently announced Nova Sonic model from Amazon and Sesame's base model for its Maya assistant are just a few examples—when it comes to production-level applications, my industry colleagues and customers alike are still more comfortable using the status quo of multi-step pipelines, with no plans to change that any time soon.
There are a few key reasons why that is the case.
Working with audio presents inherent difficulties. Text is clean, modular and easily manipulated. It allows for storage, searchability and mid-call edits. Audio, in contrast, is less forgiving. Even post-call tasks like analysis and summarization often necessitate transcription. In-call operations, such as managing state or editing messages, are more cumbersome with audio.
Function calling is crucial in production use-cases—fetching data, triggering workflows, querying APIs. Currently, one-step voice-to-voice models lag in this area. Stanford computer science professor and founder Andrew Ng, who also cofounded the Google Brain project, has publicly shared some of these limitations.
It is much easier to create and curate a good function-calling dataset for a text-based model than for a multimodal model. As a result of this, the function-calling capabilities of text-first models will always outperform those of voice-to-voice models. Considering that function calling is not perfect even for text models yet and is a crucial requirement for commercial applications, it will take some time until voice-to-voice catches up to meet production standards.
Ng shares the example of gut-checking responses like "Yes, I can issue you a refund" to ensure refunds are allowable against the current company policy and how an API can be called to issue that refund if the customer requests one. That's more doable to build in a cascading workflow but not as reliable for one-step pipelines for the reasons stated above.
Since OpenAI launched its Realtime API, there have been a number of complaints that have made developers uneasy about using it in production, including audio cutting off unexpectedly and hallucinations interrupting live conversations. Others have complained of hallucinations that don't get captured in the transcript, making it challenging to catch and debug them.
This isn't to say one-step voice-to-voice AI is a dead end. Far from it. The potential for enhanced user experience—handling interruptions, conveying emotion, capturing tone—is immense. Many in the industry, our team included, are actively experimenting, preparing for the moment when it matures. Startups and major players alike continue to invest in speech-native approaches as they anticipate a more emotionally resonant, real-time future.
In other words: It's a matter of when, not if.
In the meantime, multi-step pipelines for voice-to-voice AI models continue to win on reliability and production-readiness. With steady improvements, particularly in behavior and function calling, the moment for single-step models will come. Until then, the trusted cascading approach will carry the load, and I'm still not eating at Burger King.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?