
Voice-To-Voice Models And Beyond Meat: Still Not Ready For Mass Consumption
Arkadiy Telegin is the cofounder and CTO of Leaping AI, a conversational AI platform supporting customer experience departments worldwide.
I'm vegan. So when plant-based meat started going mainstream, I was elated. The tech was impressive, the marketing confident and, for a moment, it felt like we were on the cusp of a food revolution. Impossible Burgers hit Burger King. Beyond was everywhere. Investors poured in. The future, it seemed, had arrived.
Except it hadn't.
Today, plant-based meat is still a niche. Prices are high, availability is inconsistent and adoption is slower than expected. It's not that the products disappeared. They just haven't yet integrated into everyday life the way we imagined. This is a classic case of psychological distance: a cognitive bias where things that feel close because they're exciting or well-promoted turn out to be farther off than we think.
In voice AI, voice-to-voice model development is going through the same thing. Despite recent latency, reasoning and sound quality improvements, there's been a stubborn insistence on using older, more established technologies to build conversational AI platforms. Why is that?
After LLMs appeared, the first commercial voice AI applications all used a 'cascading' approach following a three-step sequence:
• Speech-To-Text (STT): Transcribe the user's speech to text.
• Large Language Model (LLM): Use an LLM to respond to the transcribed user's speech.
• Text-To-Speech (TTS): Synthesize speech from your response and play it back.
This is a standard, time-tested approach that's been in use even before LLMs came around, primarily for language translation.
Then, last fall, OpenAI launched its Realtime API, which promised a one-step speech-to-speech AI model capable of parsing audio directly to generate real-time responses, resulting in agents that sound much more human, can natively detect emotions and can be more 'tone aware.' OpenAI's entry into the space was the most commercially significant development yet, leading many to anticipate a new era for single-step voice-to-voice AI models that could feasibly be used in real-world applications.
Over six months later, while Realtime API's launch has created a lot of excitement around direct speech-to-speech AI models—the recently announced Nova Sonic model from Amazon and Sesame's base model for its Maya assistant are just a few examples—when it comes to production-level applications, my industry colleagues and customers alike are still more comfortable using the status quo of multi-step pipelines, with no plans to change that any time soon.
There are a few key reasons why that is the case.
Working with audio presents inherent difficulties. Text is clean, modular and easily manipulated. It allows for storage, searchability and mid-call edits. Audio, in contrast, is less forgiving. Even post-call tasks like analysis and summarization often necessitate transcription. In-call operations, such as managing state or editing messages, are more cumbersome with audio.
Function calling is crucial in production use-cases—fetching data, triggering workflows, querying APIs. Currently, one-step voice-to-voice models lag in this area. Stanford computer science professor and DeepLearning.ai founder Andrew Ng, who also cofounded the Google Brain project, has publicly shared some of these limitations.
It is much easier to create and curate a good function-calling dataset for a text-based model than for a multimodal model. As a result of this, the function-calling capabilities of text-first models will always outperform those of voice-to-voice models. Considering that function calling is not perfect even for text models yet and is a crucial requirement for commercial applications, it will take some time until voice-to-voice catches up to meet production standards.
Ng shares the example of gut-checking responses like "Yes, I can issue you a refund" to ensure refunds are allowable against the current company policy and how an API can be called to issue that refund if the customer requests one. That's more doable to build in a cascading workflow but not as reliable for one-step pipelines for the reasons stated above.
Since OpenAI launched its Realtime API, there have been a number of complaints that have made developers uneasy about using it in production, including audio cutting off unexpectedly and hallucinations interrupting live conversations. Others have complained of hallucinations that don't get captured in the transcript, making it challenging to catch and debug them.
This isn't to say one-step voice-to-voice AI is a dead end. Far from it. The potential for enhanced user experience—handling interruptions, conveying emotion, capturing tone—is immense. Many in the industry, our team included, are actively experimenting, preparing for the moment when it matures. Startups and major players alike continue to invest in speech-native approaches as they anticipate a more emotionally resonant, real-time future.
In other words: It's a matter of when, not if.
In the meantime, multi-step pipelines for voice-to-voice AI models continue to win on reliability and production-readiness. With steady improvements, particularly in behavior and function calling, the moment for single-step models will come. Until then, the trusted cascading approach will carry the load, and I'm still not eating at Burger King.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
25 minutes ago
- Yahoo
Top Steelers NFL draft prospect LaNorris Sellers passes up huge NIL deal
The top priority of the Pittsburgh Steelers scouting staff this college football season is to sort out what is already shaping up to be an elite quarterback class for the 2026 NFL draft. Thanks to the ridiculous nature of NIL money, the NFL now has another aspect of players to track and that's their loyalty to their team as opposed to making fast money in college football. One of the top quarterback prospects in the upcoming draft is LaNorris Sellers out of South Carolina. News came out about Sellers this week and thanks to some intervention by his dad, Sellers chose to pass up $8 million over two years in NIL money to stay. According to Sellers' dad, there were multiple schools bidding for his services, but he showed maturity and loyalty by staying, which is a huge green flag for an NFL team. Advertisement From a football standpoint, Sellers is poised for a huge breakout season. His athleticism and mobility are already off the charts and as the season progressed, we saw his pocket presence and processing speed improve drastically down the stretch. Sellers and Clemson's Cade Klubnik are my top two options for the Steelers and this move by Sellers just helps his case. This article originally appeared on Steelers Wire: Steelers NFL draft prospect LaNorris Sellers passes up huge NIL deal
Yahoo
26 minutes ago
- Yahoo
Franklin County home listings asked for more money in May - see the current median price here
The median home in Franklin County listed for $364,900 in May, up 1.2% from the previous month's $360,720, an analysis of data from shows. Compared to May 2024, the median home list price increased 13.2% from $324,723. The statistics in this article only pertain to houses listed for sale in Franklin County, not houses that were sold. Information on your local housing market, along with other useful community data, is available at Franklin County's median home was 1,968 square feet, listed at $183 per square foot. The price per square foot of homes for sale is up 2.2% from May 2024. Listings in Franklin County moved briskly, at a median 36 days listed compared to the May national median of 51 days on the market. In the previous month, homes had a median of 38 days on the market. Around 196 homes were newly listed on the market in May, a 7.7% increase from 182 new listings in May 2024. The median home prices issued by may exclude many, or even most, of a market's homes. The price and volume represent only single-family homes, condominiums or townhomes. They include existing homes, but exclude most new construction as well as pending and contingent sales. In Pennsylvania, median home prices were $325,000, a slight increase from April. The median Pennsylvania home listed for sale had 1,708 square feet, with a price of $196 per square foot. Throughout the United States, the median home price was $440,000, a slight increase from the month prior. The median American home for sale was listed at 1,840 square feet, with a price of $234 per square foot. The median home list price used in this report represents the midway point of all the houses or units listed over the given period of time. Experts say the median offers a more accurate view of what's happening in a market than the average list price, which would mean taking the sum of all listing prices then dividing by the number of homes sold. The average can be skewed by one particularly low or high price. The USA TODAY Network is publishing localized versions of this story on its news sites across the country, generated with data from Please leave any feedback or corrections for this story here. This story was written by Ozge Terzioglu. Our News Automation and AI team would like to hear from you. Take this survey and share your thoughts with us. This article originally appeared on Waynesboro Record Herald: Franklin County home listings asked for more money in May - see the current median price here
Yahoo
27 minutes ago
- Yahoo
The Coca-Cola Company (KO) Teams Up with UMG to Launch Real Thing Records
The Coca-Cola Company (NYSE:KO) is one of the best stocks to buy. On June 11, Coca‑Cola announced a partnership with Universal Music Group to launch Real Thing Records (rtr), a music label to promote emerging global talent and foster greater engagement between artists and audiences. This collaboration represents Coca‑Cola's longstanding relationship with music. Adopting a genre-agnostic philosophy, rtr seeks to bring forward unique and authentic voices from around the world, positioning itself as a platform for the next generation of musical talent. The label's debut signings include Max Allais, a French-New Zealand artist, and Aksomaniac, an Indian singer-songwriter and producer. Pixabay/Public Domain Joshua Burke, Coca‑Cola's Global Head of Music & Culture, commented: 'The Coca‑Cola Company has a rich legacy, one of deep human connection and cultural resonance—breaking barriers and bringing people together across borders and generations. Real thing records is designed to unlock greater potential for artists, fans, and our brands—where creativity fuels growth, and the combined power of our network and key global music partners create value greater than the sum of its parts. It's our intention to let artists shine and give them the flexibility to develop their identities with the support of global reach and expertise. It's a long-term commitment to music—enabling us to reinvest in our programs, champion the next generation of talent, and stay rooted in what matters most: music and fandom." The Coca-Cola Company (NYSE:KO) is a global beverage corporation that produces and distributes carbonated soft drinks, bottled water, juices, teas, coffees, and plant-based beverages. While we acknowledge the potential of KO as an investment, we believe certain AI stocks offer greater upside potential and carry less downside risk. If you're looking for an extremely undervalued AI stock that also stands to benefit significantly from Trump-era tariffs and the onshoring trend, see our free report on the best short-term AI stock. READ NEXT: The Best and Worst Dow Stocks for the Next 12 Months and 10 Unstoppable Stocks That Could Double Your Money. Disclosure: None. Sign in to access your portfolio