5 lessons from talking to 100+ teams building voice AI agents
Real-time voice AI is young, and the tools and infrastructure that will power its future are being built right now.
After working with and speaking to hundreds of teams building voice AI agents, we’ve learned a lot.
Some of the most important things:
Voice AI demos are easy. Scaling to production is hard.
Low latency is only one piece of the puzzle. Delivering consistent, reliable performance matters just as much for real-world adoption.
Complex voice AI agents need code, not flowcharts. Opinionated platforms that abstract your agent’s backend logic lead to a lack of control at critical moments.
Observing and evaluating voice agents is hard because they are non-deterministic. These challenges scale non-linearly with real usage.
Most teams don’t want to build, maintain and manage their own voice infrastructure.
We're making choices about how we build @Layercode accordingly, and want to hear from more builders in the voice AI space.
If you're building in voice AI: If you could wave a magic wand, what would you change about building voice AI today?



Replies
@aidanhornsby Agree! demos are easy, production is hard; my one wish is solid prod tooling: end-to-end call timeline + one-click replay so we can actually debug and scale voice agents.
Layercode
@sheraz_abdul_hayee we've found a very common problem for folks scaling in production is that any issues also scale as usage increases, so the need for good observability increases very quickly.
We've built pretty solid observability tools into the beta (track latency analytics, replay any conversation + review logs) and have a lot more ideas to build on these features to help devs get a really clear picture into what's happening with their agents.
Would love your feedback on what we have so far if you're interested in testing the beta?
minimalist phone: creating folders
Do you have any data, findings about multilingual voice agents? Many of them are only in English. But what about other language variations?
@busmark_w_nika From my research, multilingual support is a big gap in voice AI still, especially outside of English as you mentioned. I’m currently working on a Chrome extension needing such functionality, but I have only been able to get around it programmatically by transcribing audio streams, translating it via LLMs, and playing back translated speech for the user – all three stages needing their own AI. Having a single AI do all these would be the dream scenario.
Layercode
@busmark_w_nika as @john_wern said, multilingual is definitely a gap across some of the most popular model providers. Notably, ElevenLabs and Cartesia do offer support for more languages than just English and Spanish, but the majority of models are still focusing on English-first. Fwiw we are big fans of Cartesia's price/quality ratio for production use cases. Rime's Arcana model is also excellent and just launched v2 with new more languages.
If helpful, here's a table we put together which tracks of languages for a lot of the popular models. If useful, you may find this blog post interesting, although the state of the art moves so quickly we already need to update it!
minimalist phone: creating folders
@john_wern @aidanhornsby I asked because I noticed that the Slavic group of languages are quite difficult, and especially when the country is smaller, there is a chance that they will not pay attention to it. 😅
@aidanhornsby @busmark_w_nika I think I misunderstood your original question slightly Nika, so my answer was a bit vague.
Aidan is definitely more correct about ElevenLabs and Cartesia (I had personally never heard of Cartesia before) offering high quality results in other languages as well. I believe many smaller Slavic languages have already been included and perform well, at least when quickly checking and testing them at ElevenLabs.
BTW, thanks for the table Aidan, it was a great overview!
curious which use cases u think will break out best?
Layercode
@shabb_katoch from our research + conversations I'd say that today, customer service is one of the biggest areas for voice AI adoption (call centers and telephony-centric use cases especially). Sales (inbound and outbound) uses are also growing fast.
Most interestingly though, I just saw a recent study that found that voice agents outperformed human recruiters in one study of 70k job applicants! An interesting read: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5395709
I think that example points to where voice AI will get more interesting (Areas where an agent doesn't just replace a human, but is preferable). The industry is still so young, though, I think rollout of the technology will be quite uneven across different industries.
@shabb_katoch @aidanhornsby
Voice agents outperforming human recruiters is not surprising; they've been using ATS and RMS fully automated for at least 25 years. There's a lot of data out there.
Layercode
@shabb_katoch @clioteorica I've not spent a lot of time in recruiting but my understanding is that ATS/RMS systems would be the system of record for a job posting/candidate pool whereas the voice agent in this study was replacing an interaction with a human being using the ATS/RMS system? Curious if I am missing context though, any reading links appreciated!
What jumped out to me was just that this one of the first areas I've seen where people actually expressed a preference to deal with an AI VS a human. I've heard of many examples where people are already using voice agents, but not that many (yet) where they were already broadly preferable to dealing with a human.
Triforce Todos
So true! Demo day feels smooth, but real users find every edge case 😅
I've recently built two different voice agents using gpt-realtime and I found it to be great! You still need to build a considerable amount of UI and business logic to create a solid experience for your use case, not to mention a backend to handle custom agent events (which is the secret sauce), but you get a lot of functionality out of the box, including multi-language. I'm happy to chat if you'd like help getting started.
Layercode
@joeschwendt nice! Awesome to hear that. The recent updates to theOpenAI's realtime model especially are quite impressive. I would love to learn more about what you've been building! Can you drop me an email aidan at layercode.com and we can find some time for a quick chat?
The real-time models are very impressive, and certainly more than capable for many use cases. That said, we're still seeing a combination of (a) higher price and (b) limited ability to observe/eval and control the model VS a chained pipeline approach (full control over every aspect of the agent) preventing many teams from adopting them for agent use cases that demand a larger amount of scale or complexity.
The speed of improvement of these models is something we're watching closely, though. I expect them to continue eating into use cases that today, are best served by a 'chained' approach to building voice agents.
Hmm, interesting, I agree that the live-time voice ai products require a lot extra work than expected tho it's still manageable at this point....in fact, from what I know, there are already some live-time voice ai products being used in industry for...some specific areas and the scale is not that small. That is all I can tell from here and hope this makes sense :)
A while back we were building an AI UXR, and one thing we did not account for was the value of each account, ie. teams (our customers) didn't want to use AI for larger / more important accounts–concerned about not providing them with a white glove experience. While on a consumer level it was more acceptable (high frequency and shorter engagements). In other words: know your customer :)
Layercode
Very interesting. I think that's an astute point. Today, voice still feels like a 'cheaper' experience in many contexts. Something I saw last week and found very interesting was this recent study of 70,000 job applicants:
It's the first instance I've seen mentioned that wasn't anecdotal of a voice AI agent actually being preferable VS a human for end users at scale. Obviously this will never be the case for every industry and every type of communication touchpoint, but it's interesting to think about where voice AI systems might end up providing a better UX than a human.
@robingreenwood I would love to learn more about your experience, if you'd be down to share.