The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future posts. Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it! However, compared to text-based generation, it is still hard to control the output of voice-in voice-out models. In contrast to directly generating audio, when we use an LLM to generate text, we have many tools for building guardrails, and we can double-check the output before showing it to users. We can also use sophisticated agentic reasoning workflows to compute high-quality outputs. Before a customer-service agent shows a user the message, “Sure, I’m happy to issue a refund,” we can make sure that (i) issuing the refund is consistent with our business policy and (ii) we will call the API to issue the refund (and not just promise a refund without issuing it). In contrast, the tools to prevent a voice-in, voice-out model from making such mistakes are much less mature. In my experience, the reasoning capability of voice models also seems inferior to text-based models, and they give less sophisticated answers. (Perhaps this is because voice responses have to be more brief, leaving less room for chain-of-thought reasoning to get to a more thoughtful answer.) When building applications where I need a more control over the output, I use agentic workflows to reason at length about the user’s input. In voice applications, this means I end up using a pipeline that includes speech-to-text (STT) to transcribe the user’s words, then processes the text using one or more LLM calls, and finally returns an audio response to the user via TTS (text-to-speech). This, where the reasoning is done in text, allows for more accurate responses. However, this process introduces latency, and users of voice applications are very sensitive to latency. When DeepLearning.AI worked with RealAvatar (an AI Fund portfolio company led by Jeff Daniel) to build an avatar of me, we found that getting TTS to generate a voice that sounded like me was not very hard, but getting it to respond to questions using words similar to those I would choose was. Even after much tuning, it remains a work in progress. You can play with it at https://lnkd.in/gcZ66yGM [At length limit. Full text, including latency reduction technique: https://lnkd.in/gjzjiVwx ]
Developing Voice App Features
Explore top LinkedIn content from expert professionals.
Summary
Developing voice app features means creating interactive systems that let users speak to apps and receive spoken responses, making technology feel more conversational and accessible. This involves combining speech detection, natural language reasoning, and audio output so users can enjoy smooth, real-time communication with AI-driven services.
- Prioritize low latency: Aim for fast response times by efficiently coordinating speech-to-text, language models, and text-to-speech components so users don't experience awkward pauses during conversations.
- Design for flexibility: Integrate tools that let you switch between different AI models and audio technologies, giving you control over quality and cost as you build and refine voice features.
- Focus on natural interaction: Incorporate features like emotional recognition, barge-in (user interruption), and clear audio processing to make conversations feel more human and engaging.
-
-
I’ve open-sourced a key component of one of my latest projects: Voice Lab, a comprehensive testing framework that removes the guesswork from building and optimizing voice agents across language models, prompts, and personas. Speech is increasingly becoming a prominent modality companies employ to enable user interaction with their products, yet the AI community is still figuring out systematic evaluation for such applications. Key features: (1) Metrics and analysis – define custom metrics like brevity or helpfulness in JSON format and evaluate them using LLM-as-a-Judge. No more manual reviews. (2) Model migration and cost optimization – confidently switch between models (e.g., from GPT-4 to smaller models) while evaluating performance and cost trade-offs. (3) Prompt and performance testing – systematically test multiple prompt variations and simulate diverse user interactions to fine-tune agent responses. (4) Testing different agent personas, from an angry United Airlines representative to a hotel receptionist who tries to jailbreak your agent to book all available rooms. While designed for voice agents, Voice Lab is versatile and can evaluate any LLM-based agent. ⭐️ I invite the community to contribute and would highly appreciate your support by starring the repo to make it more discoverable for others. GitHub repo (commercially permissive) https://lnkd.in/gAaZ-tkA
-
Last week, a colleague was struggling to add voice features to his fintech app. After weeks of wrestling with ASR→LLM→TTS orchestration, dealing with latency issues, and trying to make different APIs play nice together, he was ready to give up. That got me thinking - why is adding voice AI still so complicated in 2025? Why are developers spending months on infrastructure instead of building features that matter? And that's when I discovered Agora's Conversational AI Engine - and honestly, it's a game-changer. Here's what makes it different: 🚀 𝗦𝗲𝘁𝘂𝗽 𝗶𝗻 𝗠𝗶𝗻𝘂𝘁𝗲𝘀, 𝗡𝗼𝘁 𝗠𝗼𝗻𝘁𝗵𝘀 - Three lines of code to add voice AI to your app - Built on open-source TEN framework - No complex orchestration headaches ⚡ 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 - ~650ms total latency (that's lightning fast!) - Runs on Agora's SD-RTN (80 billion minutes of voice and video streams monthly) - Auto-scales from 1 to 1M users without breaking 🔧 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗙𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆 - Use any LLM: OpenAI, Claude, Gemini, or your custom model - Any TTS: Microsoft, ElevenLabs, or roll your own - Zero vendor lock-in - maintain full control 🎧 𝗔𝘂𝗱𝗶𝗼 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 - Neural noise suppression that actually works - Users can interrupt the AI mid-sentence (just like real conversations!) - Echo cancellation for crystal-clear audio What really impressed me? It handles real-world conditions. Coffee shop noise, poor connectivity, multiple speakers - the Conversational AI Engine delivers consistent performance when everything else fails. If you're building anything with voice, skip the months of R&D headaches. Your users will get natural conversations, your team will get infrastructure that just works. And if you found this useful, share it with your developer community so more people can build voice-first experiences without the pain! Explore the Conversational AI Engine → https://lnkd.in/gUvPGNes [Asset: Agora-sizzle-built-for-devs+cta.mp4] #VoiceAI #DeveloperTools #ConversationalAI #APIDevelopment #TechInnovation #Agora #RealTimeAI #BuildingWithAI #DeveloperExperience
-
What if you could just speak, and AI automatically detects, transcribes, and responds to you, in lightning fast speed? Multimodal applications are changing how we interact with AI. I just released a new tutorial with Groq and Gradio for building multimodal voice apps that automatically detect when you are speaking, enabling natural back-and-forth conversation with AI. Using Whisper running on Groq, @ricky0123/vad-web for voice activity detection, and Gradio for the app interface. Creating a natural interaction with voice and text requires a dynamic and low-latency response. Thus, we need both automatic voice detection and fast inference. With @ricky0123/vad-web powering speech detection and Groq powering the LLM, both of these requirements are met. Groq provides a lightning fast response, and Gradio allows for easy creation of impressively functional apps. I have a full walkthrough of how it works on GitHub: https://lnkd.in/gkqX74hn As well as a video tutorial: https://lnkd.in/gv4tW2uV If you found this helpful, you can support this post and the repository! As always, feel free to share any feedback or questions!
-
We all know the friction in real-time voice agents: robotic voices, high latency, and that awkward moment when you can't interrupt the AI. We've focused on solving those core engineering problems with the new Gemini 2.5 Flash Native Audio model for the Live API, now available in preview for developers. 🎙️ We didn't just improve performance; we changed the interaction model: ✅ Conversational Fluidity: Barge-in is fundamentally better. Your users can interrupt the model naturally, even in noisy environments, making conversations feel human and effortless. ✅ Superior Control (Proactive Audio): Use System Instructions to enforce strict "chime-in" rules. Want an AI assistant that only speaks when the user mentions "Italian cooking"? It now follows that instruction reliably, maintaining silence until the exact trigger. ✅ Nuance and Empathy (Affective Dialogue): The model can now detect and respond to a user’s emotional state - like frustration or excitement - creating more empathetic and helpful interactions than simple Q&A. ✅ Reliability: We've significantly improved the triggering rate for Function Calling and boosted Transcription Accuracy. Less time debugging flaky triggers, more time delivering features. The power of the Live API is in the streaming, low-latency code, and we've put together a Python SDK notebook covering some of these new capabilities, including multi-turn sessions and affective dialogue examples. Check out details below and start building. ✦ Code sample: https://lnkd.in/gd2ihJaj ✦ Docs: https://lnkd.in/gQwPieam ✦ API Ref: https://lnkd.in/gt65ug5p ✦ Demo: https://lnkd.in/g6ftChTG
-
After 11 weeks of building Ella (an AI voice dictation app) with Claude Code, here's the workflow that actually works. The biggest mistake I made early on: cramming everything into CLAUDE.md. Project history, architecture, coding conventions, troubleshooting notes. The file ballooned. Sessions started slower. Claude's responses got less focused. Now my structure is intentional: ``` CLAUDE.md # Minimal signpost only docs/ ├── research/ # Thinking before building ├── features/ # Plans that become documentation ├── debug-logs/ # Institutional memory └── architecture.md # Living document, updated every commit ``` The workflow for every feature: → Fresh session (context bleeds between long sessions) → Research phase with clarification questions → Implementation plan with specific files and interfaces → Build + test → Separate sub-agent for code review → Commit with documentation updates Three things made the biggest difference: 1. Pointing Claude to open-source projects. When I hit a wall on audio latency, I found projects that solved similar problems and fed Claude their code. The output was categorically better. Not incrementally. Categorically. 2. Custom commands for consistency. Research → plan → implement → review → commit. Same pattern every time. Claude performs better when it knows what to expect. 3. Treating documentation as insurance against context loss. When Claude compacts context mid-session, it can re-read the docs. The research file, implementation plan, and architecture document persist outside the context window. The real unlock isn't any single technique. It's the compounding. Every feature makes the next feature easier. Architecture docs get richer. Debug logs cover more edge cases. Full breakdown on my workflow: https://lnkd.in/gmVy4ig3 If you want to try Ella: https://www.heyella.co
-
Voice AI is more than just plugging in an LLM. It's an orchestration challenge involving complex AI coordination across STT, TTS and LLMs, low-latency processing, and context & integration with external systems and tools. Let's start with the basics: ---- Real-time Transcription (STT) Low-latency transcription (<200ms) from providers like Deepgram ensures real-time responsiveness. ---- Voice Activity Detection (VAD) Essential for handling human interruptions smoothly, with tools such as WebRTC VAD or LiveKit Turn Detection ---- Language Model Integration (LLM) Select your reasoning engine carefully—GPT-4 for reliability, Claude for nuanced conversations, or Llama 3 for flexibility and open-source options. ---- Real-Time Text-to-Speech (TTS) Natural-sounding speech from providers like Eleven Labs, Cartesia or Play.ht enhances user experience. ---- Contextual Noise Filtering Implement custom noise-cancellation models to effectively isolate speech from real-world background noise (TV, traffic, family chatter). ---- Infrastructure & Scalability Deploy on infrastructure designed for low-latency, real-time scaling (WebSockets, Kubernetes, cloud infrastructure from AWS/Azure/GCP). ---- Observability & Iterative Improvement Continuous improvement through monitoring tools like Prometheus, Grafana, and OpenTelemetry ensures stable and reliable voice agents. 📍You can assemble this stack yourself or streamline the entire process using integrated API-first platforms like Vapi. Check it out here ➡️https://bit.ly/4bOgYLh What do you think? How will voice AI tech stacks evolve from here?
-
🔮 Design Guidelines For Voice UX. Guidelines and Figma toolkits to design better voice UX for products that support or rely on audio input ↓ 🤔 People avoid voice UIs in public spaces, or for sensitive data. ✅ But do use them with audio assistants, learning apps, in-car UIs. ✅ Good conversations always move forward, not backwards. 🤔 The way humans speak is different from the way we write. 🤔 What people say isn’t always what they mean by saying it. ✅ First, define relevant user stories for your product. ✅ Sketch key use cases, then add detours, then edge cases. ✅ Design VUI personas: tone of voice, words, sentence structure. ✅ Listen to related human conversations, transcribe them. ✅ Write conversation flows for happy and unhappy paths. ✅ Add markers (Finally, Now, Next) to structure the dialogue. ✅ Accessibility: support shaky voices and speech impediments. ✅ Allow users to slow down or speed up output, or rephrase. ✅ Adjust speech patterns, e.g. speaking to children differently. 🚫 There are no errors or “wrong input” in human interactions. 🤔 Give people time to think: 8–10s is a good time to respond. ✅ Design for long silences, thick accents, slang and contradictions. Keep in mind that many people have been “burnt” with horrible, poorly designed automated phone systems. If your voice UX will come across even nearly as bad, don’t be surprised by a very low usage rate. You can’t replicate a long scrollable list in audio, so keep answers short, with max 3 options at a time. Instead of listing more options, ask one direct question and then branch out. Re-prompt or reframe when certainty is low. People choose their voice assistant based on the personality it conveys, and the friendliness it projects. So be deliberate in how you shape the tone, word choice and the melody of the voice. Don’t broadcast personality for repetitive tasks, but let is shine in a conversation. And: if you don’t assign a personality to your product, users will do it for you. So study how your customers speak. How exactly they explain the tasks your product must perform. The closer you get to a personal human interaction, the easier it will be to earn people’s trust. Useful resources: Voice Principles, by Ben Sauer https://lnkd.in/dQACgwue Voice UI Design System, by Orange https://lnkd.in/ezP-9QUu Designing A Voice Persona, by James Walsh https://lnkd.in/e3WXaxEC Conversational UIs (Figma), by ServiceNow https://lnkd.in/dUpGmcFa Voice UI Guide, by Lars Mäder https://vui.guide/ #ux #design
-
𝗜𝗳 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗔𝗜 𝘃𝗼𝗶𝗰𝗲 𝗮𝗴𝗲𝗻𝘁𝘀, 𝘆𝗼𝘂 𝗡𝗘𝗘𝗗 𝗧𝗢 𝗞𝗡𝗢𝗪 𝘁𝗵𝗶𝘀 𝘀𝗶𝘅-𝗹𝗮𝘆𝗲𝗿 𝘁𝗲𝗰𝗵 𝘀𝘁𝗮𝗰𝗸! 🛠️ AI voice agents are evolving fast, opening up many possibilities for a new paradigm of customer interaction. In today's world, businesses still use scripted IVR menus and static call flows that frustrate customers and waste time. With AI voice agents, we can create natural conversations that adapt in real-time, handling thousands of concurrent calls with low latency. There are many tools and possibilities for AI voice agents today, creating both exciting opportunities and a lot of noise. To cut through the confusion, here's a framework of six key tech stack layers you can leverage to build powerful, production-ready voice automation: Let's break it down: ⬇️ 1. 𝗩𝗼𝗶𝗰𝗲 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻 𝗣𝗹𝗮𝘁𝗳𝗼𝗿𝗺 Start with Retell AI as your foundation. No-code builder + developer API, 30+ languages, 99.99% uptime. → Orchestrates STT, LLM, and TTS with sub-800ms latency for human-like conversations. 2. 𝗖𝗵𝗼𝗼𝘀𝗲 𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗕𝗿𝗮𝗶𝗻: Connect GPT-5 for complex reasoning, Gemini for long context, or custom models. The AI decides what to say, how to respond, and when to take action. → Think: the intelligence that powers every decision your agent makes. 3. 𝗔𝗱𝗱 𝗩𝗼𝗶𝗰𝗲 & 𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹𝗶𝘁𝘆: Select TTS providers like ElevenLabs or Cartesia for natural voices. Clone your voice or choose from libraries, control speed, emotion, and tone. → This is what makes your agent sound human, not robotic. 4. 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲 𝗧𝗼𝗼𝗹𝘀 & 𝗗𝗮𝘁𝗮: Connect calendars, CRMs and databases. Book appointments automatically, pull customer data during calls, update records in real-time. → Like giving your agent hands to actually do things, not just talk. 5. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲 𝗪𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀: Use n8n, Make, or Zapier to connect agents to existing systems. Trigger actions during or after calls, send emails, create tickets, build complex automations. → Turns voice agents into full business process automation. 6. 𝗔𝗱𝗱 𝗧𝗲𝗹𝗲𝗽𝗵𝗼𝗻𝘆 & 𝗦𝗰𝗮𝗹𝗲: Connect phone numbers via Twilio, Telnyx, or Retell's built-in telephony. Handle inbound and outbound calls, manage routing, scale to hundreds of concurrent calls. → Most voice agents fail here — this is production deployment, not demos. Understanding this tech stack can improve deployment speed, reliability, and customer satisfaction, leading to more sophisticated and scalable AI voice automation. [𝗡𝗼𝘁𝗲 𝘁𝗵𝗮𝘁 𝘁𝗵𝗲𝘀𝗲 𝗹𝗮𝘆𝗲𝗿𝘀 𝘄𝗼𝗿𝗸 𝘁𝗼𝗴𝗲𝘁𝗵𝗲𝗿, 𝗻𝗼𝘁 𝗶𝗻 𝗶𝘀𝗼𝗹𝗮𝘁𝗶𝗼𝗻.] 🛠️ This tech stack is adapted from Retell AI's production deployment framework for building AI voice agents that actually work at scale. Save 💾 ➞ React 👍 ➞ Share ♻️ Build your first AI Voice Agent with Retell AI: https://lnkd.in/dgzuQrH5
-
Voice has emerged as a widely adopted modality in applied AI. Yet the technical execution is difficult to get right. In my latest post, I dive into the technical architecture of building voice agents. This enables us at Greylock to better empathize with engineering challenges, assess product depth with greater rigor, and stay ahead of shifts in the voice stack. The post explores how teams are bringing voice agents into production across three distinct layers of the stack, breaks down the core technical challenges of building voice agents, and also highlights where we see enduring infrastructure needs across the voice stack. Here are the highlights: 1/ Layers of the stack: Teams deploying voice agents in production typically operate across one of three layers of the voice stack - core infrastructure, frameworks and developer platforms, and end-to-end applications. Each comes with its own engineering tradeoffs and product considerations. 2/ Under the hood: Today, most production-grade voice systems follow a three-part architecture: (1) a speech-to-text (STT) model, (2) a large language model (LLM), and (3) a text-to-speech (TTS) model. An emerging alternative is using end-to-end speech-to-speech (S2S) models, which would skip the intermediate transformations from audio to text and then back to audio. These models are generally more expressive and conversational out-of-the-box but not yet ready for most production use cases. 3/ Key technical considerations: Regardless of the architecture, delivering high-quality voice interaction requires solving problems across the entire stack such as: - Latency - Function Calling Orchestration - Hallucinations and Guardrails - Interrupts and Pauses - Speech Nuances - Background Noise and Multi-speaker Detection 4/ Foundational infrastructure: In our conversations with both builders and buyers, we repeatedly heard that reliability, quality, security, and compliance are critical gating factors for production deployment. Builders need confidence that the agents they’re deploying will perform reliably across edge cases, while buyers look for tools to evaluate and monitor agent behavior in real-world environments. We’re excited to continue learning from teams working at the forefront of voice infrastructure and agentic voice applications. If you're building in this space, we would love to connect. Thanks to Timothy Luong, Dave Smith, Catheryn Li, Andrew Chen, Ben Wiley, Arno Gau, Jason Risch, Jerry Chen, Corinne Marie Riley, and Christine Kim for their thought partnership. Read the essay here: https://lnkd.in/g63MNZg6
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development