Continuing from last week’s post on the rise of the Voice Stack, there’s an area that today’s voice-based systems often struggle with: Voice Activity Detection (VAD) and the turn-taking paradigm of communication. When communicating with a text-based chatbot, the turns are clear: You write something, then the bot does, then you do, and so on. The success of text-based chatbots with clear turn-taking has influenced the design of voice-based bots, most of which also use the turn-taking paradigm. A key part of building such a system is a VAD component to detect when the user is talking. This allows our software to take the parts of the audio stream in which the user is saying something and pass that to the model for the user’s turn. It also supports interruption in a limited way, whereby if a user insistently interrupts the AI system while it is talking, eventually the VAD system will realize the user is talking, shut off the AI’s output, and let the user take a turn. This works reasonably well in quiet environments. However, VAD systems today struggle with noisy environments, particularly when the background noise is from other human speech. For example, if you are in a noisy cafe speaking with a voice chatbot, VAD — which is usually trained to detect human speech — tends to be inaccurate at figuring out when you, or someone else, is talking. (In comparison, it works much better if you are in a noisy vehicle, since the background noise is more clearly not human speech.) It might think you are interrupting when it was merely someone in the background speaking, or fail to recognize that you’ve stopped talking. This is why today’s speech applications often struggle in noisy environments. Intriguingly, last year, Kyutai Labs published Moshi, a model that had many technical innovations. An important one was enabling persistent bi-direction audio streams from the user to Moshi and from Moshi to the user. If you and I were speaking in person or on the phone, we would constantly be streaming audio to each other (through the air or the phone system), and we’d use social cues to know when to listen and how to politely interrupt if one of us felt the need. Thus, the streams would not need to explicitly model turn-taking. Moshi works like this. It’s listening all the time, and it’s up to the model to decide when to stay silent and when to talk. This means an explicit VAD step is no longer necessary. Just as the architecture of text-only transformers has gone through many evolutions, voice models are going through a lot of architecture explorations. Given the importance of foundation models with voice-in and voice-out capabilities, many large companies right now are investing in developing better voice models. I’m confident we’ll see many more good voice models released this year. [Reached length limit; full text: https://lnkd.in/g9wGsPb2 ]
Voice and Chatbot Integration
Explore top LinkedIn content from expert professionals.
Summary
Voice and chatbot integration means combining voice recognition and conversational AI so that users can interact with digital assistants or automated systems naturally using both speech and text. This technology allows businesses to offer seamless, hands-free communication—think smart customer service phones or voice-enabled chatbots that understand and respond like a real person.
- Balance real-time needs: Build voice chatbot systems that prioritize quick responses and synchronized data to keep conversations smooth and natural, even in busy or complex environments.
- Plan for noisy settings: Invest in advanced noise filtering and voice detection to make sure your chatbot understands users accurately, even with background noise or multiple speakers.
- Design for action: Set up your chatbot so it doesn’t just answer questions but can also take meaningful actions—like sending messages or transferring to a human—across different communication channels.
-
-
GenAI Architecture – Week 9 Project 9: Building Multimodal + Voice Agents at Scale (MCP Unified Stack) If you’ve been following this journey, you know how each week built on the last — from setting up local agents to orchestrating enterprise RAG systems and federated data pipelines. By Week 9, everything finally came together. This was the week we gave our agents the ability to see, listen, reason, and speak — all in one place. 🎯 The Challenge: Most multimodal or voice AI demos you see online are cool but disconnected — a chatbot here, a vision model there, a voice transcriber somewhere else. But in real-world enterprises, you need something unified — a single system that can: 🎙 Listen 🖼 See 🧩 Reason 🗣 Speak … and do it all within one orchestrated environment. 🧩 The Architecture Here’s how this unified setup works: 1️⃣ User Interface Layer The experience starts at the front — voice, camera, or chat inputs through a FastAPI or Streamlit app powered by the MCP SDK. 2️⃣ MCP Agent Orchestrator Built on AWS Bedrock AgentCore, this layer coordinates between vision, audio, and reasoning agents — ensuring context flows seamlessly. 3️⃣ Modular Agent Suite 🎙 Speech Agent – Whisper or Amazon Transcribe (speech-to-text) 🖼 Vision Agent – Claude or Nova (multimodal image reasoning) 🧠 Reasoning Agent – Core logic chain using Claude 3 or Nova 🗣 Response Agent – Amazon Polly or EdgeTTS for natural voice output 4️⃣ Data + Integration Layer Unified APIs (via MindsDB, Vector DB, or RAG engine) provide real-time context, while S3 + DynamoDB store memory and results for continuity. ⚡ Why This Matters This architecture breaks the silos. It lets voice, vision, and reasoning work together — dynamically. Bedrock AgentCore handles context and tool calls. Modular design makes it easy to swap in new capabilities. It’s built for real-time decision-making in complex environments. 💡 Real-World Use Cases - Field engineers using voice + image input for automated diagnostics. - Medical assistants combining patient conversations + scan interpretation. - Voice-enabled dashboards that speak and visualize KPIs in real time. 🛠 Tech Stack Kiro IDE | Cursor IDE | AWS Bedrock AgentCore | Claude | Nova | Whisper | Amazon Polly | MindsDB | DynamoDB | S3 | FastAPI | Streamlit | OpenCV This week felt like the moment it all clicked — when agents stopped acting as standalone tools and started working as a collaborative team. Next week → Week 10: Bringing it all together – Agentic AI in Production. 🚀 #GenAI #AgentCore #AWSBedrock #Claude #Nova #VoiceAI #MultimodalAI #AgenticAI #MCP #10WeeksOfGenAI #KiroIDE #CursorIDE #AIArchitecture
-
We keep talking how AI agents are transforming omnichannel communications with real demos and forward-looking cases. Here’s the recap: Alice Goode shared how Infobip is enabling AI agents to act, not just chat by integrating communication tools (SMS, RCS, WhatsApp, voice) through their MCP Server. She demonstrated how an AI agent could send a message through Viber by prompt. It reduced the execution from 48h to 15m We also discussed agent security and saw demos from builders across healthcare, e-commerce, productivity, and robotics 3 Key Takeaways True Agents need purpose, tools, and permissions to take action autonomously whether it’s messaging a user or executing a task across platforms MCP is a game-changer For integrations with communication channels Gives agents a universal translator to interact with any API or CPaaS layer Security is not optional The community is launching “SAFE-MCP” to prevent abuse (e.g., agent-triggered spam or comms fraud). Guidelines will help devs and companies build safe-by-design AI Demos That Stood Out QuickBlox: Nate MacLeitch, demonstrated its HIPAA-compliant AI comms platform for healthcare triaging patient symptoms, handing off to live providers, and integrating with AI assistants trained on medical data. He had great innovations in the field of using AI across different moments in the communication between a professional and its patients DeskMan: Tom Jacobs , showed us a 3D-printed robot barista powered by LLMs and wake-word detection turns its head, takes your coffee order, and interacts with you like a real staffer. It was pretty cool Agora: Amar Chitimalli shared the Audio-first SDKs enabling agents to live-stream, answer voice prompts, and interact in real-time gaming, shopping even Raspberry Pi-powered robots. We also got A Voice-based dementia screening tool that passively assesses cognitive health through daily conversations A Private Equity Deal Assistant: A multi-agent negotiation tool helping PE buyers evaluate small business acquisitions A cool email assistant that reads your emails and gives a summary of them or a podcast with the more relevant news you receive using voice. We ended up agreeing on very important recommendations for builders: -Start small, but integrate deeply: Use open protocols like MCP -Build for action, not just answers: The next competitive edge is not in smarter replies, but in agents that do something -Plan for human handoff: Especially in healthcare, finance, or legal, define triggers where agents pass control to a human safely. -Stay updated on AI agent security: Join the open-source “SAFE-MCP” initiative: https://lnkd.in/gjPqaHVy If you're building anything at the intersection of AI, comms, A2H, A2A let's connect. The ecosystem is forming right now. Thanks VC Nest and Astha.ai Arjun Subedi, Mike Prince, Inderpal M., James Brown, Beatta Lovrečić, Ivan Brezak Brkan (IBB) 📝, Ivan Ostojic #AgenticAI #ConversationalAI #MCP #BusinessMessaging
-
+5
-
Building Conversational AI IVRs (AIVRs) isn’t as simple as wiring together an LLM, STT, and TTS engine. After years of working at the intersection of real-time communications and AI, I’ve seen what actually works at scale - and what falls apart when theory meets production. Most projects underestimate the complexity of real-time performance, especially when state, latency, and telephony are involved. I wrote a post outlining 5 common pitfalls teams run into when building Conversational AI IVRs - and what to do instead. If you're serious about deploying voice experiences that are fast, scalable, and actually useful, this will help you avoid months of frustration. Key takeaways include: - Why stacking APIs doesn’t scale - The real cost of latency in voice interactions - Why real-time voice automation isn’t just a chatbot with audio - The importance of synchronized state across turns - How to build systems that handle real-world conditions Read the full post: https://lnkd.in/gdkHyKib If you're building in this space, I’d be curious to hear how you're approaching the challenges. #ConversationalAI #IVR #RealTimeAI #SpeechRecognition #AIEngineering #Telecom #SignalWire
-
Voice AI is more than just plugging in an LLM. It's an orchestration challenge involving complex AI coordination across STT, TTS and LLMs, low-latency processing, and context & integration with external systems and tools. Let's start with the basics: ---- Real-time Transcription (STT) Low-latency transcription (<200ms) from providers like Deepgram ensures real-time responsiveness. ---- Voice Activity Detection (VAD) Essential for handling human interruptions smoothly, with tools such as WebRTC VAD or LiveKit Turn Detection ---- Language Model Integration (LLM) Select your reasoning engine carefully—GPT-4 for reliability, Claude for nuanced conversations, or Llama 3 for flexibility and open-source options. ---- Real-Time Text-to-Speech (TTS) Natural-sounding speech from providers like Eleven Labs, Cartesia or Play.ht enhances user experience. ---- Contextual Noise Filtering Implement custom noise-cancellation models to effectively isolate speech from real-world background noise (TV, traffic, family chatter). ---- Infrastructure & Scalability Deploy on infrastructure designed for low-latency, real-time scaling (WebSockets, Kubernetes, cloud infrastructure from AWS/Azure/GCP). ---- Observability & Iterative Improvement Continuous improvement through monitoring tools like Prometheus, Grafana, and OpenTelemetry ensures stable and reliable voice agents. 📍You can assemble this stack yourself or streamline the entire process using integrated API-first platforms like Vapi. Check it out here ➡️https://bit.ly/4bOgYLh What do you think? How will voice AI tech stacks evolve from here?
-
What is an 𝗔𝗜 𝗩𝗼𝗶𝗰𝗲 𝗔𝗴𝗲𝗻𝘁 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 and what components are needed to build one? An AI Voice Agent is an orchestrated system that speaks and listens like a human. Platforms like Retell AI combine speech recognition, large language models, and natural language understanding to make this happen. Let's look into the 6 core components: 1️⃣ Speech Input (STT Layer) 👉 Captures audio and converts speech to text in real-time. 👉 Handles voice activity detection, background noise, and accents. ✅ Barge-in detection allows natural interruptions like human conversations. Examples: Deepgram, OpenAI Whisper, etc. 2️⃣ Understanding (LLM Engine) 👉 Processes text to understand user intent and context. 👉 Handles multi-turn reasoning and triggers real-world actions via function calling. ✅ GPT-5 achieves 70%+ success rate in multi-turn function calling. Examples: OpenAI, Anthropic, Gemini etc. 3️⃣ Conversation Memory 👉 Maintains state across the entire dialogue. 👉 Remembers preferences, past interactions, and conversation history. ✅ Without memory, every turn feels like starting over. 4️⃣ Response Generation (TTS Layer) 👉 Converts LLM output into natural, human-like speech. 👉 Controls intonation, emotion, and pacing. ✅ Premium voices sound indistinguishable from humans. Examples: ElevenLabs, Cartesia, OpenAI etc. 5️⃣ Integrations & Function Calling 👉 Connects to CRMs, calendars, and business systems with n8n, Make and Zapier. 👉 Executes real actions — booking meetings, updating records, querying data. ✅ Without integrations, you just have a talking chatbot. Examples: Calcom, Salesforce, HubSpot etc. 6️⃣ Telephony & Analytics 👉 Handles call routing, SIP trunking, and phone infrastructure. 👉 Tracks call quality, sentiment, and task completion. ✅ Analytics turn voice interactions into actionable insights. Examples: Twilio, Telnyx, Retell AI's native carrier etc. 𝗠𝘆 𝘁𝗵𝗼𝘂𝗴𝗵𝘁𝘀 𝗼𝗻 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗩𝗼𝗶𝗰𝗲 𝗔𝗴𝗲𝗻𝘁𝘀 𝘄𝗶𝘁𝗵 Retell AI: 👉 Don't stitch ASR + LLM + TTS yourself — latency and interruption handling are harder than they look. 👉 Nail conversation design first. Best STT/TTS won't save a bad dialogue flow. 👉 Integrations determine ROI. If it can't book meetings or update your CRM, it's just a demo. 👉 Track latency, completion rates, and sentiment from day one. Over to you: What voice agent use case are you building?
-
Voice isn’t just a dial tone—it’s the next frontier for Agentic AI and its driving a voice renaissance. I’ve been saying for years that voice is the richest source of customer intent, yet it’s often been the hardest data set to unlock. This week RingCentral is looking to change that with its announcement of its integration with OpenAI (utilizing GPT-5.2) to bring high-fidelity, low-latency Generative AI directly into the live voice stream. Why does this matter? Most AI tools today are "after-the-fact"—they summarize a meeting or analyze a transcript once the call is over. While useful, that’s reactive. What RingCentral is doing with their new AI Virtual Assistant (AVA) and AI Receptionist (AIR) is Agentic AI. It’s about moving from "insights" to "actions" in real-time. Who should care? 🔹 CX Leaders: Imagine a world where your AI handles the "front door" (AIR), but when a human needs to step in, the AI (AVA) hands off the full context, intent, and next-best actions instantly. No more "Can you repeat your problem for the third time?" 🔹 IT & Ops: You get the power of OpenAI’s frontier models but within RingCentral’s carrier-grade, secure framework. This isn’t "shadow AI"; it’s enterprise-grade governance where your data isn't used to train public models. 🔹 The C-Suite: This is a productivity multiplier. Early data shows a 14% increase in appointments and massive revenue gains for early adopters like Televero Behavioral Health Health. The Bottom Line for RingCentral: This move shifts RingCentral from being a provider of communications tools to an AI orchestration platform. By bridging the gap between LLMs and real-world voice infrastructure, they are making AI practical and "human-like" rather than just a chatbot experience. The era of the "dumb" desk phone ended a long time ago. We are now entering the era of the Intelligent Conversation. You can check out the press release here: https://lnkd.in/e7H6dcKJ Next week, Ill be at #RingCentralAnalystSummit and hope to hear more on this Tim Dreyer, Kira Makagon, Jennifer C., Akshay Srivastava, Vlad Shmunis, Carson Hostetter #UCaaS #AI #GenerativeAI #OpenAI #RingCentral #FutureOfWork #CX #ZKResearch
-
Voice has been AI's blind spot. RingCentral just made it the centerpiece. For years, enterprise AI investment has flowed into documents, emails, and meeting summaries — the written record of business. Voice, the richest signal of customer intent, has been treated as an afterthought: recorded, transcribed, analyzed after the moment that mattered most had already passed. RingCentral is changing that calculus. This week the company announced a deep integration with OpenAI (GPT-5.2), bringing generative AI directly into live enterprise voice conversations — not post-call summaries, but real-time intelligence in the flow of the interaction itself. Two new agentic products anchor the launch: AI Receptionist (AIR) automates inbound call handling — answering, scheduling, routing, and executing natural-language follow-up without human intervention. AI Virtual Assistant (AVA) picks up where AIR leaves off. When a call does route to a human, AVA surfaces full customer context and recommended next actions before the employee says a word. Together, they form a unified intelligence layer across the entire interaction lifecycle. This isn't a chatbot bolted onto a phone system. It's AI that actively shapes conversation outcomes. The early numbers are hard to ignore. AIR surpassed 5,000 customers in just two quarters. Televero Health reported a 97% patient satisfaction rate, a 14% increase in monthly appointments, and more than $200K in additional monthly revenue within four months of deployment. On the governance side — a persistent friction point in enterprise AI adoption — customer data is not used to train public models. For regulated industries like healthcare and financial services, that's not a nice-to-have. It's a requirement. The vendors that win the AI transition won't be those that bolted features onto legacy platforms. They'll be those that rebuilt the intelligence layer from the ground up. I'm in Phoenix this week with the RingCentral team and looking forward to going deeper on this partnership — including what flexibility looks like if they ever want to evolve beyond OpenAI. In a market moving this fast, keeping your options open is just good strategy. The passive phone system era is effectively over. The question now is which platform can deliver AI-powered voice responsibly, at scale, and with measurable results. #UnifiedCommunications #EnterpriseAI #RingCentral #AI #CX #VoiceAI #AgenticAI
-
𝑨𝒏 𝒂𝒈𝒆𝒏𝒕 𝒕𝒉𝒂𝒕 𝒄𝒂𝒏 𝒕𝒂𝒍𝒌 𝒕𝒐 𝒚𝒐𝒖 𝒊𝒏 𝒓𝒆𝒂𝒍𝒕𝒊𝒎𝒆 𝒂𝒏𝒅 𝒂𝒄𝒕𝒖𝒂𝒍𝒍𝒚 𝒈𝒆𝒕 𝒚𝒐𝒖𝒓 𝒘𝒐𝒓𝒌 𝒅𝒐𝒏𝒆? Most AI assistants stop at chat. In real environments, that’s not very useful. What you actually need is an assistant that can listen, understand intent, trigger workflows, and respond back - all end to end 🙃. That’s exactly what I build in this hands-on IBM Developer tutorial. I walk through creating a voice-enabled AI assistant where: 👉 A spoken question (via Siri) is captured 👉 Routed through a watsonx Orchestrate agent 👉 Executes the right workflow 👉 And responds back with voice, not text While building this, one thing became obvious: single agents don’t scale. Real automation needs agentic AI - multiple agents, structured steps, tools, and orchestration working together instead of one model doing everything poorly. In this tutorial, you’ll learn: 🎙️ How voice input connects to watsonx Orchestrate 🧠 How agents interpret intent and execute workflows 🧩 How to design agentic workflows - not just some basic chatbots 🏭 How this pattern applies to real enterprise automation If you’re exploring agentic AI, copilot alternatives, or want assistants that go beyond chat, this is a practical place to start. 👉 Tutorial: https://lnkd.in/gd9QW7am Let’s build assistants that works with voice. 🚀 Suzanne Livingston | Ahmed Azraq | Ela Dixit | Moisés Domínguez | Dr Prasanna Shrinivas V, PhD | Ashutosh Tiwari | Warren Fung | Bill Higgins | Michelle Corbin | Karim Younsi | Mithun Katti | Matt Sanchez | Dawn Herndon | Sampath Dechu | Carlo Appugliese | Jorge Castañón, Ph.D. | Peter A. Keller | Kami Haynes | Rahul Akolkar | Andre Tost | Andreas Horn | Supriya Raman | Owais Ahmad #AgenticAI #watsonxOrchestrate #VoiceAI #IBMDeveloper #AIEngineering
-
Exciting times ahead for Gen AI/Agentic AI in Voice based Customer Support. While conversational bots/agents have made significant strides in handling non-voice interactions, a large part of customer support still relies on voice channels. Voice channels are harder to automate due to two reasons – 1) voice conversations are quite different from text conversations, with multi-turns and interruptions 2) Tech stack for handling voice has been multi-layered, creating latency and accuracy issues. Two significant developments are set to transform this space: 1. Enhanced Voice Interaction: Traditional voice agents often struggled with latency and interruptions, leading to stilted conversations. New advancements like Realtime API are enabling voice AI agents to engage in more natural, fluid dialogues. (We are building Voice Agents into Wizr platform, so please reach out if there is interest to try out some of these capabilities) 2. Machine Use by Anthropic (and hopefully from others soon): While the AI layer has the capability to answer, in a typical enterprise, information is sitting in multiple systems that need to be integrated, which is a herculean task in itself for any large enterprise. Recent launch by Anthropic of "machine use" opens up the possibility of combining the intelligence layer for customer interaction with the machine use capabilities. This can now automate a lot of customer support scenarios without painful process of system integrations. While these advancements are still maturing, we’re at the forefront of an exciting transformation in voice-driven customer support. Sirish Kosaraju Srinivas K Nitin Gupta
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development