Continuing from last week’s post on the rise of the Voice Stack, there’s an area that today’s voice-based systems often struggle with: Voice Activity Detection (VAD) and the turn-taking paradigm of communication. When communicating with a text-based chatbot, the turns are clear: You write something, then the bot does, then you do, and so on. The success of text-based chatbots with clear turn-taking has influenced the design of voice-based bots, most of which also use the turn-taking paradigm. A key part of building such a system is a VAD component to detect when the user is talking. This allows our software to take the parts of the audio stream in which the user is saying something and pass that to the model for the user’s turn. It also supports interruption in a limited way, whereby if a user insistently interrupts the AI system while it is talking, eventually the VAD system will realize the user is talking, shut off the AI’s output, and let the user take a turn. This works reasonably well in quiet environments. However, VAD systems today struggle with noisy environments, particularly when the background noise is from other human speech. For example, if you are in a noisy cafe speaking with a voice chatbot, VAD — which is usually trained to detect human speech — tends to be inaccurate at figuring out when you, or someone else, is talking. (In comparison, it works much better if you are in a noisy vehicle, since the background noise is more clearly not human speech.) It might think you are interrupting when it was merely someone in the background speaking, or fail to recognize that you’ve stopped talking. This is why today’s speech applications often struggle in noisy environments. Intriguingly, last year, Kyutai Labs published Moshi, a model that had many technical innovations. An important one was enabling persistent bi-direction audio streams from the user to Moshi and from Moshi to the user. If you and I were speaking in person or on the phone, we would constantly be streaming audio to each other (through the air or the phone system), and we’d use social cues to know when to listen and how to politely interrupt if one of us felt the need. Thus, the streams would not need to explicitly model turn-taking. Moshi works like this. It’s listening all the time, and it’s up to the model to decide when to stay silent and when to talk. This means an explicit VAD step is no longer necessary. Just as the architecture of text-only transformers has gone through many evolutions, voice models are going through a lot of architecture explorations. Given the importance of foundation models with voice-in and voice-out capabilities, many large companies right now are investing in developing better voice models. I’m confident we’ll see many more good voice models released this year. [Reached length limit; full text: https://lnkd.in/g9wGsPb2 ]
Lecture Engagement Tools
Explore top LinkedIn content from expert professionals.
-
-
#VoiceAI just crossed a line most of us didn’t see coming. Alibaba’s #Qwen3-TTS-1.7B isn’t another “better robot voice.” It sounds… human. Uncomfortably so. Natural tone. Emotional range. Accent control. And it runs in real time on everyday hardware. This isn’t a lab demo locked behind enterprise pricing. It’s fully open-source. Real-time. Usable. What stands out isn’t just the feature list, but what it signals. With a few seconds of reference audio, a voice can be recreated. Emotion is no longer implied; it’s instructed. Latency is low enough for live conversations. Languages are handled with consistency, not patchwork fixes. And the license removes the meter that used to tick with every word spoken. The quiet shock is this: Benchmarks show speaker similarity that rivals, and in some cases exceeds, well-known proprietary voice platforms—on a single GPU. That changes the economics overnight. Voice once meant studios, contracts, and per-minute costs. Now it means open models, local deployment, and fully owned voice systems. For builders, this opens doors that were previously bolted shut: Real-time agents that don’t sound synthetic. Accessibility tools that feel respectful, not mechanical. Learning, gaming, storytelling, and support systems where voice is no longer the bottleneck. The interface just became more human. And that’s exactly where the unease begins. When voices can be copied this easily, sound loses its authority. Audio can no longer stand alone as proof. Impersonation, fraud, and social engineering don’t need better scripts anymore. They just need a familiar voice. This is why risk, verification, and trust systems can no longer be optional layers. They are fast becoming core infrastructure. We are stepping into a phase where: Seeing was already questionable. Now hearing is too. Technology taught machines how to speak with us. The harder task ahead is teaching ourselves how to listen—carefully, critically, and with context. Progress didn’t slow down. It just got a voice.
-
Screens are optional—conversation isn’t. Voice agents have finally crossed the line from “nice demo” to mass scale live production. A Fortune 100 health insurer has replaced swaths of its call-centre workforce with an AI agent that listens to symptom descriptions, gauges urgency and benefit details, and steers members to the right in-house nurse or in-network provider. Early results show mis-routed calls collapsing while human nurses concentrate on the most complex cases—evidence that, when trained on medical nuance, automation can still deliver empathy. The same capability is trickling down to Main Street. A neighbourhood dental clinic now relies on a 24/7 AI receptionist that fills midnight cancellations, takes deposits and syncs instantly with the practice-management calendar, eliminating the Monday-morning voicemail backlog. Nearby, an auto body shop lets its voice agent quote repairs and capture credit-card details while mechanics sleep, winning leads that used to hang up after three rings. Why does this feel inevitable? Voice is simply higher bandwidth than text; tone, pace and sighs carry layers of meaning a text interaction cannot. Studies show people (and agents) read emotion and feel connection more accurately when they hear a voice. As latency drops below half a second and costs reach pennies per minute, talking will again beat typing for many tasks—only this time the “person” on the other end might be generated by silicon. Now imagine the next step: every brand offers you a personal concierge that remembers the hiking boots you bought last spring, the hotel room you preferred in Tel Aviv or your preference for classical hold music. It greets you by name, picks up the last conversation mid-sentence and suggests dinner before you even think to ask. Conversation becomes the API. Optimism doesn’t erase risk. Voice-cloning scams already account for more than 40 percent of fraud attempts in finance, up twenty-fold in three years. Protecting both brands and callers will demand a new security layer: real-time likeness checks, rotating pass-phrases and cryptographic watermarks baked into synthetic speech so a courtroom—or a phone—can tell the difference between a genuine agent and a deepfake. That challenge is an opening for startups. I’m curious: if you’re experimenting with voice, how are you balancing speed, empathy and security? And what surprised you when real customers finally started talking back? Happy to compare notes.
-
Imagine trying to get a workout recommendation while running, navigate a complex route while driving, or get tech support while cooking - all without touching a screen. This is the promise of voice-enabled LLM agents, a technological leap that's redefining how we interact with machines. Traditional text-based chatbots are like trying to dance with two left feet. They're clunky, impersonal, and frustratingly limited. Consider these real-world friction points: - A visually impaired user struggling to type support queries - A fitness enthusiast unable to get real-time guidance mid-workout - A busy professional multitasking who can't pause to type a complex question Voice AI breaks these barriers, mimicking how humans have communicated for millennia. We learn to speak by four months, but writing takes years - testament to speech's fundamental naturalness. Real-World Transformation Examples: 1️⃣ Healthcare: Emotion-recognizing AI can detect patient stress levels through voice modulation, enabling more empathetic remote consultations. 2️⃣ Fitness: Hands-free coaching that adapts workout intensity based on your breathing and vocal energy. 3️⃣ Customer Service: Intelligent voice systems that understand context, emotional undertones, and personalize responses in real-time. The magic of voice lies in its nuanced communication: - Tone reveals emotional landscapes - Intensity signals urgency or excitement - Rhythm creates conversational flow - Inflection adds layers of meaning beyond mere words - Recognize emotional states with unprecedented accuracy - Support rich, multimodal interactions combining voice, visuals, and context - Differentiate speakers in complex conversations - Extract subtle contextual intentions - Provide personalized responses based on voice characteristics In short, this technology is about creating more human-centric technology that listens, understands, and responds like a thoughtful companion. The future of AI isn't about machines talking at us, but talking with us.
-
🗣️ Voice AI is everywhere—but which use cases are delivering ROI today, and which will tomorrow? We map adoption into four waves—full breakdown here 👉 https://lnkd.in/gbxveFjA Voice AI’s “second act” isn’t a gimmick; it’s becoming the backbone of autonomous workflows in trillion-dollar industries. Voice isn’t just the product anymore—what matters is what voice unlocks across an entire organization. 1️⃣ Wave 1 — Infrastructure Foundational models, tooling, and orchestration. Cartesia Vapi LiveKit Hamming AI David AI etc. 2️⃣ Wave 2 — Horizontal Platforms 24/7 AI call-center agents replacing legacy phone trees. Here, a high quality voice agent from Cresta, Parloa, Sierra, or Decagon is still quite central to “the product.” 3️⃣ Wave 3 — Vertical Agents (now) Domain-specific agents eating labor spend, in which voice is either a wedge or expansion exponent. 📦 Logistics: Augment, HappyRobot, FleetWorks, Vooma, and Pallet each automate different parts of the Freight Forwarder, Broker, Carrier, and Shipper stack such as load updates, scheduling, and carrier negotiations—chipping away at legacy TMS/WMS. 🛡️ Insurance: Strada & Liberate integrate with Guidewire / Applied to run sales and service 24/7. 🩺 Healthcare & Pharma: Assort Health & Hippocratic AI book visits, triages calls, and guides patients with empathy. Tandem & Squad Health address cumbersome processes like prior authorizations, financial assistance management, and pharmacy coordination. 🏭 Manufacturing & Wholesale Distribution: Endeavor & Canals AI ingest multi-channel orders, sync ERPs, and surface cross-sell insights. DOSS.COM built an ERP to unify inventory, orders, and production into one platform. 🛠️ Home Services: Netic & Avoca layer AI agents on intake, scheduling, and quoting for trades pros. 🔍 User Research: Listen Labs & Strella deliver adaptive voice interviews at survey speed, replacing weeks of moderated sessions. 4️⃣ Wave 4 — Edge-Native, Trust-First Companions (emerging) Consumer adoption has lagged, but NPUs now ship in every phone, laptop, and wearable—running billion-parameter speech models fully offline. Qualcomm’s AR1+ glasses, Snapdragon X PCs, and Google’s Gemini Nano prove it: sub-second, privacy-safe voice on the edge. Add “dialect packs” that load on demand, plus FCC & EU rules that watermark every utterance, and the stage is set for ambient AI sidekicks that feel personal and compliant. Resonant personalities are the secret ingredient for consumer, enabled by culturally nuanced voices users choose to spend time with. Voice AI isn’t just here to stay—it’s opening the floodgates for the complete transformation of countless verticals and consumer applications. If you're building in any of these areas, please reach out to Kristina Shen and I—we'd love to chat!
-
Voice AI isn’t just getting smarter, it’s finally getting fast enough to feel natural. Here is how the drop in latency over the past year is what’s unlocking real use cases in financial services: You can now have a multi-turn, human-like conversation with AI that overlaps and pauses like a person. That subtle timing upgrade is what makes voice feel believable and practical for data collection, not just support. Here’s how firms are starting to use it: First, it’s not for troubleshooting or FAQs. It’s for servicing, collecting missing form fields without long email threads. When a form is short a few data points, AI can call the client, capture 20 to 25 fields in roughly 90 seconds, and log everything securely for audit. No typing, no portals, no “please resend this form” loops. Second, voice offers a security win. Email isn’t secure for PII, and most firms ban text for anything sensitive. Voice with enforced consent solves both: the call is consent‑logged, encrypted, and stored safely. It’s faster than forms and safer than email or SMS. Third, it meets clients where they are. Some prefer email, some text, but many like the convenience of simply speaking for three minutes instead of typing into a form. For busy clients, it’s often the easiest and fastest option, and they prefer it. Finally, the trick is tactical design: tune the AI’s pauses and overlaps so it feels natural, keep the focus narrow, and route complex questions to humans. That’s how you improve speed without risking trust. As one FA put it recently, “You’ll start to see it used in the next six to 18 months.” What part of your servicing process would you automate with voice if latency and compliance were no longer blockers?
-
We’re already in the early swell of the next technology wave. And it’s not video. Voice. 🌊🎙️ We’re swimming in a sea of video—TikTok, Instagram, YouTube, LinkedIn. But the wave I’m increasingly watching build? Voice. Over 40% of internet users already use voice search or commands. Voice now drives more than 1 in 5 online searches. And customer satisfaction with voice bots is climbing faster than text. The next wave of AI interaction may sound a lot more like a conversation than a keyboard. I’ve been a believer in voice for years — even using LinkedIn’s lesser-known voice messaging feature long before most people knew it existed. It’s faster than typing, and it carries tone, emotion, and nuance that text can’t. At its best, it brings more humanity and connection to the exchange. Here’s why I think voice is about to, or is starting, to have its moment: • Friction is vanishing — speech recognition is now near-instant and near-perfect. • AI makes voice smarter — it’s no longer just transcribing, it’s understanding. • Distribution is ready — voice-enabled devices are already everywhere: in our pockets (smartphones), in our kitchens (smart speakers), and in our cars (built-in voice systems). Once AI makes these interactions dramatically better, adoption could explode without anyone needing to buy new hardware. • Behaviors are tipping — we’re comfortable dictating and sending voice notes; the leap to richer voice-AI interaction is and will be natural. • The killer app will be invisible — voice will win when it feels like you’re talking to someone who is always there and who gets you. If this is true, then… • Workflows will change — speaking will become the default way we create, collaborate, and think with AI. That means faster decision cycles, more inclusive participation from people who think better out loud, and new norms where meetings and updates shift from typed docs to short, conversational exchanges — with AI turning them into polished outputs. • Emotional intelligence in AI will rise — tone, pacing, and emotion will shape how machines respond to us. And differentiation may come from AI that understands us better — from personality assessments to subtle cues like how we slept the night before. Upfront and ongoing understanding, powered by unique personal data, could let AI tune itself individually and uniquely to its user and owner. • Access will broaden — voice removes barriers for non-typists, non-native writers, and people with disabilities, opening up entirely new audiences and markets. Sam Altman and Jony Ive aren’t building for some distant future — they’re building for right now. If they get voice right, the shift won’t be gradual. One day soon, we’ll wake up and realize typing is the old way. What do you think — are we ready for the voice wave?
-
Voice AI spent this week proving it can ship in regulated industries. Not a model race. Not a TTS benchmark. Actual production deployments, real funding rounds, and real infrastructure bets. Telnyx launched LiveKit on Telnyx, a fully hosted platform running LiveKit agents on Telnyx-owned infrastructure. Existing LiveKit code works unchanged. 50% lower STT and TTS costs, sub-200ms round-trip time, session fees waived during beta. The carrier layer, GPU layer, and framework layer collapsed onto a single bill. ElevenLabs announced on-premise and on-device deployment. The same voice agent stack that Klarna and Revolut run in the cloud now ships locally for regulated and air-gapped environments. Combined with their recent ElevenLabs for Government product, this is the clearest signal yet that voice AI is moving into places where cloud was a blocker. REGAL launched Copilot, an agent for building self-improving voice AI agents. Time-to-live drops to a single day, and the agent keeps iterating from real call data instead of prompt engineering sessions. Regal is now at $83M total raised and betting that production traffic, not human tuning, is what makes voice agents better. Acclaim AI (formerly Aiphoria) launched in the US with a $34M Series A led by Veeam co-founder Ratmir Timashev. Voice-first CX purpose-built for banking, fintech, healthcare, and insurance. The thesis is that regulated industries need voice agents that rehumanise customer experience, not copy the robotic phone tree model. Narwhal Labs raised £20M to scale DeepBlue OS, an autonomous communications platform that runs voice, SMS, email, and WhatsApp workflows through a single AI agent. Three concurrent agents handle inbound, outbound, and follow-up across 50+ languages. The pitch is replacing the human communications team entirely, not augmenting it. Natter raised $23M Series A led by Renegade Partners for AI conversation intelligence. The platform runs thousands of simultaneous 1:1 video conversations in parallel, generating insights in hours instead of weeks. Accenture, ServiceNow, PwC, Mondelez, and Philip Morris are already replacing surveys and focus groups with it. Google quietly shipped AI Edge Eloquent on iOS, a free, offline, Gemma-powered dictation app with no cap and no subscription. Paid cloud-based dictation incumbents charging $15 a month suddenly look a lot less defensible. Miravoice raised $6.3M seed from Unusual Ventures to run AI voice interviewer agents for large-scale phone surveys. Over 100,000 structured interviews completed in 2025. The angle is known-question, survey-grade conversations at market research scale, not open-ended agentic workflows. The common thread: voice agents are no longer sold on model quality. They are sold on deployment environments, self-improvement, and the ability to do work that replaces headcount in regulated industries.
-
Google just broke the language barrier. Pixel 10 now translates your phone calls in real time. But the real breakthrough? It speaks your own voice in another language... Samsung and Apple already have live translation. But their versions sound robotic. Google’s Pixel 10 goes further: Captures your voice Recreates it in another language Runs instantly on-device (no cloud, no lag) Imagine the possibilities: Sales teams closing deals in China. Support reps helping customers in Portuguese. Factories coordinating in fluent German. But there’s a darker side. The same tech that makes you sound native also lets scammers impersonate you. In any language. With your exact voice. Picture this: Your CFO calls. Perfect English, their real voice. “Wire $2M for the acquisition. It’s urgent.” How do you know it’s real? Here’s the enterprise playbook: ✅ Green light — sales calls, vendor coordination, customer support ⚠️ Yellow light — financial or credential requests (add verification) ⛔ Red light — wire transfers by voice alone Build safeguards now: Call back known numbers before acting on financial requests Use pre-agreed codewords only insiders know Require 2+ approvals for voice-initiated transfers Enable STIR/SHAKEN caller attestation This isn’t just a security feature. It’s survival. The bottom line: This tech is too powerful to ignore. But too dangerous to deploy carelessly. The companies that balance growth with safeguards will own global markets.
-
Is This the Future of Human-AI Interaction? Sesame's "Voice Presence" is Astonishing. Have you ever truly felt like you were having a conversation with an AI? Sesame, founded by Oculus co-founder Brendan Iribe, is pushing the boundaries of AI voice technology with its Conversational Speech Model (CSM). The results are striking. As The Verge's Sean Hollister noted, it's "the first voice assistant I've ever wanted to talk to more than once." Why? Because Sesame focuses on "voice presence," creating spoken interactions that feel genuinely real and understood. What's the potential impact for businesses? Enhanced Customer Service: Imagine AI assistants that can handle complex inquiries with empathy and natural conversation flow. Improved Accessibility: More natural voice interfaces can make technology accessible to more users. Revolutionized Content Creation: Voice models like Maya and Miles could open up new audio and video content possibilities. Training and Education: Interactive AI tutors could provide personalized and engaging learning experiences. The most impressive part? In blind listening tests, humans often couldn't distinguish Sesame's AI from real human recordings. #AI #ArtificialIntelligence #VoiceTechnology #Innovation #FutureofWork #CustomerExperience #MachineLearning #SesameAI
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development