Top LinkedIn Content on User Experience for Voice Interfaces

627,898 followers 5mo

Cartesia Sonic-3 is the first AI voice model I’ve seen that nails Hindi perfectly. For years, even the best text-to-speech (TTS) models struggled with Hindi. The rhythm, tonality, and emotional micro-expressions just didn’t sound human and the accent was inaccurate. This model doesn’t just translate Hindi. It is specially trained for it, with precise control over pacing, expressions and tonality, all rendered in real time. Under the hood, Sonic-3 is engineered for low-latency voice generation optimized for conversational AI agents, clocking in 3–5x faster than OpenAI’s TTS while maintaining superior transcript fidelity. What makes it stand out technically: → 𝗚𝗿𝗮𝗻𝘂𝗹𝗮𝗿 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝘁𝗮𝗴𝘀 let developers dynamically modulate speed, volume, and emotion inside the transcript itself. ("Can you repeat that slower?" now works in production.) → 𝟰𝟮-𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗺𝗼𝗱𝗲𝗹 built on a single unified speaker embedding, so one voice can switch between languages like Hindi, Tamil, and English natively while maintaining accent continuity. → 𝟯-𝘀𝗲𝗰𝗼𝗻𝗱 𝘃𝗼𝗶𝗰𝗲 𝗰𝗹𝗼𝗻𝗶𝗻𝗴 powered by a low-sample adaptive cloning pipeline that enables instant personalization at scale. → 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝘁𝗮𝗰𝗸 achieving sub-300 ms end-to-end latency at p90, tuned for live interactions like support agents, NPCs, and healthcare assistants. → 𝗙𝗶𝗻𝗲-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝘁𝗿𝗮𝗻𝘀𝗰𝗿𝗶𝗽𝘁 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 that handles heteronyms, acronyms, and structured text (emails, IDs, phone numbers) which usually break realism in production systems. 🎧 Here is example of me trying Sonic-3’s Hindi. You have to hear it to believe it. If you’re building voice agents, conversational AI, or multimodal assistants, keep an eye on Cartesia. They’ve raised $100M to build the most human-sounding voice models in the world, and Sonic-3 just set a new benchmark for multilingual voice AI. #CartesiaPartner

25 Comments

Vitaly Friedman

Practical insights for better UX • Running “Measure UX” and “Design Patterns For AI” • Founder of SmashingMag • Speaker • Loves writing, checklists and running workshops on UX. 🍣

225,933 followers 1y

🔮 Design Guidelines For Voice UX. Guidelines and Figma toolkits to design better voice UX for products that support or rely on audio input ↓ 🤔 People avoid voice UIs in public spaces, or for sensitive data. ✅ But do use them with audio assistants, learning apps, in-car UIs. ✅ Good conversations always move forward, not backwards. 🤔 The way humans speak is different from the way we write. 🤔 What people say isn’t always what they mean by saying it. ✅ First, define relevant user stories for your product. ✅ Sketch key use cases, then add detours, then edge cases. ✅ Design VUI personas: tone of voice, words, sentence structure. ✅ Listen to related human conversations, transcribe them. ✅ Write conversation flows for happy and unhappy paths. ✅ Add markers (Finally, Now, Next) to structure the dialogue. ✅ Accessibility: support shaky voices and speech impediments. ✅ Allow users to slow down or speed up output, or rephrase. ✅ Adjust speech patterns, e.g. speaking to children differently. 🚫 There are no errors or “wrong input” in human interactions. 🤔 Give people time to think: 8–10s is a good time to respond. ✅ Design for long silences, thick accents, slang and contradictions. Keep in mind that many people have been “burnt” with horrible, poorly designed automated phone systems. If your voice UX will come across even nearly as bad, don’t be surprised by a very low usage rate. You can’t replicate a long scrollable list in audio, so keep answers short, with max 3 options at a time. Instead of listing more options, ask one direct question and then branch out. Re-prompt or reframe when certainty is low. People choose their voice assistant based on the personality it conveys, and the friendliness it projects. So be deliberate in how you shape the tone, word choice and the melody of the voice. Don’t broadcast personality for repetitive tasks, but let is shine in a conversation. And: if you don’t assign a personality to your product, users will do it for you. So study how your customers speak. How exactly they explain the tasks your product must perform. The closer you get to a personal human interaction, the easier it will be to earn people’s trust. Useful resources: Voice Principles, by Ben Sauer https://lnkd.in/dQACgwue Voice UI Design System, by Orange https://lnkd.in/ezP-9QUu Designing A Voice Persona, by James Walsh https://lnkd.in/e3WXaxEC Voice UI Kit (Figma), by Shadiah Garwell https://lnkd.in/eGjJCWf7 Conversational UIs (Figma), by ServiceNow https://lnkd.in/enHVSEWP Voice UI Guide, by Lars Mäder https://vui.guide/ #ux #design

40 Comments

Sumanyu Sharma

Founder & CEO @ Hamming AI (YC, AI Grant) | Helping you build reliable Voice Agents

12,666 followers 1w

This was the week voice AI stopped looking like demos. Hyperscalers shipped models. Carriers shipped runtimes. Production tooling, multimodal UX, and nine-figure revenue runs all landed in the same seven days. Telnyx launched LiveKit on Telnyx, a hosted platform that runs LiveKit agents on Telnyx-owned infrastructure. 50 percent lower STT and TTS costs, sub-200ms round-trip time, native AMR-WB, and STIR/SHAKEN. Telephony-first voice AI is becoming its own category. xAI shipped standalone Grok speech-to-text and text-to-speech APIs. $0.10 per hour for batch STT and $4.20 per million characters for TTS, with word-level timestamps and multichannel. Same stack powering Tesla and Starlink support. Microsoft launched MAI-Transcribe-1 and MAI-Voice-1 on Microsoft Foundry. 25 languages, number one on the FLEURS word error rate benchmark, and 2.5x faster batch transcription than the prior Azure Fast offering. Already powering Copilot voice mode. DeepL entered voice with a real-time voice-to-voice suite plus API across 40+ languages, including Zoom and Teams add-ons. ElevenLabs had a massive week. A Razorpay partnership to run Hinglish outbound voice agents for millions of Indian merchants, reported $100M+ net new ARR in Q1 driven by telecom and fintech adoption, and full on-premise and on-device deployment for enterprise. Cloudflare shipped @cloudflare/voice, an experimental real-time voice pipeline for the Agents SDK running directly on Workers. Voice becomes a capability of the agent you already have, not a separate framework. Retell AI rolled out SMS during live calls with no A2P approval required. A small UX unlock with big implications for multimodal agent workflows. Sanas acquired Tomato AI to push real-time speech AI deeper into carriers and VoIP networks. Third acquisition in under two years. Sanas is at $62M ARR and on track for $130M. Speechmatics partnered with thymia to fuse speech-to-text with voice biomarkers, detecting stress and fatigue signals in real time. Voice is moving from interface to insight layer. The common thread: voice is now core business tech. Carriers own the rails, hyperscalers ship the models, observability and multimodal UX are table stakes, and the scale numbers are real.

10 Comments

Vasu Gupta

3,639 followers 2mo

India just got its own multilingual AI stack Not a demo. A real platform. Most AI still speaks English first. India does not. We keep talking about AI scale. But ignore language reality. Sarvam AI just shipped something important. An open-source foundational model suite built for 10 Indian languages and designed voice-first. That changes who AI is for. Here’s what stands out to me: India’s first open-source 2B Indic LLM trained on ~4 trillion tokens Voice agents deployable via phone WhatsApp and in-app workflows Speech → text → translation → synthesis in a single Indic stack Legal AI workbench for drafting redaction and regulatory Q&A Pricing that starts around ₹1 per minute for multilingual agents This is not chasing Silicon Valley scale. It’s solving Indian constraints. Smaller efficient models that run where India actually is Voice interfaces for users who skip keyboards Agentic workflows not just chat responses And the quiet but big idea: Sovereign AI infrastructure. Data stays local. Models align with Indian regulation. Control stays domestic. That matters for BFSI, legal, telecom and any sector touching sensitive data. The real unlock is inclusion. AI that works in Hindi, Tamil, Telugu Malayalam, Punjabi, Odia Gujarati, Marathi, Kannada, Bengali AI that listens before it types We keep saying India will be an AI market. This is India building AI rails. Open-source, voice-first, enterprise-ready That combination is rare. If this ecosystem compounds India does not just consume AI It exports it. Watching this space closely. Local language AI is the next growth curve. What sectors do you think adopt first?

5 Comments

Andreas Tussing

charles | Marketing Automation & AI for WhatsApp, RCS & Co | 249% ROI by Forrester TEI

17,039 followers 5mo

“Make it sound like us.” Sounds easy. It isn’t. I smiled when I saw the post on X about finally getting ChatGPT to stop using em‑dashes. Two things can be true: it’s a tiny UX detail. it took serious work to make it reliable. for sure - it must have sat deep. It brings to one topic that we deal a lot with: Expectations for AI are sky‑high. We feel that every day. But LLMs don’t “follow rules” - they follow likelihood. Injecting deterministic expectations into probabilistic models is like steering a sailboat in shifting winds: you can set the course, the wind still has a say. We learned this early. Being “on point (or on dash)” from day one matters for brand voices. What actually makes it work in production? Strong data hygiene, crisp guardrails, agent evluation, reasoning, iterating: Instructions living in one source of truth (not in five docs and a Slack thread). Evaluation loops that flag drift fast - tone, phrasing, and compliance. ✅ We once had a client upload nearly 100 PDF pages on tone, words to avoid, gendering rules, style, you name it. Overkill? Maybe. Effective? Absolutely - because conversations with customers carry the brand every second. Will I miss the em‑dash? A bit. It became part of the ChatGPT “voice.” 🙂 But consistency beats charm when you represent a brand at scale. Any brand needs to think about what their brand voice prompt shall be look like, and how to make sure it turns out as deterministic as it can get — and which can stay “likely magic” ✨ #conversationalai #aiagents #aiselling

20 Comments

Isaac Peiris

Founder @ Pistachio | Organic growth for B2B brands

8,587 followers 11mo

99% of brands treat voice as decoration. Oatly used it as their entire business strategy. Here’s how they turned oat milk into a $10B movement. They scrapped their marketing department. They had creative report directly to the CEO. They built their entire strategy around their tone. That strategic decision changed everything: Voice became their market differentiator ↳ In a sea of clinical health claims, they sounded human Voice targeted to specific audiences ↳ They spoke directly to eco-conscious millennials Voice aligned with organisational values ↳ Authentic transparency and sustainability In 2018 they had a supply shortage They literally sold out of oat milk Cartons were reselling for $20+ At IPO they had a $10B valuation (more than most legacy dairy brands) After doubling sales to $400M in 2 years Voice isn't marketing decoration. It can be your most powerful brand tool. What role does voice play in your brand? --- Hey 👋 I’m Isaac Peiris I run an agency helping brands scale through content. My goal is to share tips and insights to help you grow. Hit my name + follow + 🔔

17 Comments

Eugene L.

GTM @ ElevenLabs

20,649 followers 8mo

🔊 Have you ever stayed on a customer‑service call simply because the person on the other end sounded trustworthy? 🎧 Researchers from Beijing University of Technology , the The University of Texas at Austin and the University of Memphis recently tested how different AI voices affect persuasion. Their findings were: • 𝗙𝗹𝗶𝗿𝘁𝘆 𝗱𝗼𝗲𝘀𝗻’𝘁 𝘄𝗼𝗿𝗸. A playful “coquetry” voice actually decreased persuasion, especially for male chatbots. • 𝗦𝘁𝗲𝗿𝗻 𝗶𝗻𝘃𝗶𝘁𝗲𝘀 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀. Stern voices were just as effective as gentle ones and, in male voices, even increased customer questions. • 𝗔𝗴𝗲 𝗶𝘀𝗻’𝘁 𝘁𝗵𝗲 𝗶𝘀𝘀𝘂𝗲. 𝗲𝗻𝗴𝗮𝗴𝗲𝗺𝗲𝗻𝘁 𝗶𝘀. There was no significant difference between “young” and “old” voices. What mattered was that older‑sounding voices kept people talking longer. • 𝗪𝗼𝗿𝗱𝘀 𝗺𝗮𝘁𝘁𝗲𝗿. Using affirmative sentences - particularly in female voices - prompted more customer inquiries, whereas rhetorical questions were less effective. For leaders in banking and finance, this isn’t just academic. Voice is the new front door of your brand. A gentle but confident tone can build trust with high‑net‑worth clients. An affirmative female voice can reassure anxious SME owners. Conversely, a playful chatbot might unintentionally undermine credibility. 𝗦𝗼𝗺𝗲 𝗾𝘂𝗶𝗰𝗸 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝘁𝗼 𝗰𝗼𝗻𝘀𝗶𝗱𝗲𝗿: 1. Audit your AI voice scripts. Are you using affirmative statements that invite dialogue? 2. Experiment with different voice personas. Avoid flirty tones and observe how clients react. 3. Treat voice as part of your CX strategy. Integrate data from calls, chatbots and apps so you can personalize the experience for each customer, because customer empathy is your competitive moat. We’ve moved from building “voices” metaphorically to designing them intentionally. The tone of your AI isn’t just a detail, it’s part of the customer experience. Link to research in comments below. #AI #Voice

1 Comment

Vishal Singhhal

Helping Healthcare Companies Unlock 30-50% Cost Savings with Generative & Agentic AI | Mentor to Startups at Startup Mahakumbh | India Mobile Congress 2025

18,901 followers 8mo

AI is quietly fixing the #1 pain point in Clinical Workflows. Electronic health records promised efficiency. They delivered frustration. Clinicians spend hours clicking through poorly designed interfaces. Documentation time now exceeds patient time. What happened to the promise of streamlined care? This is where AI integration changes everything. Imagine voice-to-text that actually works in clinical settings. Picture automatic note generation from patient conversations. Consider intelligent systems that pull relevant history without endless scrolling. Envision predictive analytics that highlight potential diagnosis paths. AI-enhanced EHRs learn from usage patterns. They adapt to individual provider workflows. Data interoperability becomes seamless when AI bridges legacy systems. Clinical decision support appears exactly when needed, not buried in alerts. Time returns to patient care instead of keyboard documentation. Quality improves as structured data becomes truly useful. Early adopters report saving 1-2 hours daily on documentation tasks. Physicians describe "rediscovering joy" in practice when freed from EHR burden. Patient satisfaction scores rise when doctors maintain eye contact instead of focussing on screen. The transformation happens invisibly. Good technology disappears into the background. Tomorrow's healthcare looks remarkably human despite advanced technology. We stand at the intersection of clinical expertise and computational power. What would you do with an extra hour each day?

3 Comments

Toby Coppel

Co-founder and Partner @ Mosaic Ventures | Startups

18,301 followers 10mo

Screens are optional—conversation isn’t. Voice agents have finally crossed the line from “nice demo” to mass scale live production. A Fortune 100 health insurer has replaced swaths of its call-centre workforce with an AI agent that listens to symptom descriptions, gauges urgency and benefit details, and steers members to the right in-house nurse or in-network provider. Early results show mis-routed calls collapsing while human nurses concentrate on the most complex cases—evidence that, when trained on medical nuance, automation can still deliver empathy. The same capability is trickling down to Main Street. A neighbourhood dental clinic now relies on a 24/7 AI receptionist that fills midnight cancellations, takes deposits and syncs instantly with the practice-management calendar, eliminating the Monday-morning voicemail backlog. Nearby, an auto body shop lets its voice agent quote repairs and capture credit-card details while mechanics sleep, winning leads that used to hang up after three rings. Why does this feel inevitable? Voice is simply higher bandwidth than text; tone, pace and sighs carry layers of meaning a text interaction cannot. Studies show people (and agents) read emotion and feel connection more accurately when they hear a voice. As latency drops below half a second and costs reach pennies per minute, talking will again beat typing for many tasks—only this time the “person” on the other end might be generated by silicon. Now imagine the next step: every brand offers you a personal concierge that remembers the hiking boots you bought last spring, the hotel room you preferred in Tel Aviv or your preference for classical hold music. It greets you by name, picks up the last conversation mid-sentence and suggests dinner before you even think to ask. Conversation becomes the API. Optimism doesn’t erase risk. Voice-cloning scams already account for more than 40 percent of fraud attempts in finance, up twenty-fold in three years. Protecting both brands and callers will demand a new security layer: real-time likeness checks, rotating pass-phrases and cryptographic watermarks baked into synthetic speech so a courtroom—or a phone—can tell the difference between a genuine agent and a deepfake. That challenge is an opening for startups. I’m curious: if you’re experimenting with voice, how are you balancing speed, empathy and security? And what surprised you when real customers finally started talking back? Happy to compare notes.

3 Comments

Vaibhav Goyal

Agentic AI | Collections | IITM RP Mentor | Educator

12,722 followers 1y

Imagine trying to get a workout recommendation while running, navigate a complex route while driving, or get tech support while cooking - all without touching a screen. This is the promise of voice-enabled LLM agents, a technological leap that's redefining how we interact with machines. Traditional text-based chatbots are like trying to dance with two left feet. They're clunky, impersonal, and frustratingly limited. Consider these real-world friction points: - A visually impaired user struggling to type support queries - A fitness enthusiast unable to get real-time guidance mid-workout - A busy professional multitasking who can't pause to type a complex question Voice AI breaks these barriers, mimicking how humans have communicated for millennia. We learn to speak by four months, but writing takes years - testament to speech's fundamental naturalness. Real-World Transformation Examples: 1️⃣ Healthcare: Emotion-recognizing AI can detect patient stress levels through voice modulation, enabling more empathetic remote consultations. 2️⃣ Fitness: Hands-free coaching that adapts workout intensity based on your breathing and vocal energy. 3️⃣ Customer Service: Intelligent voice systems that understand context, emotional undertones, and personalize responses in real-time. The magic of voice lies in its nuanced communication: - Tone reveals emotional landscapes - Intensity signals urgency or excitement - Rhythm creates conversational flow - Inflection adds layers of meaning beyond mere words - Recognize emotional states with unprecedented accuracy - Support rich, multimodal interactions combining voice, visuals, and context - Differentiate speakers in complex conversations - Extract subtle contextual intentions - Provide personalized responses based on voice characteristics In short, this technology is about creating more human-centric technology that listens, understands, and responds like a thoughtful companion. The future of AI isn't about machines talking at us, but talking with us.

1 Comment

User Experience for Voice Interfaces

More in User Experience for Voice Interfaces

More User Experience topics

Explore categories