Top LinkedIn Content on Voice Search Optimization

DeepLearning.AI, AI Fund and AI Aspire

2,471,325 followers 1y

The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future posts. Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it! However, compared to text-based generation, it is still hard to control the output of voice-in voice-out models. In contrast to directly generating audio, when we use an LLM to generate text, we have many tools for building guardrails, and we can double-check the output before showing it to users. We can also use sophisticated agentic reasoning workflows to compute high-quality outputs. Before a customer-service agent shows a user the message, “Sure, I’m happy to issue a refund,” we can make sure that (i) issuing the refund is consistent with our business policy and (ii) we will call the API to issue the refund (and not just promise a refund without issuing it). In contrast, the tools to prevent a voice-in, voice-out model from making such mistakes are much less mature. In my experience, the reasoning capability of voice models also seems inferior to text-based models, and they give less sophisticated answers. (Perhaps this is because voice responses have to be more brief, leaving less room for chain-of-thought reasoning to get to a more thoughtful answer.) When building applications where I need a more control over the output, I use agentic workflows to reason at length about the user’s input. In voice applications, this means I end up using a pipeline that includes speech-to-text (STT) to transcribe the user’s words, then processes the text using one or more LLM calls, and finally returns an audio response to the user via TTS (text-to-speech). This, where the reasoning is done in text, allows for more accurate responses. However, this process introduces latency, and users of voice applications are very sensitive to latency. When DeepLearning.AI worked with RealAvatar (an AI Fund portfolio company led by Jeff Daniel) to build an avatar of me, we found that getting TTS to generate a voice that sounded like me was not very hard, but getting it to respond to questions using words similar to those I would choose was. Even after much tuning, it remains a work in progress. You can play with it at https://lnkd.in/gcZ66yGM [At length limit. Full text, including latency reduction technique: https://lnkd.in/gjzjiVwx ]

162 Comments

Alex Wang

Learn AI Together - I share my learning journey into AI & Data Science here, 90% buzzword-free. Follow me and let’s grow together!

1,139,859 followers 1y

Voice AI is more than just plugging in an LLM. It's an orchestration challenge involving complex AI coordination across STT, TTS and LLMs, low-latency processing, and context & integration with external systems and tools. Let's start with the basics: ---- Real-time Transcription (STT) Low-latency transcription (<200ms) from providers like Deepgram ensures real-time responsiveness. ---- Voice Activity Detection (VAD) Essential for handling human interruptions smoothly, with tools such as WebRTC VAD or LiveKit Turn Detection ---- Language Model Integration (LLM) Select your reasoning engine carefully—GPT-4 for reliability, Claude for nuanced conversations, or Llama 3 for flexibility and open-source options. ---- Real-Time Text-to-Speech (TTS) Natural-sounding speech from providers like Eleven Labs, Cartesia or Play.ht enhances user experience. ---- Contextual Noise Filtering Implement custom noise-cancellation models to effectively isolate speech from real-world background noise (TV, traffic, family chatter). ---- Infrastructure & Scalability Deploy on infrastructure designed for low-latency, real-time scaling (WebSockets, Kubernetes, cloud infrastructure from AWS/Azure/GCP). ---- Observability & Iterative Improvement Continuous improvement through monitoring tools like Prometheus, Grafana, and OpenTelemetry ensures stable and reliable voice agents. 📍You can assemble this stack yourself or streamline the entire process using integrated API-first platforms like Vapi. Check it out here ➡️https://bit.ly/4bOgYLh What do you think? How will voice AI tech stacks evolve from here?

37 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

41,888 followers 1y

Voice is the next frontier for AI Agents, but most builders struggle to navigate this rapidly evolving ecosystem. After seeing the challenges firsthand, I've created a comprehensive guide to building voice agents in 2024. Three key developments are accelerating this revolution: -> Speech-native models - OpenAI's 60% price cut on their Realtime API last week and Google's Gemini 2.0 Realtime release mark a shift from clunky cascading architectures to fluid, natural interactions -> Reduced complexity - small teams are now building specialized voice agents reaching substantial ARR - from restaurant order-taking to sales qualification -> Mature infrastructure - new developer platforms handle the hard parts (latency, error handling, conversation management), letting builders focus on unique experiences For the first time, we have god-like AI systems that truly converse like humans. For builders, this moment is huge. Unlike web or mobile development, voice AI is still being defined—offering fertile ground for those who understand both the technical stack and real-world use cases. With voice agents that can be interrupted and can handle emotional context, we’re leaving behind the era of rule-based, rigid experiences and ushering in a future where AI feels truly conversational. This toolkit breaks down: -> Foundation layers (speech-to-text, text-to-speech) -> Voice AI middleware (speech-to-speech models, agent frameworks) -> End-to-end platforms -> Evaluation tools and best practices Plus, a detailed framework for choosing between full-stack platforms vs. custom builds based on your latency, cost, and control requirements. Post with the full list of packages and tools as well as my framework for choosing your voice agent architecture https://lnkd.in/g9ebbfX3 Also available as a NotebookLM-powered podcast episode. Go build. P.S. I plan to publish concrete guides so follow here and subscribe to my newsletter.

10 Comments

Vitaly Friedman

Practical insights for better UX • Running “Measure UX” and “Design Patterns For AI” • Founder of SmashingMag • Speaker • Loves writing, checklists and running workshops on UX. 🍣

225,933 followers 2mo

🔮 Design Guidelines For Voice UX. Guidelines and Figma toolkits to design better voice UX for products that support or rely on audio input ↓ 🤔 People avoid voice UIs in public spaces, or for sensitive data. ✅ But do use them with audio assistants, learning apps, in-car UIs. ✅ Good conversations always move forward, not backwards. 🤔 The way humans speak is different from the way we write. 🤔 What people say isn’t always what they mean by saying it. ✅ First, define relevant user stories for your product. ✅ Sketch key use cases, then add detours, then edge cases. ✅ Design VUI personas: tone of voice, words, sentence structure. ✅ Listen to related human conversations, transcribe them. ✅ Write conversation flows for happy and unhappy paths. ✅ Add markers (Finally, Now, Next) to structure the dialogue. ✅ Accessibility: support shaky voices and speech impediments. ✅ Allow users to slow down or speed up output, or rephrase. ✅ Adjust speech patterns, e.g. speaking to children differently. 🚫 There are no errors or “wrong input” in human interactions. 🤔 Give people time to think: 8–10s is a good time to respond. ✅ Design for long silences, thick accents, slang and contradictions. Keep in mind that many people have been “burnt” with horrible, poorly designed automated phone systems. If your voice UX will come across even nearly as bad, don’t be surprised by a very low usage rate. You can’t replicate a long scrollable list in audio, so keep answers short, with max 3 options at a time. Instead of listing more options, ask one direct question and then branch out. Re-prompt or reframe when certainty is low. People choose their voice assistant based on the personality it conveys, and the friendliness it projects. So be deliberate in how you shape the tone, word choice and the melody of the voice. Don’t broadcast personality for repetitive tasks, but let is shine in a conversation. And: if you don’t assign a personality to your product, users will do it for you. So study how your customers speak. How exactly they explain the tasks your product must perform. The closer you get to a personal human interaction, the easier it will be to earn people’s trust. Useful resources: Voice Principles, by Ben Sauer https://lnkd.in/dQACgwue Voice UI Design System, by Orange https://lnkd.in/ezP-9QUu Designing A Voice Persona, by James Walsh https://lnkd.in/e3WXaxEC Conversational UIs (Figma), by ServiceNow https://lnkd.in/dUpGmcFa Voice UI Guide, by Lars Mäder https://vui.guide/ #ux #design

16 Comments

Aishwarya Srinivasan

627,902 followers 5mo

Cartesia Sonic-3 is the first AI voice model I’ve seen that nails Hindi perfectly. For years, even the best text-to-speech (TTS) models struggled with Hindi. The rhythm, tonality, and emotional micro-expressions just didn’t sound human and the accent was inaccurate. This model doesn’t just translate Hindi. It is specially trained for it, with precise control over pacing, expressions and tonality, all rendered in real time. Under the hood, Sonic-3 is engineered for low-latency voice generation optimized for conversational AI agents, clocking in 3–5x faster than OpenAI’s TTS while maintaining superior transcript fidelity. What makes it stand out technically: → 𝗚𝗿𝗮𝗻𝘂𝗹𝗮𝗿 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝘁𝗮𝗴𝘀 let developers dynamically modulate speed, volume, and emotion inside the transcript itself. ("Can you repeat that slower?" now works in production.) → 𝟰𝟮-𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗺𝗼𝗱𝗲𝗹 built on a single unified speaker embedding, so one voice can switch between languages like Hindi, Tamil, and English natively while maintaining accent continuity. → 𝟯-𝘀𝗲𝗰𝗼𝗻𝗱 𝘃𝗼𝗶𝗰𝗲 𝗰𝗹𝗼𝗻𝗶𝗻𝗴 powered by a low-sample adaptive cloning pipeline that enables instant personalization at scale. → 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝘁𝗮𝗰𝗸 achieving sub-300 ms end-to-end latency at p90, tuned for live interactions like support agents, NPCs, and healthcare assistants. → 𝗙𝗶𝗻𝗲-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝘁𝗿𝗮𝗻𝘀𝗰𝗿𝗶𝗽𝘁 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 that handles heteronyms, acronyms, and structured text (emails, IDs, phone numbers) which usually break realism in production systems. 🎧 Here is example of me trying Sonic-3’s Hindi. You have to hear it to believe it. If you’re building voice agents, conversational AI, or multimodal assistants, keep an eye on Cartesia. They’ve raised $100M to build the most human-sounding voice models in the world, and Sonic-3 just set a new benchmark for multilingual voice AI. #CartesiaPartner

25 Comments

Allys Parsons

Co-Founder at techire ai. ICASSP ‘26 Sponsor. Hiring in AI since ’19 ✌️ Speech AI, TTS, LLMs, Multimodal AI & more! Top 200 Women Leaders in Conversational AI ‘23 | No.1 Conversational AI Leader ‘21

17,992 followers 1y

VoiceTextBlender introduces a novel approach to augmenting LLMs with speech capabilities through single-stage joint speech-text supervised fine-tuning. The researchers from Carnegie Mellon and NVIDIA have developed a more efficient way to create models that can handle both speech and text without compromising performance in either modality. The team's 3B parameter model demonstrates superior performance compared to previous 7B and 13B SpeechLMs across various speech benchmarks whilst preserving the original text-only capabilities—addressing the critical challenge of catastrophic forgetting that has plagued earlier attempts. Their technical approach employs LoRA adaptation of the LLM backbone, combining text-only SFT data with three distinct types of speech-related data: multilingual ASR/AST, speech-based question answering, and an innovative mixed-modal interleaving dataset created by applying TTS to randomly selected sentences from text SFT data. What's particularly impressive is the model's emergent ability to handle multi-turn, mixed-modal conversations despite being trained only on single-turn speech interactions. The system can process user input in pure speech, pure text, or any combination, showing impressive generalisation to unseen prompts and tasks. The researchers have committed to publicly releasing their data generation scripts, training code, and pre-trained model weights, which should significantly advance research in this rapidly evolving field of speech language models. Paper: https://lnkd.in/dutRcaAA Authors: Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg #SpeechLM #MultimodalAI #SpeechAI

https://arxiv.org/pdf/2410.17485 arxiv.org

Bally S Kehal

18,246 followers 2mo

Most voice AI is just a chatbot with a microphone. One company was purpose-built for real phone calls from day one. The architecture lesson most AI builders learn too late: Everyone builds text agents first. Winners build voice first. Text agents get retries. Formatting. Autocorrect. Voice agents get one shot. Real-time. No edits. Then production hits ↓ → Latency: Model takes 3 seconds. Customer hangs up. → Context: "Uh, yeah, so I need to, wait — can you also check my..." → Interruptions: Humans talk over each other. Chat agents break. → Compliance: Every voice interaction is regulated differently. Two traps I see teams fall into: Trap 1: Bolt STT onto a chat agent. Add TTS on output. Call it "voice AI." That's a wrapper. Wrappers break in production. Trap 2: Build your own with Pipecat, LiveKit, Vapi. 6 months later you're managing STT providers, TTS rate limits, LLM deprecations, infrastructure scaling, compliance audits. You wanted a voice assistant. Now you're a voice infrastructure company. PolyAI solved this differently. Full stack built for voice since 2017: → Proprietary ASR + LLM trained on real customer service transactions → 45+ languages. 24/7. Unlimited scale. → Handles surges instantly — storms, outages, promos — zero staffing panic Not just handling calls — generating revenue: → Turning bookings into room upgrades → Enrolling callers into rewards mid-conversation → QA Agents scoring every call automatically → Analyst Agents surfacing patterns no human team catches One healthcare company found fewer complaints from the AI than human reps — on the hardest, most emotional calls. Marriott. FedEx. Caesars. PG&E. 25+ countries. 391% ROI. $10.3M average savings. Payback under 6 months. The companies still running "press 1 for sales, press 2 for support"? Not behind on technology. Behind on architecture. That gap compounds every quarter. My stress test for any voice AI: → Noisy environment → Regional accent → Language switch mid-sentence → Multi-step transaction → Worth 15 minutes if you're evaluating: https://poly.ai/gordon Build or buy — what's your current approach to voice?

94 Comments

Stefano Santinelli

24,689 followers 10mo

Is voice about to become the new homepage? 🗣️📱 #Google just unveiled Search Live — a real-time, voice-powered AI search experience. You speak, Gemini answers. No more typing, no more tabs. Just live, conversational answers. This marks a turning point. Search is no longer a list of links — it’s becoming a dialogue. At local.ch we see three key implications for SMBs: ✅ Disruption is here: Live search powered by voice will reshape how users discover, interact, and transact. The traditional web is flattening into direct, AI-powered experiences. ✅ Convenience drives conversion: Consumers will expect to book, buy, and solve things instantly — by voice, on the go. Online transactional services will see a major boost. ✅ SMBs must plug in: This is a huge opportunity. Local businesses that connect their services — availability, bookings, FAQs — to these AI engines will get closer than ever to customers. We believe in technology — but even more in local relevance. That’s why SMBs need more than just Google. They need local digital partners to guide them. Let’s help them succeed. 💪 Irène Dumas Stefan Wyss Marcel Heinze Florian Laudahn Stanislav Rudnitskiy Robert Ferenc John Lee Jon Martinsen Pete Urmson Renato Bottini Semir T. Francesco Verrienti Duda iPromote vcita #VoiceSearch #AIinSearch #SMBSuccess #DigitalMarketing #LocalSearch #OnlineBooking #GoogleGemini #SearchRevolution #localsearch #ConversationalAI #SwissSMB #CustomerExperience #DigitalShift https://lnkd.in/emryebxj

Google tests real-time AI voice chats in Search theverge.com

3 Comments

Akshay Dimri

5,626 followers 4mo

🎙️ Voice Search SEO in 2026 — How Brands Win in a Voice-First World Search behavior is changing fast. Users are no longer typing queries — they’re asking questions through voice assistants like Google Assistant, Siri, Alexa, and smart speakers. By 2026, voice search is becoming a critical channel for SEO, local discovery, and conversions. Voice Search SEO focuses on optimizing content so it can be spoken, recommended, and trusted by voice-enabled devices and AI systems. Here’s what modern Voice Search optimization looks like 👇 ✔ Conversational & Long-Tail Keywords Optimizing for natural language queries like “What’s the best option near me?” instead of short keyword phrases. ✔ Featured Snippet Optimization (Position Zero) Voice assistants pull answers directly from featured snippets — ranking here means your content gets read aloud. ✔ Strong Local SEO Signals “Near me” voice queries dominate. Optimized Google Business Profiles, local schema, NAP consistency, and reviews are critical. ✔ Structured Data & Schema Markup FAQ schema, LocalBusiness schema, Product schema, and Article schema help voice systems understand and trust your content. ✔ Mobile-First Performance Most voice searches happen on mobile. Fast page speed, Core Web Vitals, and mobile usability directly impact voice visibility. ✔ Clear Content Structure Short answers, question-based headings, bullet points, and intent-focused formatting make content voice-friendly. ✔ High Intent & Conversion Opportunity Voice searches often have strong purchase intent, especially for local services, e-commerce, and immediate solutions. ✔ Trust, Authority & E-E-A-T Voice assistants prefer authoritative sources with reviews, expertise, accuracy, and consistent updates. Voice Search SEO is no longer optional — it’s a competitive advantage for SEO professionals, digital marketers, local businesses, and brands preparing for the future of AI-driven search.

10 Comments

Jeremy Moser

CEO @ uSERP — I get you more revenue from organic search.

41,318 followers 7mo

Voice search is the next frontier everyone's ignoring. While companies obsess over ChatGPT citations and LLM optimization, there's a massive opportunity hiding in plain sight: voice-first discovery. AI can't read your blog content aloud yet, but that's changing fast. Google's Speakable schema is already in beta for news publishers, and voice search queries are growing 35% year-over-year according to recent data. The gap is huge. Most publishers are completely unprepared for audio-first discovery, treating voice search as an afterthought instead of a primary optimization channel. Here's what's interesting: The companies that nail voice optimization early will dominate audio discovery before their competitors even realize it's a thing. We started testing voice-first content strategies with select uSERP clients after noticing the trend. Here’s the voice-first content approach that's working for us: 🚀 Conversational query targeting by focusing on questions people actually ask aloud. "Best marketing automation for small teams" instead of "marketing automation software comparison." 🚀 Audio comprehension structure using clear Q&A blocks and concise answers designed for 20-30 second voice excerpts that provide complete value. 🚀 Voice-optimized schema implementation including Speakable markup for eligible content sections, plus FAQPage schema for broader voice search optimization. 🚀 Context-rich content creation that leads sections with clear topic identification like "This guide explains..." to help voice assistants understand and cite content accurately. 🚀 Conversational flow testing to ensure content sounds natural when read aloud, not just when scanned visually. The timing is perfect because voice search optimization is still largely unexplored territory. Most content is optimized for scanning, not listening. The brands that flip this approach will capture intent through entirely new discovery channels. Are you seeing any voice search traffic yet? How are you thinking about optimizing content for audio-first discovery? 👇

12 Comments

Voice Search Optimization

More in Voice Search Optimization

More Marketing topics

Explore categories