Voice is the next frontier for AI Agents, but most builders struggle to navigate this rapidly evolving ecosystem. After seeing the challenges firsthand, I've created a comprehensive guide to building voice agents in 2024. Three key developments are accelerating this revolution: -> Speech-native models - OpenAI's 60% price cut on their Realtime API last week and Google's Gemini 2.0 Realtime release mark a shift from clunky cascading architectures to fluid, natural interactions -> Reduced complexity - small teams are now building specialized voice agents reaching substantial ARR - from restaurant order-taking to sales qualification -> Mature infrastructure - new developer platforms handle the hard parts (latency, error handling, conversation management), letting builders focus on unique experiences For the first time, we have god-like AI systems that truly converse like humans. For builders, this moment is huge. Unlike web or mobile development, voice AI is still being defined—offering fertile ground for those who understand both the technical stack and real-world use cases. With voice agents that can be interrupted and can handle emotional context, we’re leaving behind the era of rule-based, rigid experiences and ushering in a future where AI feels truly conversational. This toolkit breaks down: -> Foundation layers (speech-to-text, text-to-speech) -> Voice AI middleware (speech-to-speech models, agent frameworks) -> End-to-end platforms -> Evaluation tools and best practices Plus, a detailed framework for choosing between full-stack platforms vs. custom builds based on your latency, cost, and control requirements. Post with the full list of packages and tools as well as my framework for choosing your voice agent architecture https://lnkd.in/g9ebbfX3 Also available as a NotebookLM-powered podcast episode. Go build. P.S. I plan to publish concrete guides so follow here and subscribe to my newsletter.
Designing For Multimodal Interactions
Explore top LinkedIn content from expert professionals.
-
-
I went to an AI UX workshop last night expecting recycled LinkedIn advice about "building AI trust through transparency." Instead, Isabella Yamin tore down LinkedIn's job posting flow using her CarbonCopies AI framework in real-time, while founders shared raw implementation struggles. It completely changed how I'm rethinking Maibel's onboarding flow. Here's what I stole from B2B SaaS principles to redesign emotional AI for B2C: 1️⃣ Progressive disclosure with purpose LinkedIn's fatal flaw? Optimizing for completion ease > Outcome quality. Recruiters are drowning in irrelevant applications because AI never learns what "qualified" means. The personalization paradox: How do we give users enough control without overwhelming them? Users don't want "frictionless". They want INFORMED control. 📌 At Maibel: I was falling into the same trap, making emotional coaching setup so simple that the AI couldn't understand user context. Now? Progressive complexity with clear trade-offs. Show users how their choices impact outcomes. → Want deeper insights? Add more context. → Want faster setup? Here's what the AI can't personalize. 2️⃣ Closed-loop data intelligence: What Platfio gets right They've built a platform for software agencies where where every data point feeds back into the entire system. User preferences in marketing flows shape proposals. Campaign performance shapes future recommendations. Every interaction becomes intelligence for future recommendations. 📌 At Maibel: Most wellness apps store emotional check-ins like digital journals. I'm turning them into predictive feedback loops. Emotional intelligence isn’t static but COMPOUNDS. Today's reflections shift tomorrow's suggestions. Patterns fuel prevention. Users' inputs on Monday could predict AND prevent Friday's breakdown. 3️⃣ Multi-modal creativity: Wubble's transparency approach Translating images and files into music - who'd have thought? They've cracked multi-modal creativity where users become co-creators, not passive consumers. The breakthrough moment for me: What if users could see how their visual environment contributes to emotional context? 📌 At Maibel: Users upload images of their day and see how AI analyzes emotional cues: cluttered workspace = overwhelm, junk food = stress eating. Multi-modal understanding users can contribute to and influence. 💡 The bottom line? B2B Saas gets one thing right: Every interaction has to earn trust. In B2B, failed AI means churn. In emotional AI, failed trust breaks belief in tech entirely. 📌 Here's what we're doing differently at Maibel: → Progressive complexity → Context-aware feedback → Multi-modal participation → Intelligence that compounds with every input. It's not just about building WITH AI. I'm designing systems that learn understand YOU before you even need to explain yourself. Kudos to Isabella, Shivang Gupta The Generative Beings, Shaad Sufi Hayden Cassar and everyone who shared deep product insights.
-
🔊 Have you ever stayed on a customer‑service call simply because the person on the other end sounded trustworthy? 🎧 Researchers from Beijing University of Technology , the The University of Texas at Austin and the University of Memphis recently tested how different AI voices affect persuasion. Their findings were: • 𝗙𝗹𝗶𝗿𝘁𝘆 𝗱𝗼𝗲𝘀𝗻’𝘁 𝘄𝗼𝗿𝗸. A playful “coquetry” voice actually decreased persuasion, especially for male chatbots. • 𝗦𝘁𝗲𝗿𝗻 𝗶𝗻𝘃𝗶𝘁𝗲𝘀 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀. Stern voices were just as effective as gentle ones and, in male voices, even increased customer questions. • 𝗔𝗴𝗲 𝗶𝘀𝗻’𝘁 𝘁𝗵𝗲 𝗶𝘀𝘀𝘂𝗲. 𝗲𝗻𝗴𝗮𝗴𝗲𝗺𝗲𝗻𝘁 𝗶𝘀. There was no significant difference between “young” and “old” voices. What mattered was that older‑sounding voices kept people talking longer. • 𝗪𝗼𝗿𝗱𝘀 𝗺𝗮𝘁𝘁𝗲𝗿. Using affirmative sentences - particularly in female voices - prompted more customer inquiries, whereas rhetorical questions were less effective. For leaders in banking and finance, this isn’t just academic. Voice is the new front door of your brand. A gentle but confident tone can build trust with high‑net‑worth clients. An affirmative female voice can reassure anxious SME owners. Conversely, a playful chatbot might unintentionally undermine credibility. 𝗦𝗼𝗺𝗲 𝗾𝘂𝗶𝗰𝗸 𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝘁𝗼 𝗰𝗼𝗻𝘀𝗶𝗱𝗲𝗿: 1. Audit your AI voice scripts. Are you using affirmative statements that invite dialogue? 2. Experiment with different voice personas. Avoid flirty tones and observe how clients react. 3. Treat voice as part of your CX strategy. Integrate data from calls, chatbots and apps so you can personalize the experience for each customer, because customer empathy is your competitive moat. We’ve moved from building “voices” metaphorically to designing them intentionally. The tone of your AI isn’t just a detail, it’s part of the customer experience. Link to research in comments below. #AI #Voice
-
Memory & personalization might be the real moat for AI we’ve been looking for. But where that moat forms is still up for grabs: •App level •Model level •OS level •Enterprise level Each has very different dynamics. 🧵 ⸻ 1. App-level personalization Apps build their own memory & context for users. Examples: •Harvey remembering firm-specific legal knowledge for law firms •Abridge capturing patient conversations & generating notes for doctors •Perplexity building long-term search profiles for individual users ➡️ Most likely in vertical applications with focused use cases and domain-specific data. This is where Eniac Ventures is currently doing most of our investing ⸻ 2. Model-level personalization The model itself becomes personalized and portable across apps. Examples: •ChatGPT memory & custom instructions •Meta’s LLaMa fine-tuned on personal embeddings ➡️ Most likely in general-purpose assistants and broad horizontal use cases where user context needs to travel across apps. ⸻ 3. OS-level personalization Personalization happens at the OS level, shared across apps & devices. Examples: •Google Gemini native to Android •Apple (maybe) embedding Claude via Anthropic ➡️ Most likely in consumer devices and mobile ecosystems where platforms control distribution. ⸻ 4. Enterprise-level personalization Each enterprise owns and controls its own personalization layer for employees & customers. Examples: •Microsoft Copilot trained on company data •OSS models (LLaMa, Mistral) deployed on private infra with platforms like TrueFoundry •OpenAI GPTs fine-tuned & hosted in secure enterprise environments ➡️ Most likely in highly regulated industries (healthcare, financial services) where data privacy and compliance are critical. ⸻ Why it matters: Where memory & personalization “land” may define who captures AI value. Different layers may win in different sectors. Where AI memory lives may reshape who captures the next decade of value.
-
Foundation models (FMs) such as GPT, LLaMA, and CLIP are reshaping the landscape of recommender systems (RS), transforming how we personalize and interact with content across various domains like e-commerce, healthcare, and education. A recent comprehensive survey sheds light on this exciting convergence, identifying three powerful paradigms: 1. Feature-Based Paradigm: FMs enhance recommender systems by creating rich, semantic embeddings. For instance, BERT and CLIP help encode textual descriptions and multimodal data, dramatically improving feature representations and helping overcome challenges like data sparsity and cold-start scenarios. 2. Generative Paradigm: Leveraging models like GPT, this paradigm moves beyond mere recommendations, generating personalized content and explanations directly. It facilitates zero-shot/few-shot recommendations, personalized user experiences, and multimodal content generation, though it faces challenges around bias, control, and alignment with user intent. 3. Agentic Paradigm: Perhaps the most transformative, this approach uses autonomous FM-driven agents capable of real-time adaptation and interaction. Agentic systems integrate dynamic planning, reasoning, and user feedback loops to provide highly contextual and ethically aligned recommendations. Systems like AutoGPT illustrate how such agents proactively adapt to user preferences and environmental changes. The paper also discusses practical implementations across several recommendation tasks: - Top-N recommendations: Enhancing traditional ranking by incorporating semantic insights from FM embeddings. - Sequential recommendations: Leveraging FM's deep contextual understanding for accurate next-item predictions. - Conversational recommendations: Allowing more dynamic, natural dialogues between users and systems, significantly boosting user engagement. Despite substantial progress, the survey also highlights ongoing technical challenges such as efficiency, interpretability, fairness, and multimodal integration, offering a roadmap for future research directions. This comprehensive analysis by leading academic and industry institutions marks a critical step forward in our understanding of how Foundation Models can revolutionize recommender systems, paving the way for more sophisticated, user-centric, and intelligent recommendation platforms.
-
Most voice AI is just a chatbot with a microphone. One company was purpose-built for real phone calls from day one. The architecture lesson most AI builders learn too late: Everyone builds text agents first. Winners build voice first. Text agents get retries. Formatting. Autocorrect. Voice agents get one shot. Real-time. No edits. Then production hits ↓ → Latency: Model takes 3 seconds. Customer hangs up. → Context: "Uh, yeah, so I need to, wait — can you also check my..." → Interruptions: Humans talk over each other. Chat agents break. → Compliance: Every voice interaction is regulated differently. Two traps I see teams fall into: Trap 1: Bolt STT onto a chat agent. Add TTS on output. Call it "voice AI." That's a wrapper. Wrappers break in production. Trap 2: Build your own with Pipecat, LiveKit, Vapi. 6 months later you're managing STT providers, TTS rate limits, LLM deprecations, infrastructure scaling, compliance audits. You wanted a voice assistant. Now you're a voice infrastructure company. PolyAI solved this differently. Full stack built for voice since 2017: → Proprietary ASR + LLM trained on real customer service transactions → 45+ languages. 24/7. Unlimited scale. → Handles surges instantly — storms, outages, promos — zero staffing panic Not just handling calls — generating revenue: → Turning bookings into room upgrades → Enrolling callers into rewards mid-conversation → QA Agents scoring every call automatically → Analyst Agents surfacing patterns no human team catches One healthcare company found fewer complaints from the AI than human reps — on the hardest, most emotional calls. Marriott. FedEx. Caesars. PG&E. 25+ countries. 391% ROI. $10.3M average savings. Payback under 6 months. The companies still running "press 1 for sales, press 2 for support"? Not behind on technology. Behind on architecture. That gap compounds every quarter. My stress test for any voice AI: → Noisy environment → Regional accent → Language switch mid-sentence → Multi-step transaction → Worth 15 minutes if you're evaluating: https://poly.ai/gordon Build or buy — what's your current approach to voice?
-
UX research has a bit of a blind spot when it comes to text analysis. For years, it’s been the part of the process we tend to overlook or handle with a lack of real rigor. We gather huge volumes of words from open-ended surveys, support tickets, and interview transcripts, but we rarely treat that data with the respect it deserves. Instead of deep analysis, we usually just skim for a general vibe or pull out a few catchy lines to spice up a presentation. This turns our most descriptive data into mere anecdotes rather than the powerful empirical signals they actually are. The real strength of text-based data lies in how it captures the way users think and reason. When people speak or write in their own words, they reveal their mental models and the emotional framing of their experiences. Task success rates and time-on-task metrics are great, but they can’t tell you why a user feels a certain way or how they justify their choices. Language is where we find the subtle cues for cognitive load, trust, and even the early warning signs that someone is about to stop using a product. It shows us exactly what users notice and, perhaps more importantly, what they ignore. We haven’t lacked the methodology for this; psychology and linguistics have offered rigorous frameworks like thematic and discourse analysis for a long time. The issue is that these methods are often seen as too slow or too subjective for the fast pace of a design cycle, so they get skipped or done haphazardly. This is the specific area where modern AI and large language models are actually transformative. They allow us to scale these traditional, rigorous frameworks across thousands of data points without losing the human-led theoretical grounding that makes the analysis valuable in the first place. Using AI in this context doesn't mean we stop interpreting the data. Instead, it means we can finally combine human intuition with automated coding and semantic clustering to find patterns that would take months to uncover manually. We can track how user language shifts over time or across different product releases, connecting those qualitative shifts directly to our quantitative results. It makes our findings far more defensible and consistent. Ultimately, this shift allows text to become a first-class data source. It moves UX away from being a storytelling exercise and closer to a true decision science. When we treat what users say with the same level of discipline we apply to behavioral metrics, we gain a much clearer picture of where our design interventions will actually matter. It’s a practical way to understand not just what a user is doing, but the underlying reason why they are doing it, which is the most important insight a researcher can provide.
-
Natural language is the richest form of user data we have, yet it’s also the hardest to analyze at scale. Every open-ended survey, support ticket, or usability transcript holds powerful signals about how people think and feel about a product. Natural Language Processing (NLP) gives UX researchers a way to turn that language into structured insight. It bridges computation and linguistics, breaking down text into measurable layers of structure, meaning, and emotion. What used to take hours of manual coding can become a repeatable process for understanding user experience. The process starts with tokenization, which simply means breaking text into smaller, meaningful units. When every review or chat is split into words or phrases, it becomes possible to detect patterns such as how often users mention frustration near “checkout” or “navigation.” From there, part-of-speech tagging helps us understand tone and emotion by showing how people describe experiences. Verbs reveal action, while adjectives reveal judgment and feeling. Named Entity Recognition goes one level deeper by automatically finding what users are talking about -identifying brands, features, or interface elements across thousands of lines of feedback. This is how researchers can quickly separate comments about “search,” “profile,” or “payment” without reading them all. Context always matters, and that’s where Word Sense Disambiguation comes in. Words like “crash” or “bug” mean different things depending on domain or product, and disambiguation prevents misinterpretation when analyzing text from diverse sources. TF-IDF and keyword extraction then help highlight what makes each theme stand out. For instance, if “loading time” consistently ranks higher in importance than “interface color,” it shows where design and engineering teams should focus improvement efforts. Latent Semantic Analysis takes things further by uncovering hidden meaning in large datasets. It can find themes you might not see directly, like when “trust,” “privacy,” and “security” consistently cluster together in feedback about onboarding. Word embeddings such as Word2Vec or GloVe expand this idea, helping machines recognize semantic similarity. They can detect that words like “smooth,” “easy,” and “simple” belong to the same conceptual space -a valuable signal for mapping usability perception. Then come transformers, the modern foundation of generative AI. Models like BERT and GPT read language in both directions, capturing context across entire sentences. For UX researchers, this means the ability to automatically summarize interviews, identify sentiment shifts, or synthesize recurring themes. Finally, semantic analysis integrates all these methods to connect what users say with what they intend. It helps reveal the “why” behind emotion, linking language to motivation and trust.
-
Experimenting please read... I've been living through the exact job market shift everyone else is panicking about. Traditional UX roles are disappearing faster than most people want to admit. My DMs are full of talented UX designers who can't get callbacks. Meanwhile, I'm actively turning down recruiter messages. The difference? I pivoted to conversational AI and voice interface design 3.5 years ago when I saw the IVR-to-AI conversion wave coming. **The Market Reality** Traditional UX job postings dropped 71% from 2022 to 2023. UX research roles fell below 1,000 monthly postings in early 2025. Only 49.5% of designers are finding new roles within three months, down from 67.9% in 2019. Nobody's talking about this: the conversational AI market is projected to hit $32.6 billion by 2030. Companies are desperate for people who can design these experiences. The job titles? Conversational AI Designer, Voice UX Designer, AI Content Designer, Prompt Engineer, LLM Experience Designer. **The Skills That Actually Matter** Most UX designers get stuck thinking it's just learning new tools. It's not. The pivot requires: • Dialog flow architecture (conversations across turns, not screens) • NLP basics (enough to work with engineers) • Voice-first thinking (designing for ears, not eyes) • AI personality design (tone, empathy, error handling) • Multimodal experiences (bridging voice, text, visual) You need hands-on experience with platforms: Cognigy, Dialogflow, Kore.ai, Cresta. If you can't speak about intent recognition, entity extraction, and conversation flows in these systems, you're not ready. Transparency: I'm using AI tools to synthesize patterns faster, but insights come from doing this work. Converting legacy IVR to AI. Designing voice assistants. Building conversational flows that feel human. **Why This Pivot Works** Your user research skills? Critical for conversational context. Wireframing? Translates to dialog mapping. Accessibility knowledge? Essential for inclusive voice design. **The Uncomfortable Truth** Less than 5% of design roles target junior talent. Mass layoffs continue. Traditional UX teams are shrinking. AI automates entry-level tasks. Companies consolidate roles. If you're waiting for the market to bounce back to 2021, I'll be direct: it's not happening. **What Actually Works** The people working right now? AI-adjacent roles. They learned LLMs. Got hands-on with Dialogflow or Cognigy. Repositioned portfolios for conversational thinking. Applied for "AI Experience Designer" not "UX Designer." Stop thinking "learning a specialization." Start thinking survival adaptation. Different energy entirely. Your traditional UX skills aren't worthless. They're the foundation. But they need a new application layer. The question isn't whether to pivot. It's how fast you can move. What are you seeing in the job market?
-
Building Agentic Graph Systems That Learn and Adapt to Each User 🛜 Graph-based systems represent a significant advancement in creating truly personalized and agentic AI systems by enabling sophisticated patterns of memory, recommendation, and contextual awareness to work together seamlessly. The integration of graph structures allows AI agents to maintain complex webs of relationships while actively learning and adapting to individual users' needs and preferences. First, graph structures provide a natural foundation for building memory systems that can evolve into sophisticated recommendation engines. The ability to traverse and weight relationships between entities enables systems to transform from passive storage into active agents that can anticipate needs and suggest relevant actions. This is particularly powerful because the graph structure captures not just individual pieces of information, but also their context, outcomes, and interrelationships. Second, graph-based systems excel at incorporating multi-dimensional pattern recognition. Unlike traditional recommendation systems that might focus on simple similarity metrics, graph structures can simultaneously process temporal patterns, contextual relationships, user behaviors, and outcome patterns. This multi-faceted analysis enables recommendations that are both more accurate and more nuanced than conventional approaches. Third, the adaptive learning capabilities of graph-based systems create a powerful feedback loop for personalization. When users respond to suggestions, their feedback modifies the weights of relevant connections in the graph. This creates a self-improving system where successful patterns naturally strengthen while less helpful ones fade. The adaptation works at both individual and aggregate levels, enabling systems to balance personalized learning with broader pattern recognition. Fourth, graph structures provide elegant solutions to common challenges in personalization systems, particularly the cold start problem. Even with limited initial information about a new user, the system can leverage indirect relationships and partial matches to make meaningful recommendations. As more interactions occur, these initial connections rapidly refine through feedback and pattern matching. Fifth, graph-based systems offer sophisticated privacy controls while maintaining high levels of personalization. This architectural approach enables highly personalized experiences while maintaining appropriate privacy protections. The integration of these capabilities has profound implications for AI system design. The graph structure serves as a unified framework where memory, learning, and recommendation capabilities can seamlessly interact. This enables increasingly sophisticated agents that can not only store and retrieve information but actively predict and suggest relevant knowledge and actions based on deep contextual understanding.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development