VoiceTextBlender introduces a novel approach to augmenting LLMs with speech capabilities through single-stage joint speech-text supervised fine-tuning. The researchers from Carnegie Mellon and NVIDIA have developed a more efficient way to create models that can handle both speech and text without compromising performance in either modality. The team's 3B parameter model demonstrates superior performance compared to previous 7B and 13B SpeechLMs across various speech benchmarks whilst preserving the original text-only capabilities—addressing the critical challenge of catastrophic forgetting that has plagued earlier attempts. Their technical approach employs LoRA adaptation of the LLM backbone, combining text-only SFT data with three distinct types of speech-related data: multilingual ASR/AST, speech-based question answering, and an innovative mixed-modal interleaving dataset created by applying TTS to randomly selected sentences from text SFT data. What's particularly impressive is the model's emergent ability to handle multi-turn, mixed-modal conversations despite being trained only on single-turn speech interactions. The system can process user input in pure speech, pure text, or any combination, showing impressive generalisation to unseen prompts and tasks. The researchers have committed to publicly releasing their data generation scripts, training code, and pre-trained model weights, which should significantly advance research in this rapidly evolving field of speech language models. Paper: https://lnkd.in/dutRcaAA Authors: Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg #SpeechLM #MultimodalAI #SpeechAI
Speech Recognition Innovations
Explore top LinkedIn content from expert professionals.
Summary
Speech recognition innovations are transforming how humans interact with machines, making communication faster, more natural, and highly personalized. These advancements allow computers and AI agents to accurately understand and respond to spoken language—including silent signals or multimodal interactions—bridging accessibility gaps and empowering users across healthcare, customer service, and daily life.
- Embrace hands-free access: Take advantage of voice-enabled technology to multitask and interact with devices without needing a screen or keyboard.
- Protect your privacy: Stay alert to the growing risks of voice-cloning scams and explore new security layers like real-time authentication and speech watermarking.
- Support accessibility: Encourage the use of speech recognition tools that offer independence to people with disabilities, helping them communicate more naturally and efficiently.
-
-
Voice agents are having their moment in 2025: an open-source breakthrough just redefined real-time multimodal AI by slashing interaction latency to 1.5 seconds, challenging the recently released proprietary real-time APIs from OpenAI and Google. VITA-1.5, the latest iteration of the open-source interactive omni-multimodal LLM, brings three major improvements that push the boundaries of multimodal AI: (1) Speed transformation - reduced end-to-end speech interaction latency from 4 seconds to 1.5 seconds, enabling true real-time conversations (2) Speech processing leap - decreased Word Error Rate from 18.4 to 7.5, rivaling specialized speech models (3) Multimodal excellence - boosted performance across MME, MMBench, and MathVista from 59.8 to 70.8 while maintaining robust vision-language capabilities One novel method from the paper is VITA’s progressive training strategy that allows speech integration without compromising other multimodal capabilities - a persistent challenge in the field. The image understanding performance only drops by 0.5 points while gaining an entirely new modality. As we move towards agentic AI systems that need to process and respond to multiple input streams in real time, VITA-1.5's achievement in reducing latency while maintaining high accuracy across modalities sets a new standard for what's possible in open-source AI. This release signals a shift in the multimodal AI landscape, demonstrating that open-source alternatives can compete with proprietary solutions in the race for real-time, multi-sensory AI interactions. VITA-1.5 https://lnkd.in/gj7pd77P More tools, open-source models, and APIs for building voice agents in my recent AI Tidbits post https://lnkd.in/g9ebbfX3
-
376,000 ALS patients type 10 words per minute. MIT just gave them normal speech speed. No sound. No surgery. Just seven sensors reading jaw signals. Arnav Kapur and his team built AlterEgo with one mission: empower people with ALS and oral cancer, not replace them. Their wearable reads signals your brain sends to silent muscles—92% accuracy, half-second response. The cost breakthrough that matters: ↳ Neuralink surgery: $30,000-$100,000+ ↳ Brain implants: Infection risks, select trials only ↳ Current ALS devices: $1,500-$8,000 robotic voices ↳ AlterEgo target: Same price, your actual voice Think about that. No drilling into skulls like Synchron or UC Davis implants. No $100,000 medical bills. Just electrodes on your jaw detecting the same signals you use to read silently. Traditional Assistive Reality: ↳ Eye-tracking: 10 exhausting words per minute ↳ Brain surgery: $100,000+ with infection risks ↳ Robotic voices destroying identity ↳ Most patients priced out entirely AlterEgo Reality: ↳ Think naturally, speak instantly ↳ Non-invasive wearable design ↳ Your voice preserved digitally ↳ First responders using it for silent comms But here's what stopped me cold: The same device restoring voices to ALS patients is being tested for secure translation, silent note-taking, and emergency response teams. One innovation serving different needs with high impact. Consumer EEG headsets cost $100-$1,000 but can't handle real speech. Medical BCIs require brain surgery. AlterEgo sits between—medical-grade accuracy without medical risks. The Multiplication Effect: 1 voice preserved = independence restored 100 patients reconnected = isolation broken 1,000 using AlterEgo = new communication standard At scale = surgery becomes obsolete From MIT lab to human trials. From $100,000 brain implants to accessible wearables. From "I need surgery to speak" to "I just need to think." Kapur's team chose technology that empowers rather than replaces human ability. Because 376,000 people with ALS and oral cancer deserve their own voice—not a robot's. Follow me, Dr. Martha Boeckenfeld for innovations that restore human dignity without invasion. ♻️ Share if everyone deserves to keep their voice.
-
Screens are optional—conversation isn’t. Voice agents have finally crossed the line from “nice demo” to mass scale live production. A Fortune 100 health insurer has replaced swaths of its call-centre workforce with an AI agent that listens to symptom descriptions, gauges urgency and benefit details, and steers members to the right in-house nurse or in-network provider. Early results show mis-routed calls collapsing while human nurses concentrate on the most complex cases—evidence that, when trained on medical nuance, automation can still deliver empathy. The same capability is trickling down to Main Street. A neighbourhood dental clinic now relies on a 24/7 AI receptionist that fills midnight cancellations, takes deposits and syncs instantly with the practice-management calendar, eliminating the Monday-morning voicemail backlog. Nearby, an auto body shop lets its voice agent quote repairs and capture credit-card details while mechanics sleep, winning leads that used to hang up after three rings. Why does this feel inevitable? Voice is simply higher bandwidth than text; tone, pace and sighs carry layers of meaning a text interaction cannot. Studies show people (and agents) read emotion and feel connection more accurately when they hear a voice. As latency drops below half a second and costs reach pennies per minute, talking will again beat typing for many tasks—only this time the “person” on the other end might be generated by silicon. Now imagine the next step: every brand offers you a personal concierge that remembers the hiking boots you bought last spring, the hotel room you preferred in Tel Aviv or your preference for classical hold music. It greets you by name, picks up the last conversation mid-sentence and suggests dinner before you even think to ask. Conversation becomes the API. Optimism doesn’t erase risk. Voice-cloning scams already account for more than 40 percent of fraud attempts in finance, up twenty-fold in three years. Protecting both brands and callers will demand a new security layer: real-time likeness checks, rotating pass-phrases and cryptographic watermarks baked into synthetic speech so a courtroom—or a phone—can tell the difference between a genuine agent and a deepfake. That challenge is an opening for startups. I’m curious: if you’re experimenting with voice, how are you balancing speed, empathy and security? And what surprised you when real customers finally started talking back? Happy to compare notes.
-
Imagine trying to get a workout recommendation while running, navigate a complex route while driving, or get tech support while cooking - all without touching a screen. This is the promise of voice-enabled LLM agents, a technological leap that's redefining how we interact with machines. Traditional text-based chatbots are like trying to dance with two left feet. They're clunky, impersonal, and frustratingly limited. Consider these real-world friction points: - A visually impaired user struggling to type support queries - A fitness enthusiast unable to get real-time guidance mid-workout - A busy professional multitasking who can't pause to type a complex question Voice AI breaks these barriers, mimicking how humans have communicated for millennia. We learn to speak by four months, but writing takes years - testament to speech's fundamental naturalness. Real-World Transformation Examples: 1️⃣ Healthcare: Emotion-recognizing AI can detect patient stress levels through voice modulation, enabling more empathetic remote consultations. 2️⃣ Fitness: Hands-free coaching that adapts workout intensity based on your breathing and vocal energy. 3️⃣ Customer Service: Intelligent voice systems that understand context, emotional undertones, and personalize responses in real-time. The magic of voice lies in its nuanced communication: - Tone reveals emotional landscapes - Intensity signals urgency or excitement - Rhythm creates conversational flow - Inflection adds layers of meaning beyond mere words - Recognize emotional states with unprecedented accuracy - Support rich, multimodal interactions combining voice, visuals, and context - Differentiate speakers in complex conversations - Extract subtle contextual intentions - Provide personalized responses based on voice characteristics In short, this technology is about creating more human-centric technology that listens, understands, and responds like a thoughtful companion. The future of AI isn't about machines talking at us, but talking with us.
-
A high-performance speech neuroprosthesis, developed by Stanford researchers, decodes attempted speech directly from brain activity—restoring a voice to individuals who have lost the ability to speak. Key Findings: 📍Rapid and naturalistic decoding: The system translated neural signals into real-time text at 62 words per minute—nearly 3.5× faster than prior BCI systems. This speed brings decoded communication closer to everyday conversation, offering a major leap in usability and responsiveness. 📍Robust phoneme mapping and vocabulary range: Impressively, the neuroprosthesis operated with a 125,000-word vocabulary—the largest ever used in speech BCI—while maintaining semantic accuracy. Neural representations of phonemes remained intact even years after speech loss, suggesting the brain’s motor-speech pathways are more persistent than previously assumed. 📍Rethinking the neural basis of speech: While traditional models emphasize Broca’s area, this study found that area 6v was more predictive of speech intention. Furthermore, the system successfully decoded both spoken and silently mouthed words, demonstrating that silent articulation retains a reliable neural signature—crucial for fatigue-free, discreet communication. By Willett et al., Nature, 2023 https://rdcu.be/eyFkC Implication: This work marks a major milestone for brain–computer interfaces, bridging neuroscience and assistive technology to restore speech—and reshaping our understanding of the brain’s language architecture. #BrainComputerInterface #Neuroprosthetics #SpeechNeuroprosthesis #Neuroscience #Stanford #ALS #Neurotech #BCI
-
Breakthrough: BCI + AI = Instant mind-to-speech conversion A new device can detect words and turn them into speech within three seconds. 📍 The researchers used deep learning RNN-T models to achieve fluent speech synthesis with a large-vocabulary with neural decoding in 80-ms increments. In the study, Ann was a participant who lost her ability to speak after a stroke 18 years ago. Researchers used paper-thin rectangle containing 253 electrodes on the surface on her brain cortex (speech sensorimotor area) to record the activity of thousands of neurons. Researchers even personalized the synthetic voice! They used AI on recordings from her wedding video. As a result, the synthetic voice sounds like Ann’s own voice from before her injury. ❗ The result: Before: a single sentence took >20 seconds. Now: 47 - 90 words per minute. “Our framework also successfully generalized to other silent-speech interfaces, including single-unit recordings and electromyography. Our findings introduce a speech-neuroprosthetic paradigm to restore naturalistic spoken communication to people with paralysis.” Huge congratulations to the authors of this work! Just WOW.
-
An Indian teen built a speech device for paralysis patients. Pranet Khetan went on a school field trip to a paralysis care and watched patients struggle to communicate. People who knew exactly what they wanted to say, but their muscles wouldn't cooperate because of dysarthria. Prateek noticed this and created Paraspeak Labs - A device that captures slurred or unclear speech, sends it to a cloud-based AI model built and converts it into clear, natural voice output. No Hindi dysarthric speech dataset existed, so he visited NGOs and care centres, recorded 42 minutes of real-world audio from 28 patients, and expanded it to a 20-hour training set through data augmentation. The device fits around the neck. Press a button, speak, and within seconds the device speaks clearly for you. What makes Paraspeak different: - It's India's first Automatic Speech Recognition framework specifically designed for Hindi dysarthric speech - The device costs ₹2,000 to manufacture - It's multi-user and requires no retraining. Unlike existing research tools built around individual patients - It has been tested on patients with congenital disorders, end-stage ALS, and Parkinson's His work has been recognised at the Samsung Solve for Tomorrow’25, IRIS National Fair, the Regeneron ISEF 2025, and has led to ParaSpeak's incubation at FITT, IIT Delhi. That's what real innovation looks like. Just a clear problem, deep empathy, and the stubbornness to build.
-
IBM just picked its first-ever voice AI partner, and it's not who you'd expect. February wrapped with enterprise voice AI moving from experiments to infrastructure. The deals this month aren't proofs of concept. They're production integrations with real revenue attached. Deepgram lands inside IBM watsonx. IBM selected Deepgram as its first voice partner, embedding STT and TTS directly into watsonx Orchestrate. Enterprise-grade transcription, real-time captioning, and multilingual support are now native to IBM's AI platform. This is voice AI graduating from a standalone tool to an embedded enterprise layer. Anthropic goes all-in on enterprise agents. Claude Cowork shipped private plugin marketplaces, 10 department-specific plugins (finance, legal, HR), 12 new MCP connectors (Gmail, DocuSign, Clay), and cross-app workflows for Excel and PowerPoint. Enterprise admins can now curate which AI capabilities their teams access. OpenAI drops GPT-Realtime-1.5. The updated real-time voice model brings +10% transcription accuracy, +7% instruction compliance, and +5% audio reasoning at the same price. Connection success rates climbing to ~66% and error rates cut in half suggest this model is getting production-ready. SoundHound AI AI takes voice agents to the retail floor. Sales Assist, unveiled at MWC 2026, listens to in-store customer conversations and pushes real-time deal recommendations to staff devices. Built on SoundHound's Polaris ASR, purpose-built for noisy environments. ElevenLabs closes $500M at $11B. Sequoia led, a16z quadrupled down. With $200M+ ARR and expansion into multimodal agents and 14 global offices, ElevenLabs is building toward an IPO. Newo.ai raises $25M for AI receptionists. Their zero-hallucination architecture runs parallel AI agents to verify responses before they reach callers. 15,000+ agents deployed, mostly in dental, restaurants, and home services. Speechmatics signs two enterprise deals in one week. Partnered with VCONIC for healthcare/financial compliance and Boost AI for regulated European industries. Their medical model hits 93% real-world accuracy with 50% fewer terminology errors. Voice AI isn't waiting for permission anymore. Which of these moves matters most to your stack?
-
This 16-year-old just won a global award for building a $22 AI device that helps people with speech disorders communicate. Meet Pranet Khetan - a high school student from India who built Paraspeak, a matchbox-sized device that converts slurred speech into clear speech in real time. The idea started on a school trip. At a paralysis care center, Pranet met stroke survivors and people with cerebral palsy who knew exactly what they wanted to say, but their voices couldn’t express it. And no tool existed for them - especially not in Indian languages. So he built one himself. But there was a problem. Most speech recognition systems are built for English. And there's almost no data for slurred Hindi speech caused by conditions like Parkinson's or cerebral palsy. So here's what he did: → Went hospital to hospital recording patients → Built his own dataset from 28+ patients → Trained a deep learning model → Packaged it into a tiny device Total cost - under $25. And it works in real time. Paraspeak won Samsung Solve for Tomorrow 2025 and the Fourth Grand Award in Biomedical Engineering at ISEF 2025. He also received over $120K in incubation support at IIT Delhi. People usually spend weeks overthinking what to build - is the idea big enough? is the timing right? should we even start? Pranet didn’t think about any of that. He saw someone struggle - and built the thing that would help them. That’s the reminder I needed. If you saw one problem up close today, what would you actually build for it? #innovation #startups #founders #building
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development