Voice agents are having their moment in 2025: an open-source breakthrough just redefined real-time multimodal AI by slashing interaction latency to 1.5 seconds, challenging the recently released proprietary real-time APIs from OpenAI and Google. VITA-1.5, the latest iteration of the open-source interactive omni-multimodal LLM, brings three major improvements that push the boundaries of multimodal AI: (1) Speed transformation - reduced end-to-end speech interaction latency from 4 seconds to 1.5 seconds, enabling true real-time conversations (2) Speech processing leap - decreased Word Error Rate from 18.4 to 7.5, rivaling specialized speech models (3) Multimodal excellence - boosted performance across MME, MMBench, and MathVista from 59.8 to 70.8 while maintaining robust vision-language capabilities One novel method from the paper is VITA’s progressive training strategy that allows speech integration without compromising other multimodal capabilities - a persistent challenge in the field. The image understanding performance only drops by 0.5 points while gaining an entirely new modality. As we move towards agentic AI systems that need to process and respond to multiple input streams in real time, VITA-1.5's achievement in reducing latency while maintaining high accuracy across modalities sets a new standard for what's possible in open-source AI. This release signals a shift in the multimodal AI landscape, demonstrating that open-source alternatives can compete with proprietary solutions in the race for real-time, multi-sensory AI interactions. VITA-1.5 https://lnkd.in/gj7pd77P More tools, open-source models, and APIs for building voice agents in my recent AI Tidbits post https://lnkd.in/g9ebbfX3
Emerging Trends in Multimodal AI
Explore top LinkedIn content from expert professionals.
Summary
Emerging trends in multimodal AI focus on building systems that can seamlessly understand and process different types of data—like images, text, audio, and video—together, unlocking advanced capabilities for real-time interaction and cross-modal retrieval. Multimodal AI moves beyond single-format analysis by combining various data streams, allowing machines to interpret information much more like humans do.
- Explore unified data handling: Look for AI tools that interpret and connect multiple data types, enabling richer insights from sources like images, speech, and documents in a single workflow.
- Prioritize accessibility and safety: When designing multimodal interfaces, make sure features like voice and vision are inclusive and respect personal privacy to build trust with users.
- Consider lightweight models: Small, efficient multimodal AI systems make advanced capabilities available across more devices, so keep an eye out for compact architectures that balance speed and accuracy.
-
-
Over the past few months, I’ve noticed a pattern in our system design conversations: they increasingly orbit around audio and video, how we capture them, process them, and extract meaning from them. This isn’t just a technical curiosity. It signals a tectonic shift in interface design. For decades, our interaction models have been built on clickstreams: tapping, typing, selecting from dropdowns, navigating menus. Interfaces were essentially structured bottlenecks, forcing human intent into machine-readable clicks and keystrokes. But multimodal AI removes that bottleneck. Machines can now parse voice, gesture, gaze, or even the messy richness of a video feed. That means the “atomic unit” of interaction may be moving away from clicks and text inputs toward speech, motion, and visual context. Imagine a world where the UI is stripped to its essence: a microphone and a camera. Everything else, navigation, search, configuration, flows from natural human expression. Instead of learning the logic of software, software learns the logic of people. If this plays out, the implications are profound: UX shifts from layouts to behaviors: Designers move from arranging buttons to choreographing multimodal dialogues. Accessibility and inclusion take center stage: Voice and vision can open doors, but also risk excluding unless designed with empathy. Trust and control must be redefined: A camera-first interface is powerful, but also deeply personal. How do we make it feel safe, not invasive? We may be on the cusp of the first truly post-GUI era, where screens become less about control surfaces and more about feedback canvases, reflecting back what the system has understood from us.
-
Exciting developments in the field of multimodal AI! A comprehensive survey on small vision-language models (sVLMs) has just been released, offering invaluable insights into compact architectures and techniques for efficient multimodal processing. Key highlights: - Introduces a novel taxonomy of sVLM architectures: transformer-based, mamba-based, and hybrid models. - Explores cutting-edge techniques like knowledge distillation, lightweight attention mechanisms, and modality pre-fusion that enable high performance with reduced computational demands. - Provides in-depth analysis of state-of-the-art models such as TinyGPT-V, MiniGPT-4, and VL-Mamba, examining trade-offs between accuracy, efficiency, and scalability. - Addresses persistent challenges in the field, including data biases and generalization to complex tasks. Under the hood: - Transformer-based models leverage self-attention mechanisms to capture long-range dependencies in both visual and textual data. These architectures process image patches and text tokens in parallel, using cross-attention to fuse modalities. - Mamba-based models utilize state space models (SSMs) for linear scalability, making them particularly suitable for handling long sequences in resource-constrained environments. - Hybrid approaches combine elements from transformers, CNNs, and other lightweight mechanisms to balance performance and computational efficiency. The authors, hailing from institutions in India, have meticulously compiled advancements in sVLMs, setting a foundation for future research into efficient multimodal systems. Their work underscores the transformative potential of these compact models in making advanced AI capabilities accessible across a wide range of devices and applications. This survey is a must-read for researchers and practitioners working at the intersection of computer vision, natural language processing, and efficient AI systems. It offers a roadmap for developing the next generation of lightweight, yet powerful, vision-language models.
-
🧠 Gemini Embedding 2 Google DeepMind just released Gemini Embedding 2. The obvious headline is that it embeds text, images, audio, video, and documents. The more interesting shift is architectural: everything is projected into a single shared embedding space. Historically, most pipelines looked like this: text encoder → text vectors image encoder → image vectors audio encoder → audio vectors Each modality lived in its own latent space. Cross-modal retrieval required alignment tricks like projection heads or CLIP-style contrastive pipelines. Gemini Embedding 2 moves the alignment directly into the model. text → vector image → vector audio → vector video → vector PDF → vector All of these are mapped into a unified semantic manifold where vectors can be compared with cosine similarity. This makes cross-modal retrieval much simpler. A text query can retrieve a video moment. An image can retrieve documents. An audio clip can retrieve conversations. Instead of maintaining multiple modality-specific indices, you can run a single ANN index over embeddings and retrieve purely by semantic proximity. Technically, this only works if the model learns strong cross-modal alignment during training. Modern multimodal systems typically rely on combinations of: • cross-modal contrastive learning • masked multimodal modeling • generative multimodal supervision There is also interesting research emerging around how multimodal embeddings can be derived from large multimodal models. For example, a recent post from Jina AI (https://lnkd.in/gCYAVtXP) describes bootstrapping audio embeddings from multimodal LLMs by extracting hidden representations from models already trained to jointly process text, audio, and visual inputs. Since transformers internally fuse tokens from different modalities, later-layer representations often already encode cross-modal semantics. A lightweight projection layer can then convert those representations into usable embeddings. This direction highlights how the boundary between generative multimodal models and embedding systems is starting to blur. One clever component in Gemini Embedding 2 is Matryoshka Representation Learning. Instead of producing a fixed embedding dimensionality, the model learns nested representations that can be truncated: 3072 dimensions 1536 dimensions 768 dimensions The early dimensions carry the highest semantic density, allowing you to trade off storage, retrieval latency, and recall quality. ✨ The big implication is for RAG systems. When embeddings become natively multimodal, the retrieval layer stops being document retrieval and becomes knowledge retrieval across media. Your index can contain text chunks, slides, images, voice notes, and video clips and the system simply retrieves whatever is semantically closest to the query. Embeddings are increasingly becoming the universal interface between raw data and reasoning systems.
-
A challenge with AI is the division of labor between language-based systems that analyze text and sensor-based systems like computer vision that visualize our environment. #Multimodal AI trains algorithms in a fused way that allows us to manage complex AI tasks as a single workstream. Multimodal AI refers to systems capable of processing and integrating multiple types of data—such as text, images, audio, video, and sensor data—to generate comprehensive insights and perform complex tasks. Unlike traditional #AI, which specializes in one modality, multimodal AI combines these capabilities, allowing machines to "see," "hear," "read," and "understand" across various formats simultaneously. For federal leaders, it means AI can operate in environments that mirror the multifaceted, real-world challenges agencies face. For example, it can be used in the aftermath of natural disasters to analyze satellite imagery, combine it with real-time social media data and audio reports from first responders, and rapidly generate actionable maps of affected areas. One well-known multimodal AI algorithm is Contrastive Language-Image Pre-Training (CLIP), which is a key algorithm used in generating AI art. CLIP jointly trains image and text data using two neural networks called transformers, each acting as an encoder. These encoders code data into a latent space representing the features of the image and text separately. The dataset's class names (e.g., dog, cat, car) form potential text pairings to predict the most likely image-text pairs. CLIP is trained to predict if an image and text are paired in its dataset. The image encoder calculates the image's feature representation, while the text encoder trains a classifier specifying the visual concepts in the text. The key takeaway is that CLIP "jointly trains" or fuses by integrating two data types into a single training pipeline, unlike unimodal algorithms trained independently. Booz Allen is working to identify innovative applications for this technology. For example, we supported the National Institutes of Health (NIH) in developing cancer pain detection models fusing facial imagery, three-dimensional facial landmarks, audio statistics, Mel spectrograms, text embeddings, demographic, and behavioral data. For law enforcement and telemedicine, we created an acoustic #LLM tool enabling automated detection and analysis of multi-speaker conversations. We also published original research on multimodal AI algorithms that trained visible and long-wave infrared for applications in telemedicine and automated driving. Multimodal AI is no longer a vision of the future—it’s a capability ready to address today’s challenges. Federal leaders must think strategically about how to leverage this transformative technology to drive their missions forward while ensuring governance frameworks keep pace with innovation.
-
Part II of our Bessemer Venture Partners Vertical AI series is out – covering multimodal AI applications across industries. From voice to vision, AI capabilities have been progressing rapidly across modalities. While text-based generative AI solutions have seen most of the initial uptake, rapid model developments have led to emergent capabilities that make voice- and vision-based Vertical AI solutions increasingly more powerful. A new generation of speech-native models like GPT-4o and open source projects like Kyutai’s Moshi in voice, as well as GPT-4V and the multimodal Gemini 1.5 Pro which can process raw images and video in vision, have opened the floodgates to innovation across the application layer. In our newest roadmap, we’re excited to outline some of the most compelling B2B opportunities we’ve seen across multimodal AI. In voice, powerful transcription solutions in categories like medicine and field sales have allowed clinicians to focus more on care delivery than note taking (Abridge) and upleveled the coaching experience for salespeople (Rilla). But with increasingly stronger generative voice and audio capabilities, we’re also seeing high-ROI solutions across everything from home services to recruiting, with AI agents that can take calls, schedule appointments, quote prices, and run interviews. In vision, solutions range from data extraction across unstructured data in PDFs and images, to augmenting jobs that require visual inspection in industries like residential construction, catching errors in construction drawings, and generating full-scale 3D designs of buildings to refocus the work of engineers to higher-level design work. We’re also seeing increasingly advanced video analytics capabilities that can, for example, monitor video feeds for safety violations in industrial settings and allow teams to take action to ensure compliance. Multimodal AI solutions are increasingly powerful, with app-layer solutions serving as both agents and co-pilots. Across categories, they are automating tedious tasks and up-leveling employee capabilities, allowing folks to supercharge workflows and focus on higher impact work. We are just scratching the surface of what’s possible and looking forward to the innovation ahead. Check out our piece here - and if you are building in multimodal AI please reach out! https://lnkd.in/eqfa89Xw
-
#ALERT Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models This is another huge review of all multimodal reasoning models. it is absolutely extensive and includes models from the last 5 years and that revolution. it's a must read. ➡️ LMRMs integrate text, image, audio, and video to support complex reasoning abilities in AI systems operating in diverse environments. ➡️ The field has evolved through a four-stage roadmap: Perception-Driven Modular Reasoning, Language-Centric Short Reasoning (System-1), Language-Centric Long Reasoning (System-2), and Native Large Multimodal Reasoning. ➡️ Early efforts relied on task-specific modules, with reasoning implicitly embedded in processing pipelines. ➡️ Recent advancements leverage Multimodal Large Language Models (MLLMs) and techniques like Multimodal Chain-of-Thought (MCoT) for more explicit and structured reasoning. ➡️ Significant challenges persist in achieving omni-modal generalization, sufficient reasoning depth, and robust agentic behavior for real-world applications. This survey provides a structured overview of Large Multimodal Reasoning Models (LMRMs), analyzing their historical development from modular systems to unified, language-centric architectures. The paper highlights the progression through stages of increasing complexity in reasoning capabilities. Despite progress in using MLLMs and explicit reasoning chains like MCoT, current models show limitations in handling all modalities simultaneously and performing deep, long-horizon reasoning required for complex tasks and adaptive agent interactions. The work introduces Native Large Multimodal Reasoning Models (N-LMRMs) as a prospective direction. These models aim to move beyond retrofitting capabilities onto language models, focusing instead on inherently integrating omnimodal perception, understanding, generation, and agentic reasoning for enhanced scalability and real-world adaptability. #LMRMs #MultimodalAI #AIReasoning #FoundationModels
-
AI only exists from data. So the quality, depth, format variety, and diversity of data is critical for the future of AI. Data Provenance Institute has shared an excellent and useful analysis on the evolving and current state of data provenance for AI, raising key concerns about linguistic and geographic diversity, IP, ethics, and the rising role of synthetic data. Key insights from their recent paper (link in comments): 🌐 Web and Social Media Dominate Multimodal AI Data. By 2024, over 1 million hours of video and speech data came from internet platforms, with YouTube alone contributing 69% of speech datasets and 71% of video datasets. This shift replaced earlier sources like audiobooks and human-annotated data, which once led the ecosystem. While web data offers scalability and heterogeneity, it is highly prone to privacy breaches, copyright issues, and factual inaccuracies. 🔒 License Mismatches Complicate Data Usage. Although only 25–33% of datasets for text, speech, and video explicitly prohibit commercial use, over 80% of their underlying content carries non-commercial restrictions. For example, nearly 99.8% of tokens in text datasets and 78% of hours in speech datasets derive from restricted sources. This inconsistency means over half of datasets violate the original terms of their source material, creating significant challenges for developers navigating legal and ethical use. 🌍 Geographical Representation Remains Starkly Unequal. Organizations in North America and Europe contribute 93% of text tokens and 60–61% of speech and video hours, leaving Africa and South America at a meager 0.2%. Despite data now spanning 67 countries, this Western-centric dominance has persisted for over a decade, as reflected in Gini coefficients of 0.92 for text and 0.86 for speech (high is more concentrated), showing minimal diversity among creators. 🗣️ Linguistic Diversity Grows, But Gaps Persist. The number of languages represented in datasets has climbed to over 600, covering 37 linguistic families. Major projects like xP3x and Common Voice drove this growth in 2019, contributing to an influx of underrepresented languages. Yet, concentration remains high, with a Gini coefficient near 0.8 for both text and speech, indicating dominance by a few widely spoken languages at the expense of smaller ones. 📹 Synthetic Data Rising Rapidly in Popularity. Synthetic datasets now form 10% of the scale of web-based encyclopedia sources, with examples like the VidProm dataset introducing 7 million synthetic videos in 2024 alone. In text, synthetic datasets average 1,756 tokens per instance compared to 1,065 in natural datasets, emphasizing their use in long-form and creative generation tasks. Despite their potential, synthetic data carries risks related to reduced authenticity and overfitting to generative trends.
-
🧠 Why the Next Wave of Health AI Won’t Be Single-Model Multimodal AI combines multiple clinical data streams, text, imaging, labs, and physiological signals, to support decisions in a way that reflects real-world medical reasoning, rather than relying on a single isolated input. Why this matters in healthcare: -> It aligns with clinical reality. Clinicians synthesize heterogeneous signals. AI systems that do the same are better positioned to support complex decisions. -> Accuracy with higher generalization. Multimodal models can outperform single-source AI, but only when validated across populations, settings, and workflows. -> How data are fused shapes what the model learns Early, joint, and late fusion strategies directly affect robustness, interpretability, and clinical usability. -> Demand of stronger governance. Integrating modalities amplifies challenges around bias, explainability, liability, privacy, and regulation, making governance foundational, not optional. Ref: Azarfar et al. Responsible adoption of multimodal artificial intelligence in health care: promises and challenges. Lancet Digit Health 2025
-
Multimodal AI is shaping a shift in healthcare by combining different kinds of patient data to improve care across diagnostics, treatment, and monitoring. 1️⃣ It links data from imaging, wearables, clinical notes, genomics, and more to create a fuller picture of patient health. 2️⃣ Imaging, physiological signals, and clinical notes are the most commonly used data types, especially in oncology, cardiovascular, and neurological disorders. 3️⃣ Intermediate fusion is the most used integration method, combining data at the feature level for better balance between complexity and interpretability. 4️⃣ These systems enable early diagnosis, prognosis, treatment planning, and real-time monitoring, with growing applications in areas like digital twins and automated reporting. 5️⃣ Personalized medicine is a major driver, with multimodal models supporting tailored treatment decisions by analyzing combined molecular, physiological, and behavioral data. 6️⃣ Despite progress, challenges remain: data heterogeneity, privacy concerns, lack of benchmarks, and regulatory constraints slow adoption. 7️⃣ Explainability is key for clinical trust. Emerging models include attention maps, concept attribution, and human-in-the-loop feedback for better transparency. 8️⃣ Energy demands of training large models have sparked interest in "green AI", focusing on efficiency and scalability in clinical settings. 9️⃣ Future systems may rely more on self-supervised and federated learning to handle data gaps and maintain privacy across institutions. 🔟 Clinical validation and regulatory reform are needed for multimodal systems to move from labs into widespread practice. ✍🏻 Florenc Demrozi, Mina Farmanbar, Kjersti Engan. Multimodal AI for Next-Generation Healthcare: Data Domains, Algorithms, Challenges, and Future Perspectives. Current Opinion in Biomedical Engineering. 2025. DOI: 10.1016/j.cobme.2025.100632 (pre-proof)
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development