🧐 Multimodal with just fine-tuning? ByteDance makes it happen with VoRA. The new Vision as LoRA (VoRA) paper introduces a bold, streamlined approach to building multimodal LLMs—no vision encoder, no connector, no architecture bloat. Just LoRA. Instead of bolting a vision tower onto an LLM, VoRA injects vision understanding directly into the LLM using Low-Rank Adaptation (LoRA). That means: 🔹 No new inference overhead- LoRA layers are merged into the LLM after training. 🔹 Frozen base LLM- Only LoRA + visual embeddings (~6M params) are trained, preserving language ability and ensuring stability. 🔹 Image inputs at native resolution- No resizing, no tiling hacks—VoRA leverages the LLM’s flexible token handling. 🔹 Bidirectional attention for vision- Instead of using causal masks across all tokens, VoRA allows vision tokens to attend freely—boosting context modeling. To teach the LLM visual features: 🔹 VoRA uses block-wise distillation from a pretrained ViT, aligning intermediate hidden states across layers. This improves visual alignment while keeping the LLM’s core untouched. 🔹 The training objective combines distillation loss (cosine similarity between ViT + LLM visual features) and standard language modeling loss over image-caption pairs. What does this get you? 🔹 A modality-agnostic architecture ready for extension to audio, point clouds, and beyond. This might be one of the most efficient takes yet on vision-language modeling. Excited to see how this evolves. 📄 Paper + Code: https://lnkd.in/gwZr33Vj Follow Aman Chadha and I for more updates!
Multimodal AI Developments
Explore top LinkedIn content from expert professionals.
-
-
Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
✨ Multimodal AI in Radiology: Pushing the Boundaries of AI in Radiology ✨ 💡 Artificial intelligence (AI) in radiology is evolving, and multimodal AI is at the forefront. This is a nice overview of the landscape of multimodal AI in radiology research by Amara Tariq, Imon Banerjee, Hari Trivedi, and Judy Gichoya in The British Institute of Radiology. It is a recommended read for those interested in multimodal AI, including vision-language models. 👍 🔍 Why Multimodal AI? 🔹 Single-modality limitations: AI models trained on a single data type (e.g., head CTs) can have limited utility in real-world clinical settings. Radiologists, for example, rely on multiple information sources. 🔹 Clinical context matters: Without context, AI models may flag irrelevant findings, leading to unnecessary workflow disruptions. "Building single modality models without clinical context (available from multimodal data) ultimately results in impractical models with limited clinical utility." 🔹 Advancements in fusion techniques enable the integration of imaging, lab results, and clinical notes to mirror real-life decision-making. 🧪 How Does It Work? Fusion Methods Explained 🔹 Traditional Fusion Models: Combines data at different stages (early, late, or joint fusion). This approach struggles with missing data and has the potential for overfitting (early and joint). 🔹 Graph-Based Fusion Models: Uses graph convolutional networks (GCNs) to fuse implicit relationships between patients or samples based on clinical similarity, improving generalizability capabilities for missing data but facing explainability challenges. 🔹 Vision-Language Models (VLMs): Leverage transformer-based architectures to process images and text together, showing promise in tasks like radiology report generation but requiring massive training datasets. 🔧 Challenges & Ethical Considerations 🔹 Bias and transparency: AI models can unintentionally reinforce historical biases. 🔹 Generalizability: Models trained on structured clinical datasets may struggle with diverse patient populations ("out-of-distribution datasets"). 🌐 The Future of Multimodal AI in Radiology ✅ Benchmark datasets must be developed for robust evaluation. ✅ Ethical concerns must be addressed to ensure fair, explainable, and patient-centered AI solutions. ✅ Collaborative efforts between radiologists and AI developers are essential for creating clinically relevant models. 🔗 to the original open-access article is in the first comment 👇 #AI #MultimodalAI #LMMs #VLMs #GCNs #GenAI #Radiology #RadiologyAI
-
The best design will soon be invisible. Interfaces used to ask: what do you need? Agentic AI flips it to: what have I already handled? The surface shrinks. We’ll gesture less and grant more permission. Location, calendar, biometrics, preference history: these signals replace tap-and-type. The UI only shows up when confidence drops and the agent needs clarity. The foreground becomes explanation: “Here’s what I did, veto if wrong.” The background is silent execution. Multimodal stops being a demo trick. Voice for speed. Text for precision. Glanceable cards for audit. Users glide across modes instead of switching apps. Design shifts from fetching tasks to negotiating autonomy. Micro-copy matters more than motion. Reversible actions matter more than dark-mode flair. If an agent moves money or publishes words, it owes the user a trail they can scan in seconds. Solving for who makes invisible work feel trustworthy is the edge. Build the layer that hides the work and surfaces the proof. Boundless Ventures
-
Atmanity is focusing on a very interesting area in conversational AI: the subtle art of knowing when to speak versus when to stay silent. Their latest research addresses a fundamental challenge that current voice AI systems struggle with—natural turn-taking in human-computer conversations. The research reveals that effective multimodal conversation requires sophisticated understanding of contextual cues beyond just speech patterns, including visual signals, emotional states, and conversation dynamics. Traditional rule-based approaches to conversation management fall short when dealing with the nuanced timing of real human interaction. Their findings suggest that mastering these conversational protocols is critical for voice AI deployment success. Systems that can appropriately gauge when to respond, when to wait, and when to acknowledge without speaking create significantly more natural user experiences than those focused purely on speech recognition accuracy. This work highlights a fundamental gap between current voice AI capabilities and human conversational expectations - one that could determine which systems succeed in real-world applications. #ConversationalAI #VoiceAI #MultimodalAI
-
Exciting New Research on Multimodal Retrieval-Augmented Generation (MRAG)! I just finished reading a fascinating survey paper on Multimodal Retrieval-Augmented Generation (MRAG) from researchers at Huawei Cloud. This cutting-edge technology represents a significant advancement in enhancing large language models by integrating multimodal data like text, images, and videos into both retrieval and generation processes. Traditional Retrieval-Augmented Generation (RAG) systems primarily rely on textual data, which limits their ability to leverage rich contextual information available in multimodal sources. MRAG addresses this limitation by extending the RAG framework to include multimodal retrieval and generation, enabling more comprehensive and contextually relevant responses. The paper outlines the evolution of MRAG through three distinct stages: >> MRAG1.0 ("Pseudo-MRAG") This initial stage extended RAG by converting multimodal data into textual representations. The architecture consisted of three key components: - Document Parsing and Indexing: Processing multimodal documents using OCR and specialized models to generate captions for images and videos - Retrieval: Using vector embeddings to find relevant information - Generation: Synthesizing responses using LLMs While effective, this approach suffered from information loss during modality conversion and retrieval bottlenecks. >> MRAG2.0 ("True Multimodal") This stage preserved original multimodal data within the knowledge base and leveraged Multimodal Large Language Models (MLLMs) for direct processing. Key improvements included: - Using unified MLLMs for captioning instead of separate models - Supporting cross-modal retrieval to minimize data loss - Employing MLLMs for generation to directly process multimodal inputs >> MRAG3.0 (Advanced Integration) The latest evolution introduces: - Enhanced document parsing that retains document screenshots to minimize information loss - Multimodal Search Planning that optimizes retrieval strategies through retrieval classification and query reformulation - Multimodal output capabilities that combine text with images, videos, or other modalities in responses The technical architecture includes sophisticated components like multimodal retrievers (using single/dual-stream and generative structures), rerankers (fine-tuning or prompting-based), and refiners (hard or soft prompt methods) to optimize the information flow. What's particularly impressive is how MRAG outperforms traditional text-modal RAG in scenarios where both visual and textual information are critical for understanding and responding to queries. The researchers have systematically analyzed essential components, datasets, evaluation methods, and current limitations to provide a comprehensive understanding of this promising paradigm.
-
Most multimodal QA systems fail in the same place. Not at perception. Not at language. But at how evidence is retrieved and constrained before generation. This paper on Pythia - RAG makes that failure mode very clear. Even strong vision - language models still rely on captions and flat retrieval. In dense scenes, that silently drops relations. The model may see the objects, but it never commits to who is doing what to whom, so those relations disappear before reasoning even starts. What’s different here is that relations are treated as first - class structure rather than something the model is expected to infer implicitly. Textual relations are extracted as explicit triplets, visual relations are extracted directly from images using relation - aware detection, and both are unified into a single multimodal knowledge graph that is further grounded with external common sense knowledge. Retrieval is also structural, not similarity - based. Instead of pulling isolated facts, the system retrieves a query - guided subgraph using a graph algorithm that explicitly optimizes for relevance and cohesion. That sub graph is then encoded in two complementary ways: structurally through a graph encoder and textually through an LLM, while the associated image is encoded in parallel. These representations are fused with attention and only then passed to generation. Hallucinations don’t drop here because the language model is more careful or better prompted. They drop because generation is no longer free - form. The model is forced to operate over a constrained, relationally coherent slice of the multimodal graph. The larger takeaway is subtle but important. Multimodal reasoning doesn’t fail because models lack modalities. It fails because relations are implicit, retrieval is flat, and structure is introduced too late. Once relations are explicit and retrieval preserves topology, generation becomes a consequence of reasoning rather than a guess. That feels like a meaningful shift for multimodal RAG, especially in settings where confident answers without grounding are the real failure mode. #MultimodalAI #RetrievalAugmentedGeneration #KnowledgeGraphs #GraphReasoning #MultimodalQA #AIResearch #LLMs #VisionLanguage #StructuredReasoning #GraphNeuralNetworks #TrustworthyAI
-
Over the past few months, I’ve noticed a pattern in our system design conversations: they increasingly orbit around audio and video, how we capture them, process them, and extract meaning from them. This isn’t just a technical curiosity. It signals a tectonic shift in interface design. For decades, our interaction models have been built on clickstreams: tapping, typing, selecting from dropdowns, navigating menus. Interfaces were essentially structured bottlenecks, forcing human intent into machine-readable clicks and keystrokes. But multimodal AI removes that bottleneck. Machines can now parse voice, gesture, gaze, or even the messy richness of a video feed. That means the “atomic unit” of interaction may be moving away from clicks and text inputs toward speech, motion, and visual context. Imagine a world where the UI is stripped to its essence: a microphone and a camera. Everything else, navigation, search, configuration, flows from natural human expression. Instead of learning the logic of software, software learns the logic of people. If this plays out, the implications are profound: UX shifts from layouts to behaviors: Designers move from arranging buttons to choreographing multimodal dialogues. Accessibility and inclusion take center stage: Voice and vision can open doors, but also risk excluding unless designed with empathy. Trust and control must be redefined: A camera-first interface is powerful, but also deeply personal. How do we make it feel safe, not invasive? We may be on the cusp of the first truly post-GUI era, where screens become less about control surfaces and more about feedback canvases, reflecting back what the system has understood from us.
-
Multimodal CX refers to a customer experience strategy that integrates multiple modes of interaction—text, voice, video, touch, gesture, and even image recognition—into a seamless, unified customer journey. It’s about letting customers engage with a brand through whatever combination of channels or inputs feels most natural to them—and ensuring those modes work together intelligently. 🔍 Example (The Seamless Journey) Imagine a traveler interacting with an airline: • They speak to a voice assistant to change a flight. • They text a chatbot to confirm luggage options. • They tap through the app to choose a seat. All three interactions are connected, context-aware, and synchronized—so the system “remembers” what the customer already said or did, regardless of mode. 💡 Why It Matters Multimodal CX is the next evolution beyond omnichannel: • Omnichannel = consistent brand experience across channels. • Multimodal = fluid experience across input types (voice, text, image, etc.), powered by AI. It’s especially relevant now because AI and large multimodal models (like GPT-5) can process text, voice, and visual data together—making it possible for brands to build truly conversational, intuitive customer experiences. 🚀 In Practice This is happening today: • Retail: Scan an item, ask questions via voice, and get personalized styling advice via chat. • Travel: Show a photo of damaged luggage to an airline chatbot and get compensation automatically. • Banking: Use facial ID, voice commands, and chat messaging in the same secure flow. What new revenue streams are unlocked when your CX can see, hear, and read? #MultimodalCX #CustomerExperience #CXStrategy #AI #DigitalTransformation #ThoughtLeadership
-
Imagine trying to get a workout recommendation while running, navigate a complex route while driving, or get tech support while cooking - all without touching a screen. This is the promise of voice-enabled LLM agents, a technological leap that's redefining how we interact with machines. Traditional text-based chatbots are like trying to dance with two left feet. They're clunky, impersonal, and frustratingly limited. Consider these real-world friction points: - A visually impaired user struggling to type support queries - A fitness enthusiast unable to get real-time guidance mid-workout - A busy professional multitasking who can't pause to type a complex question Voice AI breaks these barriers, mimicking how humans have communicated for millennia. We learn to speak by four months, but writing takes years - testament to speech's fundamental naturalness. Real-World Transformation Examples: 1️⃣ Healthcare: Emotion-recognizing AI can detect patient stress levels through voice modulation, enabling more empathetic remote consultations. 2️⃣ Fitness: Hands-free coaching that adapts workout intensity based on your breathing and vocal energy. 3️⃣ Customer Service: Intelligent voice systems that understand context, emotional undertones, and personalize responses in real-time. The magic of voice lies in its nuanced communication: - Tone reveals emotional landscapes - Intensity signals urgency or excitement - Rhythm creates conversational flow - Inflection adds layers of meaning beyond mere words - Recognize emotional states with unprecedented accuracy - Support rich, multimodal interactions combining voice, visuals, and context - Differentiate speakers in complex conversations - Extract subtle contextual intentions - Provide personalized responses based on voice characteristics In short, this technology is about creating more human-centric technology that listens, understands, and responds like a thoughtful companion. The future of AI isn't about machines talking at us, but talking with us.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development