Exciting New Research on Multimodal Retrieval-Augmented Generation (MRAG)! I just finished reading a fascinating survey paper on Multimodal Retrieval-Augmented Generation (MRAG) from researchers at Huawei Cloud. This cutting-edge technology represents a significant advancement in enhancing large language models by integrating multimodal data like text, images, and videos into both retrieval and generation processes. Traditional Retrieval-Augmented Generation (RAG) systems primarily rely on textual data, which limits their ability to leverage rich contextual information available in multimodal sources. MRAG addresses this limitation by extending the RAG framework to include multimodal retrieval and generation, enabling more comprehensive and contextually relevant responses. The paper outlines the evolution of MRAG through three distinct stages: >> MRAG1.0 ("Pseudo-MRAG") This initial stage extended RAG by converting multimodal data into textual representations. The architecture consisted of three key components: - Document Parsing and Indexing: Processing multimodal documents using OCR and specialized models to generate captions for images and videos - Retrieval: Using vector embeddings to find relevant information - Generation: Synthesizing responses using LLMs While effective, this approach suffered from information loss during modality conversion and retrieval bottlenecks. >> MRAG2.0 ("True Multimodal") This stage preserved original multimodal data within the knowledge base and leveraged Multimodal Large Language Models (MLLMs) for direct processing. Key improvements included: - Using unified MLLMs for captioning instead of separate models - Supporting cross-modal retrieval to minimize data loss - Employing MLLMs for generation to directly process multimodal inputs >> MRAG3.0 (Advanced Integration) The latest evolution introduces: - Enhanced document parsing that retains document screenshots to minimize information loss - Multimodal Search Planning that optimizes retrieval strategies through retrieval classification and query reformulation - Multimodal output capabilities that combine text with images, videos, or other modalities in responses The technical architecture includes sophisticated components like multimodal retrievers (using single/dual-stream and generative structures), rerankers (fine-tuning or prompting-based), and refiners (hard or soft prompt methods) to optimize the information flow. What's particularly impressive is how MRAG outperforms traditional text-modal RAG in scenarios where both visual and textual information are critical for understanding and responding to queries. The researchers have systematically analyzed essential components, datasets, evaluation methods, and current limitations to provide a comprehensive understanding of this promising paradigm.
Improving Multimodal Model Performance
Explore top LinkedIn content from expert professionals.
Summary
Improving multimodal model performance means making AI systems better at understanding and generating information that comes from different types of data—like text, images, audio, and video—at the same time. Advances in this area help models reason more accurately, align their understanding across formats, and do more with less computing power.
- Prioritize cross-modal reasoning: Design your models to explicitly connect relationships between different data types instead of relying on the model to guess these links, which strengthens the quality of answers and reduces errors.
- Tailor training for your data: Fine-tuning models on your own domain data, rather than only using large generic models, often brings stronger results and can even allow you to use smaller, more efficient models.
- Use efficient architectures: Explore sparse or parallel processing techniques that keep the model's quality high while lowering the amount of computing resources required for training and running your models.
-
-
Most multimodal QA systems fail in the same place. Not at perception. Not at language. But at how evidence is retrieved and constrained before generation. This paper on Pythia - RAG makes that failure mode very clear. Even strong vision - language models still rely on captions and flat retrieval. In dense scenes, that silently drops relations. The model may see the objects, but it never commits to who is doing what to whom, so those relations disappear before reasoning even starts. What’s different here is that relations are treated as first - class structure rather than something the model is expected to infer implicitly. Textual relations are extracted as explicit triplets, visual relations are extracted directly from images using relation - aware detection, and both are unified into a single multimodal knowledge graph that is further grounded with external common sense knowledge. Retrieval is also structural, not similarity - based. Instead of pulling isolated facts, the system retrieves a query - guided subgraph using a graph algorithm that explicitly optimizes for relevance and cohesion. That sub graph is then encoded in two complementary ways: structurally through a graph encoder and textually through an LLM, while the associated image is encoded in parallel. These representations are fused with attention and only then passed to generation. Hallucinations don’t drop here because the language model is more careful or better prompted. They drop because generation is no longer free - form. The model is forced to operate over a constrained, relationally coherent slice of the multimodal graph. The larger takeaway is subtle but important. Multimodal reasoning doesn’t fail because models lack modalities. It fails because relations are implicit, retrieval is flat, and structure is introduced too late. Once relations are explicit and retrieval preserves topology, generation becomes a consequence of reasoning rather than a guess. That feels like a meaningful shift for multimodal RAG, especially in settings where confident answers without grounding are the real failure mode. #MultimodalAI #RetrievalAugmentedGeneration #KnowledgeGraphs #GraphReasoning #MultimodalQA #AIResearch #LLMs #VisionLanguage #StructuredReasoning #GraphNeuralNetworks #TrustworthyAI
-
ByteDance, in collaboration with researchers from Peking University, Princeton, and other institutions, has unveiled a significant advancement in multimodal AI on Hugging Face! They've introduced MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation. This work addresses a critical issue where current "thinking-aware" models can ironically degrade performance due to errors propagating between generated reasoning and the final image. MMaDA-Parallel tackles this by proposing a novel parallel multimodal diffusion framework. Instead of sequential processing, it enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. This new paradigm ensures a much stronger alignment between the model's internal reasoning and the visual output. The approach is further optimized by Parallel Reinforcement Learning (ParaRL), applying semantic rewards at each step to enforce cross-modal consistency. To thoroughly evaluate these improvements, the team also developed ParaBench, a new benchmark specifically designed to assess the alignment between generated reasoning and image outputs. MMaDA-Parallel demonstrates impressive results, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel. This sets a more robust standard for thinking-aware image synthesis. Explore the paper, try out the models, and delve into the new benchmark to see how parallel multimodal diffusion is enhancing image generation and editing! --- Paper: https://lnkd.in/epbZ7pVD Models (8B): MMaDA-Parallel-A: https://lnkd.in/eJf4wKaQ MMaDA-Parallel-M: https://lnkd.in/eUp4KSMm ParaBench Dataset: https://lnkd.in/egAfh2fx Project Page: https://lnkd.in/ey5wKT4x
-
📈 I've written a new blog post on training and finetuning multimodal embedding & reranker models with Sentence Transformers. As a practical example, I finetuned Qwen3-VL-Embedding-2B for Visual Document Retrieval (matching text queries to document screenshots with charts, tables, and layouts intact). Details: After 1 epoch on 10k query-image pairs, the model went from NDCG@10 of 0.888 to 0.947 on my eval set. That's ahead of every other VDR model I tested against, including ones up to 4x larger. Finetuning on your own domain is very often worth it over reaching for a bigger general-purpose model. I also wrapped the loss in MatryoshkaLoss, so you can truncate the 2048-dim embeddings at deployment time. The finetuned model stays within 0.3% of peak down to 512 dims (4x smaller), and retains 92.4% of peak even at 64 dims (32x smaller). For context, the base model already fell to 76.5% of its (lower) peak at 64 dims. The best part: the training script is nearly identical to a text-only one. The data collator automatically calls model.preprocess(), which detects the modality of each input (text, image, audio, video, or mixed) and applies the right preprocessing. No manual tokenization or image processing needed. The blog also walks through training multimodal Cross Encoder (reranker) models, with a few architectural options like Any-to-Any + LogitScore or Feature Extraction + Pooling + Dense. Read the full blog, or just point your Agent at the URL: https://lnkd.in/ewDQJME6
-
Curious what might power the intelligence of Apple Vision Pro in the future? 👓 My ex-colleagues from Apple just dropped an exciting new paper on Multimodal LLMs they called MM1. The largest MM1 model (30B dense) achieves state-of-the-art few-shot learning on multimodal benchmarks. 🔍 Key Takeaways: 🔹 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗮𝗹 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀: The study emphasizes that the choice of image encoder, particularly image resolution and token count, significantly influences model performance, overshadowing the design of the vision-language connector. 🔹 𝗗𝗮𝘁𝗮 𝗗𝗶𝘃𝗲𝗿𝘀𝗶𝘁𝘆: Incorporating a blend of image-caption, interleaved image-text, and text-only data is critical for state-of-the-art few-shot results. Interestingly, interleaved and text-only data boosts few-shot and text-only performance, while caption data enhances zero-shot capabilities. 🔹 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗦𝘂𝗰𝗰𝗲𝘀𝘀: By strategically scaling model parameters and employing mixture-of-experts (MoE) variants, the MM1 models exhibit competitive performance across multiple multimodal benchmarks after supervised fine-tuning. 🚀 Final Model Recipe: 🔸 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗜𝗺𝗮𝗴𝗲 𝗘𝗻𝗰𝗼𝗱𝗲𝗿: Utilizing a ViT-H model with 378x378px resolution pre-trained with a CLIP objective. 🔸 𝗘𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗼𝗿: Leveraging 144 tokens, underscoring the quantity over the architectural design. 🔸 𝗕𝗮𝗹𝗮𝗻𝗰𝗲𝗱 𝗗𝗮𝘁𝗮 𝗠𝗶𝘅: A calculated mixture of 45% interleaved image-text documents, 45% image-text pair documents, and 10% text-only documents ensures robust zero and few-shot performance. The core insight is that deliberate data and architecture choices, not just scale, are key to building performant multimodal models. The MM1 models also exhibit impressive emergent abilities like multi-image reasoning and in-context few-shot learning. Check-out the link in comments below 👇🏼 #AI #MachineLearning #LLM #3MinPapers
-
MAmmoTH-VL: method to create large-scale multimodal instruction-tuning dataset with intermediate rationales to elicit CoT reasoning in MLLMs. This paper introduces a streamlined yet scalable approach to enhancing MLLM performance through the strategic use of open-source models to generate diverse, high-quality training data that reflects human preferences and real-world complexity. 𝗞𝗲𝘆 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: - introduce a simple, scalable, and cost-effective methodology for constructing instruction-tuning datasets at scale designed to elicit multimodal CoT reasoning - develop a dataset comprising 12 million entries using only open-weight LLMs and MLLMs, named as MAmmoTH-VL-Instruct-12M - trained an MLLM model, MAmmoTH-VL-8B, based on the LLaVA-OneVision architecture, using the 12M dataset created with this approach 𝗠𝗲𝘁𝗵𝗼𝗱 introduce a simple, scalable, and cost-effective data generation pipeline that produces 12 million high-quality samples, involves three key steps: i) Dataset Collection and Categorization - sourced data from 153 publicly available multimodal instruction datasets - reorganized the training data into ten major categories: General, OCR, Chart, Caption, Domain-specific, Code&Math, Language, Detection, Multi-Image, and Video - conducted an initial manual screening of the 153 data sources, by randomly sampling 1,000 data points and a subsequent rapid evaluation to gauge overall quality ii) Instruction Data Rewriting - approach involves transforming the original multimodal data into diverse instruction-response pairs enriched with detailed rationales - for this transformation, designed custom prompts to generate responses that align with real-world applications while encouraging critical thinking and reasoning - results in a rich dataset of instruction-response pairs, characterized by detailed rationales and diverse real-world scenarios iii) Self-data Filtering - following the "Model-as-a-judge" paradigm, leveraged InternVL2-Llama3-76B model, the same model used during data rewriting process, to evaluate logical consistency of each question-answer pair against the corresponding image 𝗘𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁𝗮𝗹 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - MAmmoTH-VL-8B trained with 12M dataset created with this approach significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse(+8.1%), MMMU-Pro (+7%), and MuirBench(+13.3%). - Also, model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/eQB2ZkNm 𝗖𝗼𝗱𝗲: https://lnkd.in/eRcKewHU 𝗗𝗮𝘁𝗮𝘀𝗲𝘁: https://lnkd.in/eig--Fk7 𝗠𝗼𝗱𝗲𝗹: https://lnkd.in/e5Etx_mE
-
What if we've been overcomplicating the "modality gap" in CLIP? 🤯 For years, this gap has been a known bottleneck: images and texts live in separate, disconnected neighborhoods of the embedding space. A new paper on AlignCLIP suggests a simpler path. Instead of complex loss functions, they use two elegant architectural tricks to bridge the divide: 1️⃣ Share Parameters: Instead of two separate encoders, it uses a single Transformer encoder for both vision and language. This forces the model to learn one unified set of parameters that can process both modalities, preventing it from developing the divergent, specialized logics that cause the gap. 2️⃣ Separate & Mix: It introduces an intra-modal contrastive loss. This new objective actively pushes vectors of the same modality (e.g., image-vs-image) away from each other. This acts as an anti-clustering force, preventing dense, uni-modal groups from forming and forcing the two sets of embeddings to spread out and intermingle. The benefits are immediate and impressive: ▪️ The modality gap is significantly reduced, with alignment scores jumping up to 0.25 on MSCOCO. ▪️ Zero-shot performance is maintained or even improved, showing there's no trade-off. ▪️ CLIP's famous robustness is preserved, a critical feature for real-world applications. The breakthrough is the insight: elegant architectural tweaks can solve bottlenecks that complex losses can't. ⚡ This has huge implications for how we build the next generation of multi-modal models. It proves that sometimes, the simplest solutions are the most powerful.
-
I've just read a very interesting and practical paper – one that flips the usual “small models just reason worse” story. Serena Yeung-Levy and Mark Endo from Stanford showed what happens when you shrink a multimodal model and how to do it the right way. The researchers look specifically at how reducing the size of the LLM inside a multimodal model affects the model’s overall abilities. ➡️ Even though you're shrinking the language part, the part that suffers most is vision. What really collapses is perception (sometimes as much as reasoning) – the ability to reliably extract visual details. The root cause seems to be that small LLMs just can’t keep up with all the different ways they’re supposed to pull visual details out of images. They don’t have enough capacity to learn all those skills. ▪️ The researchers then fix this limitation with EXTRACT+THINK – a two–stage pipeline: 1. A tiny VLM is trained to explicitly dump instruction–relevant visual facts. This method is called visual extraction tuning. - It takes a visual instruction example (image + question + answer) - Turns the question+answer into a simple statement - Asks a strong model to describe the visual details needed to support that statement - Uses this as training data 2. A slightly bigger LLM reasons over that step-by-step (Chain-of-Thought) After applying EXTRACT+THINK, a small multimodal model: • Outperforms much larger systems using a perception module 12× smaller and a reasoning module 41× smaller. • Needs 95% fewer visual training samples compared to some baselines when trained from scratch. • Beats LLaVA-OneVision-0.5B by up to 19.5%. So, there are two important things about this research: 1) we now clearly understand what happens to a multimodal system when the LLM inside it is made much smaller, 2) we have a working recipe to mitigate the issues.
-
Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development