Top LinkedIn Content on Training AI Models With Limited Data

Jim Fan

NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

238,092 followers 2mo

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. - Website: https://lnkd.in/gxzgeP-2 - Paper: https://lnkd.in/g7PJdz_8

125 Comments

Arockia Liborious

39,287 followers 5mo

Generating Synthetic Data: A Simple Guide Synthetic data generation is critical in building AI models where your data is sparse. Here we're not just filling gaps, we're carefully crafting data to make models smarter, more robust, and safer. Just like in cooking, you have different recipes for different goals. Here’s a simple breakdown of synthetic data generation methods 1. Generative Synthesis What it is: Letting a powerful AI (like a large language or image model) invent completely new examples from scratch based on a description or a set of rules. Good for: Generating massive amounts of novel data quickly. Watch out for: The AI can "hallucinate" and create nonsense or get stuck in a repetitive style. 2. Transformation & Rephrasing What it is: Taking existing, real data and altering it while keeping its core meaning. Think paraphrasing a sentence, swapping words, or changing an image's color. Good for: A cheap and safe way to make your dataset more diverse. Watch out for: Small changes can sometimes accidentally change the data's true label, so you need to double-check. 3. Programmatic Labeling & Distillation What it is: Using a smarter, more powerful AI (the "teacher") to label a bunch of unlabeled or messy data for a smaller, simpler AI (the "student") to learn from. Good for: Quickly creating labeled datasets at a huge scale. Watch out for: The student AI will inherit all the blind spots and biases of its teacher if you're not careful. 4. Agentic Self-Play What it is: Having multiple AI "agents" interact with each other or a simulated environment to create complex, multi-step data. This is perfect for generating conversations, tool-use sequences, or strategic game-play. Good for: Teaching AI how to perform long, complicated tasks. Watch out for: The AIs can learn to "cheat" the simulation or develop weird, unrealistic strategies that don't work in the real world. 5. Adversarial & Safety Data What it is: Intentionally creating tricky, confusing, or malicious data to find and fix your AI's weak spots. This is like a quality control check. Good for: Making your AI more robust, secure, and safe before it's deployed. Watch out for: You have to be very creative to think of all the ways things can go wrong. No matter which method you use, you need a strict quality control layer. This involves: Removing Duplicates: So the AI doesn't see the same example over and over. Scrubbing Sensitive Info: Filtering out personal data or offensive content. Tracking Lineage: Knowing exactly how and where a piece of synthetic data was created. The real thing isn't in any single technique, but in knowing which combination to use for your specific goal. Start simple, measure what works, and never compromise on data quality. After all, your AI will only ever be as good as the data it sees. #AI

1 Comment

Andriy Burkov

PhD in AI, author of 📖 The Hundred-Page Language Models Book and 📖 The Hundred-Page Machine Learning Book

486,907 followers 1mo

When you want a large language model to get better at a specific task—like solving math problems or navigating websites—the standard approach is to finetune it: you adjust the model's internal parameters using training data and gradient descent, which is expensive, requires lots of data, and often makes the model worse at everything else. Instead of changing the model's parameters, this paper proposes to run the model on a small set of problems multiple times, compare the successful and failed attempts, and use the model itself to write down natural-language "lessons learned"—things like "when solving geometry problems, always check that your solution falls within the valid region." These lessons get iteratively refined across a few rounds and then get pasted into the prompt at inference time. The method is modeled after GRPO, a reinforcement learning algorithm where you generate a group of outputs, score them, and use the relative quality differences to improve the model—except here the "improvement" happens in the prompt text rather than in the weights. The paper shows that doing this with just 100 training examples and about $18 worth of API calls on a large frozen model (DeepSeek-V3.1-Terminus, 671 billion parameters) outperforms smaller models that were finetuned with thousands of examples at costs exceeding $10,000. The results hold across both math reasoning and web search tasks, and unlike finetuned models that degrade when moved to a different domain, swapping in a different set of learned experiences lets the same frozen model perform well in multiple domains simultaneously. Read with an AI tutor: https://lnkd.in/eA3Ud2a2 Download the PDF: https://lnkd.in/ekVxsz3B

17 Comments

Asad Ansari

29,651 followers 2mo

You cannot train AI on reality alone anymore. There is not enough of it. Jensen Huang explains why NVIDIA built Cosmos, an AI world model that generates synthetic training data grounded in physics. The problem is simple. Teaching physical AI like robotics requires vast amounts of diverse interaction data. Videos exist, but not nearly enough to capture the variety of situations robots will encounter. So NVIDIA transformed compute into data. Using synthetic data generation grounded by laws of physics, they can selectively generate training scenarios that would be impossible to capture otherwise. The example Huang shows is remarkable. A basic traffic simulator output gets fed into Cosmos. What emerges is physically plausible surround video that AI can learn from. This solves a fundamental limitation. You cannot train autonomous systems on every possible scenario by recording reality. There are not enough cameras or time. But you can simulate physics accurately enough that AI trained on synthetic data generalises to real environments. This applies beyond robotics. Any AI learning physical interactions, from manufacturing to logistics to infrastructure monitoring, faces the same data scarcity problem. Synthetic data generation grounded in physics laws is how you create training sets reality cannot provide. The organisations building AI for physical systems will either master synthetic data generation or get limited by whatever reality they can record. Watch the full presentation to hear Huang explain how Cosmos generates training data for physical AI. What physical AI application needs synthetic data because reality cannot provide enough examples? #AI #SyntheticData #Robotics #NVIDIA #MachineLearning

62 Comments

Zain Hasan

I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

19,610 followers 1y

Explanation of Low-Rank Adaptation (LoRA), a method for efficiently fine-tuning pre-trained neural networks. The Problem LoRA Solves: 🔸 In early 2021, Microsoft partnered with OpenAI to explore the commercial viability of GPT-3. 🔸 They found that prompting was insufficient for production tasks like natural language to code generation. 🔸 Fine-tuning was necessary but prohibitively expensive due to the large size of model checkpoints. How It Works: 🔸 LoRA generalizes full fine-tuning(updating every single parameter) by asking two questions: - Do we need to fine-tune all parameters? - For the weight matrices we fine-tune, how expressive should the updates be in terms of matrix rank? 🔸 These questions define a 2D plane where full fine-tuning is one corner(full rank and full parameter updates) and the origin represents the original model. 🔸 Any point in this plane is a valid LoRA configuration. 🔸The chosen rank of the update matrix controls the expressivity of the finetuning process. 🔸 A d x d matrix can represent any linear transformation in a d-dimensional vector space. 🔸 By first transforming the input to a lower-dimensional space and then back to the original space, we can restrict the kind of linear transformations that can be represented. 🔸 This reduces the number of parameters that need to be stored from (dxd) to (dxr + dxr) where r << d. 🔸 A point near the origin often performs as well as full fine-tuning. - because often Neural Networks are over-parametrized and thus the weight matrices are full of linearly dependent 🔸 This suggests that we can start with a low-rank configuration and gradually increase the rank if needed. Common practices when using LoRA: 🔸 How to choose the rank R of the update matrix: Start with a low rank and increase it if needed. 🔸 When to use full fine-tuning?: When finetuning on data that is completely new and absent from the pretraining of the base model (for example if you are tuning an English model on Martian then full fine-tuning may be necessary). 🔸 Can I use LoRA for any model architecture?: As long as the model uses matrix multiplication, LoRA can be applied. So basically pretty much every model architecture can use LoRA! Benefits of LoRA: 🔸 Reduced checkpoint sizes: On GPT-3, checkpoint size was reduced from 1TB to 25MB. 🔸 No additional inference latency: LoRA updates can be merged with the original parameters during inference. W_new = W_old + AxB 🔸 Ability to quickly switch between tasks: LoRA modules can be loaded and unloaded efficiently.(A_frenchxB_french),(A_germanxB_german),(A_spanishxB_spanish) Some interesting ideas enabled by LoRA: 🔸 Caching LoRA modules in RAM for faster model switching and routing between different finetunes. 🔸 Training multiple LoRA modules in parallel on different batches of the training set. 🔸 Creating a tree of adaptive models where each node is a LoRA module.

5 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

41,883 followers 1y

If you're working on AI projects with limited training data, building domain-specific AI applications, or struggling with the economics of data labeling, you should know about this new approach from the DeepSeek team. Reinforcement Fine-Tuning (RFT) is a new technique for fine-tuning large language models, cutting the required labeled data from thousands to just tens of examples. Traditional supervised fine-tuning (SFT) approaches have always been hampered by their dependence on vast amounts of labeled data. RFT takes a fundamentally different approach by utilizing a reward function to evaluate response correctness, enabling the model to learn more effectively than through simple mimicry of examples. The same technique that was used to develop DeepSeek-R1. This method proves particularly powerful in three key scenarios: (1) When no labeled data exists but correctness can be verified - such as code transpilation where outputs can be automatically tested. (2) When only limited labeled examples are available - fewer than 100 examples, where traditional methods typically overfit. (3) For tasks that benefit from chain-of-thought reasoning - where step-by-step logical thinking significantly improves results. A well-written post from Predibase here (they also added support for RFT on their platform recently!) https://lnkd.in/gHBdW5De P.S. Predibase just released an open-source model that outperforms OpenAI o1 by 67% for PyTorch-to-Triton transpilation tasks, enabling more efficient and intelligent AI models (link in comments).

6 Comments

Brian Heater

14,986 followers 1mo

NVIDIA’s Physical AI Data Factory Blueprint is Designed to Improve Robot Training Data One of the biggest hurdles standing between physical AI and its “ChatGPT moment” is a lack of quality data. A big part of the reason LLMs have been such a massive – and often surprising – success is the fact that humans have essentially been creating training data for 100,000 years or so. The same can’t be said for the input required to train robots. NVIDIA is among the companies working to address the gap, and this morning at GTC the company announced Physical AI Data Factory Blueprint, an open reference architecture designed to improve how both real-world and simulated data is gathered, shaped, and assessed. The company has already recruited some big names from across autonomous driving and robotics, including FieldAI, Hexagon AB Robotics, Linker Vision, Milestone Systems, Skild AI, Uber, and Teradyne Robotics. The platform is host to number of processes designed to do right by the real and synthetic robot data. There’s Cosmos Curator, which processes and annotates datasets, Cosmos Tranffer, which is designed to address edge cases and long tail scenarios, and Cosmos Evaluator, which, you know, evaluates data. “Physical AI is the next frontier of the AI revolution, where success depends on the ability to generate massive amounts of data,” says Omniverse VP, Rev Lebaredian. “Together with cloud leaders, we’re providing a new kind of agentic engine that transforms compute into the high-quality data required to bring the next generation of autonomous systems and robots to life. In this new era, compute is data.” #nvidia #gtc #nvidiagtc #robotics #physicalai

1 Comment

Elvis S.

Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

85,573 followers 1y

A Few Tokens Are All You Need Can you cut the fine-tuning costs of an LLM by 75% and keep strong reasoning performance? A new paper from the Tencent AI Lab claims that it might just be possible. Let's find out how: The First Few Tokens It shows that all you need is a tiny prefix to improve your model’s reasoning—no labels or massive datasets are required! Uses an unsupervised prefix fine-tuning method (UPFT)—only requiring prefix substrings (as few as 8 tokens) of generated solutions. Task template for Prefix Tuning They use a simple task template for prefix tuning. By using a few leading tokens of the solution, the model learns a consistent starting approach without requiring complete, correct final answers. Other approaches require entire reasoning traces. Prefix Self-Consistency They add that solution paths for the same question share similar initial tokens, even if the later steps diverge. By fine-tuning only on these few prefix tokens, the model learns robust initial reasoning steps without needing full correct solutions. Coverage–Accuracy Trade-Off Training on short prefixes captures broad coverage of potential reasoning paths while preserving correctness, as errors typically appear later in the CoTs. UPFT demonstrates superior performance compared to SFT in unsupervised fine-tuning. Fewer Tokens, Competitive Results UPFT cuts training overhead by 75% or more versus conventional fine-tuning, yet matches or surpasses supervised methods like RFT on math and reasoning benchmarks. When combined with optional label filtering, it can further boost performance with minimal extra cost. Final thoughts We are seeing a lot of new approaches that improve the efficiency of reasoning models. Clever inference methods to reduce compute and improve reasoning is one approach but UPFT focuses on efficient training with reduced tokens which is resource-efficient.

+1

8 Comments

Eshank Agarwal

4,533 followers 2mo

We tried RAG. We tried Fine-Tuning. Both failed. Last year, I worked on a Bank of America case study: Classifying customer conversation emotions into 74 proprietary categories. 𝗥𝗔𝗚 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀: → PII security risks (retrieval calls = data exposure) → Poor accuracy on proprietary taxonomy → High latency & cost at scale 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴: We got good accuracy! But still faced issues: → Catastrophic forgetting (lost general capabilities) → Overfitting to training data → Resource intensive Then I found this paper: "Fine-tuning with RAG for Improving LLM Learning" (Imperial College London) 𝗧𝗵𝗲 𝗯𝗿𝗶𝗹𝗹𝗶𝗮𝗻𝘁 𝗶𝗱𝗲𝗮? Use RAG to TEACH, then remove it. 𝗛𝗲𝗿𝗲'𝘀 𝗵𝗼𝘄 (𝘀𝗶𝗺𝗽𝗹𝗲 𝗮𝗻𝗮𝗹𝗼𝗴𝘆): Imagine teaching someone to cook: 🔸 RAG = Standing next to them reading recipes. They cook well, but ONLY with you there. 🔸 Fine-tuning = Making them memorize 100 recipes. They forget how to boil water (catastrophic forgetting). 🔸 This paper's approach: 1. Let them fail (burnt pasta = learning data) 2. Extract hints from failures ("taste before seasoning") 3. Give hints → they succeed → record it 4. Train on successful videos WITHOUT showing hints They're FORCED to internalize why it worked. 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: • Distilled model: 91% vs RAG: 82% • Student OUTPERFORMED the teacher • Fewer tokens at inference 𝗛𝗼𝘄 𝘁𝗵𝗶𝘀 𝘀𝗼𝗹𝘃𝗲𝘀 𝗼𝘂𝗿 𝗽𝗿𝗼𝗯𝗹𝗲𝗺: PII risk → No retrieval in production 74 categories → Hints teach edge cases Latency → Zero retrieval overhead Forgetting → Distillation preserves capabilities 𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆: It's not RAG vs Fine-Tuning. Use RAG as training wheels then remove them. Teach with retrieval. Perform without it. Has anyone tried this approach? 📄 arxiv.org/abs/2510.01375 #LLM #RAG #FineTuning #AI #MachineLearning #Banking

Fine-tuning with RAG for Improving LLM Learning of New Skills arxiv.org

24 Comments

Training AI Models With Limited Data

More in Training AI Models With Limited Data

More Artificial Intelligence topics

Explore categories