𝗜'𝘃𝗲 𝗵𝗲𝗮𝗿𝗱 𝘁𝗵𝗶𝘀 𝗮 𝗹𝗼𝘁 𝗿𝗲𝗰𝗲𝗻𝘁𝗹𝘆: "𝗪𝗲 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝘂𝗿 𝗿𝗼𝗯𝗼𝘁 𝗼𝗻 𝗼𝗻𝗲 𝗼𝗯𝗷𝗲𝗰𝘁 𝗮𝗻𝗱 𝗶𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗲𝗱 𝘁𝗼 𝗮 𝗻𝗼𝘃𝗲𝗹 𝗼𝗯𝗷𝗲𝗰𝘁 - 𝘁𝗵𝗲𝘀𝗲 𝗻𝗲𝘄 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 𝗮𝗿𝗲 𝗰𝗿𝗮𝘇𝘆!" Let's talk about what's actually happening in that "A" (Action) part of your VLA model. The Vision and Language components? They're incredible. Pre-trained on internet-scale data, they understand objects, spatial relationships, and task instructions better than ever. But the Action component? That's still learned from scratch on your specific robot demonstrations. 𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝗿𝗲𝗮𝗹𝗶𝘁𝘆: Your VLA model has internet-scale understanding of what a screwdriver looks like and what "tighten the screw" means. But the actual motor pattern for "rotating wrist while applying downward pressure"? That comes from your 500 robot demos. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗶𝘀 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 "𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻": • 𝗩𝗶𝘀𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Recognises novel objects instantly (thanks to pre-training) • 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Understands new task instructions (thanks to pre-training) • 𝗔𝗰𝘁𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Still limited to motor patterns seen during robot training Ask that same robot to "unscrew the bottle cap" and it fails because: • Vision: Recognises bottle and cap • Language: Understands "unscrew" • Action: Never learned the "twist while pulling" motor pattern 𝗧𝗵𝗲 𝗵𝗮𝗿𝗱 𝘁𝗿𝘂𝘁𝗵 𝗮𝗯𝗼𝘂𝘁 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀: The "VL" gives you incredible zero-shot understanding. The "A" still requires task-specific demonstrations. We've cracked the perception and reasoning problem. We haven't cracked the motor generalisation problem.
Role of Action Modality in Robot Learning
Explore top LinkedIn content from expert professionals.
Summary
The role of action modality in robot learning refers to how robots interpret and perform physical actions based on sensory and language input. This concept is crucial because, while robots can understand objects and instructions, translating that understanding into motor actions often requires specific demonstrations and advanced planning techniques.
- Separate planning stages: Use a two-step approach where robots create a plan before acting, allowing them to adapt to new tasks and recover from mistakes.
- Utilize motion data: Incorporate videos and motion tracking to reduce the need for extensive robot-specific training, making robot learning accessible even without detailed action labels.
- Chunk actions logically: Train robots to group actions into meaningful segments instead of single steps, improving their ability to handle complex, real-world tasks.
-
-
Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ
-
NVIDIA Teaching Robots to Think Before They Act: How Vision-Language-Action Models Learn Structured Reasoning What if robots could pause to "think" before executing actions—just like humans? Most AI systems today jump straight from perception to action. For complex tasks, this approach struggles with long-term planning, adapting to new scenarios, or recovering from mistakes. 👉 The Core Challenge Vision-language-action (VLA) models often map sensory inputs directly to low-level actions. Without explicit reasoning, they: - Fail to break tasks into subgoals - Struggle with unseen environments - Lack self-correction mechanisms 👉 ThinkAct: A Two-Stage Approach Inspired by human cognition, this NVIDIA/NTU collaboration introduces a dual-system architecture: 1. Reasoning Engine: A multimodal LLM generates step-by-step plans using visual feedback and goal alignment 2. Action Model: Compressed "visual latent plans" guide precise physical interactions Key Innovations - Reinforced Visual Rewards: Trains models using success metrics and trajectory consistency - Latent Plan Distillation: Converts verbose reasoning into compact spatial-temporal guides - Asynchronous Execution: Slow thinking (reasoning) + fast acting (control) 👉 Why This Matters Tested on LIBERO and SimplerEnv benchmarks: - Achieved 2.5x higher success rates than baseline models in 10-shot adaptation - Demonstrated autonomous error recovery (e.g., regrasping dropped objects) - Enabled cross-environment generalization without retraining The Bottom Line By decoupling reasoning from action, ThinkAct bridges the gap between abstract planning and physical implementation—a critical step toward adaptable embodied AI. Paper: "ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning" (Huang et al., NVIDIA & NTU, 2025)
-
Every field has a few ideas that quietly change everything. For large language models, it was the Transformer. For modern robot learning, I would argue it is the ACT policy (Action Chunking with Transformers). The ACT policy fundamentally changed what was possible in robotics by opening the door to open-source, learnable policies that could actually run on low-cost hardware. That single shift moved robot learning out of a few elite labs and into the hands of much smaller research teams. What makes ACT truly seminal is that innovation happened on both sides of the stack. On the hardware side, costs dropped by nearly two orders of magnitude. That alone reshaped who could experiment, iterate, and deploy real robots. On the software side, there were three key breakthroughs: (1) A conditional Variational Autoencoder to model action distributions (2) A DETR-based transformer decoder for structured action prediction (3) Action Chunking, inspired by neuroscience, allowing robots to reason and act over meaningful temporal segments rather than single timesteps. Robots trained with ACT could fold clothes, insert batteries, close zip-lock bags, and handle a wide range of real-world manipulation tasks that were previously brittle or hand-engineered. Because of how foundational this policy is, I wrote a from-scratch article explaining ACT, without assuming prior expertise in robotics or deep learning. The goal was simple: lower the barrier for anyone who wants to seriously enter modern robot learning today. If you are curious about where robotics is headed and why learning-based policies are the future, this is a great place to start. Read the Substack article here: https://lnkd.in/dKxQV_hM
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development