How to Apply Zero-Shot Learning in Robotics

Explore top LinkedIn content from expert professionals.

Summary

Zero-shot learning in robotics allows robots to perform new tasks without any prior training or demonstrations by interpreting instructions or videos and generating actions on the fly. This approach uses advanced models to map visual or language-based cues to robot actions, enabling flexible and scalable solutions for real-world challenges.

Separate task stages: Break down a robotics task into understanding what needs to happen and then figuring out how the robot should perform those actions using video or language instructions.
Use off-the-shelf data: Integrate models that can learn from publicly available videos or images, even those featuring humans, to extract motion and behavior patterns for robotic use.
Prompt-driven control: Apply techniques that let you specify goals and hazards at inference time—like using prompt tokens—so the robot can adapt to new tasks and environments without retraining.

Summarized by AI based on LinkedIn member posts

Animesh Garg

RL + Foundation Models in Robotics. Faculty at Georgia Tech. Prev at Nvidia

19,003 followers 10mo
Report this post
Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ

1 Comment
Like Comment
Murtaza Dalal

Robotics ML Engineer @ Tesla Optimus | CMU Robotics PhD

2,157 followers 1y
Report this post
Can my robot cook my food, tidy my messy table, rearrange my dresser and do much much more without ANY demos or real-world training data? Introducing ManipGen: A generalist agent for manipulation that can solve long-horizon robotics tasks entirely zero shot, from text input! Key idea: for many manipulation tasks of interest, they can be decomposed into two phases, contact-free reaching (aka motion planning!) and contact-rich local interaction. The latter is hard to learn, and we take a sim2real transfer approach! We define local policies, which operate in a local region around an object of interest. They are uniquely well-suited to generalization (see below!) and sim2real transfer. This is because they are invariant to: 1) Absolute pose 2) Skill orders 3) Environment configurations As an overview, our approach 1) acquires generalist behaviors for local skills at scale using RL 2) distills these behaviors into visuomotor policies using multitask DAgger and 3) deploys local policies in the real-world using VLMs and motion planning. Phase 1: Train state-based, single-object policies to acquire skills such as picking, placing, opening and closing. We train policies using PPO across thousands of objects, designing reward and observation spaces for efficient learning and effective sim2real transfer. Phase 2: We need visuomotor policies to deploy on robots! We distill single-object experts into multi-task policies using online imitation learning (aka DAgger) that observe local visual (wrist cam) input with edge and hole augmentation to match real-world depth noise. To deploy local policies in the real-world, we decompose the task into components (GPT-4o), estimate where to go using Grounded SAM, and motion plan using Neural MP. For control, we use Industreallib from NVIDIA, an excellent library for sim2real transfer! ManipGen can solve long-horizon tasks in the real-world entirely zero-shot generalizing across objects, poses, environments and scene configurations! We outperform SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 tasks by 36%, 76%, 62% and 60%! ManipGen exhibits exciting capabilities such as performing manipulation in tight spaces and with clutter, entirely zero-shot! From putting items on the shelf, carefully extracting the red pepper from clutter and putting large items in drawers, ManipGen is quite capable. By training local policies at scale on thousands of objects, ManipGen generalizes to some pretty challenging out of distribution objects that don’t look anything like what was in training, such as pliers and the clamps as well as deformable objects such as the wire. This work was done at Carnegie Mellon University Robotics Institute, with co-lead Min Liu, as well as Deepak Pathak and Russ Salakhutdinov and in collaboration with Walter Talbott, Chen Chen, Ph.D., and Jian Zhang from Apple. Paper, videos and code (coming soon!) at https://lnkd.in/ekjWPXHM

16 Comments
Like Comment
Wenlong Huang

CS PhD Student at Stanford (AI / Robotics)

2,395 followers 1mo
Report this post
What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. 🌐 dream2flow.github.io by Stanford University 🔹Robot manipulation is about inducing changes in an environment through actions. We observe that video models (e.g., Veo) excel at producing plausible object motions from an in-the-wild image and language instructions. Intriguingly, these motions are more physically realistic when the actor is human rather than robot, likely because the internet contains far more human interaction data than robot data. 🔹But how do we turn those generated videos into low-level robot actions? This is a nuanced question beyond simple retargeting, because strategies taken by a human may not work on a robot. 🔹We propose Dream2Flow, which uses 3D object flow to separate what should happen in the scene from how a robot should realize it. We extract this flow from generated videos using off-the-shelf vision models, then use it as a shared objective for both trajectory optimization and reinforcement learning. 🔹Dream2Flow can perform a range of in-the-wild tasks zero-shot with trajectory optimization, including manipulation of rigid, articulated, and deformable objects. The robot plans by asking a counterfactual question using a dynamics model (either heuristics-based or learned): if I take this action, will the scene evolve toward the desired 3D flow? 🔹Using as reward for RL, Dream2Flow enables different embodiments to discover emergent behaviors that achieve the same effect (e.g., base motion of the robot dog). Dream2Flow unifies these behaviors through a shared task interface and unifies model-free and model-based methods around a shared tracking goal. 🔹By leveraging purely off-the-shelf video models, Dream2Flow also allows generalization to different object instances, backgrounds, and camera viewpoints. It is also surprisingly steerable: different language instructions in the same scene can induce different desired behaviors. 🔹World modeling encodes rich priors about not only environment dynamics but also behaviors within it. It is immensely useful for robotics, yet we are only scratching the surface of understanding it. The project was led by Karthik Dharmarajan and has been a year in the making, along with the rest of the team Jiajun Wu, Fei-Fei Li, and Ruohan Zhang. Karthik Dharmarajan will also be joining UC Berkeley as a PhD student this fall! Website: dream2flow.github.io Paper: https://lnkd.in/gpwP2hkT Code: https://lnkd.in/gvJZTxaP

8 Comments
Like Comment
Marinka Zitnik

Associate Professor at Harvard

17,087 followers 11mo
Report this post
📢 🧬 New paper drop: "Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies" by stellar PhD student Kevin Li Massachusetts Institute of Technology Harvard Medical School https://lnkd.in/eWbtEyVy Imagine an agent that can reach any goal while avoiding danger, without retraining, even when the hazards change. That's the reach-avoid challenge. Think self-driving cars dodging new construction or cell therapies steering clear of tumorigenic states. Most RL methods hardwire the danger zones during training. Want to avoid something new? Retrain. Want to scale to new configurations? Retrain. But what if you could just tell the model what to avoid, on the fly? Enter RADT: Reach-Avoid Decision Transformer. It learns from suboptimal data. It uses no rewards or costs. It encodes goals and avoid regions as prompt tokens. And it generalizes zero-shot to new goals and hazards. 🧵👇 What is different here? RADT does not see rewards. Instead, it learns from relabeled offline trajectories. Each trajectory is framed as either a "good" or "bad" demonstration of avoiding specified regions. The prompt looks like this: ✅ Goal token ❌ One or more avoid tokens (can be of any shape/size) 🟢 Success/failure indicators You can mix, match, or modify the prompt at inference time. RADT will adapt, zero-shot. Benchmarks: FetchReach and MazeObstacle 🏗️ RADT beats baselines (even retrained ones!) at avoiding hazards and hitting targets Handles more avoid regions and larger ones, without ever seeing them in training Zero-shot generalization actually works Real-world applications: Cell reprogramming 🧬 Start with a fibroblast, reach a cardiomyocyte, and avoid dangerous intermediate states (e.g., tumorigenic ones). RADT reduces time spent in harmful expression states, even when avoidance is impossible, it minimizes exposure. Why it matters: Flexible deployment: same model, new avoid regions Reward-free: no need for hand-designed cost functions Works in both robotics and biology Helps in safety-critical settings where retraining is infeasible Limitations: It can only handle "box-shaped" avoid regions for now But the core idea, prompt-driven, reward-free, zero-shot control, is powerful and widely applicable. RADT is part of a bigger vision: general-purpose agents that follow high-level instructions about where to go and what to avoid, safely and efficiently. Read the paper: https://lnkd.in/eWbtEyVy 👏 Big kudos to Kevin Li for pushing the frontier on safe, compositional policy learning! Massachusetts Institute of Technology Harvard Medical School Department of Biomedical Informatics Harvard Medical School Kempner Institute at Harvard University Harvard Data Science Initiative Broad Institute of MIT and Harvard Harvard University
No more previous content

No more next content
3 Comments
Like Comment

How to Apply Zero-Shot Learning in Robotics

Summary

More in Applications of Robotics

Explore categories