Robotic Control Using Unlabeled Data

Explore top LinkedIn content from expert professionals.

Summary

Robotic control using unlabeled data refers to training robots to perform tasks by learning from videos or real-world observations that do not have specific action instructions provided. Instead of relying on costly, manually labeled datasets, robots can now learn to move and act by interpreting patterns in everyday videos, making the learning process more scalable and adaptable.

Tap into video abundance: Use publicly available videos—even those without action labels—to teach robots new skills and increase the diversity of scenarios they can handle.
Bridge human-robot learning: Train robots to understand movement by translating human actions seen in videos into robot-friendly instructions, reducing dependence on detailed labeled demonstrations.
Accelerate real-world adaptation: Combine a foundation of general motion knowledge with a small amount of robot-specific examples to quickly adapt robots to new environments and tasks.

Summarized by AI based on LinkedIn member posts

Animesh Garg

RL + Foundation Models in Robotics. Faculty at Georgia Tech. Prev at Nvidia

19,003 followers 10mo
Report this post
Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ

1 Comment
Like Comment
Giovanni Sisinna

Program Director | PMO & Portfolio Governance | AI & Digital Transformation

6,686 followers 1y
Report this post
How AI and LLMs Are Teaching Robots to Think in Motion Imagine robots in your facility learning not through costly, action-labeled data but by observing the world through videos, effortlessly adapting to tasks with human, like proficiency. This groundbreaking idea, presented by the Authors, may redefine robotic training by leveraging AI and Large Language Models (LLMs). 🔹 Research Focus The paper explores a pivotal question: can we harness abundant video data to teach robots effectively? Introducing Moto, a framework that converts video frame changes into latent motion tokens, it creates a universal "language" of motion. These tokens enable robots to understand and execute tasks with unprecedented efficiency. 🔹 Latent Motion Tokens At the core of Moto lies the Latent Motion Tokenizer, a model that analyzes transitions between video frames to generate motion tokens. These tokens abstract movement patterns in a hardware-agnostic manner, enabling robots to learn from any video source. This eliminates the reliance on expensive action-labeled datasets and fosters more scalable training. 🔹 Pre-training Motion Priors Moto-GPT, the generative model in the framework, leverages these motion tokens through autoregressive pre-training. By predicting future motion tokens based on past sequences, the model develops a deep understanding of motion dynamics, akin to how humans learn language structures. This equips robots with a robust motion knowledge base, allowing them to anticipate and evaluate actions effectively. 🔹 Co-fine-tuning for Precision To bridge the gap between general motion understanding and specific task execution, Moto employs a co-fine-tuning strategy. This process integrates motion priors with smaller sets of action-labeled data, refining the model for real-world applications. Benchmarks such as SIMPLER and CALVIN highlight Moto's exceptional performance, particularly in scenarios with limited labeled data. 🔹 Experiments and Results Robots trained with Moto outperformed traditional methods, demonstrating faster learning, greater adaptability, and reduced dependency on labeled datasets. By focusing on motion dynamics over static frame details, Moto not only accelerates training but also enhances robot performance across varied environments. 📌 Business Value This innovation signals a paradigm shift. By leveraging Moto’s motion-token-based learning, organizations can drastically cut training costs, increase operational agility, and unlock new possibilities in robotics. Whether in logistics, healthcare, or manufacturing, smarter robots driven by this approach promise smoother workflows and a sharper competitive edge. 👉 How could your organization capitalize on the wealth of video data to train robots? What new opportunities could arise with adaptable, self-learning robotics? 👈 #ArtificialIntelligence #MachineLearning #DeepLearning #Automation #Robotics
No more previous content

No more next content
15 Comments
Like Comment
Elvis S.

Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

85,577 followers 3mo
Report this post
New research from Meta and collaborators. This is a good paper showing what's possible with proper world models. World models need actions to predict consequences. The default approach today requires labeled action data, which is expensive to obtain and limited to narrow domains like video games or robotic manipulation. But the vast majority of video data online has no action labels at all. This new research tackles learning latent action world models directly from in-the-wild videos, expanding beyond the controlled settings of previous work to capture the full diversity of real-world actions. The challenge is significant. In-the-wild videos contain actions far beyond simple navigation or manipulation: people entering frames, objects appearing and disappearing, dancers moving, fingers forming guitar chords. There's also no consistent embodiment across videos, unlike robotics datasets, where the same arm appears throughout. So how do the authors address this? Continuous but constrained latent actions, using sparse or noisy regularization, effectively capture this action complexity. Discrete quantization, the common approach in prior work, struggles to adapt. Without a shared embodiment, the model learns spatially-localized, camera-relative transformations. The results demonstrate genuine action transfer. Motion from a walking person can be applied to a flying ball. Actions like "someone entering the frame" transfer across completely different videos. By training a small controller to map known actions to latent ones, the world model trained purely on natural videos can solve robotic manipulation and navigation tasks with performance close to models trained on domain-specific, action-labeled data. Latent action spaces learned from unlabeled internet videos can serve as a universal interface for planning, removing the bottleneck of action annotation.
No more previous content

No more next content
4 Comments
Like Comment
Amirhosein Shirani

AI & Robotics | Founder @ Wish Work

2,921 followers 2mo
Report this post
NVIDIA's DreamDojo, A Practical Milestone for Open-Source World Models in Robotics The robotics community has long discussed applying the bitter lesson to physical AI, but scaling robot data remains a bottleneck. DreamDojo offers a compelling open-source solution: an interactive world model predicting future states directly from motor controls and pixels, without physics engines or hand-authored dynamics. This is a notable release for several reasons: 1. Scaling Through Human Data and Latent Actions Acquiring real-world robot data is slow due to hardware limitations and safety constraints. DreamDojo bypasses this by pre-training on 44,000 hours of human egocentric video. To bridge the gap between unlabeled human actions and robot hardware, the team introduced latent actions. This allows the model to learn the fundamental physics of grasping, pouring, and manipulating objects without needing the underlying motor commands attached to the pre-training data. 2. Zero-Shot Generalization via Two-Stage Training By learning from diverse human data first, DreamDojo establishes a foundational understanding of physical rules. It then undergoes a smaller post-training phase to "snap onto" specific robot hardware (such as Unitree G1, AgiBot, or YAM). This separation of general world physics from robot-specific actuation allows the model to generalize zero-shot to objects and environments absent from any robot training set. 3. Real-Time Inference for Closed-Loop Control A simulator's utility is limited by its speed. Through autoregressive distillation, DreamDojo achieves a real-time 10 FPS and maintains stability for over a minute of continuous rollout. This unlocks several practical applications: ➡️Live Teleoperation: Enabling real-time virtual teleoperation inside the generated environment using a VR controller. ➡️Scalable Policy Evaluation: Allowing teams to accurately rank policy checkpoints inside the neural simulator before deploying them to physical hardware, reducing wear and risk. ➡️Model-Based Planning: Facilitating test-time planning by simulating multiple action proposals in parallel. The team demonstrated a +17% real-world success rate improvement using this method. Built on the open-weight NVIDIA Cosmos, the team has open-sourced the weights, code, post-training dataset, and evaluation sets. This provides a highly usable foundation for the community to build upon as we move deeper into the era of world models for physical AI. Congratulations to the full team of researchers and contributors behind this work: Shenyuan Gao, Will Liang, Kaiyuan Zheng, Ayaan Naveed Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loïc Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, KR Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel, Ming-Yu Liu, Yuke Zhu, Joel Jang. Website: https://lnkd.in/dN4iNktu

2 Comments
Like Comment
Ilir Aliu

AI & Robotics | 150k+ | 22Astronauts

106,330 followers 7mo
Report this post
Robots usually need tons of labeled data to learn precise actions. What if they could learn control skills directly from human videos… no labels needed? Robotics pretraining just took a BIG jump forward. A new Autoregressive Robotic Model, learns low-level 4D representations from human video data. Bridging the gap between vision and real-world robotic control. Why this matters: ✅ Pretraining with 4D geometry enables better transfer from human video to robot actions ✅ Overcomes the gap between high-level VLA pretraining and low-level robotic control ✅ Unlocks more accurate, data-efficient learning for real-world tasks For more details, check out the paper: 📍https://lnkd.in/dbsp7Fz5 The team at @Berkeley AI Research will release the project page and code soon.

4 Comments
Like Comment

Robotic Control Using Unlabeled Data

Summary

More in Applications of Robotics

Explore categories