Scaling Robotic Task Learning in Automation

Explore top LinkedIn content from expert professionals.

Summary

Scaling robotic task learning in automation means using advanced techniques to teach robots a wide variety of tasks quickly and efficiently, often by expanding small sets of real-world data through simulation and innovative learning methods. This approach helps robots adapt to new environments and tasks without needing extensive retraining, making automation more flexible and powerful for everyday use.

Focus on data quality: Structure training demonstrations so they are consistent and clear, making it easier for robots to learn reliable strategies and recover from mistakes.
Expand with simulation: Multiply a small set of real-world data using simulation tools to create diverse environments and varied robot movements, allowing robots to practice and adapt to many situations.
Build knowledge incrementally: Encourage robots to gradually develop a library of skills they can reuse and combine, boosting their ability to handle complex, long-term tasks over time.

Summarized by AI based on LinkedIn member posts

Jim Fan Jim Fan is an Influencer

NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

238,087 followers 1y
Report this post
Exciting updates on Project GR00T! We discover a systematic way to scale up robot data, tackling the most painful pain point in robotics. The idea is simple: human collects demonstration on a real robot, and we multiply that data 1000x or more in simulation. Let’s break it down: 1. We use Apple Vision Pro (yes!!) to give the human operator first person control of the humanoid. Vision Pro parses human hand pose and retargets the motion to the robot hand, all in real time. From the human’s point of view, they are immersed in another body like the Avatar. Teleoperation is slow and time-consuming, but we can afford to collect a small amount of data. 2. We use RoboCasa, a generative simulation framework, to multiply the demonstration data by varying the visual appearance and layout of the environment. In Jensen’s keynote video below, the humanoid is now placing the cup in hundreds of kitchens with a huge diversity of textures, furniture, and object placement. We only have 1 physical kitchen at the GEAR Lab in NVIDIA HQ, but we can conjure up infinite ones in simulation. 3. Finally, we apply MimicGen, a technique to multiply the above data even more by varying the *motion* of the robot. MimicGen generates vast number of new action trajectories based on the original human data, and filters out failed ones (e.g. those that drop the cup) to form a much larger dataset. To sum up, given 1 human trajectory with Vision Pro -> RoboCasa produces N (varying visuals) -> MimicGen further augments to NxM (varying motions). This is the way to trade compute for expensive human data by GPU-accelerated simulation. A while ago, I mentioned that teleoperation is fundamentally not scalable, because we are always limited by 24 hrs/robot/day in the world of atoms. Our new GR00T synthetic data pipeline breaks this barrier in the world of bits. Scaling has been so much fun for LLMs, and it's finally our turn to have fun in robotics! We are creating tools to enable everyone in the ecosystem to scale up with us: - RoboCasa: our generative simulation framework (Yuke Zhu). It's fully open-source! Here you go: http://robocasa.ai - MimicGen: our generative action framework (Ajay Mandlekar). The code is open-source for robot arms, but we will have another version for humanoid and 5-finger hands: https://lnkd.in/gsRArQXy - We are building a state-of-the-art Apple Vision Pro -> humanoid robot "Avatar" stack. Xiaolong Wang group’s open-source libraries laid the foundation: https://lnkd.in/gUYye7yt - Watch Jensen's keynote yesterday. He cannot hide his excitement about Project GR00T and robot foundation models! https://lnkd.in/g3hZteCG Finally, GEAR lab is hiring! We want the best roboticists in the world to join us on this moon-landing mission to solve physical AGI: https://lnkd.in/gTancpNK

102 Comments
Like Comment
Murtaza Dalal

Robotics ML Engineer @ Tesla Optimus | CMU Robotics PhD

2,157 followers 1y
Report this post
Can my robot cook my food, tidy my messy table, rearrange my dresser and do much much more without ANY demos or real-world training data? Introducing ManipGen: A generalist agent for manipulation that can solve long-horizon robotics tasks entirely zero shot, from text input! Key idea: for many manipulation tasks of interest, they can be decomposed into two phases, contact-free reaching (aka motion planning!) and contact-rich local interaction. The latter is hard to learn, and we take a sim2real transfer approach! We define local policies, which operate in a local region around an object of interest. They are uniquely well-suited to generalization (see below!) and sim2real transfer. This is because they are invariant to: 1) Absolute pose 2) Skill orders 3) Environment configurations As an overview, our approach 1) acquires generalist behaviors for local skills at scale using RL 2) distills these behaviors into visuomotor policies using multitask DAgger and 3) deploys local policies in the real-world using VLMs and motion planning. Phase 1: Train state-based, single-object policies to acquire skills such as picking, placing, opening and closing. We train policies using PPO across thousands of objects, designing reward and observation spaces for efficient learning and effective sim2real transfer. Phase 2: We need visuomotor policies to deploy on robots! We distill single-object experts into multi-task policies using online imitation learning (aka DAgger) that observe local visual (wrist cam) input with edge and hole augmentation to match real-world depth noise. To deploy local policies in the real-world, we decompose the task into components (GPT-4o), estimate where to go using Grounded SAM, and motion plan using Neural MP. For control, we use Industreallib from NVIDIA, an excellent library for sim2real transfer! ManipGen can solve long-horizon tasks in the real-world entirely zero-shot generalizing across objects, poses, environments and scene configurations! We outperform SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 tasks by 36%, 76%, 62% and 60%! ManipGen exhibits exciting capabilities such as performing manipulation in tight spaces and with clutter, entirely zero-shot! From putting items on the shelf, carefully extracting the red pepper from clutter and putting large items in drawers, ManipGen is quite capable. By training local policies at scale on thousands of objects, ManipGen generalizes to some pretty challenging out of distribution objects that don’t look anything like what was in training, such as pliers and the clamps as well as deformable objects such as the wire. This work was done at Carnegie Mellon University Robotics Institute, with co-lead Min Liu, as well as Deepak Pathak and Russ Salakhutdinov and in collaboration with Walter Talbott, Chen Chen, Ph.D., and Jian Zhang from Apple. Paper, videos and code (coming soon!) at https://lnkd.in/ekjWPXHM

16 Comments
Like Comment
Michael Yip

Professor at UC San Diego | US National Academy of Inventors Member | Co-Founder | Surgical and Healthcare Robotics, Humanoid Robots, and Physical AI Strategic Advisor

7,128 followers 5mo
Report this post
Can a robot learn incrementally over time, building a library of skills that allows it to reuse, compose and build upon prior knowledge to become more advanced? That has been an area of research my lab (and many others!) have been fighting for many years. But it is not trivial to figure out how to combine pieces of knowledge, skillsets into a library of knowledge especially when that knowledge base may have very different objectives, state-action spaces, or model sizes. Yun-Jie Ho and Zih-Yun Chiu set out to tackle this problem, and happy to say that recently their results have been published in IEEE RA-L! SurgIRL: Towards Life-Long Learning for Surgical Automation by Incremental Reinforcement Learning. https://lnkd.in/ePXKY2yR Instead of having to create a huge policy that can handle all possible skills, we, SurgIRL, let a robot build *incrementally* a growing library of prior skills and learns to reuse them and compose them when facing new tasks. We demonstrated that this incremental learning approach can handle multiple simulated surgical tasks and then transfer policies into reality on a da Vinci Research Kit (dVRK). Some of these tasks area really hard -- like precision grasping of suture needles --- and some are really diverse, like camera tracking -- all composed under the same architecture. The benefit is that this framework lets us be flexible with retaining knowledge of skills in their original form, allowing flexibility and updates naturally. Huge kudos to the team—Yun-Jie Ho, Zih-Yun (Sarah) Chiu, Yuheng Zhi—for driving this forward. I'm hoping for continued research into life-long learning in robot manipulation -- not entirely there yet but great step forward in my opinion!

3 Comments
Like Comment
Jade Choghari

Robotics @ Scale AI | Prev Hugging Face

7,431 followers 3w
Report this post
⭐️ We're releasing a comprehensive, hands-on recipe for teaching robots to fold clothes 🤟 … a 25 min read with a full breakdown of modern end-to-end robot learning from hardware to training to evaluation, all open-sourced with LeRobot and Hugging Face 🤗. → Built from 131 hours of teleoperation data, 5k+ GPU hours, 8 robot setups, and a set of practical findings we didn’t expect 👀 We trained language-conditioned vision-action policies for bimanual cloth folding, reaching 90% success on arbitrary t-shirts on real hardware. But the most interesting result wasn’t the model. With architecture and training held fixed, performance moved from 40% → 90% almost entirely by changing the data: – making demonstrations more consistent (same strategy each time) – selecting higher-quality trajectories instead of using everything – giving the model a notion of “progress” through the task (SARM) – adding examples of how to recover from mistakes (Dagger-style) This suggests a useful lens: For long-horizon, contact-rich tasks, we are not yet model-limited. Performance depends heavily on how we structure and supervise interaction data over time. Concretely: – consistency helps more than showing many different ways of doing the task – learning which parts of a trajectory matter is more important than treating every step equally – teaching the model how to recover from failure is as important as showing successful executions We wrote this as a detailed, reproducible system for others to build on. hope it’s useful if you’re working on real-world robot learning. Blog: https://lnkd.in/dW_8JKD9

3 Comments
Like Comment
Pascal Biese

AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

85,058 followers 5mo
Report this post
A new approach just solved a task with over 1 million steps perfectly - according to the authors. Even state-of-the-art reasoning models inevitably make errors when chaining their capabilities into extended processes. Recent experiments showed that after a few hundred steps, the process becomes derailed. This fundamental reliability problem has been blocking LLMs from executing the kind of large-scale tasks that organizations and societies routinely perform. Researchers at Cognizant AI Lab and UT Austin just demonstrated MAKER, a system that successfully completed over one million LLM steps with zero errors. Their approach might sound counterintuitive: instead of relying on increasingly intelligent base models, they achieve reliability through extreme decomposition and error correction. MAKER breaks tasks into minimal subtasks - each handled by a focused microagent - and applies multi-agent voting at each step. This modular approach enables effective error correction that scales logarithmically with task length. The team provides formal scaling laws showing why this works: under maximal decomposition, the system scales log-linearly (Θ(s log s)) rather than exponentially failing like traditional approaches. Relatively small non-reasoning models like GPT-4.1-mini prove more cost-effective than advanced reasoning models for this architecture. When each agent focuses on a single tiny step, raw reasoning power matters less than reliability and cost efficiency. This might open an alternative path to AI scaling beyond building ever-larger models. By decomposing intelligence into millions of coordinated pieces, we might build systems that are not just more capable, but fundamentally more reliable and controllable. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡

5 Comments
Like Comment

Scaling Robotic Task Learning in Automation

Summary

More in Robotics Engineering Technical Skills

Explore categories