Can my robot cook my food, tidy my messy table, rearrange my dresser and do much much more without ANY demos or real-world training data? Introducing ManipGen: A generalist agent for manipulation that can solve long-horizon robotics tasks entirely zero shot, from text input! Key idea: for many manipulation tasks of interest, they can be decomposed into two phases, contact-free reaching (aka motion planning!) and contact-rich local interaction. The latter is hard to learn, and we take a sim2real transfer approach! We define local policies, which operate in a local region around an object of interest. They are uniquely well-suited to generalization (see below!) and sim2real transfer. This is because they are invariant to: 1) Absolute pose 2) Skill orders 3) Environment configurations As an overview, our approach 1) acquires generalist behaviors for local skills at scale using RL 2) distills these behaviors into visuomotor policies using multitask DAgger and 3) deploys local policies in the real-world using VLMs and motion planning. Phase 1: Train state-based, single-object policies to acquire skills such as picking, placing, opening and closing. We train policies using PPO across thousands of objects, designing reward and observation spaces for efficient learning and effective sim2real transfer. Phase 2: We need visuomotor policies to deploy on robots! We distill single-object experts into multi-task policies using online imitation learning (aka DAgger) that observe local visual (wrist cam) input with edge and hole augmentation to match real-world depth noise. To deploy local policies in the real-world, we decompose the task into components (GPT-4o), estimate where to go using Grounded SAM, and motion plan using Neural MP. For control, we use Industreallib from NVIDIA, an excellent library for sim2real transfer! ManipGen can solve long-horizon tasks in the real-world entirely zero-shot generalizing across objects, poses, environments and scene configurations! We outperform SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 tasks by 36%, 76%, 62% and 60%! ManipGen exhibits exciting capabilities such as performing manipulation in tight spaces and with clutter, entirely zero-shot! From putting items on the shelf, carefully extracting the red pepper from clutter and putting large items in drawers, ManipGen is quite capable. By training local policies at scale on thousands of objects, ManipGen generalizes to some pretty challenging out of distribution objects that don’t look anything like what was in training, such as pliers and the clamps as well as deformable objects such as the wire. This work was done at Carnegie Mellon University Robotics Institute, with co-lead Min Liu, as well as Deepak Pathak and Russ Salakhutdinov and in collaboration with Walter Talbott, Chen Chen, Ph.D., and Jian Zhang from Apple. Paper, videos and code (coming soon!) at https://lnkd.in/ekjWPXHM
Methods for Combining Robot Skills
Explore top LinkedIn content from expert professionals.
Summary
Methods for combining robot skills refer to strategies that allow robots to perform complex tasks by blending different abilities, such as movement, manipulation, and perception, in a coordinated way. These approaches help robots adapt to new situations, switch between tasks, and operate reliably in unfamiliar environments without retraining for each new challenge.
- Build general skills: Focus on developing versatile abilities that can serve as a foundation for multiple tasks, making it easier for robots to adapt as demands change.
- Decompose tasks: Break complex jobs into smaller, manageable components so the robot can use its existing skills in sequence to achieve long-term goals.
- Integrate sensory input: Combine visual and tactile information to help the robot focus on relevant objects and actions, improving decision-making in cluttered or unfamiliar settings.
-
-
Humanoid robots need to adapt to different tasks, like moving around, handling objects while walking, and working on tables, each requiring a unique way to control the robot’s body. For instance, moving around focuses on tracking how fast the robot's base is moving, while working at a table relies more on controlling the robot's arm movements. Many current methods train robots with specific controls for each task, making it hard for them to switch between tasks smoothly. This new approach suggests using whole-body motion imitation to create a common base that can work for all tasks, helping robots learn general skills that apply to different types of control. With this idea, researchers developed HOVER (Humanoid Versatile Controller), a system that combines different control modes into one shared setup. HOVER allows robots to switch between tasks without losing the strengths needed for each one, making humanoid control easier and more flexible. This approach removes the need to retrain the robot for each task, making it more efficient and adaptable for future uses. The diverse team of researchers that developed HOVER come from: NVIDIA, Carnegie Mellon University, University of California, Berkeley, The University of Texas at Austin, and UC San Diego. 📝 Research Paper: https://lnkd.in/eMatAxMu 📊 Project Page: https://lnkd.in/eY4gzmme #robotics #research
-
Just collecting manipulation data isn’t enough for robots - they need to be able to move around in the world, which has a whole different set of challenges from pure manipulation. And bringing navigation and manipulation together in a single framework is even more challenging. Enter HERMES, from Zhecheng Yuan and Tianming Wei. This is a four-stage process in which human videos are used to set up an RL sim-to-real training pipeline in order to overcome differences between robot and human kinematics, and used together with a navigation foundation model to move around in a variety of environments. To learn more, join us as Zhecheng Yuan and Tianming Wei tell us about how they built their system to perform mobile dexterous manipulation from human videos in a variety of environments. Watch Episode #45 of RoboPapers today, hosted by Michael Cho and Chris Paxton! Abstract: Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks Project Page: https://lnkd.in/e-aEbQzn ArXiV: https://lnkd.in/eemU6Pwa Watch/listen: Youtube: https://lnkd.in/erzbkYjz Substack: https://lnkd.in/e3ea76Q8
Ep#45: HERMES: Human-to-Robot Embodied Learning From Multi-Source Motion Data for Mobile Dexterous Manipulation
robopapers.substack.com
-
New paper: Compose by Focus — we make robot visuomotor skills robustly composable by focusing perception on the task-relevant parts of a scene using 3D scene graphs, then learning a single policy over those structured inputs. Motivation: planners (VLM/TAMP) can break a long task into sub-goals, but visuomotor policies often crumble in cluttered, novel scenes. The issue isn’t the planner — but the policy's brittle visual processing under distribution shift. We argue skills must be focused, attending only to relevant objects and relations. Method in a nutshell: for each sub-goal, we build a dynamic sub-scene graph—Grounded SAM segments relevant objects, a VLM infers relations; a GNN encodes the graph; CLIP encodes the skill text; a diffusion policy is conditioned on both language + graph features; a VLM planner sequences sub-goals. Why it matters: focusing on relevant objects/relations mitigates distribution shift, enabling reliable multi-skill composition in visually complex scenes. Scene graphs also form a natural interface between high-level planning and low-level control, easing long-horizon execution and reducing data demands. Project & paper: https://lnkd.in/eGv9JQ5n
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development