Robotics Manipulation Using Partial Scene Data

Explore top LinkedIn content from expert professionals.

Summary

Robotics manipulation using partial scene data means enabling robots to interact with their environment even when they only have incomplete information about what’s around them—like when objects are hidden, scenes are cluttered, or the robot’s camera can’t see everything. Recent advances combine 3D scene modeling, neural network policies, and video-based prediction to help robots plan, learn, and act in real-world settings with limited data.

Embrace scene completion: Using AI-driven tools, robots can reconstruct full 3D models from partial images, allowing them to make smarter decisions in cluttered or occluded environments.
Utilize predictive modeling: By imagining possible future outcomes based on incomplete data, robots can choose actions that are likely to achieve the desired results, even in unfamiliar situations.
Scale data creatively: Generating synthetic demonstrations and augmenting real-world data gives robots a wider range of experiences, improving their ability to handle new objects and tasks with limited firsthand information.

Summarized by AI based on LinkedIn member posts

Wenlong Huang

CS PhD Student at Stanford (AI / Robotics)

2,396 followers 1mo
Report this post
What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. 🌐 dream2flow.github.io by Stanford University 🔹Robot manipulation is about inducing changes in an environment through actions. We observe that video models (e.g., Veo) excel at producing plausible object motions from an in-the-wild image and language instructions. Intriguingly, these motions are more physically realistic when the actor is human rather than robot, likely because the internet contains far more human interaction data than robot data. 🔹But how do we turn those generated videos into low-level robot actions? This is a nuanced question beyond simple retargeting, because strategies taken by a human may not work on a robot. 🔹We propose Dream2Flow, which uses 3D object flow to separate what should happen in the scene from how a robot should realize it. We extract this flow from generated videos using off-the-shelf vision models, then use it as a shared objective for both trajectory optimization and reinforcement learning. 🔹Dream2Flow can perform a range of in-the-wild tasks zero-shot with trajectory optimization, including manipulation of rigid, articulated, and deformable objects. The robot plans by asking a counterfactual question using a dynamics model (either heuristics-based or learned): if I take this action, will the scene evolve toward the desired 3D flow? 🔹Using as reward for RL, Dream2Flow enables different embodiments to discover emergent behaviors that achieve the same effect (e.g., base motion of the robot dog). Dream2Flow unifies these behaviors through a shared task interface and unifies model-free and model-based methods around a shared tracking goal. 🔹By leveraging purely off-the-shelf video models, Dream2Flow also allows generalization to different object instances, backgrounds, and camera viewpoints. It is also surprisingly steerable: different language instructions in the same scene can induce different desired behaviors. 🔹World modeling encodes rich priors about not only environment dynamics but also behaviors within it. It is immensely useful for robotics, yet we are only scratching the surface of understanding it. The project was led by Karthik Dharmarajan and has been a year in the making, along with the rest of the team Jiajun Wu, Fei-Fei Li, and Ruohan Zhang. Karthik Dharmarajan will also be joining UC Berkeley as a PhD student this fall! Website: dream2flow.github.io Paper: https://lnkd.in/gpwP2hkT Code: https://lnkd.in/gvJZTxaP

8 Comments
Like Comment
Murtaza Dalal

Robotics ML Engineer @ Tesla Optimus | CMU Robotics PhD

2,157 followers 1y
Report this post
Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 Quickly and dynamically moving around and in-between obstacles (motion planning) is a crucial skill for robots to manipulate the world around us. Traditional methods (sampling, optimization or search) can be slow and/or require strong assumptions to deploy in the real world. Instead of solving each new motion planning problem from scratch, we distill knowledge across millions of problems into a generalist neural network policy. Our Approach: 1) large-scale procedural scene generation 2) multi-modal sequence modeling 3) test-time optimization for safe deployment Data Generation involves: 1) Sampling programmatic assets (shelves, microwaves, cubbys, etc.) 2) Adding in realistic objects from Objaverse 3) Generating data at scale using a motion planner expert (AIT*) - 1M demos! We distill all of this data into a single, generalist policy Neural policies can hallucinate just like ChatGPT - this might not be safe to deploy! Our solution: Using the robot SDF, optimize for paths that have the least intersection of the robot with the scene. This technique improves deployment time success rate by 30-50%! Across 64 real-world motion planning problems, Neural MP drastically outperforms prior work, beating out SOTA sampling-based planners by 23%, trajectory optimizers by 17% and learning-based planners by 79%, achieving an overall success rate of 95.83% Neural MP extends directly to unstructured, in-the-wild scenes! From defrosting meat in the freezer and doing the dishes to tidying the cabinet and drying the plates, Neural MP does it all! Neural MP generalizes gracefully to OOD scenarios as well. The sword in the first video is double the size of any in-hand object in the training set! Meanwhile the model has never seen anything like the bookcase during training time, but it's still able to safely and accurately place books inside it. Since, we train a closed-loop policy, Neural MP can perform dynamic obstacle avoidance as well! First, Jim tries to attack the robot with a sword, but it has excellent dodging skills. Then, he adds obstacles dynamically while the robot moves and it’s still able to safely reach its goal. This work is the culmination of a year-long effort at Carnegie Mellon University with co-lead Jiahui(Jim) Yang as well as Russell Mendonca, Youssef Khaky, Russ Salakhutdinov, and Deepak Pathak The model and hardware deployment code is open-sourced and on Huggingface! Run Neural MP on your robot today, check out the following: Web: https://lnkd.in/emGhSV8k Paper: https://lnkd.in/eGUmaXKh Code: https://lnkd.in/e6QehB7R News: https://lnkd.in/enFWRvft

8 Comments
Like Comment
Rangel Isaías Alvarado Walles

Robotics & AI Engineer | AI Engineer | Machine Learning | Deep Learning | Computer Vision | Agentic AI | Reinforcement Learning | Self-Driving Cars | IoT | IIoT | AIOps | MLOps | LLMOps | DevOps | Cloud | Edge AI

4,585 followers 11mo
Report this post
Real2Render2Real – Scaling Robot Data Without Dynamics Simulation or Robot Hardware ArXiv: Project: real2render2real.com As robots move toward general-purpose manipulation in unstructured environments, collecting large and diverse training data remains a major bottleneck. Enter Real2Render2Real (R2R2R): a framework that scales robot data generation from just a smartphone scan and one human video—no teleoperation, robot hardware, or physics simulation needed. R2R2R generates thousands of realistic, robot-agnostic demonstrations via 3D Gaussian Splatting and differential inverse kinematics, then trains models that match the performance of human teleoperation-based learning—at 27× the throughput. 🧠 Key Concepts 1️⃣ Real-to-Synthetic Pipeline Input: A multi-view smartphone scan and a monocular human demo video Extract 3D object shape via 3D Gaussian Splatting Track 6-DoF object motion with 4D-DPM Render thousands of synthetic trajectories in photorealistic scenes using IsaacLab 2️⃣ One-to-Many Demonstration Scaling Interpolate and augment object trajectories for new object placements Use analytic grasp generation for diverse valid grasps Generate robot joint-space trajectories via inverse kinematics Supports rigid and articulated objects with automatic part segmentation 3️⃣ No Physics, No Robots, No Problem No force modeling, torque computation, or simulation dynamics Robot arms are treated as kinematic bodies, sidestepping collision models Policies trained only on R2R2R data match those trained on 150 real teleop demos ⚙️ How to Implement R2R2R Phase 1 – Real-to-Sim Extraction Scan object → Reconstruct with 3DGS → Meshify with GARField Track object motion from video → Extract part-level 6-DoF trajectories Phase 2 – Trajectory Diversification Interpolate trajectories to adapt to random poses using Slerp Estimate grasps from hand-object proximity Generate IK trajectories with PyRoki solver under smoothness & joint limits Phase 3 – Parallelized Rendering Render RGB frames and action data with IsaacLab Apply domain randomization: camera pose, lighting, table textures Output: RGB + proprioception + actions → usable for VLA, π0-FAST, Diffusion Policy ✅ Advantages FeatureBenefit⚡ 27× faster than human teleop51 demos/min on 1 GPU🧠 No physics or robot neededNo dynamics engine or torque simulation🎥 Generalizes from 1 videoThousands of demos from a single example🔧 Robot-agnosticCompatible with any robot URDF🎯 High performanceMatches/surpasses real demos in 5 real-world tasks📦 Works with π0-FAST, Diffusion Policy, VLADrop-in for modern imitation learners 🛠 Applications Vision-Language-Action (VLA) Model Training Robot Learning at Scale Without Robots Augmenting Real Datasets with Rich Visual Diversity Tool Learning, Multi-Object Interaction, Bimanual Tasks Follow me to know more about AI, ML and Robotics.
No more previous content

No more next content
Like Comment
Aditya Agarwal

PhD student at MIT

8,014 followers 1y Edited
Report this post
🌟 Excited to share our latest work, SceneComplete: an open-world 3D scene completion system that constructs a complete, segmented 3D model of complex scenes from a single RGB-D image. 🤖 Designed for real-world robotic applications, SceneComplete enables dexterous grasping and robust manipulation, even in highly cluttered environments. 🛠 The Challenge: Traditional methods struggle with occlusions, noisy depth data, and diverse object configurations. SceneComplete addresses these issues by generating high-quality 3D meshes, providing a complete view of the scene even with dense clutter and partial observability. 🚀 How SceneComplete Works: Our approach composes general-purpose pretrained perception modules, into a powerful system. 1️⃣ Vision-Language Models to describe objects in the scene 2️⃣ Grounded Segmentation to localize them 3️⃣ Image Inpainting to fill in occluded parts in 2D 4️⃣ Image-to-3D models to reconstruct full object meshes 5️⃣ Scaling and Pose Estimation to finalize 3D scene reconstruction via registration 🎓 Fun collaboration with Gaurav Singh and Bipasha Sen at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). 👉 Check out our full paper and website for more on SceneComplete. 🔗 Paper: https://lnkd.in/d6sKdaxQ 🌐 Website: https://lnkd.in/dFkb_vR6 📽️ Video: https://lnkd.in/dbQF_HCy [Update] Code: https://lnkd.in/e32demte

11 Comments
Like Comment

Robotics Manipulation Using Partial Scene Data

Summary

More in Applications of Robotics

Explore categories