VLA models are systems that combine three capabilities into one framework: seeing the world through cameras, understanding natural language instructions like "pick up the red apple," and generating the actual motor commands to make a robot do it. Before these unified models existed, robots had separate modules for vision, language, and movement that were stitched together with manual engineering, which made them brittle and unable to handle new situations. This review paper covers over 80 VLA models published in the past three years, organizing them into a taxonomy based on their architectures—some use a single end-to-end network, others separate high-level planning from low-level control, some use diffusion models for smoother action sequences. The paper walks through how these models are trained using both internet data and robot demonstration datasets, then maps out where they're being applied. The later sections lay out the concrete technical problems that remain unsolved. Read online with an AI tutor: https://lnkd.in/eZdzYfdu PDF: https://lnkd.in/ezzncewE
How Robots Plan Tasks Using Visual Processing
Explore top LinkedIn content from expert professionals.
Summary
Robots plan tasks using visual processing by interpreting visual data—like images and videos—to understand their surroundings and decide what actions to take. This approach combines computer vision and AI, allowing robots to translate instructions into real-world movements without relying solely on text or manual programming.
- Combine vision and language: Integrate camera feeds and natural language instructions so robots can recognize objects and follow commands like “pick up the red apple.”
- Use video for learning: Train robots with video footage showing how tasks should be performed, helping them mimic human actions and understand physical motion.
- Adapt to new environments: Enable robots to update their plans in real time as they encounter unfamiliar objects or spaces, ensuring they stay flexible and reliable.
-
-
First fully open Action Reasoning Model (ARM); can ‘think’ in 3D & turn your instructions into real-world actions: [📍 Bookmark for later] A model that reasons in space, time, and motion. It breaks down your command into three steps: ✅ Grounds the scene with depth-aware perception tokens ✅ Plans the motion through visual reasoning traces ✅ Executes low-level commands for real hardware Think of it as chain-of-thought for physical action. Give it an instruction like “Pick up the trash” and MolmoAct will: 1. Understand the environment through depth perception 2. Visually plan the sequence of moves 3. Carry them out… while letting you see the plan overlaid on camera frames before anything moves It’s steerable in real time: draw a path, change the prompt, and the trajectory updates instantly. AAAANNNDDD: It’s completely open: checkpoints, code, and evaluation scripts are ALL PUBLIC! Resources Models: https://lnkd.in/dcMVV29k Data: https://lnkd.in/dUwszSvd 📍Blog: https://lnkd.in/diNJFXEi MolmoAct runs across different robot types (from gripper arms to humanoids) and adapts quickly to new tasks. It outperforms models from major labs like NVIDIA, Google, and Microsoft on benchmark tests for generalization and real-world success rates. For anyone building robotics systems or studying AI-driven action models, this is worth exploring… and worth sharing! ♻️
-
What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. 🌐 dream2flow.github.io by Stanford University 🔹Robot manipulation is about inducing changes in an environment through actions. We observe that video models (e.g., Veo) excel at producing plausible object motions from an in-the-wild image and language instructions. Intriguingly, these motions are more physically realistic when the actor is human rather than robot, likely because the internet contains far more human interaction data than robot data. 🔹But how do we turn those generated videos into low-level robot actions? This is a nuanced question beyond simple retargeting, because strategies taken by a human may not work on a robot. 🔹We propose Dream2Flow, which uses 3D object flow to separate what should happen in the scene from how a robot should realize it. We extract this flow from generated videos using off-the-shelf vision models, then use it as a shared objective for both trajectory optimization and reinforcement learning. 🔹Dream2Flow can perform a range of in-the-wild tasks zero-shot with trajectory optimization, including manipulation of rigid, articulated, and deformable objects. The robot plans by asking a counterfactual question using a dynamics model (either heuristics-based or learned): if I take this action, will the scene evolve toward the desired 3D flow? 🔹Using as reward for RL, Dream2Flow enables different embodiments to discover emergent behaviors that achieve the same effect (e.g., base motion of the robot dog). Dream2Flow unifies these behaviors through a shared task interface and unifies model-free and model-based methods around a shared tracking goal. 🔹By leveraging purely off-the-shelf video models, Dream2Flow also allows generalization to different object instances, backgrounds, and camera viewpoints. It is also surprisingly steerable: different language instructions in the same scene can induce different desired behaviors. 🔹World modeling encodes rich priors about not only environment dynamics but also behaviors within it. It is immensely useful for robotics, yet we are only scratching the surface of understanding it. The project was led by Karthik Dharmarajan and has been a year in the making, along with the rest of the team Jiajun Wu, Fei-Fei Li, and Ruohan Zhang. Karthik Dharmarajan will also be joining UC Berkeley as a PhD student this fall! Website: dream2flow.github.io Paper: https://lnkd.in/gpwP2hkT Code: https://lnkd.in/gvJZTxaP
-
Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic-video takes a different route. Instead of grounding robot control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! mimic, Microsoft, ETH Zürich, University of California, Berkeley NVIDIA Robotics ~~ ♻️ Join the weekly robotics newsletter, and never miss any news → ziegler.substack.com
-
📑 A Major Milestone in 𝗦𝗽𝗮𝘁𝗶𝗮𝗹 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: bridging the gap between 𝗟𝗟𝗠𝘀 and 𝟯𝗗 𝗦𝗰𝗲𝗻𝗲 𝗚𝗿𝗮𝗽𝗵𝘀 (3DSGs) for Advanced Real-World Navigation Traditional robotic systems struggle to interpret abstract commands and operate in expansive environments. For humans, a task like "grab a snack from the kitchen" is trivial, but it involves understanding vague instructions, knowing where things are, and planning the best way to get them—all of which remain significant challenges for robots operating in large, complex environments. How can we combine AI with detailed 3D maps of spaces to help AI systems and robots not only understand complex tasks in large environments but also adapt and refine their plans as they discover new information or face unexpected challenges? Excited to share this new research, introducing SayPlan - a framework that bridges the gap between Large Language Models (LLMs) and 3D Scene Graphs (3DSGs), setting a new standard for robotic task planning in complex, multi-room, and multi-floor spaces. How does it work? 🏢 1) Hierarchical Scene Representation: The framework leverages 3DSGs to represent environments hierarchically, from floors to rooms, assets, and individual objects. This allows the system to abstract and collapse unnecessary details, focusing only on task-relevant components. 🔍 2) Semantic Search: SayPlan employs LLMs to explore task-relevant subgraphs through iterative expansion and contraction, refining the scope of planning. For example, when asked to "fetch an item from the fridge," the system narrows its focus from the building to the kitchen and finally the fridge. 🔄 3) Iterative Replanning: Plans are verified against a simulator, which identifies errors like unfulfilled preconditions (e.g., forgetting to open a fridge). The LLM receives feedback to correct its output, ensuring that the final plan is executable and aligned with environmental constraints. 🗺️ 4) Path Optimization and Learning: Navigational tasks are optimized using algorithms like Dijkstra's, offloading computational complexity from the LLM. Industrial Implications SayPlan reduces 𝗶𝗻𝗽𝘂𝘁 𝘁𝗼𝗸𝗲𝗻 𝘀𝗶𝘇𝗲 𝗯𝘆 𝘂𝗽 𝘁𝗼 𝟴𝟮% using hierarchical graph compression and achieves 100% success in simple tasks and 86.6% for complex, multi-step plans. Iterative replanning resolves execution errors, ensuring near-perfect performance in tests. SayPlan exemplifies the potential of cutting-edge research in robotics and AI, demonstrating how 𝗟𝗟𝗠𝘀 𝗰𝗮𝗻 𝗱𝗼 𝗺𝘂𝗰𝗵 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝘁𝗲𝘅𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (😉) and how hierarchical environmental representations can create a scalable, reliable, and precise planning system. 👉 Follow me for more in-depth insights on Industrial #AI applications across sectors. Enno Danke Maria Danninger Christian Souche Amine Kharrat Simon Roggendorf Dr. Veo Zumpe Dr. Matthias Ziegler #AI #Robotics #TaskPlanning #Innovation #Automation #Research
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development