Latest AIST Robot Foundation Model Developments

Explore top LinkedIn content from expert professionals.

Summary

The latest AIST robot foundation model developments focus on creating general-purpose robot models that can learn new tasks and adapt to changing environments by combining vision, language, and action. Foundation models are large-scale AI systems trained on diverse data, enabling robots to follow instructions, handle unfamiliar objects, and work in real-world settings with minimal retraining.

  • Embrace open-source: Take advantage of publicly available robot foundation models and datasets to experiment, collaborate, and accelerate learning across teams and organizations.
  • Utilize demonstration learning: Teach robots new tasks by simply providing video examples or natural language instructions, reducing the need for complex coding or frequent retraining.
  • Explore real-world adaptation: Deploy these models in varied environments, knowing they are designed to handle changes in scene layout, object positions, and unstructured tasks through advanced vision–language reasoning.
Summarized by AI based on LinkedIn member posts
  • View profile for Jiafei Duan

    Robotics & AI PhD student at University of Washington, Seattle

    6,901 followers

    Why do powerful pretrained generalist robot models fail when you move an object a few inches, swap a target, or change the scene layout? It’s usually not a lack of motor skill — it’s an alignment problem at test time. In our new paper, we introduce Vision–Language Steering (VLS): a training-free, inference-time framework that adapts frozen diffusion and flow-matching robot policies to out-of-distribution (OOD) scenarios. Key idea: Treat adaptation as an inference-time control problem. Instead of retraining policies, we steer the denoising process using: -Vision–Language Models to interpret test-time constraints -Differentiable, programmatic rewards grounded in 3D geometry -Gradient-based guidance + particle resampling for stable long-horizon execution 📊 Results CALVIN: +31% absolute success over prior steering methods LIBERO-PRO: +13% improvement on strong VLAs (π0.5, OpenVLA) Real world (Franka): Robust execution under appearance shifts, position swaps, and novel object substitutions This work suggests a broader takeaway for robotics foundation models: Scaling policies alone isn’t enough — inference-time alignment matters. 📄Paper: https://lnkd.in/g67pf5Tm 🌐 Project page: https://lnkd.in/gkPxZjXw

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

    229,029 followers

    🚀The world’s first Open Foundation Model for generalist humanoid robots was just launched during NVIDIA’s GTC, and it’s nothing short of exciting! My take is, this new model, designed for diverse manipulation tasks, will be performing in open-ended environments, where “new, unseen data” will be coming in on the fly! I’m hoping we’re surmounting the hurdles seen with autonomous vehicles, as we fine tune this foundational model into many sub-versions. Making it open source is a major strength, in my opinion. Researchers around the world will be thinking about ways to fine tune using innovative reinforcement learning techniques, given that Omniverse and and Cosmos provides a space to explore synthetic data while removing the constraints of human-annotated data. Nonetheless, here are the quick facts about Groot N1: 🔹Vision-Language-Action (VLA) Architecture: Combines a vision-language model for reasoning (System 2) with a diffusion transformer for real-time motor actions (System 1). 🔹Trained on Heterogeneous Data: Uses a structured data pyramid like human videos, synthetic simulations, and real-robot demonstrations. 🔹Cross-Embodiment Generalization: Supports multiple robot types, from simple arms to full humanoid robots. 🔹High-Frequency Control: Processes perception at 10Hz and generates motor actions at 120Hz on an NVIDIA L40 GPU. 🔹State-of-the-Art Learning: Outperforms imitation learning baselines in both simulation and real-world humanoid benchmarks. 🔹Open-Source Availability: Model weights, datasets, and simulation environments are accessible on GitHub & Hugging Face. Hope you’re as excited as I am about this new frontier, and what’s coming next! #genai #technology #artificialintelligence

  • View profile for Simon Lancaster 🇺🇸🇨🇦🇵🇹

    GP, Omni Ventures - The Manufacturing Tech VC™️| Author of Unlocking Alpha | Investing in AI for manufacturing, engineering design, and value chain transformation.

    34,465 followers

    Sound on! NVIDIA just took a huge step toward the GPT of humanoid robots with Isaac GR00T N1.5, a foundation model for general-purpose robotics. Here’s how it works: → You demo a task once → Cosmos (their physics AI) generates thousands of variations → Omniverse runs high-fidelity simulations of each motion → The robot “trains” entirely in simulation → It then fine-tunes itself in the real world That means robots can now pick up general skills—across tasks, tools and even different body types—with a single human demo. AI isn’t limited to text anymore. It’s perceiving. Reasoning. Moving. Physical AI has arrived, and it’s teaching itself. What tasks would you hand off to a self-training robot first? Let me know below.

  • View profile for Russ Tedrake

    Building Physical AI

    3,377 followers

    TRI's "LBM 1.0" paper appeared on arxiv last night! Large Behavior Models (LBMs) are foundation models for robots that map robot sensors (notably camera inputs) and natural language commands into robot actions. The robots are programmed just through demonstration; we can develop incredible new skills like the video below without writing a single line of new code. There is a lot of excitement in the field right now because of the incredible potential for this type of technology. Inevitably, there is also a lot of hype. One of our main goals for this paper was to put out a very careful and thorough study on the topic to help people understand the state of the technology, and to share a lot of details for how we're achieving it. The short version is: LBMs work! We see consistent and statistically significant improvements as we increase the amount of pretraining data. But doing the science is still hard; as a field we have more work to do to improve the statistical power of our experiments. Please check out our project website for the paper and more details: https://lnkd.in/eDn_sqGh. https://lnkd.in/epSksw5E

    CutAppleInSlices Task

    https://www.youtube.com/

  • 𝗔 𝗡𝗲𝘄 𝗘𝗿𝗮 𝗼𝗳 𝗣𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲?   What if a single model could control multiple robots — and learn new tasks from just one video?   The team at Physical Intelligence has developed a generalist robotics policy that demonstrates exactly that: 🧠 One model 🤖 Multiple embodiments 🛠️ Countless real-world tasks   Trained across 𝗲𝗶𝗴𝗵𝘁 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗿𝗼𝗯𝗼𝘁 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺𝘀, it can perform everything from folding laundry to bussing tables — even tasks it has never seen before.   All it takes? ✅ A single video demonstration ✅ Or a natural language instruction No fine-tuning. No task-specific code. Just action.   This isn’t just robotics automation — it’s a bold step toward 𝗴𝗲𝗻𝗲𝗿𝗮𝗹-𝗽𝘂𝗿𝗽𝗼𝘀𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲.   We’ve seen what foundation models can do in vision and language. This is what happens when that power moves into the physical world.   #AI #Robotics #Automation #Innovation

  • View profile for Dr. Kal Mos

    Executive VP, Research & Predevelopment @ Siemens, ex-Google, ex-Amazon AGI, Startup Founder

    13,205 followers

    This new paper proposes dual-stream diffusion (DUST), a world-model augmented VLA framework. It shows that combining world models with physics-aware VLA delivers major gains in generalization and real-world task success. DUST outperforms standard VLA architectures that map perception to action without internal physical simulation. DUST keeps vision + action streams separated but cross-modal, enabling a physically consistent internal state that boosts manipulation success by 6% in simulation and 13% on real robots. This hybrid approach is the direction next-gen Robotics Foundation Models will go: physics-aware, temporally grounded, scalable, general-purpose embodied intelligence. https://lnkd.in/gCQn3-Ta #Robotics #RFM #RFM1 #RoboticsFoundationModel #WorldModel #LeCunWorldModel #EmbodiedAI #VLA #VisionLanguageAction #PhysicsAugmentedAI #DiffusionModels #ModelBasedRL #RobotManipulation #AutonomousSystems #PhysicalAI #EmbodiedFoundationModels #RobotLearning #Sim2Real #AIResearch #GeneralistRobots #IndustrialAI #DeepLearning #AIInfrastructure #FoundationModels #MachineLearning #Transformers #DiffusionTransformers #EmbodiedIntelligence #FutureOfAutomation #NextGenAI #Siemens

  • View profile for Aniket Deb

    Co-founder & CEO, Flosync - Machine Intelligence for Manufacturing | Co-founded Bizongo (B2B unicorn, $300M+ raised) | Forbes 30U30 · Fortune 40U40 | IIT Bombay

    41,174 followers

    The ChatGPT moment for robotics is almost here. Not because robots have started to look better recently, but because the learning bottleneck in Physical AI is finally cracking. Leading teams are taking an increasingly foundational approach to how robots are trained and how physical intelligence compounds. And this is where Figure comes in. Figure, founded by Brett Adcock, is building general-purpose humanoid robots. This "general-purpose" distinction matters. It means robots won't stay confined to factories or warehouses, but can operate alongside us in our homes, offices, and other shared human spaces. Figure recently announced Figure 03, a fully autonomous humanoid built for every day tasks. Capabilities like cooking, cleaning, laundry and household assistance are no longer speculative demos, but part of an increasingly realistic roadmap. That's the real ChatGPT moment for robotics. At the core of this is Helix, Figure's most advanced vision-language-action system to date. Helix 02 enables whole-body autonomy, coordinating perception and action so robots can complete long-horizon tasks without human intervention. Most importantly, Helix 02 is designed so physical skills carry forward and compound over time, rather than being relearned for every new task, which is what makes these robots genuinely general-purpose. This is what it looks like when Physical AI starts to scale. I break this transition down in my latest newsletter edition, along with two other companies pushing the frontier of Physical AI across humanoids, warehouse robotics, and robot foundation models. Read to explore the cutting-edge innovation shaping this shift.

  • View profile for Vedant Nair

    Co-Founder @ Miru (YC S24) | RobotOps Software Infra

    14,557 followers

    Breaking News: The GPT-3 Moment for Robotics? 🤖🚀 Skild AI just came out of stealth mode, announcing a whopping $300M Series A funding round at a $1.5B valuation. Those are some big numbers! 😅 Skild AI was founded by PhDs Deepak Pathak and Abhinav Gupta, both former professors at Carnegie Mellon University. Problem: The reason you don’t have a robot making you dinner every night isn’t due to hardware limitations. It’s a software issue. Traditionally, robots have been trained to perform specific tasks—picking and placing, moving, unloading, etc. These robots have only been effective in constrained environments with narrow instructions. So far, we’ve been unable to string these tasks together to create general-purpose actions. Solution: Skild AI aims to change this with its foundation model. They’ve discovered that “vision-language-action (VLA) models exhibit the same sort of emergent behavior as large, pre-trained language models (LLMs). Just as training an LLM on Algebra makes it better at Spanish, research suggests training a VLA on navigation improves its grasping ability.” Previously, foundational models like these were untenable due to a lack of training data. However, Skild claims their “foundation model is trained on an unparalleled scale of data, representing a breakthrough in the robotics data barrier.” It’s not public how they’ve overcome this barrier, but my best guess is through simulations. Better hardware and ample compute power have led to world-scale simulations that produce rich enough data to train foundation models. In 2022, Deepak and Abhinav won the Best Robotic System Award at the Conference on Robotic Learning. They likely took these learnings with them when they founded Skild in 2023. A large chunk of the $300 million will be poured into the GPUs to maintain these rich simulation environments and train their models. Skild isn’t alone in this arena. Earlier this year, NVIDIA announced their GR00T foundation model for robotics, trained on their proprietary Isaac Sim platform. Both teams hope their foundation models will unlock massive progress in robotics, allowing us to replace humans in the dull, dirty, and dangerous tasks still prevalent in the physical world. According to Sequoia Capital, "A GPT-3 moment is coming to the world of robotics. It will spark a monumental shift, bringing advancements similar to what we’ve seen in digital intelligence to the physical world." The question remains, do we believe the hype around robot foundation models? And who will be the team that makes the breakthrough? Let me know what you think. Quote credits: Skild AI, Felicis, and Sequoia

Explore categories