Bi-Level Machine Learning Techniques for Robotics

Explore top LinkedIn content from expert professionals.

Summary

Bi-level machine learning techniques for robotics involve splitting robot learning into two distinct stages: one focuses on understanding higher-level goals or motion patterns, and the other translates those insights into precise, lower-level actions. This structured approach helps robots learn new tasks from limited data and adapt quickly to real-world situations.

Streamline training: Separate the process of learning what needs to be done from how to do it, making it easier to use available video or demonstration data for teaching robots.
Enable quick adaptation: Build base models that handle general behaviors, then refine them with real-time adjustments so robots can adapt to new environments or tasks as they happen.
Boost generalization: Use hierarchical learning strategies to help robots handle unfamiliar motions and tasks, reducing the need for extensive manual programming or large datasets.

Summarized by AI based on LinkedIn member posts

Animesh Garg

RL + Foundation Models in Robotics. Faculty at Georgia Tech. Prev at Nvidia

19,002 followers 10mo
Report this post
Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ

1 Comment
Like Comment
Vedant Nair

Co-Founder @ Miru (YC S24) | RobotOps Software Infra

14,551 followers 1mo
Report this post
After building general base models, real-world RL is the endgame. Robots need to be able to quickly adapt to new situations and fix their mistakes on the fly. A base model that can pick up a screwdriver is great, but it's only valuable in production if it can consistently align with a tiny screw at submillimeter precision. Today's models can't do that. Physical Intelligence introduced RL Tokens (RLT), a method that lets a small RL policy sit on top of their base VLA model and refine just the precise, critical phase of a task. No need to fine-tune; instead, the robot can learn from hours (or even minutes) of real-world practice directly on board. The results showed that the RL policy actually executed faster than human teleoperation on half the trials. Across all four tasks they tested, RLT sped up the hardest phases by up to 3x. This is exciting because it provides a pathway for foundation models to achieve production-grade reliability. A robot that can learn in real time can adapt to dynamic conditions in the real world. Interested to see who's first to ship something like this in a real production line.

1 Comment
Like Comment
Andrey Kolobov

Principal researcher, researcher manager & lead | Robot learning @ Microsoft Research

2,315 followers 1y
Report this post
🔥ICML-2024 Oral 📢🔥for our work "PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control", one of the 🔥1.5% of submissions🔥 honored this way! Paper 📰: https://lnkd.in/ggAUMMA7 Code 👩💻: https://lnkd.in/gsWWVpg8 In a nutshell, PRISE tackles the issue of large planning horizons in decision-making, which complicates training models for robot control. 💡The paper's key insight is that large language models (LLMs) have been secretly using a suitable approach to this issue for a long time.💡 PRISE can be understood via the following analogy... 👇 LLMs are models for text generation. Text is a sequence of byte values -- let's call these 256 possible values "codes" -- but handling it byte-by-byte is too expensive. Instead, LLMs treat it as a sequence of less granular units, 🔶 tokens 🔶, each of which corresponds to a sequence of codes. LLMs learn to generate token streams and decode them into (much longer) codes streams. Thus, tokens are temporal abstractions for text. Continuous control 🤖, like text generation, involves making a sequence of highly granular decisions ("which control inputs to send to the robot's motors in the current state?") that result in a trajectory -- a sequence of state-action pairs. Like LLM training, behavior cloning (BC), a staple of pretraining for robotics, also faces major challenges when learning from granular trajectories, as this involves long planning horizons even for basic tasks. Temporal abstractions in the form of tokens decodable into blocks of several low-level decisions seem like they could be helpful here... 👉 Our PRISE (PRImitive Sequence Encoding) method builds on this intuition by transplanting LLM-style tokenization to BC-based learning to control. LLM training commonly constructs a mapping from tokens to code sequences using a method called Byte Pair Encoding (BPE). BPE can be applied to code sequences over any finite codebook, but state-action pairs in control trajectories are from a continuous space. To overcome this problem, we observe that robotic manipulation policies typically consist of switching among a handful of low-level primitives: free-space moves (FSM), lowering the end-effector (LEEF), etc. PRISE learns a "codebook" of these primitives using VQ-VAE and labels every state-action pair in our training set's trajectories with the code of the primitive that generated it. 🎉Voila, our trajectories are ready for tokenization with BPE! 🎉 The "policy tokens" BPE constructs are sequences of primitive policy applications, e.g., FSM-FSM-LEEF. Our experiments show that training hierarchical state-to-token and token-to-control policies over tokens and primitives induced by PRISE results in significantly higher task success rates than using either raw BC or BC with action chunking (https://lnkd.in/gnuemsgn). Happy to chat about the details (oh yes, there are many more details!) at ICML-2024 in Vienna! ✈
No more previous content

No more next content
5 Comments
Like Comment
Shubhankar Tripathy

Crafting a laptop for agents to fly for fun α | Computer Engineering | Training AI agents to Hoop 🏀 l | Stanford, MIT, Berkeley, UMass | Ex-Data @ Dell | Hackathon Winners (MIT, Stanford, CalHacks, YC Agents)

11,137 followers 7mo
Report this post
Pre‑trained Vision‑Language‑Action (VLA) policies are a big step toward generalist robots—but they can be brittle when tasks or scenes shift. A new paper by Neary, Younis, Kuramshin, Aslan, and Glen Berseth proposes VLAPS, a lightweight wrapper that plans at inference time: run a modified MCTS using a world model, and guide it with the VLA’s action priors. Why this matters: • Reliability via compute: spend a little planning compute to recover long‑horizon performance. • Plug‑and‑play: treat the VLA as a proposal model; no finetuning required. • Structure in the loop: inject environment knowledge through a simulator or learned dynamics. Results reported: up to +67 percentage points success over VLA‑only baselines on language‑specified robotics tasks. A deeper analysis :- Pre‑trained vision‑language‑action (VLA) policies hold promise as generalist robot controllers but often fail when deployed in novel scenarios. VLAPS proposes a planning‑time alternative: leave the VLA frozen, and embed a modified MCTS into inference that uses the VLA’s action probabilities as priors while rolling out a world model to evaluate futures. This reframes VLA execution as guided search rather than pure feed‑forward prediction, enabling (i) controllable test‑time compute, (ii) injection of task‑ and environment‑specific structure through a simulator or learned dynamics, and (iii) principled integration with established planning/RL techniques (e.g., PUCT, value backups). Empirically, VLAPS reports substantial gains—up to +67pp success on language‑specified robotics tasks—over direct VLA execution, supporting the thesis that long‑horizon credit assignment and safety constraints are better handled by search than by a one‑shot policy. Limitations include the need for a reliable dynamics model and the latency costs of MCTS. Natural directions are to co‑train a compact world model, employ risk‑sensitive objectives, and combine search with lightweight prior sharpening via parameter‑efficient finetuning of the VLA. In the broader context of OpenVLA and generalist manipulation, VLAPS highlights a practical path to robustify frozen policies without offline data churn: treat the VLA as a powerful proposal mechanism, and let planning arbitrate execution. https://lnkd.in/gaRE_u-W #academic #research #reinforcementlearning #rl #sail #stanford #mila #montreal #umass #mit #machinelearning #idea #paper #tech #testtimecompute #sota #visionlanguagemodel #2025 #august #calberkeley #cal

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search arxiv.org
Like Comment
Marc Theermann

Chief Strategy Officer and GTM Leader at Boston Dynamics (Building the world’s most capable mobile #robots and Embodied AI)

65,653 followers 1y
Report this post
Another robotics masterpiece from our friends from Disney Research! Recent progress in physics-based character control has improved learning from unstructured motion data, but it's still hard to create a single control policy that handles diverse, unseen motions and works on real robots. To solve this, the team at Disney proposes a new two-stage technique. In the first stage, an autoencoder is used to learn a latent space encoding from short motion clips. In the second stage, this encoding helps train a policy that maps kinematic input to dynamic output, ensuring accurate and adaptable movements. By keeping these stages separate, the method benefits from better motion encoding and avoids common issues like mode collapse. This technique has shown to be effective in simulations and has successfully brought dynamic motions to a real bipedal robot, marking an important step forward in robot control. You can find the full paper here: https://lnkd.in/d-kzexdJ What Markus Gross, Moritz Baecher and the rest of the gang are bringing to life is unbelievable!

27 Comments
Like Comment
Dr. Volkan Erol

IT Leader at TEB - BNP Paribas Joint Venture

9,593 followers 3w
Report this post
The quest for robots that can reason over long horizons just took a massive leap forward. Most AI world models today suffer from a classic problem: they get "confused" as the task gets longer. Small prediction errors pile up, and the computational cost of searching for the right move explodes. It is the AI equivalent of losing the forest for the trees. A new paper from Meta FAIR and NYU introduces Hierarchical Planning with Latent World Models (HWM). Yann LeCun Instead of forcing a robot to plan every tiny joint movement from start to finish, HWM splits the brainwork into two levels. The high-level planner thinks in "big picture" macro-actions, setting strategic subgoals in a shared latent space. Then, a low-level planner handles the precision work needed to reach those waypoints. The results are striking. In real-world robotic pick-and-place tasks, HWM boosted success rates from 0% to 70% using only a final goal image. It is up to 3x more computationally efficient than flat models because it ignores the noise and focuses on the signal. This framework is modular and model-agnostic, meaning it can be plugged into various existing AI architectures to unlock zero-shot control on tasks that require non-greedy reasoning. Read the full study to see how hierarchical latent spaces are bridging the gap between simulation and the messy reality of physical robotics. #Robotics #ArtificialIntelligence #WorldModels #MachineLearning #MetaAI #ComputerVision
No more previous content

No more next content
24 Comments
Like Comment
Supriya Rathi

110k+ | India #1. World #10 | Physical-AI | Podcast Host - SRX Robotics | Connecting founders, researchers, & markets | DM to post your research | DeepTech

112,806 followers 1y
Report this post
Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, they introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. #research #paper: https://lnkd.in/dJMDWWri #author: Harsh Mahesheka, Zhixian Xie, Zhaoran Wang, Wanxin Jin

6 Comments
Like Comment
Aaron Prather

Director, Robotics & Autonomous Systems Program at ASTM International

84,962 followers 1y
Report this post
In Robot Learning, connecting complex observations, like RGB images, to simple robotic actions is challenging because these two areas are very different. This becomes even harder with limited data. This is why researchers at the 𝐃𝐲𝐬𝐨𝐧 𝐑𝐨𝐛𝐨𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐋𝐚𝐛 and Imperial College London introduced 𝐑𝐞𝐧𝐝𝐞𝐫 𝐚𝐧𝐝 𝐃𝐢𝐟𝐟𝐮𝐬𝐞, which connects low-level robot actions and RGB observations using virtual images of the robot's 3D model. Combining these observations and actions in the image space, 𝐑𝐞𝐧𝐝𝐞𝐫 𝐚𝐧𝐝 𝐃𝐢𝐟𝐟𝐮𝐬𝐞 calculates low-level robot actions through a learned process that gradually updates the robot's virtual images. This approach simplifies the learning problem and includes helpful patterns for using fewer samples and generalizing to different spaces. The team tested several versions of 𝐑𝐞𝐧𝐝𝐞𝐫 𝐚𝐧𝐝 𝐃𝐢𝐟𝐟𝐮𝐬𝐞 in simulations and demonstrated their use in six everyday tasks in the real world. The results showed that 𝐑𝐞𝐧𝐝𝐞𝐫 𝐚𝐧𝐝 𝐃𝐢𝐟𝐟𝐮𝐬𝐞 has strong spatial generalization abilities and is more efficient with samples than common image-to-action methods. 📝 Research Paper: https://lnkd.in/eW5tmVsh 📊 Project Page: https://lnkd.in/eX3df_JU #robotics #research

2 Comments
Like Comment
Shantanu Parab

Senior Robot Learning Engineer at Trossen Robotics

3,544 followers 1y
Report this post
I've been implementing the groundbreaking research from the paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." at Trossen Robotics. This paper introduces the innovative Action Chunking Transformer (ACT) combined with Temporal Ensemble, pushing the boundaries of what's possible with low-precision robots in imitation learning. In traditional imitation learning, high precision is crucial for performing high dexterity tasks. Low precision often forces robots into out-of-distribution states, leading to compounding errors and task failure. ACT tackles this challenge by predicting future K steps from a current state, significantly reducing compounding errors by K folds. While this may seem like an open-loop control solution, it shines when paired with Temporal Ensemble. By aggregating actions at each time step and giving the highest weight to the oldest actions, our robot can effectively make decisions based on states within the sample distribution, resulting in more reliable task performance. Dive into the details of Action Chunking and Temporal Ensemble by checking out the paper and explore the amazing capabilities of the Aloha Kit. Paper: https://lnkd.in/dZTcdrii Aloha Kit: https://lnkd.in/d2ZH7XHa #MachineLearning #Robotics #ImitationLearning #AI #Innovation #TrossenRobotics #InternshipExperience #ActionChunkingTransformer #TemporalEnsemble #AlohaKit

3 Comments
Like Comment
Dr. Kal Mos

Executive VP, Research & Predevelopment @ Siemens, ex-Google, ex-Amazon AGI, Startup Founder

13,194 followers 4mo
Report this post
This new paper proposes dual-stream diffusion (DUST), a world-model augmented VLA framework. It shows that combining world models with physics-aware VLA delivers major gains in generalization and real-world task success. DUST outperforms standard VLA architectures that map perception to action without internal physical simulation. DUST keeps vision + action streams separated but cross-modal, enabling a physically consistent internal state that boosts manipulation success by 6% in simulation and 13% on real robots. This hybrid approach is the direction next-gen Robotics Foundation Models will go: physics-aware, temporally grounded, scalable, general-purpose embodied intelligence. https://lnkd.in/gCQn3-Ta #Robotics #RFM #RFM1 #RoboticsFoundationModel #WorldModel #LeCunWorldModel #EmbodiedAI #VLA #VisionLanguageAction #PhysicsAugmentedAI #DiffusionModels #ModelBasedRL #RobotManipulation #AutonomousSystems #PhysicalAI #EmbodiedFoundationModels #RobotLearning #Sim2Real #AIResearch #GeneralistRobots #IndustrialAI #DeepLearning #AIInfrastructure #FoundationModels #MachineLearning #Transformers #DiffusionTransformers #EmbodiedIntelligence #FutureOfAutomation #NextGenAI #Siemens

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model arxiv.org
Like Comment

Bi-Level Machine Learning Techniques for Robotics

Summary

More in Machine Learning Algorithms

Explore categories