Scaling VLA Data Collection in Robotics Projects

Explore top LinkedIn content from expert professionals.

Summary

Scaling VLA (vision-language-action) data collection in robotics projects means building systems and strategies to gather and use large amounts of synchronized data from robots that combine video, language, and action information. This coordinated data is crucial for training advanced robot models that learn from both human demonstrations and their own experiences.

Streamline data pipelines: Use specialized tools and formats, like LeRobot, to bring together data from different sensors and logs into a single, organized dataset that is easy to analyze and train on.
Prioritize quality checks: Regularly audit and visualize your datasets to spot labeling mistakes, gaps, or quality issues, which helps you fix problems early and improve model performance.
Embrace human and synthetic data: Combine large-scale human video demonstrations with synthetic and real robot data to teach robots new skills with less hands-on robot time and faster results.

Summarized by AI based on LinkedIn member posts

Harpreet Sahota 🥑 Harpreet Sahota 🥑 is an Influencer

🤖 Hacker-in-Residence @ Voxel51| 👨🏽💻 AI/ML Engineer | 👷🏽♀️ Technical Developer Advocate | Learn. Do. Write. Teach. Repeat.

75,975 followers 8mo
Report this post
I was listening to one of my favorite podcasts last week, Unsupervised Learning by Redpoint Ventures. They had Karol Hausman and Danny Driess (Research Scientist) from Physical Intelligence. Around the 33 minute mark of the podcast they mentioned the need for a tool or infrastructure to help them understand what is in their dataset, particularly given the massive amount of multimodal, time-series data that robotics generates. They outlined what they'd want in such a tool: - Decide what data to collect - Build machinery around understanding the collected data - Understand the quality of the data collected so far - Perform quality assurance at scale - Execute language annotations correctly at scale - Determine how much more data is needed for the model - Identify the optimal strategy for data collection - Provide a bird's-eye view understanding of the entire dataset I was excited by that, cuz, well, I work at FiftyOne and we have a tool that does just that... For understanding what's in your dataset, FiftyOne lets you visually explore massive datasets interactively. When they talked about needing a "bird's-eye view," that's literally what our embedding visualizations provide - you can see your entire dataset in embedding space, revealing clusters, gaps, and outliers. The QA at scale problem? FiftyOne has built-in queries to find labeling mistakes and inconsistent patterns across millions of samples. And for data collection strategy, it shows where your dataset has gaps and where models struggle - no more training for weeks to "get a signal." So I went to Physical Intelligence's Hugging Face org and found their "aloha_pen_uncap" dataset. I parsed it into FiftyOne format to see how well our tool would work with their data. In the process, I implemented a data loader for LeRobot format datasets, which means the entire robotics community can now load their datasets in FiftyOne and get all these benefits. The loader handles the multimodal nature of robotics data, parsing camera views, robot states, and actions. What became clear when I loaded their dataset: - You can visually browse task executions and see patterns in successful vs failed attempts - Embedding visualizations shows clusters of similar robot behaviors - Quality issues like poor lighting or occlusions become immediately apparent It's all open source, and all you need to do to get started is `pip install fiftyone` to see what your data looks like in FiftyOne. The tool mentioned in the podcast already exists, and it's open source!
No more previous content

No more next content
9 Comments
Like Comment
Yves Albers-Schoenberg

Founder & CTO at Roboto AI

4,382 followers 4mo
Report this post
𝗙𝗿𝗼𝗺 𝗥𝗢𝗦 𝘁𝗼 𝗟𝗲𝗥𝗼𝗯𝗼𝘁: 𝗛𝗼𝘄 𝗔𝗿𝗲 𝗧𝗲𝗮𝗺𝘀 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗩𝗟𝗔 𝗗𝗮𝘁𝗮 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀? Most real-world robotics systems are built on pub/sub architectures like #ROS. Sensors and estimators publish asynchronously and at different rates: • Cameras at ~30 Hz • Perception at ~10 Hz • State, control, and actions all run on their own clocks This decoupled design has powered robotics for decades. Vision-Language-Action models like NVIDIA Robotics GR00T and Physical Intelligence pi0 work differently. For both training and inference, they require synchronized, tensor-based data with aligned observations, states, and actions on a shared timeline. Hugging Face's #LeRobot has emerged as the community standard for representing this kind of training data. It is PyTorch-native, well documented, and increasingly supported across the ecosystem. The hard part is the bridge from asynchronous ROS topics to synchronized LeRobot episodes, without introducing bias or artifacts. At Roboto AI, we see a few common approaches in practice: 1) 𝗥𝗮𝘄 𝗥𝗢𝗦𝗯𝗮𝗴 𝗼𝗿 𝗠𝗖𝗔𝗣, 𝘁𝗵𝗲𝗻 𝗼𝗳𝗳𝗹𝗶𝗻𝗲 𝗰𝗼𝗻𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝘁𝗼 𝗟𝗲𝗥𝗼𝗯𝗼𝘁 ✔ Maximum data fidelity and the ability to reprocess later ✘ Timestamp handling, resampling, interpolation, and episode definition all need real care 2) 𝗢𝗻𝗹𝗶𝗻𝗲 𝘀𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝘁𝗵 𝗱𝗶𝗿𝗲𝗰𝘁 𝗟𝗲𝗥𝗼𝗯𝗼𝘁 𝘄𝗿𝗶𝘁𝗶𝗻𝗴 ✔ Training-ready data immediately ✘ Synchronization choices are locked in once data is recorded 3) 𝗛𝘆𝗯𝗿𝗶𝗱 𝗰𝗮𝗽𝘁𝘂𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝗿𝗮𝘄 𝗯𝗮𝗴𝘀 𝗽𝗹𝘂𝘀 𝗮 𝘀𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗶𝘇𝗲𝗱 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 ✔ Fast iteration with reproducibility ✘ Higher storage costs and more operational complexity 4) 𝗖𝘂𝘀𝘁𝗼𝗺, 𝗻𝗼𝗻-𝗥𝗢𝗦 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 ✔ Full control over data primitives ✘ You end up re-implementing large parts of the robotics stack The most common failure mode we see is train-inference skew between offline preprocessing and live data flow. This problem exists across ML, but it becomes especially critical when observations map directly to robot actions. Typical causes include: • Different resampling or alignment logic • Implicit lookahead during offline conversion • Episode boundaries that do not match deployment The result is strong offline metrics and disappointing real-world behavior. Despite the push toward end-to-end learning, most production robots will continue to rely on ROS-style pub/sub systems for the foreseeable future. That makes reproducible and auditable data curation the key link between robotics stacks and VLA training. At Roboto, we are actively building tooling to go from raw robotics data to ML-ready datasets. If you are working on VLA pipelines and have wrestled with this gap, I would love to compare notes.
No more previous content

No more next content
18 Comments
Like Comment
Ilir Aliu

AI & Robotics | 150k+ | 22Astronauts

106,317 followers 5mo
Report this post
Robot models get better only when humans feed them more demos. This one improves by learning from its own mistakes. pi*0.6 is a new VLA from Physical Intelligence, that can refine its skills through real-world RL, not just teleop data. The team calls the method Recap, and from what I can see, the gains are not small. A quick summary: ✅ Learns from its own rollouts using a value function trained across all data ✅ Humans only step in when the robot is about to drift too far ✅ Every correction updates the model and improves future rollouts ✅ Works across real tasks like espresso prep, laundry, and box assembly ✅ Throughput more than doubles on hard tasks, with far fewer failure cases What stands out is the structure: a general policy, a shared value function, and a loop where the robot collects data, improves the critic, then improves itself again. No huge fleets of teleoperators. No massive manual resets. If VLAs can reliably self-improve in the real world, the bottleneck shifts. Data becomes cheaper. Deployment becomes the real test bench. Full paper, videos, and method details here: https://lnkd.in/dgCeZdjT

17 Comments
Like Comment
Marco Pavone

18,656 followers 1mo
Report this post
A central challenge in #physical #AI is data scarcity: vision-language-action (#VLA) models are fundamentally limited by the availability of high-quality robotics demonstrations. In our recent work, we introduce R&B-EnCoRe (https://lnkd.in/gSQJp6dV), a framework that enables models to self-bootstrap embodied #reasoning by leveraging synthetic visuo-textual data together with limited embodiment-specific experience. In essence, R&B-EnCoRe allows models to learn how to reason in an embodied setting. Our approach treats reasoning as a latent variable and uses self-supervised refinement to learn reasoning strategies that are directly predictive of successful control—without human annotations, reward engineering, or external verifiers. We validate the approach across a range of embodiments—including manipulation, navigation, and autonomous driving—and across model scales from 1B to 30B parameters, observing consistent improvements: 💪 +28% task success in real-world manipulation 🦿 +101% score in legged locomotion navigation 🚗 −21% collision rate in autonomous driving Overall, this work highlights a promising direction: aligning internet-scale priors with embodiment-specific data to enable scalable, self-improving physical intelligence. Kudos to an amazing team: Milan Ganai Katie Luo Jonas Frey Clark Barrett 🌐 Website: https://lnkd.in/gs3abvbW 📄 Paper: https://lnkd.in/gSQJp6dV
No more previous content

No more next content
3 Comments
Like Comment
Ninad Madhab

Physical AI @NVIDIA | Sr. Solutions Architect

11,255 followers 2mo
Report this post
What if your next dexterous robot skill didn’t need weeks of robot data-just massive human first-person video? Just read this new paper: “EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data.” from the NVIDIA Robotics Why it stood out (in plain terms): • Trains a Vision-Language-Action (VLA) policy using 20,854+ hours of action-labeled egocentric (first-person) human video - 20× larger than many prior efforts. • Finds a log-linear scaling law: more human data → lower validation loss, and that loss tracks real-robot performance. • Uses a practical recipe: large-scale human pretraining → lightweight human–robot mid-training, enabling long-horizon dexterous tasks + one-shot adaptation with minimal robot data. • Reports +54% average success rate improvement over a no-pretraining baseline on a 22-DoF dexterous hand, and transfers to lower-DoF hands (strong “motor prior” signal). My takeaway: scaling egocentric human behavior looks like a legit foundation model pathway for manipulation-especially if your goal is less robot data, more generalization, faster deployment. #Robotics #EmbodiedAI #VLA #ImitationLearning #PhysicalAI #MachineLearning #ComputerVision
No more previous content

No more next content
3 Comments
Like Comment
Elvis S.

Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

85,570 followers 3mo
Report this post
First empirical evidence that VLA models scale with massive real-world robot data. VLA foundation models promise robots that can follow natural language instructions and adapt to new tasks quickly. However, the field has lacked comprehensive studies on how performance actually scales with real-world data. This new research introduces LingBot-VLA, a Vision-Language-Action foundation model trained on approximately 20,000 hours of real-world manipulation data from 9 dual-arm robot configurations. Scaling pre-training data from 3,000 hours to 20,000 hours improves downstream success rates consistently, with no signs of saturation. More data still helps. The architecture uses a Mixture-of-Transformers design that couples a pre-trained VLM (Qwen2.5-VL) with an action expert through shared self-attention. This allows high-dimensional semantic priors to guide action generation while avoiding cross-modal interference. On the GM-100 benchmark spanning 100 tasks across 3 robotic platforms with 22,500 evaluation trials, LingBot-VLA achieves 17.30% success rate and 35.41% progress score, outperforming π0.5 (13.02% SR, 27.65% PS), GR00T N1.6 (7.59% SR, 15.99% PS), and WALL-OSS (4.05% SR, 10.35% PS). In simulation on RoboTwin 2.0, the model reaches 88.56% success rate in clean scenes and 86.68% in randomized environments, beating π0.5 by 5.82% and 9.92% respectively. Training efficiency matters for scaling. Their optimized codebase achieves 261 samples per second per GPU on an 8-GPU setup, representing a 1.5-2.8× speedup over existing VLA codebases like StarVLA, OpenPI, and DexBotic. Data efficiency is equally impressive: with only 80 demonstrations per task, LingBot-VLA outperforms π0.5 using the full 130-demonstration set. This is the first empirical demonstration that VLA performance continues scaling with more real-world robot data without saturation, providing a clear roadmap for building more capable robotic foundation models.
No more previous content

No more next content
6 Comments
Like Comment
Takayuki Y.

59,696 followers 2mo
Report this post
EgoScale — Scaling Human Video to Unlock Dexterous Robot Intelligence https://lnkd.in/gMDqCzXi One of the major hurdles in robotics is teaching machines to perform fine-grained, dexterous manipulation with human-level skill. NVIDIA’s EgoScale research shows that large-scale human egocentric video (20,854+ hours) can serve as a predictable, reusable supervision source to train Vision-Language-Action (VLA) models that transfer robustly to real robotic platforms. ■ Key insights include: • A log-linear scaling law between video data volume and model performance — more data drives better dexterity. • A simple two-stage transfer recipe: extensive human pre-training followed by lightweight human-robot aligned mid-training. • Policies that improve success rates by ~54% over baselines on a 22-DoF dexterous robotic hand and generalize to lower-DoF hardware. This establishes large-scale human behavior as a cornerstone for scalable robot intelligence and one-shot task adaptation, suggesting a future where robots learn complex skills more like humans do — from observing us at scale. === 📈 NVIDIAの新研究 EgoScale https://lnkd.in/gWVk8EYn 2万時間超の人間視点動画を使い、精密操作を学習する汎用モデルを構築。大規模データが“再利用可能な運動 prior”になり、少ないロボットデータで複雑動作を習得。デクスタリティ向上に道。 ■ ポイント(要約) 🔹 大規模データが生み出すスケーラブルな学習力従来は少量データやシミュ専用学習が中心だったが、人間の視点から撮られた 20,000 時間超の動画を使うことで、精密で連続的な操作学習が大幅に改善された。 🔹 log-linear スケーリング則の発見データ量と検証損失の間に「対数線形の関係」があり、これは学習モデルの性能評価を予測可能にする基盤法則として重要である。 🔹 少ないロボットデータで一発習得（one-shot adaptation）中間学習をわずかな aligned human-robot データだけにしても、未知のタスクを一度のデモで習得可能になるなど、実用性と効率性が大きく向上している。 🔹 ロボット側の自由度に依存しない知識転移学習した「モーター prior」は 22 自由度（DoF）の手だけでなく、DoFが少ないハードウェアへも汎化している。 #Robotics #AI #DexterousManipulation #SimToReal #VisionLanguageAction #HumanVideoData #EgoScale

1 Comment
Like Comment
Rangel Isaías Alvarado Walles

Robotics & AI Engineer | AI Engineer | Machine Learning | Deep Learning | Computer Vision | Agentic AI | Reinforcement Learning | Self-Driving Cars | IoT | IIoT | AIOps | MLOps | LLMOps | DevOps | Cloud | Edge AI

4,580 followers 11mo
Report this post
Real2Render2Real – Scaling Robot Data Without Dynamics Simulation or Robot Hardware ArXiv: Project: real2render2real.com As robots move toward general-purpose manipulation in unstructured environments, collecting large and diverse training data remains a major bottleneck. Enter Real2Render2Real (R2R2R): a framework that scales robot data generation from just a smartphone scan and one human video—no teleoperation, robot hardware, or physics simulation needed. R2R2R generates thousands of realistic, robot-agnostic demonstrations via 3D Gaussian Splatting and differential inverse kinematics, then trains models that match the performance of human teleoperation-based learning—at 27× the throughput. 🧠 Key Concepts 1️⃣ Real-to-Synthetic Pipeline Input: A multi-view smartphone scan and a monocular human demo video Extract 3D object shape via 3D Gaussian Splatting Track 6-DoF object motion with 4D-DPM Render thousands of synthetic trajectories in photorealistic scenes using IsaacLab 2️⃣ One-to-Many Demonstration Scaling Interpolate and augment object trajectories for new object placements Use analytic grasp generation for diverse valid grasps Generate robot joint-space trajectories via inverse kinematics Supports rigid and articulated objects with automatic part segmentation 3️⃣ No Physics, No Robots, No Problem No force modeling, torque computation, or simulation dynamics Robot arms are treated as kinematic bodies, sidestepping collision models Policies trained only on R2R2R data match those trained on 150 real teleop demos ⚙️ How to Implement R2R2R Phase 1 – Real-to-Sim Extraction Scan object → Reconstruct with 3DGS → Meshify with GARField Track object motion from video → Extract part-level 6-DoF trajectories Phase 2 – Trajectory Diversification Interpolate trajectories to adapt to random poses using Slerp Estimate grasps from hand-object proximity Generate IK trajectories with PyRoki solver under smoothness & joint limits Phase 3 – Parallelized Rendering Render RGB frames and action data with IsaacLab Apply domain randomization: camera pose, lighting, table textures Output: RGB + proprioception + actions → usable for VLA, π0-FAST, Diffusion Policy ✅ Advantages FeatureBenefit⚡ 27× faster than human teleop51 demos/min on 1 GPU🧠 No physics or robot neededNo dynamics engine or torque simulation🎥 Generalizes from 1 videoThousands of demos from a single example🔧 Robot-agnosticCompatible with any robot URDF🎯 High performanceMatches/surpasses real demos in 5 real-world tasks📦 Works with π0-FAST, Diffusion Policy, VLADrop-in for modern imitation learners 🛠 Applications Vision-Language-Action (VLA) Model Training Robot Learning at Scale Without Robots Augmenting Real Datasets with Rich Visual Diversity Tool Learning, Multi-Object Interaction, Bimanual Tasks Follow me to know more about AI, ML and Robotics.
No more previous content

No more next content
Like Comment
Emily Yu

Deep Tech Investor @ Boost VC | ex-Amazon Robotics engineer, founder | MIT MBA

7,411 followers 1mo
Report this post
Robotics data for physical AI is front and center this year. From GTC's heavy focus on data infrastructure to human data ecosystems like EgoVerse, the field is waking up to the bottleneck of scaling robotics data. And there's real divergence in how people think about quantity, quality, modality, and diversity. Aurora Feng, Albert K., and I have been looking at this problem. We put together a robotics data infrastructure market map and an open vendor/collector list for the community. We structured the market map around the robot data workflow end to end: - Simulations & Evaluation: Generating synthetic + real data, benchmarking model performance. - Curation & Labeling: Annotation, QA, slicing the data that actually matters. - Ingestion & Sync: Time-aligning multimodal sensor streams into usable formats. - Storage & Indexing: Making robotics data searchable and retrievable. - Deployment & Ops: Fleet telemetry, incident detection, closing the loop back into the data stack. Most companies span multiple layers. The interesting question is which layers are still wide open. The LLM instinct is more data = better model. In robotics that logic weakens, and the scaling laws don't directly apply (yet). What actually matters when evaluating data: - Outcome quality: Are success/failure trajectories labeled correctly? Failure trajectories can be quite useful if labeled correctly. Some vendors mix them in with successes to inflate volume and price. - Distribution diversity: How varied are the demonstrations? Different initial states, grasp points, camera views. Narrow distributions produce brittle models. - Annotation granularity: Task-level labels aren't enough. You need trajectory-level and step-level annotations. It's not about dataset size. It's about whether the data teaches the right things. Most robot data pipelines are still duct-taped together. Some gaps we see: - Ingestion & sync is still painfully manual. Time-aligning video, depth, force, and actions across sensors is an engineering tax every team pays independently. - Curation & QA tooling was built for AV, not manipulation or dexterous tasks. - Evaluation infra barely exists outside big labs. Most teams can't tell if new data actually improves their models. Ultimately, the most valuable data infra companies will be the platforms that match the right data modalities to the right model architectures and evolve alongside the models themselves. Who are we missing? If you're building data infra or have insights on what data inputs matter most, we want to hear from you. Vendor list + details in the first comment ↓
No more previous content

No more next content
38 Comments
Like Comment
Leo 磊 Su

Physical AI & Robotics | Exploring where value is emerging

8,993 followers 3w
Report this post
Picking works on fixed robots. But what about flying drones? I recently visited Stanford University MSL Multi-Robot Systems Lab I spoke with a PHD student of the lab, the team in the lab is actively working on this exact problem. Today’s VLA models, like Physical Intelligence π0, already show strong manipulation capability in fixed, static environments. But once you move to a drone, everything changes. The system is constantly moving. The moment you grasp something, the payload shifts the dynamics. Stability, control, and execution all become tightly coupled. They call this the “dynamics gap.” This is where many “general” capabilities start to fall apart. In their new paper, they introduced an #AirVLA system to bridge this gap. It comes to two moves: 1. Fix the action interface. • Rather than modeling full dynamics, they focus on the dominant failure mode and correct it directly during action generation. • They adjust actions at inference to account for payload-induced instability, especially along the vertical axis where drones are most sensitive. 2. Fix the data gap. • Instead of relying solely on real flight data, they build a synthetic pipeline using Gaussian Splatting to generate navigation and recovery trajectories. • This covers edge cases that are hard or unsafe to collect in the real world. Still early, still more work needed. But I‘d love to see drones can pick up and deliver our packages in the near future. - Leo 磊 Su Qianzhong Chen

12 Comments
Like Comment

Scaling VLA Data Collection in Robotics Projects

Summary

More in Advancing Robotics Technology

Explore categories