Diffusion Models for Robotics Performance Optimization

Diffusion Models for Robotics Performance Optimization 跳到主要内容领英热门内容会员 Learning 职位游戏马上加入登录热门内容 Productivity Performance Optimization Techniques Diffusion Models for Robotics Performance Optimization

浏览来自职场专家的热门领英内容。

摘要

Diffusion models for robotics performance optimization use advanced AI techniques inspired by how particles spread in nature to help robots better predict, plan, and control their actions in complex environments. These models allow robots to adapt in real time, improve motion reasoning, and handle new or changing scenarios without needing exhaustive retraining.

Embrace simulation data: Rely on scalable synthetic data generation to train diffusion models, making it easier to prepare robots for a variety of tasks and environments. Adapt on the fly: Use inference-time steering and alignment methods to let robots dynamically adjust their actions when faced with unexpected changes or new objects. Combine world knowledge: Integrate physics-aware planning and motion prediction into robot control systems for more reliable and robust manipulation in real-world settings. 由 AI 根据领英会员动态总结

Honglu Zhou

multimodal AI, computer vision, video understanding, machine reasoning

2,790 位关注者 4 个月举报此动态

VLAs can't just mimic expert trajectories — they need 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝘃𝗲 𝗺𝗼𝘁𝗶𝗼𝗻 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴. Our new work shows that jointly learning motion prediction via image diffusion gives 𝗥𝗼𝗯𝗼𝘁𝗶𝗰 𝗩𝗟𝗔𝘀 superior ability to reason about what actions to take. The result: stronger, more reliable real-world manipulation. Code and model will be released. 📄 https://lnkd.in/g9vfn_SE 🔗 https://lnkd.in/g_9sBcVe #Robotics #EmbodiedAI #VLA #DiffusionModels 🤿 Deep dive: Our method extends the VLA architecture with a dual-head design: while the action head predicts action chunks as in vanilla VLAs, an additional motion head, implemented as a Diffusion Transformer (DiT), predicts optical-flow-based motion images that capture future dynamics. The two heads are trained jointly, enabling the shared VLM backbone to learn representations that couple robot control with motion knowledge. This joint learning builds temporally coherent and physically grounded representations without modifying the inference pathway of standard VLAs, thereby maintaining test-time latency. Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5% on the LIBERO benchmark and 58.0% on the RoboTwin benchmark, yielding a 𝟮𝟯% 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗶𝗻 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 and validating its effectiveness in enhancing the motion reasoning capability of large-scale VLAs. Great work by our intern Yu Fang while he's at Salesforce AI Research!

…展开 98 赞评论分享复制 LinkedIn Facebook X

Adithya Murali

Staff Research Scientist at NVIDIA | MIT TR35, Prev CMU PhD, Berkeley AI Research

3,219 位关注者 10 个月举报此动态

I’m super excited to release a multi-year project we have been cooking at NVIDIA Robotics. Grasping is a foundational challenge in robotics 🤖 — whether for industrial picking or general-purpose humanoids. VLA + real data collection is all the rage now but is expensive and scales poorly for this task. For every new embodiment and/or scene, we'll have to recollect the dataset in this paradigm for the best perf. Key Idea: Since grasping is a well-defined task in physics simulation - why can’t we just scale synthetic data generation and train a GenAI model for grasping? By embracing modularity and standardized grasp formats, we can make this a turnkey technology that works zero-shot for multiple settings. Introducing… 🚀 GraspGen: A Diffusion-Based Framework for 6-DOF Grasping GraspGen is a modular framework for diffusion-based 6-DOF grasp generation that scales across embodiment types, observability conditions, clutter, task complexity. Key Features: ✅ Multi-embodiment support: suction, antipodal pinch, and underactuated pinch grippers ✅ Generalization to both partial and complete 3D point clouds ✅ Generalization to both single-objects and cluttered scenes ✅ Modular design relies on other robotics packages and foundation models (SAM2, cuRobo, FoundationStereo, FoundationPose). This allows GraspGen to focus on only one thing - grasp generation ✅ Training recipe: grasp discriminator is trained with On-Generator data from the diffusion model - so that it learns to correct any mistakes of the diffusion generator ✅ Real-time performance (~20 Hz) before any GPU acceleration; low memory footprint 📊 Results: • SOTA on the FetchBench [Han et. al. CoRL 2024] benchmark • Zero-shot sim-to-real transfer on unknown objects and cluttered scenes • Dataset of 53M simulated grasps across 8K objects from Objaverse We're also releasing: 🔹 Simulation-based grasp data generation workflows 🔹 Standardized formats and gripper definitions 🔹 Full training infrastructure 📄 arXiv: https://lnkd.in/gaYmcfz4 🌐 Website: https://lnkd.in/gGiKRCMX 💻 Code: https://lnkd.in/gYR77bEh A huge thank you to everyone involved in this journey — excited to hear the feedback from the community! Joint work with Clemens Eppner, Balakumar Sundaralingam, Yu-Wei Chao, Mark T. Carlson, Jun Yamada and other collaborators. Many thanks to Yichao Pan, Shri Sundaram, Spencer Huang, Buck Babich, Amit Goel for product management and feedback. #robotics #grasping #physicalAI #simtoreal

…展开 1,022 27 条评论赞评论分享复制 LinkedIn Facebook X

John Lambert 7,181 位关注者 9 个月举报此动态

Can a single autonomous driving simulation world model jointly insert, delete, and control the behavior of all agents and traffic lights in a bird's-eye-view scene? For the first time, we show this is possible in SceneDiffuser++, our CVPR '25 paper, w/ 60+ second simulations. Led by our amazing intern at Waymo Research, Shuhan T., SceneDiffuser++ is a diffusion model that is solely trained on the diffusion denoising objective, yet supports all insertion, deletion, and behavior control capabilities via simple autoregressive rollout. Only learned simulators can emulate the realism of crowded city scenes. Without the ability to insert or delete objects, these simulators can only simulate a few seconds before the scene becomes empty as initial logged agents and traffic lights leave the periphery of the AV. Like SceneDiffuser, we learn an agents "scene tensor," but generalize this to multi-tensor diffusion. Agent spawning, removal and occlusion can be jointly modeled simply via predicting an additional validity channel along with other agent features such as x, y, size, type, etc. For agents and traffic lights scene tensors, with a varying number of elements and feature dimensions, we can project scene tensors to the same latent dimension, and concatenate into a multi-tensor. We then pass this to a transformer denoiser backbone. Though conceptually simple, this requires diffusion to learn to generate sparse tensors without prespecified sparse structure. During inference, we develop new clipping techniques to account for invalid entries in the denoising process. We propose a new task, CitySim, where given a city map and an AV software stack, the simulator can simulate the trip from point A -> B by populating the city around the AV and controlling all aspects of the scene (e.g., vehicles, pedestrians, traffic light states). Thanks to brilliant collaborators: Shuhan T., Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, "Max" Chiyu Jiang. Full details available here: - SceneDiffuser++ Paper: https://lnkd.in/efanc7UM - Watch our video: https://lnkd.in/ehYbADcU - SceneDiffuser Paper: https://lnkd.in/edr2REsS

…展开

无上一项内容

无下一项内容 224 19 条评论赞评论分享复制 LinkedIn Facebook X

Jiafei Duan

Incoming Presidential Young Professor at NUS Computing | Robotics & AI PhD student at University of Washington, Seattle

8,355 位关注者 3 个月举报此动态

Why do powerful pretrained generalist robot models fail when you move an object a few inches, swap a target, or change the scene layout? It’s usually not a lack of motor skill — it’s an alignment problem at test time. In our new paper, we introduce Vision–Language Steering (VLS): a training-free, inference-time framework that adapts frozen diffusion and flow-matching robot policies to out-of-distribution (OOD) scenarios. Key idea: Treat adaptation as an inference-time control problem. Instead of retraining policies, we steer the denoising process using: -Vision–Language Models to interpret test-time constraints -Differentiable, programmatic rewards grounded in 3D geometry -Gradient-based guidance + particle resampling for stable long-horizon execution 📊 Results CALVIN: +31% absolute success over prior steering methods LIBERO-PRO: +13% improvement on strong VLAs (π0.5, OpenVLA) Real world (Franka): Robust execution under appearance shifts, position swaps, and novel object substitutions This work suggests a broader takeaway for robotics foundation models: Scaling policies alone isn’t enough — inference-time alignment matters. 📄Paper: https://lnkd.in/g67pf5Tm 🌐 Project page: https://lnkd.in/gkPxZjXw

…展开 146 1 条评论赞评论分享复制 LinkedIn Facebook X

Dr. Kal Mos

Executive VP, Head of Research & Predevelopment @ Siemens, ex-Google, ex-Amazon AGI, Startup Founder, Board Member

13,490 位关注者 6 个月举报此动态

This new paper proposes dual-stream diffusion (DUST), a world-model augmented VLA framework. It shows that combining world models with physics-aware VLA delivers major gains in generalization and real-world task success. DUST outperforms standard VLA architectures that map perception to action without internal physical simulation. DUST keeps vision + action streams separated but cross-modal, enabling a physically consistent internal state that boosts manipulation success by 6% in simulation and 13% on real robots. This hybrid approach is the direction next-gen Robotics Foundation Models will go: physics-aware, temporally grounded, scalable, general-purpose embodied intelligence. https://lnkd.in/gCQn3-Ta #Robotics #RFM #RFM1 #RoboticsFoundationModel #WorldModel #LeCunWorldModel #EmbodiedAI #VLA #VisionLanguageAction #PhysicsAugmentedAI #DiffusionModels #ModelBasedRL #RobotManipulation #AutonomousSystems #PhysicalAI #EmbodiedFoundationModels #RobotLearning #Sim2Real #AIResearch #GeneralistRobots #IndustrialAI #DeepLearning #AIInfrastructure #FoundationModels #MachineLearning #Transformers #DiffusionTransformers #EmbodiedIntelligence #FutureOfAutomation #NextGenAI #Siemens

…展开 Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model arxiv.org 45 赞评论分享复制 LinkedIn Facebook X

Heng Yang

Assistant Professor at Harvard SEAS

9,109 位关注者 2 个月举报此动态

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper: https://lnkd.in/g9YdKmtn

…展开

无上一项内容

无下一项内容 119 1 条评论赞评论分享复制 LinkedIn Facebook X

Diffusion Models for Robotics Performance Optimization

Diffusion Models for Robotics Performance Optimization,AI智能索引,全网链接索引,智能导航,网页索引