摘要

 Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control.
  We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception–execution gap by enforcing temporally aligned action execution.
 To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation.
 Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.

讨论

 This work shows that, for dynamic object manipulation with VLA models, the dominant failure mode is not perceptual ambiguity but the temporal misalignment between observation and action execution — a factor largely ignored in static manipulation.
 To address this misalignment, we design DynamicVLA with three innovations: 1) a compact 0.4B backbone that supports high-frequency reasoning; 2) Continuous Inference to overlap reasoning and execution
for timely adaptation; 3) Latent-aware Action Streaming to enforce temporally aligned action execution.
 To address the scarcity of large-scale dynamic manipulation data, we develop an automatic simulation and real-world data collection pipeline that drives a robot arm with state-machine controllers, using object states from the simulation engine and from a real-world “simulator” interface, respectively.
  Together, these elements significantly reduce the perception–execution gap and yield more responsive behavior than conventional VLA models.
 Looking forward, several limitations of the current study point to promising directions for future work:
 More Efficient VLA Architectures. While DynamicVLA highlights the importance of latency-aware design for dynamic manipulation, real-time constraints fundamentally trade off multimodal understanding against responsiveness. Dynamic tasks tightly couple perception, reasoning, and execution, demanding architectures and inference schemes that preserve understanding under strict latency budgets.
 Beyond Short-horizon Dynamics. Our current formulation emphasizes short- to medium-horizon reactive interaction, which exposes latency-induced failures but does not capture longer-horizon dynamic behaviors. Future work should extend dynamic manipulation to multi-stage tasks with persistent object motion, integrating planning, memory, and task decomposition while remaining compatible with language conditioning and real-time execution constraints.
 Beyond Rigid-Body Dynamics. Our data pipeline assumes rigid-body state estimation, whereas many dynamic tasks involve non-rigid or fluid dynamics with continuously evolving states that are difficult to represent in both simulation and the real world. Extending VLA models and data pipelines to such settings remains an open challenge.

论文概览

标题: DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
作者: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhong Hong, Haiwen Diao, Ziwei Liu
机构: 南洋理工大学 S-Lab
arXiv: 2601.22153v1 [cs.RO] (2026年1月29日)


核心问题与研究背景

现有VLA模型的局限性

当前VLA模型在静态操作中表现优异,但在动态物体操作(如抓取移动物体)时面临严峻挑战:

问题类型 具体表现
感知-执行间隙 (P.E. Gap) 模型推理期间物体持续移动,导致预测动作与实际环境状态错位
块间等待 (Inter-chunk Waiting) 必须等待当前动作块执行完毕后才能开始下一次推理,造成控制中断
延迟敏感性 即使100-200ms的延迟也可能导致任务失败

如图1(a)所示,传统VLA模型存在明显的感知-执行间隙和块间等待时间,而DynamicVLA通过连续推理潜在感知动作流消除了这些问题。


DynamicVLA的三大核心设计

1. 紧凑的0.4B参数VLA架构 (图2a)

架构组成:
├── 视觉编码器: FastViT (卷积式,非Transformer)
│   └── 优势: 空间压缩效率高,避免多帧输入时的二次方token增长
├── 语言主干: SmolLM2-360M (仅前16层)
│   └── 参数量: 360M → 截断后显著降低延迟
└── 动作专家: 16层Flow Matching Transformer
    └── 基于扩散模型的动作生成,预测20步动作块

关键创新: 使用卷积视觉编码器替代传统的Transformer编码器,实现:

  • 更快的多模态推理速度
  • 更强的结构保持能力
  • 总参数量仅430M

2. 连续推理 (Continuous Inference) (图2b)

核心思想: 重叠推理与执行,消除块间等待

传统方式 连续推理
推理 → 等待 → 执行 → 等待 → 推理… 推理与执行并行:执行当前动作块A_t的同时,已开始推理下一个动作块A_{t+m}
存在空闲等待时间 无阻塞执行,保持动作流的连续性

假设: 动作块长度 n > 推理延迟 m,确保新动作序列在当前序列耗尽前可用。

3. 潜在感知动作流 (Latent-aware Action Streaming, LAAS) (图2c)

解决推理延迟导致的时序错位问题:

问题1: 感知-执行间隙

  • 在时刻t开始推理,预测动作A_t
  • 推理完成时已到t+m时刻,环境状态变为O_{t+m}
  • 动作{a_t, …, a_{t+m-1}}已与当前观察错位

问题2: 重叠动作块冲突

  • 连续推理导致A_t和A_{t+m}在时间上重叠
  • 同一执行时刻存在多个候选动作

LAAS解决方案:

  1. 丢弃过时动作: 舍弃A_t中对应t+m之前时刻的动作
  2. 优先新预测: 重叠时段优先执行来自更新序列A_{t+m}的动作
  3. 强制时序对齐: 确保执行始终基于最新环境状态

DOM基准测试集

论文构建了首个大规模动态物体操作基准 Dynamic Object Manipulation (DOM)

数据规模

类型 数量 详情
合成数据 200K episodes Isaac Sim仿真,2.8K场景,206个物体
真实世界数据 2K episodes 无需遥操作,约10秒/episode
跨 embodiment Franka Panda + AgileX PiPER 验证通用性

自动数据采集流程 (图3)

仿真环境:

  • 物体: 206个日常物品(水果、容器、工具等),速度0-0.75m/s
  • 场景: 2.8K个3D-FRONT室内场景
  • 传感器: 3个摄像头(25FPS,480×360)

真实世界"仿真器":

  • 使用双RGB视角进行实时3D物体跟踪
  • EfficientTAM提供物体掩码
  • 几何三角测量恢复3D质心
  • 拟合短时窗口运动估计速度

状态机控制器 (四阶段):

  1. 接近物体: 预测未来0.23s物体位置,末端执行器定位
  2. 抓取与提升: 下降、稳定残余运动、抓取、提升
  3. 接近目标与放置: 移动至目标位置并精确放置
  4. 重置: 返回初始姿态

实验结果

仿真基准测试结果 (表I)

DynamicVLA在9个评估维度上全面超越基线:

维度 最佳基线 DynamicVLA 提升幅度
Interaction (交互)
Closed-loop Reactivity (闭环反应性) 21.0% 60.5% +188%
Dynamic Adaptation (动态适应) 20.5% 38.5% +88%
Long-horizon Sequencing (长时序规划) 7.5% 40.5% +440%
Perception (感知)
Visual Understanding (视觉理解) 16.5% 51.5% +212%
Spatial Reasoning (空间推理) 16.5% 48.0% +191%
Motion Perception (运动感知) 14.0% 33.5% +139%
Generalization (泛化)
Visual Generalization (视觉泛化) 15.0% 59.5% +297%
Motion Generalization (运动泛化) 21.0% 65.0% +210%
Disturbance Robustness (抗干扰) 20.0% 26.5% +33%
平均成功率 13.61% 47.06% +246%

关键发现:

  • 任务完成时间从~10秒缩短至8.53秒
  • 路径长度增加(2.50m vs 1.27-1.51m),反映更主动的追踪行为

真实世界实验 (图4-6)

交互任务 (图4):

  • 任务#1 (咖啡罐→木盒): DynamicVLA 78.3% vs π₀.₅ 16.7%
  • 任务#6 (收集网球): DynamicVLA 71.6% vs 基线最高6.7%

感知任务 (图5):

  • 任务#2 (蓝胶带区域): DynamicVLA 61.7% vs 基线最高6.7%
  • 任务#5 (慢速球): DynamicVLA 31.6% vs 基线最高8.3%

泛化任务 (图6):

  • 未见物体外观: 73.3% 成功率
  • 不规则运动模式: 50-70% 成功率

消融研究 (表II)

验证各组件贡献:

配置 成功率 关键发现
[1] 基线 (无CI, 无LAAS) 30.27% 存在明显感知-执行间隙
[2] + 连续推理 (CI) 36.11% 消除块间等待,提升+5.8%
[3] + LAAS (无CI) 39.72% 单独LAAS部分有效
[7] 完整DynamicVLA 47.06% CI+LAAS协同,提升+16.8%
[4] 小模型 (135M) 26.67% 容量不足
[5] 大模型 (1.7B) 24.33% 延迟过高,适得其反
[6] Transformer视觉编码 28.89% FastViT更优

关键洞察:

  • 360M参数是甜点: 平衡推理速度与模型容量
  • CI和LAAS互补: 单独使用均有提升,组合效果最佳
  • FastViT关键: 相比Transformer编码器提升显著

跨模型验证 (表V)

将CI和LAAS集成到现有VLA模型:

方法 原始成功率 +CI+LAAS 提升
π₀.₅ (1.5B) ~11% 15.89% 有限(延迟太高)
SmolVLA ~12% 25.56% 显著 (+113%)
DynamicVLA - 47.06% 专为动态设计

结论: CI和LAAS具有通用性,但效果受基础模型延迟约束。


技术贡献总结

  1. 首个面向动态物体操作的紧凑VLA模型 (0.4B参数)

    • 卷积视觉编码器实现高效空间压缩
    • 截断语言主干平衡速度与性能
  2. 连续推理机制

    • 重叠推理与执行,消除块间等待
    • 保持高频控制流,适应快速变化
  3. 潜在感知动作流

    • 显式处理推理延迟导致的时序错位
    • 强制感知-执行对齐,提升闭环稳定性
  4. DOM基准测试集

    • 首个大规模动态操作标准化评估
    • 自动采集流程支持仿真与真实世界

局限性与未来方向

  1. 短时程动态: 当前聚焦反应式交互,需扩展至多阶段长时程任务
  2. 刚体假设: 状态估计假设刚体动力学,需扩展到非刚体/流体
  3. 效率-理解权衡: 实时约束限制了多模态理解能力,需更高效的架构设计

结论

DynamicVLA通过紧凑架构连续推理时序对齐执行三大创新,解决了VLA模型在动态物体操作中的核心瓶颈——感知与执行的时间错位。在DOM基准上的全面领先(平均成功率47.06% vs 基线13.61%)证明了该方法的有效性,为机器人动态交互提供了新的技术范式。

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐