论文阅读“ACot-VLA: Action Chain-of-Thought for Vision-Language-Action Models“
论文阅读"ACot-VLA: Action Chain-of-Thought for Vision-Language-Action Models"
摘要
Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning—such as sub-task prediction (language) or goal image synthesis (vision)—to guide action generation.
However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution.
Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space.
We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy.
In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm.
Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR).
The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning.
Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively
结论
In this work, we addressed the fundamental semantic-kinematic gap in modern robotic policies by proposing a new paradigm: Action Chain-of-Thought (ACoT).
We argued that for physically grounded intelligence, deliberation should occur not in the abstract space of language or vision, but directly in the kinematically grounded space of actions.
We materialized this concept in our ACoT-VLA framework, which leverages two synergistic modules, i.e.,
an Explicit Action Reasoner (EAR) and an Implicit Action Reasoner (IAR), to generate and fuse both explicit trajectory plans and implicit behavioral priors.
This action-centric guidance mechanism creates a direct, information-rich conduit between high-level intent and low-level motor control.
Our extensive experiments across multiple simulation and real-world benchmarks demonstrate that this approach yields state-of-the-art performance, significantly improving both task success and robustness.
By shifting the locus of reasoning from perception to action, our work not only provides a more effective and grounded method for robot policy learning but also opens a new avenue for research into more structured, interpretable, and capable embodied agents.
We believe that learning to “think” in the language of actions is a critical step towards developing the next generation of generalist robots.
我来详细分析这篇关于 ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models 的论文。
一、研究背景与核心问题
1.1 VLA模型的演进
Vision-Language-Action (VLA) 模型已成为通用机器人策略的主流范式,其核心架构是:
- 利用预训练的视觉-语言模型(VLM)编码视觉和语言输入
- 通过VLM的隐式表示条件化动作解码器
1.2 现有方法的局限
论文识别了两种主流的中间推理范式及其问题:
| 范式 | 代表工作 | 中间表示 | 核心局限 |
|---|---|---|---|
| 语言CoT | π0.5, OpenVLA | 子任务描述 | 抽象、无法传递精确的运动学信息 |
| 视觉CoT | CoT-VLA, DreamVLA | 目标图像 | 间接指导,仍受限于视觉-动作鸿沟 |
关键洞察:现有方法主要在输入空间(视觉/语言)进行推理,而非输出空间(动作),导致高层语义表示与低层运动控制之间存在根本性失配。
二、核心创新:Action Chain-of-Thought (ACoT)
2.1 范式转变
ACoT提出将"思维过程"重新定义为显式的、运动学 grounded 的动作意图序列,而非抽象的语言token或视觉图像。
“对于物理 grounded 的智能体,推理应发生在运动学 grounded 的动作空间中,而非抽象的语言或视觉空间。”
2.2 ACoT-VLA 架构
┌─────────────────────────────────────────────────────────┐
│ ACoT-VLA 架构概览 │
├─────────────────────────────────────────────────────────┤
│ 输入: 自然语言指令 l + 当前视觉观测 O_t │
│ ↓ │
│ VLM Backbone (SigLIP + Gemma 2B) → Key-Value Cache │
│ ↓ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Explicit Action │ │ Implicit Action │ │
│ │ Reasoner (EAR) │ │ Reasoner (IAR) │ │
│ │ │ │ │ │
│ │ • Transformer-based│ │ • Cross-attention│ │
│ │ • 合成粗粒度参考轨迹│ │ with learnable │ │
│ │ • Flow matching │ │ queries │ │
│ │ • 输出: Z^ex │ │ • 输出: Z^im │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ ↓ ↓ │
│ ┌─────────────────────────────────┐ │
│ │ Action-Guided Prediction │ │
│ │ (AGP) - 双交叉注意力融合机制 │ │
│ │ │ │
│ │ Q_action ──→ CrossAttn ──→ S^ex │ │
│ │ ↓ ──→ CrossAttn ──→ S^im │ │
│ │ [S^ex; S^im] → Self-Attn → Action Head │ │
│ │ │ │
│ │ 输出: 去噪后的动作序列 a_{t:t+H-1} │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
三、关键技术细节
3.1 Explicit Action Reasoner (EAR)
功能:自主合成参考动作序列作为策略学习的内部指导
数学形式:
- 输入:噪声动作序列 a ~ t : t + H r e f − 1 \tilde{a}_{t:t+H^{ref}-1} a~t:t+Href−1
- 通过VLM的KV Cache进行交叉注意力:
h ~ i r e f = Self-Attn ( h i − 1 r e f ) + CrossAttn ( h i − 1 r e f , K i V L M , V i V L M ) \tilde{h}_i^{ref} = \text{Self-Attn}(h_{i-1}^{ref}) + \text{CrossAttn}(h_{i-1}^{ref}, K_i^{VLM}, V_i^{VLM}) h~iref=Self-Attn(hi−1ref)+CrossAttn(hi−1ref,KiVLM,ViVLM) - 通过flow matching学习动作轨迹分布,输出去噪参考动作 a t : t + H r e f − 1 r e f a_{t:t+H^{ref}-1}^{ref} at:t+Href−1ref
关键设计:
- 轻量级Transformer (N=18层)
- Teacher Forcing稳定训练:训练时用GT轨迹计算Z^ex,推理时切换为自条件模式
- 参考动作长度 H^ref = 15,策略输出长度 H = 10
3.2 Implicit Action Reasoner (IAR)
功能:从VLM内部表示提取隐式动作先验
机制:
- 每层初始化可学习查询矩阵 Q i ∈ R M × d Q_i \in \mathbb{R}^{M \times d} Qi∈RM×d (M=1)
- 下采样KV Cache至 d’=128 维度降低计算开销
- 交叉注意力提取动作相关信息,经池化和MLP投影得到 z i i m z_i^{im} ziim
核心发现:下采样策略优于直接查询或注意力池化,说明VLM特征包含动作预测的噪声信息,需要精心设计对齐机制。
3.3 Action-Guided Prediction (AGP)
双交叉注意力融合:
S e x = CrossAttn ( Q a c t i o n , Z e x , Z e x ) S^{ex} = \text{CrossAttn}(Q_{action}, Z^{ex}, Z^{ex}) Sex=CrossAttn(Qaction,Zex,Zex)
S i m = CrossAttn ( Q a c t i o n , Z i m , Z i m ) S^{im} = \text{CrossAttn}(Q_{action}, Z^{im}, Z^{im}) Sim=CrossAttn(Qaction,Zim,Zim)
自注意力融合: [ S e x ; S i m ] → Self-Attn → h ˉ [S^{ex}; S^{im}] \rightarrow \text{Self-Attn} \rightarrow \bar{h} [Sex;Sim]→Self-Attn→hˉ
训练目标: L t o t a l = λ 1 L π θ r e f + λ 2 L π θ h e a d \mathcal{L}_{total} = \lambda_1 \mathcal{L}_{\pi_{\theta}^{ref}} + \lambda_2 \mathcal{L}_{\pi_{\theta}^{head}} Ltotal=λ1Lπθref+λ2Lπθhead (λ₁=λ₂=0.5)
四、实验结果分析
4.1 仿真基准测试
LIBERO (表1)
| 方法 | 指导类型 | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|---|
| π0.5 | 语言 | 98.8 | 98.2 | 98.0 | 92.4 | 96.9 |
| OpenVLA-OFT | 语言 | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| UniVLA | 视觉 | 95.4 | 98.8 | 93.6 | 94.0 | 95.5 |
| DreamVLA | 视觉 | 97.5 | 94.0 | 89.5 | 89.5 | 92.6 |
| Ours | 动作 | 99.4 | 99.6 | 98.8 | 96.0 | 98.5 |
关键观察:
- 在Long-horizon任务上提升最显著 (+3.6% vs π0.5),验证ACoT对误差累积的鲁棒性
- 动作空间推理优于语言和视觉CoT
LIBERO-Plus 鲁棒性测试 (表2)
在7种分布偏移条件下(相机视角、机器人初始状态、语言变化、光照、背景、传感器噪声、物体布局):
| 方法 | Camera | Robot | Language | Light | Background | Noise | Layout | Avg. |
|---|---|---|---|---|---|---|---|---|
| π0.5* | 70.3 | 41.7 | 81.1 | 97.3 | 94.6 | 71.8 | 84.9 | 75.7 |
| OpenVLA-OFT+ | 92.8 | 30.3 | 85.8 | 94.9 | 93.9 | 89.3 | 77.6 | 79.6 |
| Ours | 91.2 | 62.5 | 80.3 | 95.1 | 91.5 | 88.3 | 84.9 | 84.1 |
突破性发现:
- 相机视角偏移:+20.9% (vs π0.5)
- 机器人初始状态扰动:+20.8%
- 传感器噪声:+16.5%
- 证明动作空间指导对外部扰动具有内在稳定性
VLABench (表3)
| 方法 | In-dist. (IS/PS) | Category (IS/PS) | Commonsense (IS/PS) | Instruction (IS/PS) | Texture (IS/PS) | Avg. (IS/PS) |
|---|---|---|---|---|---|---|
| π0 | 67.8/62.7 | 44.0/33.6 | 54.9/43.0 | 58.0/38.7 | 50.6/42.5 | 55.0/44.1 |
| π0.5 | 75.0/60.8 | 49.6/35.3 | 57.5/41.6 | 57.1/30.3 | 62.0/47.4 | 60.2/43.1 |
| Ours | 79.8/66.1 | 54.1/38.9 | 52.3/37.8 | 56.8/39.6 | 74.6/54.6 | 63.5/47.4 |
未见纹理测试大幅提升 (+12.6% IS),表明动作表示对外观变化的鲁棒性。
4.2 消融研究 (表4, 9, 10)
| 配置 | LIBERO Avg. | LIBERO-Plus Avg. | 分析 |
|---|---|---|---|
| Baseline (π0.5) | 96.9 | 75.7 | 起点 |
| + EAR only | 98.3 | 83.7 | 显式指导带来强归纳偏置 |
| + IAR only | 98.1 | 80.4 | 隐式语义补充可行行为分布 |
| + EAR + IAR | 98.5 | 84.1 | 互补效应,最优配置 |
参数效率分析 (表10):
- EAR模块300M参数时性能最佳,过大(500M)会导致过拟合生成有偏参考轨迹
- 在匹配总参数和去噪步数的公平比较下,ACoT仍显著优于纯扩大模型
推理开销 (表11):引入EAR+IAR仅增加21ms延迟(91ms→112ms),参数增加13.7%,性能提升显著。
4.3 真实世界验证 (图3, 图4)
在AgiBot G1(22自由度)和AgileX(14自由度)平台上测试:
| 任务 | Ours | π0.5 | π0 |
|---|---|---|---|
| Wipe Stain | 83.3% | 79.1% | 75.0% |
| Pour Water | 33.3% | 22.5% | 25.0% |
| Open-set Pick (G1) | 77.5% | 80.0% | 12.5% |
| Open-set Pick (AgileX) | 72.5% | 62.5% | 22.5% |
| 平均 | 66.7% | 61.0% | 33.8% |
跨本体泛化能力得到验证,在AgileX上提升最明显(+10%)。
五、理论贡献与局限性
5.1 核心贡献
- 概念层面:首次将推理过程形式化为显式的动作空间意图链,而非抽象语言或视觉子目标
- 技术层面:提出显式-隐式双通路架构,协同提供运动学指导和行为先验
- 实证层面:在3个仿真基准和真实世界验证SOTA性能,特别是在分布偏移下的鲁棒性
5.2 局限性与未来方向
| 局限 | 未来方向 |
|---|---|
| 额外计算开销(虽然相对较小) | 针对资源受限平台的轻量化设计 |
| 动作表示为低级控制指令块,缺乏显式几何结构 | 融入3D空间grounding的动作表示,支持对象级协调和接触几何推理 |
DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。
更多推荐



所有评论(0)