论文阅读“Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models“

论文阅读"Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models"

YMWM_

395人浏览 · 2026-02-20 01:12:43

YMWM_ · 2026-02-20 01:12:43 发布

Vision-Language-Action (VLA) models benefit from chain-of-thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control.
We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multimodal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control.
To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation.
We construct two structured CoT datasets and evaluate LaRA-VLA on both simulation benchmarks and long-horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA consistently outperforms state-of-the-art VLA methods while reducing inference latency by up to 90% compared to explicit CoT-based approaches, demonstrating latent reasoning as an effective and efficient paradigm for real-time embodied control. Project Page: LaRA-VLA Website.

结论

We presented LaRA-VLA, a latent reasoning framework for Vision–Language–Action models that internalizes chain-of-thought reasoning into continuous latent representations across both textual and visual modalities. Rather than generating long explicit CoT sequences at inference time, LaRA-VLA replaces them with compact textual CoT latents and employs a curriculum-based training strategy to progressively transfer explicit reasoning into latent space.
Visual latents are aligned with continuous perceptual features encoded by a shared visual encoder and stabilized using an exponential moving average, providing implicit supervisory signals that guide the learning of textual CoT latents.
Experiments on simulated benchmarks and long-horizon real-robot manipulation tasks demonstrate that LaRA-VLA achieves strong performance while significantly improving inference efficiency, supporting the view that structured reasoning for embodied agents can be effectively realized in latent space without explicit chain-of-thought generation.

我来详细分析这篇关于LaRA-VLA (Latent Reasoning Vision-Language-Action) 的论文。这是一篇提出新型VLA架构的研究，核心创新在于将显式的思维链(Chain-of-Thought)推理内化为连续的潜在空间表示。

一、研究背景与核心问题

1.1 现有VLA模型的局限性

当前VLA模型的推理方式主要分为三类（如论文图1和表1所示）：

类型	代表方法	问题
文本CoT	ECoT, π0.5, ThinkAct	推理时需生成大量离散文本token，导致推理延迟高达1-5Hz，难以满足实时控制需求
视觉CoT	CoT-VLA, DreamVLA, F1	依赖VQ-VAE等离散视觉token，与连续感知/动作空间存在表示不匹配
混合CoT	UP-VLA	仍受限于离散表示，推理开销大

核心矛盾：显式CoT虽然提升了解释性和泛化性，但离散token表示与机器人连续的感知-动作空间存在本质错配，且推理时延过高（可达数秒）。

1.2 LaRA-VLA的解决思路

论文提出将CoT推理完全内化到连续潜在空间：

文本推理 → 压缩为紧凑的连续潜在向量（text latents）
视觉推理 → 对齐到编码器的连续视觉特征（visual latents）
动作生成 → 由潜在表示直接条件化的扩散模型输出

这样消除了推理时的显式token生成，实现高效、连续、面向动作的推理。

二、方法详解

2.1 整体架构（图2）

LaRA-VLA采用三阶段渐进式训练策略：

Stage I: 显式CoT监督学习
    ↓ 输入：图像 + 指令 + 完整文本CoT + <img next> token
    ↓ 输出：预测下一帧视觉潜在 + 离散动作token
    ↓ 目标：建立结构化推理与视觉预测能力

Stage II: 课程式潜在化  
    ↓ 逐步用<thinking> token替换显式CoT token
    ↓ 从"1个思考token"逐步增加到"4个思考token"
    ↓ 视觉预测提供隐式监督，防止潜在空间崩溃

Stage III: 动作专家适应
    ↓ 移除所有显式CoT，仅保留潜在表示
    ↓ 引入基于Flow Matching的Action Expert（Diffusion Transformer）
    ↓ 潜在表示直接条件化连续动作生成

2.2 关键技术创新

(1) 统一潜在表示空间

不同于以往方法将文本和视觉分开处理，LaRA-VLA将两者统一编码：

文本CoT潜在：通过可学习的潜在token替换离散文本，保持语义结构
视觉目标潜在：使用与输入图像共享的视觉编码器生成，确保表示一致性
EMA目标网络：稳定视觉潜在学习，防止表示崩溃

$\bar{\theta}_v^t = \tau_v \bar{\theta}_v^{t-1} + (1-\tau_v)\theta_v^t$

(2) 课程式训练策略（图2右侧）

这是防止潜在空间崩溃的关键。通过渐进式替换：

早期：保留大部分显式CoT，模型学习推理结构
中期：逐步增加潜在token比例，视觉预测提供监督信号
后期：完全潜在化，模型内化推理动态

(3) 专用注意力机制（图3）

针对不同阶段设计注意力掩码：

Stage I & II：未来图像token因果关注文本和当前图像；动作token自回归生成
Stage III：移除动作token的注意力计算，仅基于文本和视觉潜在生成动作

(4) 动作生成：Flow Matching

Stage III采用连续扩散模型替代离散token：
$\mathcal{L}_{act-con} = \mathbb{E}_{a_t,\epsilon,\tau}\left[\|v_{\theta_a}(a_\tau, \tau | h_t) - (a_t - \epsilon)\|_2^2\right]$

其中 $h_t$ 是多模态潜在上下文（视觉+语言+文本潜在+视觉目标潜在）。

三、数据集构建

论文构建了自动化CoT标注流程（图9），包含三个核心组件：

3.1 语义锚点（Semantic Anchors）

使用Qwen3-VL从首帧和指令中提取操作目标物体，例如"apple", “banana”。

3.2 时间锚点（Temporal Anchors）

基于夹爪状态变化将轨迹分割为原子操作阶段：

Approach（接近）
Grasp（抓取）
Transport（搬运）
Release（释放）
Retract（收回）

3.3 多模态标注生成

类型	方法	输出
子任务描述	Qwen3-VL基于关键帧生成	“Move toward the banana”
目标定位	GroundingDINO + SAM3	时序一致的2D边界框
运动推理	末端执行器轨迹分析	全局运动方向 + 局部运动方向

基于此构建了：

LIBERO-LaRA：模拟环境，4个任务套件
Bridge-LaRA：真实到模拟迁移评估
真实机器人数据集：4类长程操作任务

四、实验结果分析

4.1 模拟环境性能（表2、表3）

LIBERO基准（97.9%平均成功率）：

在Object套件达到99.8%，显著优于DeepThinkVLA (97.0%)和π0.5 (96.8%)
在Long套件（长程任务）达到96.6%，展现强时序推理能力

SimplerEnv-WidowX（真实到模拟迁移，68.8%平均成功率）：

显著优于视觉CoT方法F1 (59.4%)和UD-VLA (62.5%)
在Put Spoon任务达到95.8%，展现强泛化性

4.2 真实世界实验（图4、图5）

在4类长程任务上对比ACT和GR00T N1.5：

任务	ACT	GR00T N1.5	LaRA-VLA
Put all objects into basket	0%	50.0%	41.6%
Sort all fruits into basket	0%	33.3%	50.0%
Find block and place in basket	0%	16.7%	33.3%
Stack two bowls	0%	91.7%	100%
平均	0%	47.9%	56.2%

关键发现（表6详细分析）：

在强时序依赖任务（Find block → Place）上，LaRA-VLA优势最明显（16.7%→33.3%），说明潜在推理能更好维持跨子任务一致性
在弱耦合任务（Stack two bowls）上两者接近，但LaRA-VLA仍略胜

4.3 消融实验（表4）

验证各组件贡献（SimplerEnv上）：

配置	成功率
无CoT基线	55.21%
+ 显式文本CoT	58.33% (+3.1%)
+ 潜在文本CoT	64.58% (+9.4%)
+ 潜在视觉CoT	68.75% (+13.5%)

结论：潜在化文本CoT收益远大于显式文本CoT；加入视觉潜在提供多模态对齐监督，进一步提升性能。

4.4 潜在空间分析（图6）

t-SNE可视化显示：

无崩溃现象：不同推理组件（子任务、边界框、运动）形成良好分离的聚类
功能专门化：语言指令token（灰色）与推理潜在占据不同子空间，说明潜在CoT未简单复用语言嵌入

4.5 推理效率（图7）

核心优势：LaRA-VLA推理延迟仅135ms，相比：

ThinkAct-7B: 7513ms（快55倍）
ECoT-7B: 4434ms（快33倍）
Fast-ThinkAct-3B: 805ms（快6倍）

这实现了高达90%的推理时间减少，使实时控制（>7Hz）成为可能。

五、局限性与未来方向

论文坦诚指出：

潜在token数量受限：为防止崩溃，目前每步仅使用1个潜在token，可能限制表达能力
训练效率：课程式策略导致训练后期CoT相关token增加，训练成本上升
可解释性权衡：潜在推理虽高效，但失去了显式CoT的自然语言可解释性

六、核心贡献总结

贡献	说明
范式创新	首个将多模态CoT完全内化到连续潜在空间的VLA框架
架构设计	三阶段课程训练 + EMA视觉对齐 + Flow Matching动作生成
数据基础设施	自动化CoT标注流程 + LIBERO-LaRA/Bridge-LaRA数据集
性能验证	SOTA性能 + 90%推理加速 + 真实机器人验证