摘要

 We introduce Green-VLA, a staged Vision–Language–Action framework for real-world deployment on the humanoid Green robot, while maintaining generalization across diverse embodiments.
  Green-VLA follows a five-stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) RL-based policy alignment.
  Progression builds semantic and physical priors, learns shared affordances, and aligns policies for long-horizon execution beyond behavior cloning.
 At its core is a unified data and control stack for robot fleets.
  A scalable data-processing pipeline including DataQA and temporal-alignment filters and synchronizes 3,000 hours of demonstrations; a unified, embodiment-aware action interface enables a single policy to control humanoids, mobile manipulators, and fixed-base arms; and the VLA controller is enhanced with episode progress prediction, out-of-distribution detection, and a joint-prediction-based guidance module that generalizes to unseen objects.
  Optimized for the Green humanoid, Green-VLA generalizes in a zero-shot manner to new embodiments and achieves state-of-the-art performance across bimanual systems and benchmarks, with RL alignment providing gains in success rate, robustness, and long-horizon efficiency.
  Code: https://github.com/greenvla/GreenVLA
  Project Page: https://greenvla.github.io/

结论

 We presented Green-VLA, a staged vision–language–action framework that moves beyond raw scale toward quality alignment, action unification, and reinforcement learning fine-tuning.
  At the data level, our DataQA pipeline filters and smooths heterogeneous demonstrations and aligns temporal dynamics via optical flow–based resampling.
  At the policy level, a unified action space with embodiment prompts resolves cross-robot inconsistencies and enables positive transfer.
  At the training level, a target-balanced sampling schedule stabilizes multi-embodiment flow matching, while conservative RL fine-tuning boosts performance on difficult, long-horizon tasks requiring advanced dexterity.
  Finally, at inference, efficiency optimizations and guidance enable low-latency, instruction-following control—even for novel, language-specified items.
  Empirically, Green-VLA demonstrates strong pretrain-stage performance on Simpler and CALVIN, outperforming prior foundation policies at comparable stages and approaching fine-tuned baselines.
  On real robots, we observe successful application on bimanual setups, and reliable humanoid behavior under OOD layouts. With the R2 RL alignment phase, Green-VLA achieves state-of-the-art results on the Simpler BRIDGE WidowX setup and competitive, near-state-of-the-art performance on CALVIN ABC→D.
  While promising, Green-VLA’s performance still depends on retargeting fidelity, residual dataset bias, and adequate coverage of dexterous skills. Future work will extend multilingual instruction following, strengthen the coupling between fast reasoning and real-time control, and integrate online data collection with safety-aware RL to further reduce failure modes.
  Overall, Green-VLA offers a practical recipe—from web-scale grounding to unified robotics pretraining, embodiment adaptation, and RL alignment—for building generalist, responsive, and reliable robot policies.

论文概述

Green-VLA 是由 Sber Robotics Center 提出的分阶段视觉-语言-动作(Vision-Language-Action, VLA)框架,专为真实世界部署设计,主要面向 Green 人形机器人,同时保持跨不同机器人形态的泛化能力。


核心贡献与创新点

1. 五阶段训练流程 (L0→L1→R0→R1→R2)

阶段 名称 目标 数据类型
L0 Base VLM 基础视觉-语言模型 大规模图文数据
L1 Web Pretrain 物理世界理解 2400万互联网多模态样本
R0 General Robotics Pretrain 通用机器人预训练 3000+小时跨形态机器人数据
R1 Embodiment SFT 特定形态微调 目标机器人高质量数据
R2 RL Alignment 强化学习对齐 环境交互反馈

关键洞察:单纯的数据规模扩展无法解决真实部署中的核心问题(数据异质性、质量参差不齐、行为克隆饱和),需要分阶段逐步构建语义先验、物理先验和奖励对齐。


2. 统一动作空间 (Unified Action Space)

这是论文的核心技术创新之一:

问题:不同机器人形态(单臂、双臂、人形)的动作空间维度、坐标系、控制类型各异,简单填充(padding)会导致语义冲突。

Green-VLA 的解决方案

  • 定义固定的 64维统一动作空间 A u ⊂ R 64 \mathcal{A}_u \subset \mathbb{R}^{64} AuR64,每个索引区间有跨机器人一致的物理语义
  • 使用 ** embodiment-aware mask** m e ∈ { 0 , 1 } 64 m_e \in \{0,1\}^{64} me{0,1}64 标记有效维度
  • 通过 控制类型提示 c e c_e ce 显式指定:臂数、手类型、关节/笛卡尔控制、移动/固定底座等

损失函数
L uni ( θ ) = E [ ∥ m e ⊙ ( π θ ( x t e , c e ) − Φ e ( a t e ) ) ∥ 2 2 ] \mathcal{L}_{\text{uni}}(\theta) = \mathbb{E}\left[\left\|m_e \odot \left(\pi_\theta(x_t^e, c_e) - \Phi_e(a_t^e)\right)\right\|_2^2\right] Luni(θ)=E[me(πθ(xte,ce)Φe(ate))22]

关键:对未使用的维度 ( 1 − m e ) (1-m_e) (1me) 不施加损失,避免填充带来的虚假梯度。


3. 数据质量保证体系 (DataQA)

针对机器人数据固有的抖动、模糊帧、执行不一致等问题:

指标 计算方法 用途
抖动 J $S_{\text{tremble}} = \frac{ \dot{s}_{\text{smooth}} - \dot{s}
清晰度 S Laplacian 标准差 + MaxPool 评估图像质量
多样性 D DINOv3 特征时间标准差 衡量视觉多样性
状态方差 σ 2 \sigma^2 σ2 状态协方差矩阵的Frobenius范数 评估状态空间覆盖

时间对齐:基于光流估计执行速度,使用单调三次样条插值统一不同数据集的时间尺度。


4. 推理时增强机制

(a) 关节预测引导模块 (JPM)
  • 问题:在密集视觉场景(如电商货架)中,VLA可能无法识别新物品
  • 方案:VLM预测2D affordance点 → 反投影到3D → IK求解目标关节配置 q ⋆ q^\star q → 流匹配引导
(b) 分布外检测 (OOD Detector)
  • 用高斯混合模型拟合训练状态分布 p train ( s ) p_{\text{train}}(s) ptrain(s)
  • 若预测状态概率低于阈值 τ ood \tau_{\text{ood}} τood,则沿密度梯度方向修正:
    s ← s + α ∇ p train ( s ) s \leftarrow s + \alpha \nabla p_{\text{train}}(s) ss+αptrain(s)
© episode 进度预测
  • 联合预测归一化进度 ρ ^ t ∈ [ 0 , 1 ] \hat{\rho}_t \in [0,1] ρ^t[0,1],支持高层任务规划器决策

5. R2 强化学习对齐

挑战:直接对扩散/流匹配模型进行RL微调存在困难(需要估计log概率、通过迭代去噪过程反向传播)

Green-VLA 的两阶段R2方法

阶段1:轨迹优化 + 原生微调

  • 训练IQL critic估计Q函数
  • 用Q函数梯度优化轨迹: a ← a + η ∇ a Q ( s , a ) ∣ ∇ a Q ( s , a ) ∣ a \leftarrow a + \eta \frac{\nabla_a Q(s,a)}{|\nabla_a Q(s,a)|} aa+ηaQ(s,a)aQ(s,a)
  • 在环境中验证优化后的轨迹,加入训练集进行标准微调

阶段2:源分布优化

  • 训练小型的"噪声actor" π θ noise ( ϵ ∣ s ) \pi_\theta^{\text{noise}}(\epsilon|s) πθnoise(ϵs) 生成输入流匹配模型的初始噪声
  • 保持基础模型权重不变,通过改变噪声分布来优化行为

实验结果

主要基准测试

任务/环境 对比方法 Green-VLA 表现
ALOHA 桌面清理 π0, GR00T N1, AgiBot GO-1, WALL-OSS R0阶段即超越所有基线,首项成功率69.5% vs 最佳基线38.4%
Simpler (WidowX) π0, OpenVLA, RT-1X, Flower, DB-MemVLA R2达到SOTA,平均成功率87.5%
Simpler (Google Robot) 同上 R2在多个任务上达到98%+成功率
CALVIN ABC→D π0, Flower R2的ACL(平均链长度)4.63,超越Flower的4.53
电商货架拣选 - +JPM引导后,OOD成功率从10.2%提升至72.8%

人形机器人真实世界评估

在Green人形机器人(32自由度上半身控制)上测试:

  • 指令跟随:根据语言指令选择左右手、特定物品、目标篮子
  • 双臂协调:完成需要双手配合的任务
  • 跨形态泛化:同一策略零样本迁移到单臂、双臂、人形平台

架构细节

输入:文本指令 + 多视角图像(手腕×2 + 头部×1)+ 本体感知状态
      ↓
[GigaVision VLM] → 任务分解为原子子任务(pick/place/move/give)
      ↓
[PaliGemma 3B 视觉-语言编码器] + 控制类型提示 $c_e$
      ↓
[流匹配 Action Expert] → 统一动作块预测 $\hat{u}_t \in \mathcal{A}_u$
      ↓
[ embodiment-specific 逆映射 $\Phi_e^{-1}$ ] → 机器人原生动作
      ↓
[JPM引导 + OOD检测 + 进度预测] → 安全执行

推理优化

  • SDPA(缩放点积注意力)内核
  • 减少去噪步数
  • 总参数量 ~4B(3B PaliGemma + 动作专家)

局限性与未来工作

局限 未来方向
仅支持英语指令 多语言支持(英语、俄语等)
推理与执行分离 轻量级推理模块,保持低延迟
无显式记忆机制 具身记忆与轨迹回放
依赖重定向保真度 改进跨形态动作重定向
数据集残留偏差 在线数据收集与安全感知RL

总结评价

Green-VLA 的核心价值

  1. 实用性导向:不仅追求基准测试分数,更强调真实部署的可行性(延迟、安全性、跨形态兼容)

  2. 系统性思维:从数据管道(DataQA)、动作表示(统一空间)、训练策略(五阶段)到推理增强(JPM/OOD)的全链路优化

  3. 高效性:用3000小时数据达到超越10000+小时数据方法的效果,证明数据质量 > 纯规模

  4. 开源承诺:代码与项目页面已公开(https://github.com/greenvla/GreenVLA)

这项工作为构建通用人形机器人提供了一个实用且可扩展的技术路线,特别是在数据效率、跨形态迁移和真实世界可靠性方面提供了有价值的工程实践参考。

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐