论文阅读“OpenVLA: An Open-Source Vision-Language-Action Model“
论文阅读"OpenVLA: An Open-Source VIsion-Language-Action Model"
目录
摘要
Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control.
Yet, widespread adoption of VLAs for robotics has been challenging as
1) existing VLAs are largely closed and inaccessible to the public, and
2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption.
Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP.
As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters.
We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.
讨论
In this work, we presented OpenVLA, a state-of-the-art, open-source vision-language-action model that obtains strong performance for cross-embodiment robot control out-of-the-box. We also demonstrated that OpenVLA can be easily adapted to new robot setups via parameter-efficient fine-tuning techniques.
The current OpenVLA model has several limitations.
First, it currently only supports single-image observations. In reality, real-world robot setups are heterogeneous, with a wide range of possible sensory inputs [5]. Expanding OpenVLA to support multiple image and proprioceptive inputs as well as observation history is an important avenue for future work. Exploring the use of VLMs pretrained on interleaved image and text data may facilitate such flexible-input VLA fine-tuning.
Secondly, improving the inference throughput of OpenVLA is critical to enable VLA control for high-frequency control setups such as ALOHA [90], which runs at 50Hz. This will also enable testing VLAs on more dexterous, bi-manual manipulation tasks than what we investigated in this work. Exploring the use of action chunking or alternative inference-time optimization techniques such as speculative decoding [91] offer potential remedies.
Additionally, there is room for further performance improvements. While OpenVLA outperforms prior generalist policies, it does not yet offer very high reliability on the tested tasks, typically achieving <90% success rate.
Finally, due to compute limitations, many VLA design questions remain underexplored: What effect does the size of the base VLM have on VLA performance? Does co-training on robot action prediction data and Internet-scale vision-language data substantially improve VLA performance? What visual features are best-suited for VLA models? We hope that the release of the OpenVLA model and codebase will enable the community to jointly investigate these questions.
我来为您详细分析这篇关于OpenVLA的论文。这是一篇关于开源视觉-语言-动作模型(Vision-Language-Action Model)的机器人学习研究论文。
论文概述
OpenVLA 是一个70亿参数的开源视觉-语言-动作模型,旨在解决现有VLA模型封闭、不可访问以及缺乏高效微调方法的问题。该模型在97万个真实机器人演示数据上训练,能够直接控制多种机器人,并支持参数高效的微调。
核心贡献
1. 模型架构设计
OpenVLA基于Prismatic-7B VLM构建,包含三个关键组件:
| 组件 | 说明 | 具体配置 |
|---|---|---|
| 视觉编码器 | 融合DINOv2和SigLIP特征 | 6亿参数,224×224分辨率 |
| 投影器 | 将视觉特征映射到语言嵌入空间 | 2层MLP |
| 语言模型骨干 | 处理多模态输入并生成动作 | Llama 2 7B |
关键创新:使用DINOv2+SigLIP的融合视觉编码器,结合DINOv2的低级空间信息和SigLIP的高级语义信息,显著提升空间推理能力。
2. 动作表示方法
将连续机器人动作离散化为语言模型词汇表中的token:
- 每个动作维度独立离散化为256个bin
- 使用1%-99%分位数确定bin边界(而非min-max),避免异常值影响
- 覆盖Llama tokenizer中最少使用的256个token作为动作token
3. 训练数据与策略
数据来源:Open X-Embodiment数据集(970k条轨迹)
- 涵盖23个数据集,包括BridgeData V2、Fractal、Kuka等
- 经过精心筛选:仅保留单臂末端执行器控制、至少一个第三人称视角的数据
- 采用Octo的数据混合权重策略,平衡不同数据集的多样性
关键训练发现(第6页):
- 需要27个epoch(远多于典型LLM的1-2个epoch)
- 学习率固定为 2e-5
- 必须微调视觉编码器(与VLM常规做法相反)
实验结果分析
5.1 多机器人平台直接评估
BridgeData V2 (WidowX机器人) - 图3:
| 方法 | 参数量 | 平均成功率 | 相对提升 |
|---|---|---|---|
| RT-1-X | 35M | 18.5% | - |
| Octo | 93M | 20.0% | +8% |
| RT-2-X | 55B | 50.6% | +173% |
| OpenVLA | 7B | 70.6% | +281% |
关键发现:
- OpenVLA以7倍少的参数超越RT-2-X(55B)16.5个百分点
- 在视觉泛化、运动泛化、物理泛化、语言 grounding 等所有类别均领先
- 仅在语义泛化上略逊于RT-2-X(因其使用更大规模的互联网预训练数据)
Google机器人评估 - 图4:
- OpenVLA与RT-2-X性能相当(85.0% vs 82.9%),显著优于RT-1-X和Octo
5.2 数据高效适应新机器人设置
在Franka机器人上的微调实验(图5)显示:
| 任务类型 | 最佳方法 | 关键洞察 |
|---|---|---|
| 狭窄单指令任务 | Diffusion Policy | 从 scratch 训练的扩散策略在简单任务上更精准 |
| 多样多指令任务 | OpenVLA | 预训练的VLA在需要语言grounding的复杂任务上优势明显 |
| 视觉鲁棒性任务 | OpenVLA | 在存在干扰物的OOD场景中表现最佳 |
OpenVLA (scratch) vs OpenVLA:直接微调基础VLM(无OpenX预训练)性能显著下降,证明大规模机器人预训练的重要性。
5.3 参数高效微调(LoRA)
表1展示了不同微调策略的对比:
| 策略 | 训练参数量 | 显存需求 | 成功率 | 效率评估 |
|---|---|---|---|---|
| 全量微调 | 71.9亿 (100%) | 163GB | 69.7% | 基准 |
| 仅最后一层 | 4.65亿 (6.5%) | 51GB | 30.3% | ❌ 性能差 |
| 冻结视觉编码器 | 67.6亿 (94%) | 156GB | 47.0% | ❌ 性能差 |
| Sandwich微调 | 9.14亿 (12.7%) | 64GB | 62.1% | ⚠️ 中等 |
| LoRA (r=32) | 0.98亿 (1.4%) | 60GB | 68.2% | ✅ 最佳 |
关键结论:LoRA以仅1.4%的可训练参数达到接近全量微调的性能,使单卡A100(40GB)微调成为可能。
5.4 量化推理
表2显示4-bit量化可实现:
- 显存从16.8GB降至7.0GB(减少58%)
- 性能无损失(71.9% vs 71.3%)
- 在RTX 4090上可达6Hz控制频率
技术细节与工程实践
训练基础设施(第6页)
- 64× A100 GPU,训练14天(共21,500 GPU小时)
- 批次大小2048,使用FSDP、FlashAttention、自动混合精度
- 支持从单卡微调到多节点大规模训练的扩展
关键设计决策(第6页)
通过BridgeData V2上的消融实验确定:
| 决策 | 探索选项 | 最终选择 | 原因 |
|---|---|---|---|
| VLM骨干 | IDEFICS-1, LLaVA, Prismatic | Prismatic | 多对象语言grounding任务上高35% |
| 图像分辨率 | 224px vs 384px | 224px | 384px训练慢3倍但性能无提升 |
| 视觉编码器 | 冻结 vs 微调 | 微调 | 保留细粒度空间细节对机器人控制至关重要 |
| 训练轮数 | 1-2 epoch vs 多epoch | 27 epoch | 动作token准确率需达95%以上 |
局限性与未来方向(第11页)
- 单图像输入:不支持多视角、本体感受信息、观测历史
- 推理速度:6Hz对于高频控制(如ALOHA的50Hz)不足
- 可靠性:多数任务成功率<90%,需进一步提升
- 未充分探索的设计空间:基础VLM规模、联合训练策略、视觉特征选择等
论文的核心价值
| 维度 | 贡献 |
|---|---|
| 学术 | 首次系统研究VLA的微调方法,验证LoRA和量化的有效性 |
| 工程 | 开源完整的7B参数模型、训练代码、微调笔记本,降低研究门槛 |
| 性能 | 以更小模型超越封闭源SOTA,证明数据质量和架构设计的重要性 |
| 实用 | 支持消费级GPU部署(量化后7GB显存),推动技术民主化 |
OpenVLA代表了机器人基础模型向开源、可访问、可适应方向的重要迈进,为社区提供了研究VLA的新基准。
DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。
更多推荐


所有评论(0)