nielsr HF Staff commited on
Commit
42a8db1
·
verified ·
1 Parent(s): 14611fc

Add model card and metadata

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face. I noticed this repository was missing a model card. This PR adds a README with:
- Metadata for the `robotics` pipeline tag and `transformers` library name.
- Links to the research paper [DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models](https://huggingface.co/papers/2511.15669).
- A link to the official GitHub repository.
- A summary of the model's architecture and performance results on benchmarks like LIBERO.

This will help users find and understand your work more easily on the Hugging Face Hub!

Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: robotics
4
+ base_model: physical-intelligence/pi0fast_base
5
+ tags:
6
+ - vision-language-action
7
+ - chain-of-thought
8
+ - embodied-ai
9
+ ---
10
+
11
+ # DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
12
+
13
+ DeepThinkVLA is a Vision-Language-Action (VLA) model designed to enhance the reasoning capabilities of robotic agents through explicit deliberation. It refactors the policy into a 2.9B parameter hybrid decoder that generates a reasoning trace (Chain-of-Thought) before emitting action chunks.
14
+
15
+ - **Paper:** [DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models](https://huggingface.co/papers/2511.15669)
16
+ - **Repository:** [https://github.com/OpenBMB/DeepThinkVLA](https://github.com/OpenBMB/DeepThinkVLA)
17
+
18
+ ## Model Description
19
+ DeepThinkVLA addresses the challenges of integrating Chain-of-Thought (CoT) into VLA models by satisfying two key conditions:
20
+ 1. **Decoding Alignment:** It uses a hybrid-attention decoder that pairs causal attention for linguistic reasoning tokens with bidirectional attention for parallel action decoding.
21
+ 2. **Causal Alignment:** The model is trained via a two-stage SFT-then-RL pipeline (using GRPO) to ensure the reasoning chain is causally linked to task success.
22
+
23
+ The model is initialized from the `pi0-FAST` checkpoint and demonstrates significant performance gains on robotic manipulation benchmarks.
24
+
25
+ ## Performance
26
+ - **LIBERO:** 97.0% average success rate.
27
+ - **LIBERO-Plus:** 79.0% zero-shot robustness under distribution shifts.
28
+ - **RoboTwin 2.0:** 59.3% success rate, exceeding prior VLA baselines by significant margins.
29
+
30
+ ## Citation
31
+ If you find this work helpful, please consider citing:
32
+
33
+ ```bibtex
34
+ @article{yin2025deepthinkvla,
35
+ title={DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models},
36
+ author={Yin, Cheng and Lin, Yankai and Xu, Wang and Tam, Sikyuen and Zeng, Xiangrui and Liu, Zhiyuan and Yin, Zhouping},
37
+ journal={arXiv preprint arXiv:2511.15669},
38
+ year={2025}
39
+ }
40
+ ```