Add model card and metadata

Hi! I'm Niels, part of the community science team at Hugging Face. I noticed this repository was missing a model card. This PR adds a README with:
- Metadata for the `robotics` pipeline tag and `transformers` library name.
- Links to the research paper [DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models](https://huggingface.co/papers/2511.15669).
- A link to the official GitHub repository.
- A summary of the model's architecture and performance results on benchmarks like LIBERO.

This will help users find and understand your work more easily on the Hugging Face Hub!

Files changed (1) hide show

README.md +40 -0

README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+---
+library_name: transformers
+pipeline_tag: robotics
+base_model: physical-intelligence/pi0fast_base
+tags:
+- vision-language-action
+- chain-of-thought
+- embodied-ai
+---
+# DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
+DeepThinkVLA is a Vision-Language-Action (VLA) model designed to enhance the reasoning capabilities of robotic agents through explicit deliberation. It refactors the policy into a 2.9B parameter hybrid decoder that generates a reasoning trace (Chain-of-Thought) before emitting action chunks.
+- **Paper:** [DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models](https://huggingface.co/papers/2511.15669)
+- **Repository:** [https://github.com/OpenBMB/DeepThinkVLA](https://github.com/OpenBMB/DeepThinkVLA)
+## Model Description
+DeepThinkVLA addresses the challenges of integrating Chain-of-Thought (CoT) into VLA models by satisfying two key conditions:
+1. **Decoding Alignment:** It uses a hybrid-attention decoder that pairs causal attention for linguistic reasoning tokens with bidirectional attention for parallel action decoding.
+2. **Causal Alignment:** The model is trained via a two-stage SFT-then-RL pipeline (using GRPO) to ensure the reasoning chain is causally linked to task success.
+The model is initialized from the `pi0-FAST` checkpoint and demonstrates significant performance gains on robotic manipulation benchmarks.
+## Performance
+- **LIBERO:** 97.0% average success rate.
+- **LIBERO-Plus:** 79.0% zero-shot robustness under distribution shifts.
+- **RoboTwin 2.0:** 59.3% success rate, exceeding prior VLA baselines by significant margins.
+## Citation
+If you find this work helpful, please consider citing:
+```bibtex
+@article{yin2025deepthinkvla,
+  title={DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models},
+  author={Yin, Cheng and Lin, Yankai and Xu, Wang and Tam, Sikyuen and Zeng, Xiangrui and Liu, Zhiyuan and Yin, Zhouping},
+  journal={arXiv preprint arXiv:2511.15669},
+  year={2025}
+}
+```