Improve model card and add metadata

Hi! I'm Niels from the community science team at Hugging Face.

I've opened this PR to improve the model card for Vision-R1-72B. The updates include:
- Adding YAML metadata for `library_name` (`transformers`), `pipeline_tag` (`image-text-to-text`), and relevant tags.
- Linking the model to the original paper: [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).
- Adding the official GitHub repository link.
- Including a performance summary and sample usage instructions found in the documentation.

These additions will make the model more discoverable and provide better context for users on the Hub.

Files changed (1) hide show

README.md +77 -3

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+- reasoning
+- math
+- qwen2.5-vl
+- reinforcement-learning
+---
+# Vision-R1-72B
+Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.
+- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
+- **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
+## Performance
+Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:
+| Model                      | MathVista   | MathVerse    | MathVerse (mini Vision_Only) | MM-Math      | DynaMath (Overall; Avg) | AVG.         |
+| -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
+| Qwen2.5-VL-72B             | 73.5        | 51.3         | 47.3                         | 45.6         | 61.2                    | 55.8         |
+| **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6)                 | 59.3 (+13.7) | 66.4 (+5.2)             | 65 (+9.2)    |
+\*: Vision-R1-72B used additional data in RL training.
+## Quickstart
+### Using 🤗 Transformers for Inference
+You can run inference using the scripts provided in the official repository. First, install the requirements:
+```bash
+pip install -r requirements.txt
+# Optional: install Flash Attention 2
+pip install -U flash-attn --no-build-isolation
+```
+Then, run the inference script:
+```bash
+MODEL_PATH="Osilly/Vision-R1-72B"
+TEMP=0.6
+TOP_P=0.95
+MAX_TOKENS=4096
+IMAGE_PATH="./path/to/your/image.png"
+PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
+Choices:
+A: 2π
+B: 3π
+C: 6π
+D: 8π"
+python3 inference.py \
+    --model_path ${MODEL_PATH}  \
+    --enable_flash_attn True \
+    --image_path ${IMAGE_PATH} \
+    --prompt "${PROMPT}" \
+    --max_tokens ${MAX_TOKENS} \
+    --temperature ${TEMP} \
+    --top_p ${TOP_P}
+```
+## Citation
+If you find our work helpful, please consider citing it:
+```bibtex
+@article{huang2025visionr1,
+  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
+  author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
+  journal={arXiv preprint arXiv:2503.06749},
+  year={2025}
+}
+```