Add metadata and improve model card

Hi! I'm Niels from the community science team at Hugging Face.

I've opened this PR to improve the model card for DRA-GRPO. Specifically, I have:
- Added metadata for `pipeline_tag`, `library_name`, and `base_model` to make the model more discoverable.
- Improved the markdown structure for better readability.
- Linked the model to its corresponding paper and code repository.
- Added a BibTeX citation.

These changes will enable features like the "Use in Transformers" button and ensure the model appears in the correct categories on the Hub.

Files changed (1) hide show

README.md +33 -2

README.md CHANGED Viewed

@@ -1,7 +1,38 @@
 ---
 license: cc-by-4.0
 ---
-This model is described in the paper [DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models](https://arxiv.org/abs/2505.09655).
-Full code is in: https://github.com/xiwenc1/DRA-GRPO

 ---
 license: cc-by-4.0
+library_name: transformers
+pipeline_tag: text-generation
+base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+tags:
+- mathematics
+- grpo
+- reinforcement-learning
+- reasoning
 ---
+# DRA-GRPO
+This repository contains the weights for DRA-GRPO, a model developed to enhance mathematical reasoning in Large Language Models (LLMs) through **Diversity-aware Reward Adjustment (DRA)**.
+The model is based on the [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) architecture and was trained using a framework that calibrates reward signals using semantic density to prevent policy collapse into redundant reasoning paths.
+## Resources
+- **Paper:** [DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning](https://huggingface.co/papers/2505.09655)
+- **ArXiv:** [2505.09655](https://arxiv.org/abs/2505.09655)
+- **Code:** [xiwenc1/DRA-GRPO](https://github.com/xiwenc1/DRA-GRPO)
+## Method Overview
+DRA-GRPO implements Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that uses Submodular Mutual Information (SMI) and Inverse Propensity Scoring (IPS). This mechanism de-biases gradient estimation by creating a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward reasoning landscape.
+## Performance
+Empirical evaluations demonstrate that DRA-GRPO consistently outperforms standard GRPO baselines, achieving an average accuracy of 58.2% on mathematical benchmarks (such as MATH-500, AMC23, and AIME24) with high data efficiency.
+## Citation
+```bibtex
+@article{chen2025dra,
+  title={DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning},
+  author={Xiwen Chen and Wenhui Zhu and Peijie Qiu and Xuanzhao Dong and Hao Wang and Haiyu Wu and Huayu Li and Aristeidis Sotiras and Yalin Wang and Abolfazl Razi},
+  journal={arXiv preprint arXiv:2505.09655},
+  year={2025}
+}
+```