nielsr HF Staff commited on
Commit
9f817ec
·
verified ·
1 Parent(s): ac317cb

Add metadata and improve model card

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

I've opened this PR to improve the model card for DRA-GRPO. Specifically, I have:
- Added metadata for `pipeline_tag`, `library_name`, and `base_model` to make the model more discoverable.
- Improved the markdown structure for better readability.
- Linked the model to its corresponding paper and code repository.
- Added a BibTeX citation.

These changes will enable features like the "Use in Transformers" button and ensure the model appears in the correct categories on the Hub.

Files changed (1) hide show
  1. README.md +33 -2
README.md CHANGED
@@ -1,7 +1,38 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- This model is described in the paper [DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models](https://arxiv.org/abs/2505.09655).
6
 
7
- Full code is in: https://github.com/xiwenc1/DRA-GRPO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
6
+ tags:
7
+ - mathematics
8
+ - grpo
9
+ - reinforcement-learning
10
+ - reasoning
11
  ---
12
 
13
+ # DRA-GRPO
14
 
15
+ This repository contains the weights for DRA-GRPO, a model developed to enhance mathematical reasoning in Large Language Models (LLMs) through **Diversity-aware Reward Adjustment (DRA)**.
16
+
17
+ The model is based on the [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) architecture and was trained using a framework that calibrates reward signals using semantic density to prevent policy collapse into redundant reasoning paths.
18
+
19
+ ## Resources
20
+ - **Paper:** [DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning](https://huggingface.co/papers/2505.09655)
21
+ - **ArXiv:** [2505.09655](https://arxiv.org/abs/2505.09655)
22
+ - **Code:** [xiwenc1/DRA-GRPO](https://github.com/xiwenc1/DRA-GRPO)
23
+
24
+ ## Method Overview
25
+ DRA-GRPO implements Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that uses Submodular Mutual Information (SMI) and Inverse Propensity Scoring (IPS). This mechanism de-biases gradient estimation by creating a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward reasoning landscape.
26
+
27
+ ## Performance
28
+ Empirical evaluations demonstrate that DRA-GRPO consistently outperforms standard GRPO baselines, achieving an average accuracy of 58.2% on mathematical benchmarks (such as MATH-500, AMC23, and AIME24) with high data efficiency.
29
+
30
+ ## Citation
31
+ ```bibtex
32
+ @article{chen2025dra,
33
+ title={DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning},
34
+ author={Xiwen Chen and Wenhui Zhu and Peijie Qiu and Xuanzhao Dong and Hao Wang and Haiyu Wu and Huayu Li and Aristeidis Sotiras and Yalin Wang and Abolfazl Razi},
35
+ journal={arXiv preprint arXiv:2505.09655},
36
+ year={2025}
37
+ }
38
+ ```