Add metadata and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +33 -2
README.md CHANGED
@@ -1,7 +1,38 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- This model is described in the paper [DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models](https://arxiv.org/abs/2505.09655).
6
 
7
- Full code is in: https://github.com/xiwenc1/DRA-GRPO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
6
+ tags:
7
+ - mathematics
8
+ - grpo
9
+ - reinforcement-learning
10
+ - reasoning
11
  ---
12
 
13
+ # DRA-GRPO
14
 
15
+ This repository contains the weights for DRA-GRPO, a model developed to enhance mathematical reasoning in Large Language Models (LLMs) through **Diversity-aware Reward Adjustment (DRA)**.
16
+
17
+ The model is based on the [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) architecture and was trained using a framework that calibrates reward signals using semantic density to prevent policy collapse into redundant reasoning paths.
18
+
19
+ ## Resources
20
+ - **Paper:** [DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning](https://huggingface.co/papers/2505.09655)
21
+ - **ArXiv:** [2505.09655](https://arxiv.org/abs/2505.09655)
22
+ - **Code:** [xiwenc1/DRA-GRPO](https://github.com/xiwenc1/DRA-GRPO)
23
+
24
+ ## Method Overview
25
+ DRA-GRPO implements Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that uses Submodular Mutual Information (SMI) and Inverse Propensity Scoring (IPS). This mechanism de-biases gradient estimation by creating a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward reasoning landscape.
26
+
27
+ ## Performance
28
+ Empirical evaluations demonstrate that DRA-GRPO consistently outperforms standard GRPO baselines, achieving an average accuracy of 58.2% on mathematical benchmarks (such as MATH-500, AMC23, and AIME24) with high data efficiency.
29
+
30
+ ## Citation
31
+ ```bibtex
32
+ @article{chen2025dra,
33
+ title={DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning},
34
+ author={Xiwen Chen and Wenhui Zhu and Peijie Qiu and Xuanzhao Dong and Hao Wang and Haiyu Wu and Huayu Li and Aristeidis Sotiras and Yalin Wang and Abolfazl Razi},
35
+ journal={arXiv preprint arXiv:2505.09655},
36
+ year={2025}
37
+ }
38
+ ```