SpiceRL
/

DRA-GRPO

Safetensors

qwen2

Model card Files Files and versions

xet

Community

Add metadata and improve model card

by nielsr HF Staff - opened Mar 7

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+33

-2

Files changed (1) hide show

README.md +33 -2

README.md CHANGED Viewed

@@ -1,7 +1,38 @@
 ---
 license: cc-by-4.0
 ---
-This model is described in the paper [DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models](https://arxiv.org/abs/2505.09655).
-Full code is in: https://github.com/xiwenc1/DRA-GRPO

 ---
 license: cc-by-4.0
+library_name: transformers
+pipeline_tag: text-generation
+base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+tags:
+- mathematics
+- grpo
+- reinforcement-learning
+- reasoning
 ---
+# DRA-GRPO
+This repository contains the weights for DRA-GRPO, a model developed to enhance mathematical reasoning in Large Language Models (LLMs) through **Diversity-aware Reward Adjustment (DRA)**.
+The model is based on the [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) architecture and was trained using a framework that calibrates reward signals using semantic density to prevent policy collapse into redundant reasoning paths.
+## Resources
+- **Paper:** [DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning](https://huggingface.co/papers/2505.09655)
+- **ArXiv:** [2505.09655](https://arxiv.org/abs/2505.09655)
+- **Code:** [xiwenc1/DRA-GRPO](https://github.com/xiwenc1/DRA-GRPO)
+## Method Overview
+DRA-GRPO implements Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that uses Submodular Mutual Information (SMI) and Inverse Propensity Scoring (IPS). This mechanism de-biases gradient estimation by creating a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward reasoning landscape.
+## Performance
+Empirical evaluations demonstrate that DRA-GRPO consistently outperforms standard GRPO baselines, achieving an average accuracy of 58.2% on mathematical benchmarks (such as MATH-500, AMC23, and AIME24) with high data efficiency.
+## Citation
+```bibtex
+@article{chen2025dra,
+  title={DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning},
+  author={Xiwen Chen and Wenhui Zhu and Peijie Qiu and Xuanzhao Dong and Hao Wang and Haiyu Wu and Huayu Li and Aristeidis Sotiras and Yalin Wang and Abolfazl Razi},
+  journal={arXiv preprint arXiv:2505.09655},
+  year={2025}
+}
+```