File size: 4,899 Bytes
98300a3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | ---
library_name: transformers
license: apache-2.0
base_model: OpenDataArena/ODA-Fin-SFT-8B
tags:
- finance
- reasoning
- reinforcement-learning
- GRPO
model-index:
- name: ODA-Fin-RL-8B
results: []
datasets:
- OpenDataArena/ODA-Fin-SFT-318k
- OpenDataArena/ODA-Fin-RL-12k
language:
- en
- zh
metrics:
- accuracy
- f1
size_categories:
- 10K<n<100K
---
<div align="center">
<h1>Unlocking Data Value in Finance: A Study on Distillation
and Difficulty-Aware Training</h1>
</div>
<div align="center">
[](https://arxiv.org/abs/2603.07223)
[](https://huggingface.co/collections/OpenDataArena/oda-finance)
</div>
<figure align="center">
<img src="imgs/model_compare.png" width="100%" alt="Model Performance Comparison">
<figcaption><em>Average score across Financial benchmarks. ODA-Fin-RL/SFT-8B demonstrates strong performance relative to thinking models with significantly more parameters.</em></figcaption>
</figure>
---
This repository provides **ODA-Fin-RL-8B**, the reinforcement learning-enhanced version of ODA-Fin-SFT-8B. It achieves **state-of-the-art performance** among open-source financial LLMs of comparable size.
## 📖 Overview
**ODA-Fin-RL-8B** is built on [ODA-Fin-SFT-8B](https://huggingface.co/OpenDataArena/ODA-Fin-SFT-8B) and further optimized via **Group Relative Policy Optimization (GRPO)** on the **ODA-Fin-RL-12K** dataset—a carefully curated subset of 12K hard-but-verifiable financial reasoning tasks. This two-stage training strategy (SFT → RL) achieves optimal performance across diverse financial benchmarks.
### 🎯 Key Highlights
- **Base Model**: ODA-Fin-SFT-8B (Qwen3-8B fine-tuned on 318K CoT samples)
- **RL Training**: GRPO on ODA-Fin-RL-12K (12K difficulty-filtered samples)
- **Avg Performance**: 74.6% across 9 financial benchmarks (+2.5 over SFT)
- **SOTA Achievement**: Highest score among open-source 8B financial LLMs
- **Key Strengths**:
- **Finova: 54.6%** (Best among 8B models, +6.8 over SFT)
- **TaTQA: 89.3%** (+2.3 over SFT, +4.2 over Qwen3-32B)
- **FPB: 83.4%** (+7.8 over SFT, strong sentiment reasoning)
---
## 🧠 Model Training
### Stage 1: Supervised Fine-Tuning (SFT)
- **Dataset**: ODA-Fin-SFT-318K
- **Method**: Full-parameter fine-tuning
- **Epochs**: 3
- **Result**: Establishes strong reasoning foundation (72.1% avg)
### Stage 2: Reinforcement Learning (RL)
- **Dataset**: ODA-Fin-RL-12K (difficulty-filtered: fail rate >= 50%)
- **Algorithm**: GRPO (Group Relative Policy Optimization)
- **Training Config**:
```yaml
Hardware: 8×NVIDIA H800 (80GB)
Batch Size: 256
Rollouts per Sample: 4
Temperature: 0.6
Top-p: 0.85
Learning Rate: 1e-6
KL Coefficient: 0.001
```
---
## 📊 Model Performance
### Main Results (vs SOTA Baselines)
<figure align="center">
<img src="imgs/main_results_table.png" width="100%" alt="p">
<figcaption><em>Main Results: ODA-Fin-RL achieves top three performance across most benchmarks. 'FinIQ', 'HL' and 'CFQA' refer to FinanceIQ, Headlines, and ConvFinQA benchmarks.</em></figcaption>
</figure>
**Performance Highlights**:
- **Matches Qwen3-32B** (74.7%) with **4× fewer parameters**
- **+4.3 points** over DianJin-R1-7B (best previous 7B financial LLM)
- **+2.1 points** over Qwen3-8B-Thinking (larger reasoning model)
- **Dominates numerical reasoning**: TaTQA (89.3%), FinQA (73.3%), ConvFinQA (80.4%)
---
## 📚 Citation
```bibtex
@misc{cao2026unlockingdatavaluefinance,
title={Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training},
author={Chuxue Cao and Honglin Lin and Zhanping Zhong and Xin Gao and Mengzhang Cai and Conghui He and Sirui Han and Lijun Wu},
year={2026},
eprint={2603.07223},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.07223},
}
```
---
## 📄 License
This model is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0). The training data (ODA-Fin-SFT-318K) aggregates from 25+ open-source repositories, each with their own licenses.
---
## 🤝 Acknowledgments
We thank the creators of DianJin-R1-Data, Agentar-DeepFinance-100K, financial_phrasebank, Finance-Instruct-500k, and others. We also thank the Qwen team for the powerful Qwen3 series models.
---
## 🔗 Related Resources
- **SFT Dataset**: [ODA-Fin-SFT-318K](https://huggingface.co/datasets/OpenDataArena/ODA-Fin-SFT-318k)
- **RL Dataset**: [ODA-Fin-RL-12K](https://huggingface.co/datasets/OpenDataArena/ODA-Fin-RL-12K)
- **SFT Model**: [ODA-Fin-SFT-8B](https://huggingface.co/OpenDataArena/ODA-Fin-SFT-8B)
<!-- - **RL Model**: [ODA-Fin-RL-8B](https://huggingface.co/OpenDataArena/ODA-Fin-RL-8B) -->
<!-- - **Paper**: [arXiv:2512.XXXXX](https://arxiv.org/abs/2512.XXXXX) -->
|