File size: 4,899 Bytes
98300a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
library_name: transformers
license: apache-2.0
base_model: OpenDataArena/ODA-Fin-SFT-8B
tags:
- finance
- reasoning
- reinforcement-learning
- GRPO
model-index:
- name: ODA-Fin-RL-8B
  results: []
datasets:
- OpenDataArena/ODA-Fin-SFT-318k
- OpenDataArena/ODA-Fin-RL-12k
language:
- en
- zh
metrics:
- accuracy
- f1
size_categories:
- 10K<n<100K
---


<div align="center">
  <h1>Unlocking Data Value in Finance: A Study on Distillation
and Difficulty-Aware Training</h1>

</div>

<div align="center">
  
[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2603.07223)
[![Collections](https://img.shields.io/badge/🤗-Collections-yellow)](https://huggingface.co/collections/OpenDataArena/oda-finance)

</div>

<figure align="center">
  <img src="imgs/model_compare.png" width="100%" alt="Model Performance Comparison">
  <figcaption><em>Average score across Financial benchmarks. ODA-Fin-RL/SFT-8B demonstrates strong performance relative to thinking models with significantly more parameters.</em></figcaption>
</figure>

---

This repository provides **ODA-Fin-RL-8B**, the reinforcement learning-enhanced version of ODA-Fin-SFT-8B. It achieves **state-of-the-art performance** among open-source financial LLMs of comparable size.

## 📖 Overview

**ODA-Fin-RL-8B** is built on [ODA-Fin-SFT-8B](https://huggingface.co/OpenDataArena/ODA-Fin-SFT-8B) and further optimized via **Group Relative Policy Optimization (GRPO)** on the **ODA-Fin-RL-12K** dataset—a carefully curated subset of 12K hard-but-verifiable financial reasoning tasks. This two-stage training strategy (SFT → RL) achieves optimal performance across diverse financial benchmarks.

### 🎯 Key Highlights

- **Base Model**: ODA-Fin-SFT-8B (Qwen3-8B fine-tuned on 318K CoT samples)
- **RL Training**: GRPO on ODA-Fin-RL-12K (12K difficulty-filtered samples)
- **Avg Performance**: 74.6% across 9 financial benchmarks (+2.5 over SFT)
- **SOTA Achievement**: Highest score among open-source 8B financial LLMs
- **Key Strengths**:
  - **Finova: 54.6%** (Best among 8B models, +6.8 over SFT)
  - **TaTQA: 89.3%** (+2.3 over SFT, +4.2 over Qwen3-32B)
  - **FPB: 83.4%** (+7.8 over SFT, strong sentiment reasoning)

---

## 🧠 Model Training


### Stage 1: Supervised Fine-Tuning (SFT)

- **Dataset**: ODA-Fin-SFT-318K
- **Method**: Full-parameter fine-tuning
- **Epochs**: 3
- **Result**: Establishes strong reasoning foundation (72.1% avg)

### Stage 2: Reinforcement Learning (RL)

- **Dataset**: ODA-Fin-RL-12K (difficulty-filtered: fail rate >= 50%)
- **Algorithm**: GRPO (Group Relative Policy Optimization)
- **Training Config**:
  ```yaml
  Hardware: 8×NVIDIA H800 (80GB)
  Batch Size: 256
  Rollouts per Sample: 4
  Temperature: 0.6
  Top-p: 0.85
  Learning Rate: 1e-6
  KL Coefficient: 0.001
  ```

---

## 📊 Model Performance

### Main Results (vs SOTA Baselines)

<figure align="center">
  <img src="imgs/main_results_table.png" width="100%" alt="p">
  <figcaption><em>Main Results: ODA-Fin-RL achieves top three performance across most benchmarks. 'FinIQ', 'HL' and 'CFQA' refer to FinanceIQ, Headlines, and ConvFinQA benchmarks.</em></figcaption>
</figure>

**Performance Highlights**:
- **Matches Qwen3-32B** (74.7%) with **4× fewer parameters**
- **+4.3 points** over DianJin-R1-7B (best previous 7B financial LLM)
- **+2.1 points** over Qwen3-8B-Thinking (larger reasoning model)
- **Dominates numerical reasoning**: TaTQA (89.3%), FinQA (73.3%), ConvFinQA (80.4%)

---

## 📚 Citation

```bibtex
@misc{cao2026unlockingdatavaluefinance,
      title={Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training}, 
      author={Chuxue Cao and Honglin Lin and Zhanping Zhong and Xin Gao and Mengzhang Cai and Conghui He and Sirui Han and Lijun Wu},
      year={2026},
      eprint={2603.07223},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.07223}, 
}

```

---

## 📄 License

This model is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0). The training data (ODA-Fin-SFT-318K) aggregates from 25+ open-source repositories, each with their own licenses.

---

## 🤝 Acknowledgments

We thank the creators of DianJin-R1-Data, Agentar-DeepFinance-100K, financial_phrasebank, Finance-Instruct-500k, and others. We also thank the Qwen team for the powerful Qwen3 series models.

---

## 🔗 Related Resources

- **SFT Dataset**: [ODA-Fin-SFT-318K](https://huggingface.co/datasets/OpenDataArena/ODA-Fin-SFT-318k)
- **RL Dataset**: [ODA-Fin-RL-12K](https://huggingface.co/datasets/OpenDataArena/ODA-Fin-RL-12K)
- **SFT Model**: [ODA-Fin-SFT-8B](https://huggingface.co/OpenDataArena/ODA-Fin-SFT-8B)
<!-- - **RL Model**: [ODA-Fin-RL-8B](https://huggingface.co/OpenDataArena/ODA-Fin-RL-8B) -->

<!-- - **Paper**: [arXiv:2512.XXXXX](https://arxiv.org/abs/2512.XXXXX) -->