library_name: transformers
license: apache-2.0
base_model: OpenDataArena/ODA-Fin-SFT-8B
tags:
- finance
- reasoning
- reinforcement-learning
- GRPO
model-index:
- name: ODA-Fin-RL-8B
results: []
datasets:
- OpenDataArena/ODA-Fin-SFT-318k
- OpenDataArena/ODA-Fin-RL-12k
language:
- en
- zh
metrics:
- accuracy
- f1
size_categories:
- 10K<n<100K
Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training
This repository provides ODA-Fin-RL-8B, the reinforcement learning-enhanced version of ODA-Fin-SFT-8B. It achieves state-of-the-art performance among open-source financial LLMs of comparable size.
π Overview
ODA-Fin-RL-8B is built on ODA-Fin-SFT-8B and further optimized via Group Relative Policy Optimization (GRPO) on the ODA-Fin-RL-12K datasetβa carefully curated subset of 12K hard-but-verifiable financial reasoning tasks. This two-stage training strategy (SFT β RL) achieves optimal performance across diverse financial benchmarks.
π― Key Highlights
- Base Model: ODA-Fin-SFT-8B (Qwen3-8B fine-tuned on 318K CoT samples)
- RL Training: GRPO on ODA-Fin-RL-12K (12K difficulty-filtered samples)
- Avg Performance: 74.6% across 9 financial benchmarks (+2.5 over SFT)
- SOTA Achievement: Highest score among open-source 8B financial LLMs
- Key Strengths:
- Finova: 54.6% (Best among 8B models, +6.8 over SFT)
- TaTQA: 89.3% (+2.3 over SFT, +4.2 over Qwen3-32B)
- FPB: 83.4% (+7.8 over SFT, strong sentiment reasoning)
π§ Model Training
Stage 1: Supervised Fine-Tuning (SFT)
- Dataset: ODA-Fin-SFT-318K
- Method: Full-parameter fine-tuning
- Epochs: 3
- Result: Establishes strong reasoning foundation (72.1% avg)
Stage 2: Reinforcement Learning (RL)
- Dataset: ODA-Fin-RL-12K (difficulty-filtered: fail rate >= 50%)
- Algorithm: GRPO (Group Relative Policy Optimization)
- Training Config:
Hardware: 8ΓNVIDIA H800 (80GB) Batch Size: 256 Rollouts per Sample: 4 Temperature: 0.6 Top-p: 0.85 Learning Rate: 1e-6 KL Coefficient: 0.001
π Model Performance
Main Results (vs SOTA Baselines)
Performance Highlights:
- Matches Qwen3-32B (74.7%) with 4Γ fewer parameters
- +4.3 points over DianJin-R1-7B (best previous 7B financial LLM)
- +2.1 points over Qwen3-8B-Thinking (larger reasoning model)
- Dominates numerical reasoning: TaTQA (89.3%), FinQA (73.3%), ConvFinQA (80.4%)
π Citation
@misc{cao2026unlockingdatavaluefinance,
title={Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training},
author={Chuxue Cao and Honglin Lin and Zhanping Zhong and Xin Gao and Mengzhang Cai and Conghui He and Sirui Han and Lijun Wu},
year={2026},
eprint={2603.07223},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.07223},
}
π License
This model is released under the Apache 2.0 License. The training data (ODA-Fin-SFT-318K) aggregates from 25+ open-source repositories, each with their own licenses.
π€ Acknowledgments
We thank the creators of DianJin-R1-Data, Agentar-DeepFinance-100K, financial_phrasebank, Finance-Instruct-500k, and others. We also thank the Qwen team for the powerful Qwen3 series models.
π Related Resources
- SFT Dataset: ODA-Fin-SFT-318K
- RL Dataset: ODA-Fin-RL-12K
- SFT Model: ODA-Fin-SFT-8B