Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training

Model Performance Comparison — *Average score across Financial benchmarks. ODA-Fin-RL/SFT-8B demonstrates strong performance relative to thinking models with significantly more parameters.*

This repository provides ODA-Fin-RL-8B, the reinforcement learning-enhanced version of ODA-Fin-SFT-8B. It achieves state-of-the-art performance among open-source financial LLMs of comparable size.

📖 Overview

ODA-Fin-RL-8B is built on ODA-Fin-SFT-8B and further optimized via Group Relative Policy Optimization (GRPO) on the ODA-Fin-RL-12K dataset—a carefully curated subset of 12K hard-but-verifiable financial reasoning tasks. This two-stage training strategy (SFT → RL) achieves optimal performance across diverse financial benchmarks.

🎯 Key Highlights

Base Model: ODA-Fin-SFT-8B (Qwen3-8B fine-tuned on 318K CoT samples)
RL Training: GRPO on ODA-Fin-RL-12K (12K difficulty-filtered samples)
Avg Performance: 74.6% across 9 financial benchmarks (+2.5 over SFT)
SOTA Achievement: Highest score among open-source 8B financial LLMs
Key Strengths:
- Finova: 54.6% (Best among 8B models, +6.8 over SFT)
- TaTQA: 89.3% (+2.3 over SFT, +4.2 over Qwen3-32B)
- FPB: 83.4% (+7.8 over SFT, strong sentiment reasoning)

🧠 Model Training

Stage 1: Supervised Fine-Tuning (SFT)

Dataset: ODA-Fin-SFT-318K
Method: Full-parameter fine-tuning
Epochs: 3
Result: Establishes strong reasoning foundation (72.1% avg)

Stage 2: Reinforcement Learning (RL)

Dataset: ODA-Fin-RL-12K (difficulty-filtered: fail rate >= 50%)
Algorithm: GRPO (Group Relative Policy Optimization)

Training Config:

Hardware: 8×NVIDIA H800 (80GB)
Batch Size: 256
Rollouts per Sample: 4
Temperature: 0.6
Top-p: 0.85
Learning Rate: 1e-6
KL Coefficient: 0.001

📊 Model Performance

Main Results (vs SOTA Baselines)

*Main Results: ODA-Fin-RL achieves top three performance across most benchmarks. 'FinIQ', 'HL' and 'CFQA' refer to FinanceIQ, Headlines, and ConvFinQA benchmarks.*

Performance Highlights:

Matches Qwen3-32B (74.7%) with 4× fewer parameters
+4.3 points over DianJin-R1-7B (best previous 7B financial LLM)
+2.1 points over Qwen3-8B-Thinking (larger reasoning model)
Dominates numerical reasoning: TaTQA (89.3%), FinQA (73.3%), ConvFinQA (80.4%)

📚 Citation

@misc{cao2026unlockingdatavaluefinance,
      title={Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training}, 
      author={Chuxue Cao and Honglin Lin and Zhanping Zhong and Xin Gao and Mengzhang Cai and Conghui He and Sirui Han and Lijun Wu},
      year={2026},
      eprint={2603.07223},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.07223}, 
}

📄 License

This model is released under the Apache 2.0 License. The training data (ODA-Fin-SFT-318K) aggregates from 25+ open-source repositories, each with their own licenses.

🤝 Acknowledgments

We thank the creators of DianJin-R1-Data, Agentar-DeepFinance-100K, financial_phrasebank, Finance-Instruct-500k, and others. We also thank the Qwen team for the powerful Qwen3 series models.