Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training

Paper Collections

Model Performance Comparison
Average score across Financial benchmarks. ODA-Fin-RL/SFT-8B demonstrates strong performance relative to thinking models with significantly more parameters.

This repository provides ODA-Fin-RL-8B, the reinforcement learning-enhanced version of ODA-Fin-SFT-8B. It achieves state-of-the-art performance among open-source financial LLMs of comparable size.

📖 Overview

ODA-Fin-RL-8B is built on ODA-Fin-SFT-8B and further optimized via Group Relative Policy Optimization (GRPO) on the ODA-Fin-RL-12K dataset—a carefully curated subset of 12K hard-but-verifiable financial reasoning tasks. This two-stage training strategy (SFT → RL) achieves optimal performance across diverse financial benchmarks.

🎯 Key Highlights

  • Base Model: ODA-Fin-SFT-8B (Qwen3-8B fine-tuned on 318K CoT samples)
  • RL Training: GRPO on ODA-Fin-RL-12K (12K difficulty-filtered samples)
  • Avg Performance: 74.6% across 9 financial benchmarks (+2.5 over SFT)
  • SOTA Achievement: Highest score among open-source 8B financial LLMs
  • Key Strengths:
    • Finova: 54.6% (Best among 8B models, +6.8 over SFT)
    • TaTQA: 89.3% (+2.3 over SFT, +4.2 over Qwen3-32B)
    • FPB: 83.4% (+7.8 over SFT, strong sentiment reasoning)

🧠 Model Training

Stage 1: Supervised Fine-Tuning (SFT)

  • Dataset: ODA-Fin-SFT-318K
  • Method: Full-parameter fine-tuning
  • Epochs: 3
  • Result: Establishes strong reasoning foundation (72.1% avg)

Stage 2: Reinforcement Learning (RL)

  • Dataset: ODA-Fin-RL-12K (difficulty-filtered: fail rate >= 50%)
  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Training Config:
    Hardware: 8×NVIDIA H800 (80GB)
    Batch Size: 256
    Rollouts per Sample: 4
    Temperature: 0.6
    Top-p: 0.85
    Learning Rate: 1e-6
    KL Coefficient: 0.001
    

📊 Model Performance

Main Results (vs SOTA Baselines)

p
Main Results: ODA-Fin-RL achieves top three performance across most benchmarks. 'FinIQ', 'HL' and 'CFQA' refer to FinanceIQ, Headlines, and ConvFinQA benchmarks.

Performance Highlights:

  • Matches Qwen3-32B (74.7%) with 4× fewer parameters
  • +4.3 points over DianJin-R1-7B (best previous 7B financial LLM)
  • +2.1 points over Qwen3-8B-Thinking (larger reasoning model)
  • Dominates numerical reasoning: TaTQA (89.3%), FinQA (73.3%), ConvFinQA (80.4%)

📚 Citation

@misc{cao2026unlockingdatavaluefinance,
      title={Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training}, 
      author={Chuxue Cao and Honglin Lin and Zhanping Zhong and Xin Gao and Mengzhang Cai and Conghui He and Sirui Han and Lijun Wu},
      year={2026},
      eprint={2603.07223},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.07223}, 
}

📄 License

This model is released under the Apache 2.0 License. The training data (ODA-Fin-SFT-318K) aggregates from 25+ open-source repositories, each with their own licenses.


🤝 Acknowledgments

We thank the creators of DianJin-R1-Data, Agentar-DeepFinance-100K, financial_phrasebank, Finance-Instruct-500k, and others. We also thank the Qwen team for the powerful Qwen3 series models.


🔗 Related Resources

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for OpenDataArena/ODA-Fin-RL-8B

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(1)
this model

Datasets used to train OpenDataArena/ODA-Fin-RL-8B

Collection including OpenDataArena/ODA-Fin-RL-8B

Paper for OpenDataArena/ODA-Fin-RL-8B