ODA-Fin-RL-8B / README.md
chuxuecao's picture
Create README.md
98300a3 verified
metadata
library_name: transformers
license: apache-2.0
base_model: OpenDataArena/ODA-Fin-SFT-8B
tags:
  - finance
  - reasoning
  - reinforcement-learning
  - GRPO
model-index:
  - name: ODA-Fin-RL-8B
    results: []
datasets:
  - OpenDataArena/ODA-Fin-SFT-318k
  - OpenDataArena/ODA-Fin-RL-12k
language:
  - en
  - zh
metrics:
  - accuracy
  - f1
size_categories:
  - 10K<n<100K

Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training

Paper Collections

Model Performance Comparison
Average score across Financial benchmarks. ODA-Fin-RL/SFT-8B demonstrates strong performance relative to thinking models with significantly more parameters.

This repository provides ODA-Fin-RL-8B, the reinforcement learning-enhanced version of ODA-Fin-SFT-8B. It achieves state-of-the-art performance among open-source financial LLMs of comparable size.

πŸ“– Overview

ODA-Fin-RL-8B is built on ODA-Fin-SFT-8B and further optimized via Group Relative Policy Optimization (GRPO) on the ODA-Fin-RL-12K datasetβ€”a carefully curated subset of 12K hard-but-verifiable financial reasoning tasks. This two-stage training strategy (SFT β†’ RL) achieves optimal performance across diverse financial benchmarks.

🎯 Key Highlights

  • Base Model: ODA-Fin-SFT-8B (Qwen3-8B fine-tuned on 318K CoT samples)
  • RL Training: GRPO on ODA-Fin-RL-12K (12K difficulty-filtered samples)
  • Avg Performance: 74.6% across 9 financial benchmarks (+2.5 over SFT)
  • SOTA Achievement: Highest score among open-source 8B financial LLMs
  • Key Strengths:
    • Finova: 54.6% (Best among 8B models, +6.8 over SFT)
    • TaTQA: 89.3% (+2.3 over SFT, +4.2 over Qwen3-32B)
    • FPB: 83.4% (+7.8 over SFT, strong sentiment reasoning)

🧠 Model Training

Stage 1: Supervised Fine-Tuning (SFT)

  • Dataset: ODA-Fin-SFT-318K
  • Method: Full-parameter fine-tuning
  • Epochs: 3
  • Result: Establishes strong reasoning foundation (72.1% avg)

Stage 2: Reinforcement Learning (RL)

  • Dataset: ODA-Fin-RL-12K (difficulty-filtered: fail rate >= 50%)
  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Training Config:
    Hardware: 8Γ—NVIDIA H800 (80GB)
    Batch Size: 256
    Rollouts per Sample: 4
    Temperature: 0.6
    Top-p: 0.85
    Learning Rate: 1e-6
    KL Coefficient: 0.001
    

πŸ“Š Model Performance

Main Results (vs SOTA Baselines)

p
Main Results: ODA-Fin-RL achieves top three performance across most benchmarks. 'FinIQ', 'HL' and 'CFQA' refer to FinanceIQ, Headlines, and ConvFinQA benchmarks.

Performance Highlights:

  • Matches Qwen3-32B (74.7%) with 4Γ— fewer parameters
  • +4.3 points over DianJin-R1-7B (best previous 7B financial LLM)
  • +2.1 points over Qwen3-8B-Thinking (larger reasoning model)
  • Dominates numerical reasoning: TaTQA (89.3%), FinQA (73.3%), ConvFinQA (80.4%)

πŸ“š Citation

@misc{cao2026unlockingdatavaluefinance,
      title={Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training}, 
      author={Chuxue Cao and Honglin Lin and Zhanping Zhong and Xin Gao and Mengzhang Cai and Conghui He and Sirui Han and Lijun Wu},
      year={2026},
      eprint={2603.07223},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.07223}, 
}

πŸ“„ License

This model is released under the Apache 2.0 License. The training data (ODA-Fin-SFT-318K) aggregates from 25+ open-source repositories, each with their own licenses.


🀝 Acknowledgments

We thank the creators of DianJin-R1-Data, Agentar-DeepFinance-100K, financial_phrasebank, Finance-Instruct-500k, and others. We also thank the Qwen team for the powerful Qwen3 series models.


πŸ”— Related Resources