Model Card for Model ID

This modelcard documents FM-FCI/FinNumQA-VLSP2025, a Vietnamese LLM fine-tuned for Financial Numerical Reasoning QA task. It achieved #1 in the VLSP 2025 benchmark this task.

Model Details

Model Description

This study details our methodology for the VLSP 2025 Numerical Reasoning QA challenge, focusing on building transparent and accurate models for Vietnamese financial question answering that requires computational reasoning. We propose a two-stage alignment framework combining supervised fine-tuning (SFT) with program-centric policy optimization (PCPO), which is implemented through group relative policy optimization (GRPO) to enhance both program and execution accuracy. First, we leverage an advanced large language model (LLM) to generate high-quality structured reasoning pathways from an augmented dataset derived from the competition organizers’ resources. The Qwen3-8B model is then fine-tuned on these structured traces and further refined through GRPO using meticulously designed reward functions to optimize logical consistency. Our approach secured first place among 16 participating teams, achieving 77.87% program accuracy with 82.49% execution accuracy on the public test set, and 76.63% program accuracy with 79.88% execution accuracy on the private test set. Key insights reveal the significance of domain-specific structured reasoning traces, the effectiveness of multilingual data augmentation, and the critical role of PCPO in maintaining accurate numerical reasoning abilities.

  • Developed by: FPT Smart Cloud, FPT Corporation
  • Model type: Dense
  • Language(s) (NLP): Vietnamese (primary)
  • License: ?

Model Sources [optional]

Training Details

Training Data

14,661 samples

Training Procedure

Training Hyperparameters

The primary training hyperparameters included mixed precision BF16, a maximum sequence length of 5, 888 tokens, a learning rate of 5.0 × 10−5 with a cosine scheduler, 5 training epochs, the AdamW optimizer with 25 warmup steps, and a per-device training batch size of 4. To accelerate training and maximize hardware utilization, we employed DeepSpeed Stage 3 together with FlashAttention 2.

We adopted conservative settings to stabilize policy updates and preserve SFT behaviour, including a learning rate of 1 × 10−6 with the AdamW optimizer and KL regularization with a KL loss coefficient of 0.001. Training used a global batch size of 16, a PPO mini-batch size of 16, and a per-GPU micro-batch size of 2 to manage memory consumption. Rollouts sampled n = 5 candidate responses per prompt to obtain stable advantage estimates. Gradient checkpointing was 201 enabled, and the pipeline supported long structured traces (maximum prompt length of 5, 888 tokens and maximum response length of 26, 880 tokens). Training was conducted for a total of 5 epochs.

Evaluation

Testing Data

Đánh giá dựa vào public test, private test mà BTC cung cấp

Metrics

PA, EA

Program Accuracy. This metric assesses whether the generated computation program (e.g., subtract(663, 362)) faithfully reflects the logical structure of the reference solution. It is the sole criterion for competitive ranking, as it ensures both the transparency and correctness of the reasoning process. This requirement is particularly critical in financial applications, where logically flawed programs – even if they yield numerically correct results – pose systemic risks to decision-making support.

Execution Accuracy. This metric verifies whether the final numerical output (e.g., 26.07) matches the ground-truth value. While informative, execution accuracy does not determine official rankings, as correctness of reasoning takes precedence over coincidental correctness of outputs.

Results

Team PA (%) EA (%) HUET 76.63 79.88 ngoquanghuy 75.00 81.95 dathvt 69.82 79.14 truong13012004 69.67 74.26 vietld 61.83 68.49 masterunited 54.14 56.80

BibTeX:

Enhancing Numerical Reasoning in Vietnamese Financial QuestionAnswering through Program-Centric Policy Optimization

Duc Dinh Chu*, Thanh-Bac Nguyen Ba*, Duy Dinh Le, Khanh Van Tran

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support