arxiv:2601.06747

FinForge: Semi-Synthetic Financial Benchmark Generation

Published on Jan 11

· Submitted by

Akhil Theerthala on Jan 13

Financial Services Innovation Lab, Georgia Tech

Upvote

Authors:

Akhil Theerthala ,

Abstract

FinForge presents a scalable semi-synthetic pipeline for creating domain-specific financial evaluation benchmarks using expert curation and language model synthesis, demonstrating significant variations in financial reasoning capabilities among state-of-the-art models.

AI-generated summary

Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline's efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework's utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at https://github.com/gtfintechlab/FinForge.

View arXiv page View PDF Add to collection

Community

Akhil-Theerthala

Paper author Paper submitter about 9 hours ago

This paper introduces FinForge, a novel framework designed to address the scarcity of high-quality, domain-specific datasets for evaluating Large Language Models (LLMs) in finance. The authors propose a scalable, semi-synthetic pipeline that combines expert-guided data curation from authoritative sources with controlled question generation and validation using Gemini 2.5 Flash.

Key Contributions:

FinForge Framework: A hybrid pipeline integrating manual/programmatic corpus construction with rigorous LM-based synthesis.
FinForge-5k Dataset: A new snapshot benchmark comprising over 5,000 human-validated Q&A pairs across 11 financial subdomains, derived from a curated corpus of 100,000 verified documents (143M tokens).
Benchmarking Results: Evaluation of state-of-the-art open and closed-source models reveals significant variance in financial reasoning capabilities, with leading models achieving approximately 80% accuracy.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.06747 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.06747 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.06747 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.