VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
Abstract
A novel framework called VeriEvol is introduced that addresses the challenge of scaling reinforcement learning for visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach that separates prompt difficulty from answer reliability, utilizing evolutionary operators and hypothesis testing verification to improve model performance and transparency.
Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.
Community
We are excited to share VeriEvol, a verifiable data-construction framework for scaling multimodal mathematical reasoning.
The key idea is that scaling RL for visual math reasoning is not only about generating harder questions. As the data and rollout budget grow, answer reliability becomes a first-order bottleneck: noisy labels can be repeatedly reinforced as reward signals. VeriEvol therefore decouples two axes before policy optimization:
- Prompt difficulty: type-aware evolution operators rewrite low-difficulty image-question seeds into harder, image-grounded prompts.
- Answer reliability: HTV-Agent treats each candidate answer as a falsifiable hypothesis and verifies it through independent solver hypotheses, counter-evidence seeking, programmatic checks, visual checks, conflict resolution, and a deterministic acceptance gate.
The resulting verified samples can be used directly with existing SFT and GRPO-style RL recipes. In our experiments, scaling evolved SFT data from 10K to 250K raises the five-benchmark average from 35.42 to 54.73, and scaling verified RL data from 10K to 130K further improves the average to 59.12. At a fixed RL setting, full VeriEvol adds +3.88 points over the un-evolved RL baseline, with gains from both evolved prompts and HTV-Agent verification.
Project page: https://robertmarton.github.io/verievol/
GitHub: https://github.com/RobertMarton/verievol
We are preparing the release of prompts, data, model checkpoints, code, and full verifier traces so that the community can audit and extend the data-construction pipeline, not just inspect final outputs.
Get this paper in your agent:
hf papers read 2606.23543 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper