arxiv:2605.29001

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Published on May 27

Authors:

Abstract

Analysis reveals significant semantic inconsistencies in paraphrased mathematical content that affect model rankings and highlight fundamental measurement gaps in evaluation protocols.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.29001

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29001 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29001 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29001 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.