arxiv:2606.29985

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Published on Jun 29

· Submitted by

Lee SangMook on Jul 1

Seoul National University

Upvote

Authors:

Sangmook Lee ,

Minbeom Kim ,

Abstract

Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.

View arXiv page View PDF Add to collection

Community

sangmook12

Paper author Paper submitter about 8 hours ago

When LLMs appear to give diverse math solutions, are they truly exploring different strategies—or merely rephrasing the same one? We address this question through approach-level diversity, which captures whether solutions differ in how they solve the problem, not just in how they are written.