huggingchat/papers-content / 2510 /2510.00915.md

|

67.9 kB

Title: Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

URL Source: https://arxiv.org/html/2510.00915

Markdown Content: 1Introduction 2Related Works 3Problem Setup and Algorithms 4Experiments 5Conclusion Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers Xin-Qiang Cai1 Wei Wang2,1 Feng Liu2 Tongliang liu3 Gang Niu1 Masashi Sugiyama1,2 1RIKEN AIP Tokyo Japan 2The University of Tokyo Tokyo Japan 3The University of Melbourne Melbourne Australia 4The University of Sydney Sydney Australia Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to { 0 , 1 } , but imperfect verifiers inevitably introduce false negatives (rejecting correct answers) and false positives (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates 𝜌 0 and 𝜌 1 —the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a backward correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a forward correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

1Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) offers a scalable paradigm for improving the reasoning abilities of Large Language Models (LLMs) by replacing expensive human annotation with automated feedback (RLVR2025). In this problem, a policy is trained using rewards from a verifier that automatically checks the correctness of a model’s output (deepseekmath). The efficacy of this approach, however, heavily depends on the verifier’s reliability. Emerging evidence reveals that verifiers are systematically fallible in two critical ways: they can accept incorrect solutions (false positives; FPs) or reject correct ones (false negatives; FNs) (xu2025tinyv; zhao2025onetoken).

FPs have been widely documented as a vulnerability of LLM-based verifiers. Recent studies showed that LLM judges can be swayed by superficial cues—e.g., popular specialized verifiers, such as GPT-4o, give 35 % – 66.8% FP rate when the answer starts with Let’s solve this problem step by step (zhao2025onetoken; shi2025positionbias; shi2024judgedeceiver; chen2024humansjudge). Conversely, FNs are common with rule-based verifiers. These checkers, while highly precise, are often brittle; they may reject valid solutions that are formatted differently, expressed in an algebraically equivalent form, or embedded in explanatory text (math-verify). A recent analysis of a math-RL dataset found that over 38% of responses flagged as incorrect by a rule-based system were in fact correct, a gap that a lightweight LLM verifier could partially close (xu2025tinyv). Both FPs and FNs materially degrade RLVR training: FNs deprive the agent of informative gradients and slow convergence, while FPs reward hackable patterns and inflate returns during policy optimization (xu2025tinyv; huang2025pitfalls).

Figure 1:Verifier-noise flow in RLVR. An AI agent produces candidate solutions that are scored by automated verifiers. While verifiers would yield false negatives ( 12 36 vs. 1 3 , reaching 38 % rates (xu2025tinyv)) and false positives (mislead by “Let’s solve it step by step…”, reaching 35 % − 68 % rates (zhao2025onetoken)), confusing the agent; applying our backward/forward corrections restores correct signals.

Motivated by this asymmetry, we address these challenges by explicitly treating verifier errors as noise in the reward signal. We model the verifier as a stochastic reward channel that corrupts the true, latent reward with verifier-conditional probabilities—the FP rate and the FN rate. Building on this formulation, we derive two estimators to counteract this noise. Our first method, noisy Policy Gradient with Backward Correction (PGBC), inverts the noise process to compute an unbiased estimator of the true reward, which can be used as a drop-in replacement in any advantage-estimation in RL. Our second method, noisy Policy Gradient with Forward Correction (PGFC), directly reweights the terms of the policy gradient to ensure its expected direction aligns with the clean gradient. This approach requires only an estimate of the FN rate to achieve directional correctness, making it suitable for the often harder-to-estimate FP rate (xu2025tinyv). The pipeline is illustrated in Figure 1.

We integrate these corrections into a group relative policy optimization (GRPO) (deepseekmath)-based RLVR pipeline and demonstrate their effectiveness on competitive math reasoning benchmarks (deepseekmath). Both methods consistently outperform uncorrected training and achieve performance nearly equivalent to that of noise-free scenarios, with the forward correction offering the fastest and most stable convergence. We also propose a practical online estimation scheme for the FN rate: a querying mechanism where a low-cost LLM verifier like TinyV (xu2025tinyv) re-evaluates outputs rejected by a primary rule-based checker. This hybrid approach provides a reliable estimate of the FN rate with minimal computational overhead, recovering near-oracle performance. Our contributions are: (i) a formal verifier-channel model for RLVR that captures the prevalent FP and FN errors; (ii) backward and forward corrections derived from RL principles to calibrate policy learning; and (iii) a practical implementation for online noise estimation that makes these corrections viable in real-world systems.

2Related Works Reasoning with LLMs

A large body of work improves LLM reasoning through prompting and search at inference time. Chain-of-Thought (CoT) prompting elicits step-by-step rationales and boosts arithmetic and commonsense reasoning (Wei2022CoT), while self-consistency aggregates diverse reasoning paths to further improve robustness (Wang2022SelfConsistency). Decomposition-based prompting, such as Least-to-Most, solves complex problems via ordered sub-problems (Zhou2022LeastToMost); search-based schemes like Tree-of-Thoughts explicitly explore and evaluate alternative reasoning branches (Yao2023ToT). Orthogonally, training-time verifier signals can be used to re-rank candidate solutions, as in Cobbe2021GSM8K. Our work is complementary: rather than proposing a new inference-time strategy or re-ranking scheme, we address how to perform policy optimization when the reward itself—supplied by a verifier—is noisy.

RLVR and verifier reliability

Recent math-RL pipelines combine on-policy RL (e.g., GRPO) with programmatic or LLM-based verifiers to yield verifiable rewards (mroueh2025grpo). However, LLM-as-a-judge is prone to systematic biases, including position bias and prompt-based attacks, producing FPs (accepting incorrect solutions) (LLMJudgeSurvey2024; Thakur2024JudgingJudges; shi2025positionbias; zhao2025onetoken; shi2024judgedeceiver). Conversely, exact-match or brittle parsers in rule-based checkers can miss algebraic equivalences and text-formatted answers, leading to widespread FNs (Li2025VerifyBench; huang2025pitfalls). A recent study showed that over 38% of model responses labeled incorrect by a rule-based pipeline were in fact correct, and introduced a lightweight verifier (TinyV) that recovers many such cases with minimal overhead (xu2025tinyv). Prior efforts therefore improved the verifier (e.g., better judges or appeals) or evaluation suites; in contrast, we treat verifier errors as verifier-conditional noise and directly correct the policy gradient so that learning remains aligned with the clean objective even when the verifier is imperfect.

Learning with noisy labels

While our method is derived from RL, it shares intuition with the literature on learning with label noise. wang2020reinforcement modeled reward corruption via a class-conditional confusion matrix and derived an unbiased surrogate reward for Q-learning algorithms to solve control tasks. In supervised learning, a central thread is to make empirical-risk minimization robust either by correcting the loss given a noise-transition model or by avoiding or down-weighting suspected noisy examples (song2020survey; ICML:Li+etal:2021). The former includes the unbiased-risk estimators of natarajan2013 and the now-standard loss corrections of patrini2017. A second family avoids the noise explicitly by relying on the “small-loss first” memorization dynamics of deep nets: mentor methods (MentorNet) learn a weighting network that feeds cleaner samples to the student, and Co-teaching (han2018coteaching) trains two peers that exchange their selected small-loss examples; semi-supervised hybrids such as DivideMix (li2020dividemix) split data into clean/noisy partitions via mixture modeling and co-train with consistency regularization (jiang2018mentornet). Unlike them, we target the policy gradient estimator in RLVR: we instantiate both backward and forward corrections inside the policy gradient estimator to stabilize and align RL under verifier noise.

3Problem Setup and Algorithms

In this section, we introduce the problem formulations and our proposed algorithms.

3.1Problem Setup

We consider the standard RLVR setting where a stochastic policy 𝜋 𝜃 ( 𝑦 | 𝑥 ) generates a response 𝑦 for a given prompt 𝑥 , where 𝜃 denotes learnable parameters. The goal is to maximize an objective based on the true, or clean, reward 𝑅 ∗ ( 𝑥 , 𝑦 ) ∈ { 0 , 1 } , which indicates whether the response 𝑦 is genuinely correct. This objective is typically regularized by a KL-divergence term to a reference policy 𝜋 ref to maintain stability:

max 𝜃 ⁡ 𝐽 𝛽 ( 𝜃 )

𝔼 𝑥 𝔼 𝑦 ∼ 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) [ 𝑅 ∗ ( 𝑥 , 𝑦 ) − 𝛽 KL ( 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) ∥ 𝜋 ref ( ⋅ ∣ 𝑥 ) ) ] ,

(1)

in which KL denotes the Kullback-Leibler (KL)-divergence. The policy gradient with respect to the unregularized reward is given by the REINFORCE estimator (sutton1999policy), ∇ 𝜃 𝐽 ( 𝜃 )

𝔼 [ 𝑅 ∗ ( 𝑥 , 𝑦 ) ∑ 𝑡 𝐺 𝑡 ] , where 𝐺 𝑡

∇ 𝜃 log ⁡ 𝜋 𝜃 ( 𝑦 𝑡 ∣ 𝑥 , 𝑦 < 𝑡 ) is the score function for the token at step 𝑡 , and 𝑦 < 𝑡 denotes the sequence of tokens generated prior to step 𝑡 . In the following, we will denote 𝑅 ( 𝑥 , 𝑦 ) with 𝑅 for brevity.

In practice, the clean reward 𝑅 ∗ is unavailable. Instead, the agent receives a noisy reward 𝑅 ~ ( 𝑥 , 𝑦 ) ∈ { 0 , 1 } from an automated verifier. We model this verifier as a stochastic reward channel that flips the latent clean reward to an observed noisy one. This channel is characterized by asymmetric, verifier-conditional noise rates.

Definition 1 (Verifier Reward Channel).

The observed verifier reward 𝑅 ~ is generated from the latent clean reward 𝑅 ∗ according to the following conditional probabilities:

ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 ,

ℙ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝜌 1 ,

where 𝜌 0 + 𝜌 1 < 1 .

Assumptions and scope of the verifier channel

Definition 1 is a deliberately lightweight abstraction of verifier unreliability. It assumes (i) instance-independent class-conditional noise rates ( 𝜌 0 , 𝜌 1 ) that do not vary with ( 𝑥 , 𝑦 ) , and (ii) conditional independence: given the latent correctness 𝑅 ∗ ( 𝑥 , 𝑦 ) , the verifier’s stochastic flip 𝑅 ~ is independent of the policy’s sampling randomness (hence independent of the score-function terms used by policy gradients). These assumptions are standard in class-conditional noise models and allow closed-form reward/gradient corrections. In real RLVR deployments, verifier errors can be content-dependent (e.g., spurious phrases can increase FPs), which violates (i) and can partially violate (ii). We therefore view ( 𝜌 0 , 𝜌 1 ) as effective noise rates, and empirically test robustness to mis-specification in Section 4.4; we further estimate 𝜌 1 online via a low-cost appeals mechanism in Section 4.3.

Based on this noise model, we establish a linear relationship between the expected noisy reward and the clean reward, which is presented in Proposition 1.

Proposition 1 (Connection between Corrupted Rewards and True Rewards).

Under the Verifier Reward Channel model, the expectation of the noisy reward 𝑅 ~ conditioned on the clean reward 𝑅 ∗ is an affine transformation of 𝑅 ∗ :

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

( 1 − 𝜌 0 − 𝜌 1 ) 𝑅 ∗ + 𝜌 0 .

The proof is provided in Appendix C.1. The central challenge of RLVR is that naively optimizing with the noisy reward 𝑅 ~ leads to a biased policy gradient, causing the policy to learn from mistakes of the verifier. Our goal is to develop policy gradient estimators that use only the observable noisy reward 𝑅 ~ but the expectation is either identical or parallel to the true policy gradient ∇ 𝜃 𝐽 ( 𝜃 ) . We integrate these estimators within REINFORCE-style policy gradient algorithms widely used for reasoning tasks that compute normalized advantages over a group of sampled trajectories (deepseekmath).

3.2Noisy Policy Gradient with Backward Correction (PGBC)

Our first proposed approach aims to construct an unbiased estimator of the true reward 𝑅 ∗ by “inverting” the noise process. Since the expected noisy reward 𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ] is an affine transformation of the clean reward 𝑅 ∗ in Proposition 1, we can solve for 𝑅 ∗ to derive a corrected reward estimator, 𝑅 ^ , that is unbiased in expectation.

Theorem 1 (Unbiased Reward Estimator).

Given the verifier channel with known noise rates 𝜌 0 and 𝜌 1 with 𝜌 0 + 𝜌 1 < 1 , the estimator

𝑅 ^

𝑅 ~ − 𝜌 0 1 − 𝜌 0 − 𝜌 1

(2)

is an unbiased estimator of the true reward 𝑅 ∗ ( 𝑥 , 𝑦 ) , i.e., 𝔼 [ 𝑅 ^ ]

𝑅 ∗ .

The proof is provided in Appendix C.2. Theorem 1 shows that, by replacing the noisy reward 𝑅 ~ with 𝑅 ^ in any standard policy gradient formulation, we obtain an unbiased estimate of the true policy gradient. This allows 𝑅 ^ to serve as a drop-in replacement for the reward signal in complex RL algorithms, seamlessly integrating with advantage estimation and other machinery. The details of the algorithm are summarized in Algorithm 1. After rolling out a set of trajectories and obtaining their rewards, we use the corrected rewards to compute the policy gradients for model update.

Algorithm 1 Noisy Policy Gradient with Backward Correction (PGBC) 1:Input: Initial policy 𝜃 0 ; learning rate 𝜂 ; batch size 𝑀 ; estimates of noise rates, ( 𝜌 ^ 0 , 𝜌 ^ 1 ) . 2:loop 3: Roll out 𝑀 trajectories { ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑀 , obtaining observed rewards { 𝑅 ~ 𝑖 } 𝑖

1 𝑀 . 4: For each trajectory 𝑖 , compute the unbiased reward estimate:
𝑅 ^ 𝑖 ← 𝑅 ~ 𝑖 − 𝜌 ^ 0 1 − 𝜌 ^ 0 − 𝜌 ^ 1 .
5: Compute the policy gradient using the corrected rewards { 𝑅 ^ 𝑖 } 𝑖

1 𝑀 . 6: Update parameters: 𝜃 ← 𝜃 + 𝜂 Δ 𝜃 . 7:end loop 3.3Noisy Policy Gradient with Forward Correction (PGFC)

While PGBC provides an elegant unbiased estimator, it faces two practical challenges. First, the denominator ( 1 − 𝜌 0 − 𝜌 1 ) can be small if the total noise rate is high, leading to high variance in the reward estimate 𝑅 ^ and potentially unstable training. Second, it requires accurate estimation of both the false positive rate 𝜌 0 and the false negative rate 𝜌 1 , which are usually difficult to obtain in practice. The PGFC approach is introduced to mitigate these issues.

Instead of correcting the reward itself, this method directly modifies the policy gradient estimator to ensure its expectation is correctly aligned. The core idea is to reweight the score function 𝐺 𝑡 under the 𝑡 -th token based on the observed noisy reward 𝑅 ~ . We define an update term ℎ 𝑡

𝑤 𝑅 ~ 𝐺 𝑡 and choose the forward weights 𝑤 0 (for 𝑅 ~

0 ) and 𝑤 1 (for 𝑅 ~

1 ) such that the expected update 𝔼 [ ℎ 𝑡 ] is parallel to the clean gradient ∇ 𝜃 𝐽 ( 𝜃 ) . The key insight is to choose the forward weights so that the conditional expectations given the true reward 𝑅 ∗ have a specific structure.

Proposition 2 (Conditional Expectation of Forward Weights).

Let the forward weights be defined as 𝑤 0

𝜌 1 − 1 and 𝑤 1

𝜌 1 . Under the Verifier Reward Channel model, the conditional expectations of the corresponding weight 𝑤 𝑅 ~ given the true reward 𝑅 ∗ are:

1.

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

1 ]

0 .

2.

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

0 ]

− ( 1 − 𝜌 0 − 𝜌 1 ) .

The proof can be found in Appendix C.3. Proposition 2 reveals that, in expectation, the forward weights are chosen so that their conditional expectation vanishes when 𝑅 ∗

1 , i.e., truly positive cases do not require correction. Consequently, the explicit contribution to the correction term arises from samples with 𝑅 ∗

0 . This structure allows us to recover the correct gradient direction, as shown in the following theorem.

Theorem 2 (Policy Gradient Correction with Only 𝜌 1 ).

Let the gradient-update term be Δ 𝜃

1 𝑀 ∑ 𝑤 𝑅 ~ 𝐺 𝑡 , where 𝑤 0

𝜌 1 − 1 and 𝑤 1

𝜌 1 with 𝑀 trajectories. Under the Verifier Reward Channel model, the expected update is parallel to the clean policy gradient:

𝔼 [ Δ 𝜃 ]

𝑐 ∇ 𝜃 𝐽 ( 𝜃 ) ,

(3)

in which 𝑐

( 1 − 𝜌 0 − 𝜌 1 ) .

The proof is given in Appendix C.4. Since 1 − 𝜌 0 − 𝜌 1

0 , Theorem 2 guarantees that the expected update of the policy is in the same direction as the true gradient; meanwhile, the positive scaling factor ( 1 − 𝜌 0 − 𝜌 1 ) can be absorbed into the learning rate. By avoiding the inverse operation of PGBC, PGFC circumvents the variance-inflation problem. Furthermore, its reliance solely on the FN rate makes it more practical, as the FN rate is often the more dominant and easily estimable error source in RLVR with rule-based verifiers (xu2025tinyv).

Algorithm 2 Noisy Policy Gradient with Forward Correction (PGFC) 1:Input: Initial policy 𝜃 0 ; learning rate 𝜂 ; batch size 𝑀 ; an estimate of the false negative rate, 𝜌 ^ 1 . 2:loop 3: Roll out 𝑀 trajectories and observe rewards { 𝑅 ~ 𝑖 } 𝑖

1 𝑀 . 4: For each trajectory 𝑖 , define the weight based on the observed reward 𝑅 ~ 𝑖 :
𝑤 𝑅 ~ 𝑖 ← { 𝜌 ^ 1 − 1 ,
if 𝑅 ~ 𝑖

0 ,

𝜌 ^ 1 ,
if 𝑅 ~ 𝑖

1 .
5: For each trajectory 𝑖 , compute the score-function term 𝐺 𝑖 and the weighted term ℎ 𝑖 ← 𝑤 𝑅 ~ 𝑖 𝐺 𝑖 . 6: Form the policy gradient estimate: 𝑔 ^ ← 1 𝑀 ∑ 𝑖

1 𝑀 ℎ 𝑖 . 7: Update parameters: 𝜃 ← 𝜃 + 𝜂 𝑔 ^ . 8:end loop (a)Qwen2.5-Math-1.5B (b)DeepSeek-R1-Distill-Qwen-1.5B (c)Llama-3.2-3B-Instruct (d)Qwen2.5-Math-7B Figure 2:Synthetic-Noise Results (pass@1) with 16 samples and 5 random seeds on the four backbones. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise_BC: Training with noise under backward correction; Noise_FC: Training with noise under forward correction.

In practice, we estimate 𝜌 ^ 1 during training by appealing a small, uniformly random subset of rule-based negatives to a lightweight LLM verifier (e.g., TinyV (xu2025tinyv)) and then smoothing the empirical flip rate. Concretely, at step 𝑡 , let 𝒩 R ( 𝑡 ) be items labeled negative by the rule-based verifier, 𝒫 R ( 𝑡 ) the rule-based positives, and sample a fraction 𝑞 ∈ ( 0 , 1 ] of 𝒩 R ( 𝑡 ) for appeal; denote by 𝒫 L ( 𝑡 ) those appealed items that the LLM flips to positive (i.e., rule-negative & LLM-positive). Using a Horvitz–Thompson correction (karwa2023admissibility), we estimate the FN rate as | 𝒫 L ( 𝑡 ) | / 𝑞 and the true positive (TP) rate as | 𝒫 R ( 𝑡 ) | (since the rule-based FP rate should be 0). We then set 𝜌 ^ 1 ( 𝑡 )

| 𝒫 L ( 𝑡 ) | / 𝑞 + 𝛼 | 𝒫 L ( 𝑡 ) | / 𝑞 + | 𝒫 R ( 𝑡 ) | + 𝛼 + 𝛽 with small Beta-prior pseudocounts 𝛼 , 𝛽 (we use 10 − 5 ), and apply exponential moving average (EMA) smoothing over a sliding window. If both 𝜌 ^ 0 and 𝜌 ^ 1 are available, one can use PGBC; if only 𝜌 ^ 1 is available, use PGFC. The code can be found in https://github.com/caixq1996/noisy-RLVR. More details of appealing mechanism can be found in Appendix D.4.

Bridging REINFORCE-Style Corrections to GRPO Updates

Our core analysis (Theorems 1 and 2) is stated for REINFORCE-style policy gradients. In experiments, we implement the corrections inside GRPO (deepseekmath), which forms group-normalized advantages and then applies a PPO-style clipped surrogate with an explicit KL penalty. More connection and implementation details are included in Appendix A and B.

4Experiments

We evaluate our approach under both synthetic and real-world verifier noise. We first spell out the experimental protocol, and then present main and ablation results.

(a)Qwen2.5-Math-1.5B (b)DeepSeek-R1-Distill-Qwen-1.5B (c)Llama-3.2-3B-Instruct (d)Qwen2.5-Math-7B Figure 3:Synthetic-Noise Results (pass@8) with 16 samples and 5 random seeds on the four backbones Llama-3.2-3B-Instruct, and Qwen2.5-Math-7B. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise_BC: Training with noise under backward correction; Noise_FC: Training with noise under forward correction. 4.1Experimental Setup

We train on three small backbones, Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Llama-3.2-3B-Instruct, and probe scale with Qwen2.5-Math-7B. Rewards come from either (i) a rule-based checker that extracts the final \boxed{\cdot} answer and tests numeric/rational equivalence or (ii) a lightweight LLM verifier (TinyV 1.5B) used for appeals/estimation; prior work motivates explicit noise modeling due to systematic FP/FN behavior in these verifiers. Unless stated otherwise, we follow the DeepScaleR corpus and a GRPO-style on-policy recipe with BoN sampling, implementing our backward and forward corrections as drop-in hooks at the advantage-construction stage within VERL. Evaluation uses six verifiable math suites—AIME-2024, AIME-2025, AMC-2023, MATH500, Minerva MATH, and OlympiadBench—reporting Pass@1 and Pass@8 with 16 samples and 5 random seeds in average. Compute is 8 × A100 (40GB) GPUs servers; unless noted, KL schedules, sampling temperatures, and other rollout settings are kept identical across compared conditions. More details can be found in Appendix D.

4.2Synthetic Noise

To disentangle optimization effects from verifier unreliability, we inject verifier-conditional noise into the binary reward stream during training. Concretely, when the latent clean reward is 𝑅 ∗ ∈ { 0 , 1 } , the observed reward 𝑅 ~ is drawn from a reward channel with Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 and Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝜌 1 ; unless stated otherwise we use 𝜌 0

0.1 and 𝜌 1

0.2 . We train GRPO on DeepScaleR with identical sampling and KL settings across conditions, and evaluate Pass@1 (16 samples) and Pass@8 on AIME-2024/2025, AMC-2023, Math500, Minerva MATH, and OlympiadBench. We compare five variants: Base (no RL), Oracle (clean rewards), Noise (uncorrected), and our two corrections—PGBC (backward correction using 𝑅 ^

𝑅 ~ − 𝜌 0 1 − 𝜌 0 − 𝜌 1 ) and PGFC (forward correction that rescales the gradient using 𝜌 1 only).

As illustrated in Figure 2 and Figure 3, across models and benchmarks, the injected noise degrades uncorrected GRPO substantially, while both corrections recover most of the gap to the oracle. On DeepSeek-R1-Distill-Qwen-1.5B, uncorrected training under noise underperforms the oracle across all tasks (e.g., a noticeably lower Average), whereas PGBC and PGFC nearly match oracle performance; PGFC is consistently as good as or slightly better than PGBC, echoing its variance advantages from avoiding division by ( 1 − 𝜌 0 − 𝜌 1 ) . From Qwen2.5-Math-1.5B and DeepSeek-R1-Distill-Qwen-1.5B, as well as the up-scale model Qwen2.5-Math-7B, we observe the same pattern: noise hurts, PGBC/PGFC close the gap compared with Oracle.

4.3Real-World Noise

As discussed above, automated verifiers exhibit both false positives (LLM judges over-crediting incorrect solutions) and false negatives (rule-based checkers rejecting correct ones) (LLMJudgeSurvey2024; xu2025tinyv; huang2025pitfalls; Li2025VerifyBench). In math RL pipelines, the latter is particularly prevalent: brittle exact-match or limited equivalence rules lead to many valid answers being scored as incorrect, depriving the agent of learning signal. Motivated by this, in this subsection we specifically investigate whether denoising false negatives—i.e., estimating 𝜌 ^ 1 as in Section 3.3—can measurably improve policy optimization. Concretely, we keep a fast rule-based checker as the primary reward source and, whenever it returns 𝑅 ~

0 , trigger an appeals pass with a lightweight LLM verifier (TinyV) (xu2025tinyv). From disagreements on negatives we maintain an online estimate 𝜌 ^ 1 (EMA over a sliding window to track policy drift).

Results in Table 1 and Table 2 show a consistent pattern across backbones and benchmarks. Using a LLM as the reward (LV) underperforms the rule-based pipeline, corroborating prior observations that LLM judges are bias-prone and gameable (LLMJudgeSurvey2024; shi2025positionbias). Employing the LLM as adds-on to recheck rule-based negatives (Adds on) reduces FN and yields a modest but reliable lift, yet the benefit remains indirect because the primary gradient is still driven by noisy binary rewards. In contrast, our forward correction (FCO) directly applies the FN correction to the policy gradient via weights 𝑤 𝑅 ~ ∈ { 𝜌 ^ 1 − 1 , 𝜌 ^ 1 } , and it delivers the strongest and stable improvements.

Table 1:Mean and standard deviation (pass@1) with 16 samples and 5 random seeds under real-world noise. Rule: rule-based rewards; LV: direct LLM-judge rewards; Adds on: rule-based reward plus LLM appeals on negative samples (no gradient correction); FCO: forward correction using online 𝜌 ^ 1 . Dataset AIME2024 AIME2025 AMC2023 Math500 Minerva MATH Olympiad Bench Average Qwen2.5-Math-1.5B Rule 15.0 ± 0.4 5.6 ± 0.6 50.3 ± 0.6 69.4 ± 0.4 17.8 ± 0.6 31.6 ± 0.0 31.6 ± 0.4 LV 10.9 ± 1.3 4.7 ± 1.0 42.1 ± 1.8 63.0 ± 0.7 15.9 ± 0.7 25.3 ± 0.5 27.0 ± 1.0 Adds on 11.9 ± 0.6 5.8 ± 1.2 47.8 ± 1.2 68.3 ± 0.1 16.7 ± 0.6 29.8 ± 0.1 30.1 ± 0.6 PGFC (Ours) 20.3 ± 0.0 10.7 ± 1.7 53.3 ± 1.4 68.6 ± 0.8 16.5 ± 0.4 32.9 ± 0.2 33.7 ± 0.8 DeepSeek-R1-Distill-Qwen-1.5B Base 9.0 ± 0.6 9.4 ± 0.6 41.4 ± 1.4 61.1 ± 0.1 10.5 ± 0.6 22.9 ± 0.4 25.7 ± 0.6 Rule 19.0 ± 1.0 17.5 ± 0.8 65.6 ± 2.5 77.6 ± 0.4 19.9 ± 0.0 38.5 ± 0.3 39.7 ± 0.8 LV 11.9 ± 1.0 12.7 ± 1.7 52.3 ± 1.6 69.8 ± 0.8 14.2 ± 1.0 31.9 ± 0.5 32.1 ± 1.1 Adds on 21.7 ± 0.4 17.3 ± 0.6 66.2 ± 0.3 77.4 ± 0.0 20.0 ± 0.6 37.9 ± 0.1 40.1 ± 0.3 PGFC (Ours) 23.2 ± 1.2 22.5 ± 0.8 70.7 ± 0.6 78.2 ± 0.0 19.4 ± 0.4 41.0 ± 0.4 42.5 ± 0.6 Llama-3.2-3B-Instruct Base 5.7 ± 1.2 0.6 ± 0.4 17.2 ± 1.5 34.8 ± 0.7 4.8 ± 0.0 12.7 ± 0.2 12.6 ± 0.7 Rule 6.0 ± 1.0 1.0 ± 0.4 24.3 ± 1.7 40.4 ± 0.5 9.2 ± 0.0 15.0 ± 0.5 16.0 ± 0.7 LV 1.4 ± 1.1 0.1 ± 0.3 12.7 ± 2.1 26.4 ± 0.2 3.7 ± 0.0 7.7 ± 0.4 8.7 ± 0.7 Adds on 3.8 ± 1.4 0.6 ± 0.7 16.9 ± 1.8 40.8 ± 0.3 8.8 ± 0.0 13.7 ± 0.3 14.1 ± 0.7 PGFC (Ours) 9.4 ± 1.9 0.4 ± 0.6 21.2 ± 1.3 46.4 ± 0.3 9.5 ± 0.2 16.3 ± 0.3 17.2 ± 0.8 Qwen2.5-Math-7B Base 12.7 ± 0.9 5.8 ± 0.7 44.4 ± 2.2 52.0 ± 0.4 9.8 ± 0.7 26.4 ± 0.4 25.2 ± 0.9 Rule 29.2 ± 2.5 13.5 ± 1.2 62.8 ± 1.9 78.9 ± 0.5 24.2 ± 0.7 37.8 ± 0.5 41.1 ± 1.2 LV 16.8 ± 2.1 6.8 ± 1.4 50.2 ± 1.8 62.5 ± 1.0 10.1 ± 0.4 31.0 ± 0.9 29.6 ± 1.3 Adds on 27.4 ± 1.9 11.8 ± 1.9 63.7 ± 1.6 74.9 ± 0.7 20.6 ± 0.8 37.6 ± 0.3 39.3 ± 1.2 PGFC (Ours) 31.0 ± 1.5 14.6 ± 1.2 65.7 ± 1.4 81.6 ± 0.4 26.2 ± 0.5 39.3 ± 0.5 43.1 ± 0.9 Table 2:Mean and standard deviation (pass@8) with 16 samples and 5 random seeds under real-world noise. Dataset AIME2024 AIME2025 AMC2023 Math500 Minerva MATH Olympiad Bench Average Qwen2.5-Math-1.5B Base 32.4 ± 0.5 16.7 ± 0.7 79.3 ± 1.4 62.8 ± 1.3 15.5 ± 1.1 31.3 ± 0.8 39.7 ± 1.0 Rule 34.3 ± 1.2 19.9 ± 1.1 80.7 ± 1.4 66.9 ± 0.5 17.6 ± 0.6 32.8 ± 1.2 42.0 ± 1.0 LV 29.6 ± 1.1 15.4 ± 0.4 80.0 ± 0.6 63.4 ± 1.2 16.2 ± 0.8 28.6 ± 0.3 38.9 ± 0.7 Adds on 30.5 ± 0.2 20.5 ± 0.7 80.7 ± 0.9 68.9 ± 1.3 17.6 ± 1.0 30.5 ± 0.5 41.4 ± 0.8 PGFC (Ours) 31.0 ± 0.2 20.0 ± 0.6 82.2 ± 0.4 69.8 ± 1.1 18.2 ± 0.5 33.3 ± 0.3 42.4 ± 0.5 DeepSeek-R1-Distill-Qwen-1.5B Base 28.7 ± 1.0 23.3 ± 0.8 77.5 ± 0.2 67.6 ± 1.3 14.0 ± 0.3 29.8 ± 0.4 40.1 ± 0.7 Rule 32.9 ± 0.3 25.7 ± 1.2 79.3 ± 0.7 71.2 ± 1.3 15.5 ± 0.3 34.7 ± 1.4 43.2 ± 0.9 LV 29.2 ± 1.3 28.3 ± 0.6 81.2 ± 1.2 68.8 ± 0.3 14.7 ± 1.0 32.0 ± 1.4 42.4 ± 1.0 Adds on 33.8 ± 1.5 28.7 ± 1.1 80.0 ± 1.5 70.8 ± 1.2 15.8 ± 1.3 34.5 ± 1.4 43.9 ± 1.3 PGFC (Ours) 36.2 ± 0.5 27.5 ± 1.1 81.2 ± 0.4 71.4 ± 0.2 15.8 ± 1.0 34.4 ± 0.4 44.4 ± 0.6 Llama-3.2-3B-Instruct Base 18.8 ± 1.1 3.3 ± 0.2 54.7 ± 1.4 40.8 ± 0.8 4.8 ± 1.0 16.4 ± 0.5 23.1 ± 0.8 Rule 22.4 ± 1.0 6.7 ± 1.4 50.0 ± 1.3 44.8 ± 0.6 9.2 ± 0.7 18.7 ± 1.5 25.3 ± 1.1 LV 10.0 ± 0.9 0.8 ± 0.9 44.4 ± 1.4 27.2 ± 0.9 3.7 ± 1.2 9.5 ± 0.2 15.9 ± 0.9 Adds on 17.5 ± 0.2 4.2 ± 1.0 53.8 ± 0.2 42.3 ± 1.5 8.8 ± 1.3 14.9 ± 1.2 23.6 ± 0.9 PGFC (Ours) 20.8 ± 0.9 3.3 ± 1.5 58.1 ± 1.3 47.6 ± 0.6 9.6 ± 1.4 18.5 ± 0.6 26.3 ± 1.0 Qwen2.5-Math-7B Base 36.2 ± 1.2 23.3 ± 1.0 81.9 ± 1.3 66.6 ± 0.5 11.8 ± 0.6 33.9 ± 0.7 42.3 ± 0.9 Rule 50.0 ± 0.1 28.7 ± 0.8 83.4 ± 0.6 82.8 ± 0.5 29.0 ± 1.1 45.8 ± 1.3 53.3 ± 0.7 LV 41.7 ± 0.3 22.1 ± 0.3 81.9 ± 1.5 69.0 ± 0.2 13.2 ± 1.2 39.4 ± 1.1 44.6 ± 0.8 Adds on 47.1 ± 1.0 30.4 ± 0.3 84.4 ± 1.2 80.8 ± 0.3 23.5 ± 1.3 45.6 ± 0.5 52.0 ± 0.8 PGFC (Ours) 54.6 ± 1.4 30.4 ± 1.2 82.8 ± 0.2 83.2 ± 1.3 29.0 ± 0.6 47.6 ± 0.9 54.6 ± 0.9 4.4Robustness to Noise Mis-Specification

There now exist practical procedures to estimate verifier FP/FN rates via rule-based equivalence checkers (e.g., Math-Verify), lightweight appeals/judges such as TinyV, and meta-evaluation suites that quantify verifier reliability (math-verify; xu2025tinyv; Li2025VerifyBench). In real deployments, however, these estimates can be imperfect. We therefore test how backward and forward corrections behave under mis-specified noise rates. We follow the synthetic-noise setting from Section 4.2 with Qwen2.5-math-1.5B: rewards are corrupted at data-time with 𝜌 0

0.1 , 𝜌 1

0.2 . During training, we intentionally feed each algorithm mis-specified rates from a grid 𝜌 ^ 0 ∈ [ 0.0 , 0.2 ] , 𝜌 ^ 1 ∈ [ 0.1 , 0.3 ] , and report the Average Pass@1 (16 samples) across our six benchmarks. Because the forward method only requires the false negative rate, we vary 𝜌 ^ 1 for forward correction while keeping 𝜌 ^ 0 unused, as shown in Fig. 4.

(a)BC: 𝜌 ^ 0

0.1 , 𝜌 ^ 1 ∈ [ 0.1 , 0.3 ] (b)BC: 𝜌 ^ 1

0.2 , 𝜌 ^ 0 ∈ [ 0.0 , 0.2 ] (c)FC: 𝜌 ^ 0

0.1 , 𝜌 ^ 1 ∈ [ 0.1 , 0.3 ] Figure 4:Robustness results. (a) Backward correction (BC) with 𝜌 ^ 0 fixed and sweeping 𝜌 ^ 1 ; (b) Backward correction (BC) with 𝜌 ^ 1 fixed and sweeping 𝜌 ^ 0 ; (c) Forward correction (FC) with 𝜌 ^ 0 fixed and sweeping 𝜌 ^ 1 .

Backward correction remains strong when the total noise is underestimated, but performance degrades as we overestimate the rates (i.e., as 1 − 𝜌 ^ 0 − 𝜌 ^ 1 shrinks). Intuitively, the unbiased de-biasing, 𝑅 ^

𝑅 ~ − 𝜌 ^ 0 1 − 𝜌 ^ 0 − 𝜌 ^ 1 , involves an inverse factor that amplifies reward variance and accentuates estimation error when the effective denominator becomes small. In contrast, forward correction exhibits a notably flat response curve around the true 𝜌 1 ; perturbing 𝜌 ^ 1 changes only the per-sample weights and avoids any inverse operation, yielding lower-variance policy gradient estimates.

4.5Case Studies

We also provide typical failure modes observed during math RL with automated verifiers on Qwen2.5-Math-7B under MATH500 dataset. For rule-based verifier: 52 FNs and 0 FP out of 500 items (marginal rates 10.4% and 0%); for LLM verifier: 81 FNs and 168 FPs out of 500 items (marginal rates 16.2% and 33.6%). The FP case shows an LLM judge over-crediting a malformed solution, while the FN case shows a rule-based checker rejecting a numerically correct answer due to formatting. The raw context can be found in Appendix E. {examplebox} False Positive (FP). Problem. Compute: 1 − 2 + 3 − 4 + ⋯ + 99 − 100 . Ground truth. − 50

Model response (excerpt).

Compute: $1+2-3-4+5+6-7-8+\dots+97+98-99-100$. \boxed{ } \boxed{ } \boxed{ } \dots

Verifier signals. Rule-based checker: ✗ (incorrect); LLM judge: ✓ (incorrectly accepts) ⇒ FP.

{examplebox}

False Negative (FN). Problem. Each of seven islands independently has treasure with probability 1 5 . What is the probability that exactly four islands have treasure? Correct answer. 448 15625 (equals 0.028672 ). Model answer. 0.028672

Verifier signals. Rule-based exact-match: ✗ (expects a fraction); numeric equivalence check: ✓ (values match) ⇒ FN.

5Conclusion

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a practical and popular paradigm for training reasoning-focused large language models (LLMs). Yet binarized rewards inevitably suffer from false positives (FP) and false negatives (FN). Prior efforts largely mitigated these errors from the verifier side—e.g., debiasing LLM judges or strengthening equivalence checking—to reduce the downstream impact on learning (Li2025VerifyBench; huang2025pitfalls). In contrast, we modeled verifier unreliability as a verifier-conditional noise channel with the FP and FN rates, and introduced two theory-driven corrections that act directly at the policy-learning interface. The backward correction de-biases the observed reward to yield an unbiased gradient estimator; it requires estimates the FP and FN rates but is agnostic to the choice of policy-optimization algorithm and can thus be used beyond policy gradient methods. The forward correction rescales score-function terms so that the expected update is aligned with the clean gradient; notably, it needs only the FN rate estimate, which is often the dominant source of real-world noise due to rule-based FNs (huang2025pitfalls; xu2025tinyv).

While we instantiated appeals with a lightweight LLM verifier for efficiency, the same backward/forward correction mechanisms should also gain from stronger verifiers (xu2025tinyv). Meanwhile, currently using the constant noise rate, the proposed methods already show clear advantages. This means that the constant noise rates approximate the real-world noise rates quite well. However, real-world noise can be very complex, e.g., depending both on the contents and verifiers, leaving interesting future directions.

Appendix AFrom REINFORCE-Style Corrections to GRPO Updates Setup (single prompt and group)

Fix a prompt 𝑥 and sample a group of 𝐾 i.i.d. responses 𝑦 1 , … , 𝑦 𝐾 ∼ 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) . Let 𝐺 𝑖 ≜ ∑ 𝑡

1 | 𝑦 𝑖 | ∇ 𝜃 log ⁡ 𝜋 𝜃 ( 𝑦 𝑖 , 𝑡 ∣ 𝑥 , 𝑦 𝑖 , < 𝑡 ) denote the trajectory score-function. Let 𝑢 𝑖 be any scalar proxy reward computed from the trajectory (e.g., 𝑢 𝑖

𝑟 ^ 𝑖 for PGBC, or 𝑢 𝑖

𝑤 𝑟 ~ 𝑖 for PGFC), and define the group mean 𝑢 ¯

1 𝐾 ∑ 𝑗

1 𝐾 𝑢 𝑗 . Consider the centered score-function update

Δ center ( 𝜃 ) ≜ 1 𝐾 ∑ 𝑖

1 𝐾 ( 𝑢 𝑖 − 𝑢 ¯ ) 𝐺 𝑖 .

(4) Proposition 3 (Group centering preserves expected direction).

Assume { ( 𝑢 𝑖 , 𝐺 𝑖 ) } 𝑖

1 𝐾 are i.i.d. under 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) and 𝔼 [ 𝐺 𝑖 ∣ 𝑥 ]

0 (the standard score-function identity (williams1992reinforce; sutton1999policy)). Then the centered update in (4) satisfies

𝔼 [ Δ center ( 𝜃 ) ∣ 𝑥 ]

( 1 − 1 𝐾 ) 𝔼 [ 𝑢 𝐺 ∣ 𝑥 ] .

(5)

In particular, if 𝔼 [ 𝑢 𝐺 ∣ 𝑥 ] is parallel to the clean gradient direction for the prompt-conditioned objective, then so is 𝔼 [ Δ center ( 𝜃 ) ∣ 𝑥 ] .

Proof.

Expand Δ center

1 𝐾 ∑ 𝑖 𝑢 𝑖 𝐺 𝑖 − 𝑢 ¯ ⋅ 1 𝐾 ∑ 𝑖 𝐺 𝑖 . Taking expectation conditioned on 𝑥 yields

𝔼 [ 𝑢 ¯ ⋅ 1 𝐾 ∑ 𝑖 𝐺 𝑖 ∣ 𝑥 ]

1 𝐾 2 ∑ 𝑖 , 𝑗 𝔼 [ 𝑢 𝑗 𝐺 𝑖 ∣ 𝑥 ] .

The 𝑖

𝑗 terms contribute 𝐾 𝐾 2 𝔼 [ 𝑢 𝐺 ∣ 𝑥 ] . For 𝑖 ≠ 𝑗 , independence gives 𝔼 [ 𝑢 𝑗 𝐺 𝑖 ∣ 𝑥 ]

𝔼 [ 𝑢 ∣ 𝑥 ] 𝔼 [ 𝐺 ∣ 𝑥 ]

0 . Thus 𝔼 [ 𝑢 ¯ ⋅ 1 𝐾 ∑ 𝑖 𝐺 𝑖 ∣ 𝑥 ]

1 𝐾 𝔼 [ 𝑢 𝐺 ∣ 𝑥 ] , implying (5). ∎

Corollary 1 (Directional correctness of PGFC under centered GRPO-style updates).

Under the Verifier Reward Channel (Definition 1) and the conditional-independence assumption stated above, let 𝑢

𝑤 𝑅 ~ with 𝑤 0

𝜌 1 − 1 and 𝑤 1

𝜌 1 as in Proposition 2. Then for each prompt 𝑥 ,

𝔼 [ Δ center ( 𝜃 ) ∣ 𝑥 ]

( 1 − 1 𝐾 ) ( 1 − 𝜌 0 − 𝜌 1 ) ∇ 𝜃 𝐽 ( 𝜃 ∣ 𝑥 ) ,

so the expected centered update remains aligned with the clean policy-gradient direction.

Standardization, clipping, and KL (what is guaranteed, and what is approximate)

GRPO further divides centered proxy rewards by the within-group standard deviation to form normalized advantages, and then applies a PPO-style clipping on token-level ratios and an explicit KL penalty to 𝜋 ref (deepseekmath). The exact direction-preservation statement in Proposition 3 covers the centering step, which is the main nontrivial departure from vanilla REINFORCE. The additional standardization is a multiplication by a random positive scalar; this primarily acts as an adaptive step-size (and cannot flip the sign of individual sample advantages), though a strict equality in expectation would require additional concentration assumptions. Clipping and the KL penalty add further stabilizers: in the small-update regime (encouraged by the KL term and typical learning rates), clipping is rarely active and the update is close to the unclipped score-function form; when clipping activates, it intentionally limits step size and may deviate from the exact REINFORCE direction. Importantly, our corrections target the reward-gradient component; the KL gradient is unchanged, so any global scaling factor (e.g., 1 − 𝜌 0 − 𝜌 1 ) can be absorbed into the effective learning rate and/or interpreted as modifying the reward–KL trade-off.

Appendix BImplementation Details

We describe how to integrate Algorithm 1 (backward, unbiased reward de-biasing) and Algorithm 2 (forward, gradient-scaled) into Group Relative Policy Optimization (GRPO) under both outcome and process supervision. GRPO samples, for each prompt 𝑥 , a group of 𝐾 responses { 𝑦 𝑖 } 𝑖

1 𝐾 from the behavior policy, computes a group-normalized advantage for each sample (or step), and then applies a PPO-style clipped surrogate with a separate KL regularizer to a reference policy; no value network is used. Our modifications are confined to the advantage-construction stage, leaving ratio clipping and KL loss unchanged (details of GRPO in (deepseekmath) and open-source implementations).

Notation (shared)

Let 𝜋 𝜃 be the current policy and 𝜋 old the behavior policy. Define token-level ratios 𝑟 𝑖 , 𝑡

𝜋 𝜃 ( 𝑦 𝑖 , 𝑡 ∣ 𝑥 , 𝑦 𝑖 , < 𝑡 ) 𝜋 old ( 𝑦 𝑖 , 𝑡 ∣ 𝑥 , 𝑦 𝑖 , < 𝑡 ) . GRPO’s PPO-style surrogate at token 𝑡 uses an advantage 𝐴 𝑖 , 𝑡 :

ℒ grpo ( 𝜃 )

1 𝐾 ∑ 𝑖

1 𝐾 1 | 𝑦 𝑖 | ∑ 𝑡

1 | 𝑦 𝑖 | min ⁡ ( 𝑟 𝑖 , 𝑡 𝐴 𝑖 , 𝑡 , clip ( 𝑟 𝑖 , 𝑡 , 1 ± 𝜀 ) 𝐴 𝑖 , 𝑡 ) − 𝛽 KL ( 𝜋 𝜃 ∥ 𝜋 ref ) ,

where the KL term is added to the loss rather than folded into the reward. Our corrections only change how 𝐴 𝑖 , 𝑡 is formed.

Outcome Supervision (one scalar reward per response)

For each 𝑖 , we observe a binary verifier reward 𝑟 ~ 𝑖 ∈ { 0 , 1 } .

Algo 1 (Backward) in GRPO. First construct an unbiased per-sample reward

𝑟 ^ 𝑖

𝑟 ~ 𝑖 − 𝜌 ^ 0 1 − 𝜌 ^ 0 − 𝜌 ^ 1 .

Compute group statistics on { 𝑟 ^ 𝑖 } 𝑖

1 𝐾 :

𝑟 ¯

1 𝐾 ∑ 𝑖

1 𝐾 𝑟 ^ 𝑖 , 𝑠

1 𝐾 ∑ 𝑖

1 𝐾 ( 𝑟 ^ 𝑖 − 𝑟 ¯ ) 2 .

Define the group-normalized advantage constant across tokens of the same response,

𝑎 𝑖 (back)

𝑟 ^ 𝑖 − 𝑟 ¯ 𝑠 + 𝜀 , 𝐴 𝑖 , 𝑡 ≡ 𝑎 𝑖 (back) , ∀ 𝑡 .

This is a drop-in replacement for the usual GRPO outcome-advantage, with the sole change being that the group mean/variance are computed over de-noised rewards 𝑟 ^ 𝑖 rather than raw 𝑟 ~ 𝑖 .

Algo 2 (Forward) in GRPO (Pre-Center). To ensure the gradient direction aligns with Theorem 2, we adopt a pre-center approach where the correction weights are treated as proxy rewards before normalization. First, define the forward weight for each trajectory based on the observed reward 𝑟 ~ 𝑖 and the estimated false negative rate 𝜌 ^ 1 :

𝑤 𝑖

{ 𝜌 ^ 1 − 1 ,
if 𝑟 ~ 𝑖

0 ,

𝜌 ^ 1 ,
if 𝑟 ~ 𝑖

1 .

We use these weights directly as the proxy rewards for the GRPO advantage calculation:

ℛ 𝑖 proxy

𝑤 𝑖 .

Then, apply standard GRPO group normalization to these proxy rewards:

𝑎 𝑖 (fwd)

ℛ 𝑖 proxy − ℛ proxy ¯ std ( { ℛ 𝑗 proxy } ) + 𝜀 , 𝐴 𝑖 , 𝑡 ≡ 𝑎 𝑖 (fwd) , ∀ 𝑡 .

This pre-centering ensures that negative samples (where 𝑤 𝑖 < 0 ) are correctly positioned relative to the group mean, preserving the sign logic required to suppress false negatives without inadvertently flipping the optimization direction.

Post-Center Option

Alternatively, one could apply the weights after calculating the standard advantages (post-center):

𝐴 𝑖 , 𝑡 (post) ← 𝑤 𝑖 ⋅ ( 𝑟 ~ 𝑖 − 𝑟 ~ ¯ std ( { 𝑟 ~ 𝑗 } ) + 𝜀 ) .

However, we note that if 𝑤 𝑖 is negative (e.g., 𝜌 ^ 1 − 1 for 𝑟 ~ 𝑖

0 ), multiplying it by a typically negative advantage (for 𝑟 ~ 𝑖

0 ) would flip the sign to positive, potentially encouraging incorrect behaviors. Our default implementation and reported results therefore use the pre-center formulation, which is theoretically consistent with our derivation. We provide the post-center variant solely as an engineering option in our code.

Process Supervision (step-wise rewards)

Suppose each response 𝑦 𝑖 has step indices index 𝑖 ( 1 ) < ⋯ < index 𝑖 ( 𝐾 𝑖 ) with step-level observed rewards 𝑟 ~ 𝑖 ( 𝑗 ) attached at those indices.

Algo 1 (Backward) in GRPO-Process. De-noise each step reward:

𝑟 ^ 𝑖 ( 𝑗 )

𝑟 ~ 𝑖 ( 𝑗 ) − 𝜌 ^ 0 1 − 𝜌 ^ 0 − 𝜌 ^ 1 .

Normalize across the group and steps in the current batch following GRPO’s process recipe:

𝑟 ~ 𝑖 ( 𝑗 )

𝑟 ^ 𝑖 ( 𝑗 ) − mean ( { 𝑟 ^ ℓ ( 𝑚 ) } ) std ( { 𝑟 ^ ℓ ( 𝑚 ) } ) + 𝜀 .

Accumulate into token-level advantages (for all tokens 𝑡 at or before the 𝑗 -th step boundary):

𝐴 𝑖 , 𝑡

∑ index 𝑖 ( 𝑗 ) ≥ 𝑡 𝑟 ~ 𝑖 ( 𝑗 ) .

Algo 2 (Forward) in GRPO-Process (Pre-Center). Similar to the outcome case, we apply weights directly to the step-level rewards before normalization. Define the step-level weights:

𝑤 𝑖 ( 𝑗 )

{ 𝜌 ^ 1 − 1 ,
if 𝑟 ~ 𝑖 ( 𝑗 )

0 ,

𝜌 ^ 1 ,
if 𝑟 ~ 𝑖 ( 𝑗 )

1 .

Use these as proxy rewards ℛ 𝑖 ( 𝑗 ) , proxy

𝑤 𝑖 ( 𝑗 ) , normalize them across the group and steps to obtain 𝑟 ~ 𝑖 ( 𝑗 ) , and finally accumulate the token advantages:

𝐴 𝑖 , 𝑡

∑ index 𝑖 ( 𝑗 ) ≥ 𝑡 𝑟 ~ 𝑖 ( 𝑗 ) .

Scope of verifier-noise regimes

While both FPs and FNs appear in practice, many math-RL pipelines rely on rule-based or programmatic checkers whose FP rate is typically negligible but whose FN rate can be substantial due to brittleness (formatting, algebraic equivalence, or answer representation). Our real-world experiments therefore emphasize this FN-dominant regime and evaluate whether explicitly correcting the policy update using an online estimate of 𝜌 1 improves learning beyond simply rechecking negatives. In contrast, LLM-as-a-judge regimes can exhibit higher and non-stationary FP rates and may violate instance-independence more severely; we discuss the implications and recommended usage of PGBC/PGFC across regimes in Table 3.

When conditional independence is violated (bias decomposition)

Our directional-alignment results rely on conditional independence to factorize 𝔼 [ 𝑤 𝑅 ~ 𝐺 ∣ 𝑅 ∗ ]

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗ ] 𝔼 [ 𝐺 ∣ 𝑅 ∗ ] . If verifier errors depend on trajectory features beyond correctness (e.g., style-based FP hacks), the update admits the decomposition

𝔼 [ 𝑤 𝑅 ~ 𝐺 ∣ 𝑅 ∗ ]

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗ ] 𝔼 [ 𝐺 ∣ 𝑅 ∗ ] + Cov ( 𝑤 𝑅 ~ , 𝐺 ∣ 𝑅 ∗ ) ,

so the residual bias is governed by the conditional covariance term. This highlights a practical diagnostic: if verifier decisions are strongly predictable from superficial cues that also affect generation likelihood, Cov ( 𝑤 𝑅 ~ , 𝐺 ∣ 𝑅 ∗ ) may be non-negligible and one should combine channel-based correction with verifier hardening or instance-dependent reliability modeling.

Table 3:Recommended usage regimes for PGBC/PGFC under common verifier setups. “Estimatable” indicates which noise rates are typically practical to estimate in that regime. Verifier setup Typical failure mode Estimatable rates Recommended use Rule-based checker FN-dominant; FP ≈ 0

𝜌 1 via appeals PGFC (+ online 𝜌 ^ 1 ) LLM judge (single) FP/FN; often content-dependent 𝜌 0 , 𝜌 1 harder Prefer judge hardening / hybrid; use PGBC if rates available Hybrid (rule + appeals) reduced FN; small FP 𝜌 1 (and sometimes 𝜌 0 ) PGFC by default; PGBC if both rates are reliable Online FN-rate estimation via appeals implementation details

At training step 𝑡 , let 𝒩 R ( 𝑡 ) denote trajectories labeled negative by the rule-based verifier and 𝒫 R ( 𝑡 ) denote those labeled positive. We appeal a uniformly random subset 𝑆 ( 𝑡 ) ⊆ 𝒩 R ( 𝑡 ) by including each element independently with probability 𝑞 ∈ ( 0 , 1 ] and re-evaluating it using a lightweight LLM verifier (TinyV (xu2025tinyv)). Let 𝒫 L ( 𝑡 ) ⊆ 𝑆 ( 𝑡 ) be the appealed negatives that TinyV flips to positive (i.e., rule-negative & LLM-positive). Using a Horvitz–Thompson correction (karwa2023admissibility), we estimate the number of false negatives among rule-based negatives by

FN ^ ( 𝑡 )

| 𝒫 L ( 𝑡 ) | 𝑞 ,

which is unbiased under uniform subsampling. Assuming rule-based positives have negligible FP rate, we treat | 𝒫 R ( 𝑡 ) | as a proxy for true positives and estimate the FN rate as

𝜌 ^ 1 ( 𝑡 )

FN ^ ( 𝑡 ) + 𝛼 FN ^ ( 𝑡 ) + | 𝒫 R ( 𝑡 ) | + 𝛼 + 𝛽 ,

with small Beta-prior pseudocounts 𝛼 , 𝛽 (we use 10 − 5 ) for numerical stability. To track non-stationarity as the policy evolves, we apply exponential moving average (EMA) smoothing,

𝜌 ¯ 1 ( 𝑡 )

𝜆 𝜌 ¯ 1 ( 𝑡 − 1 ) + ( 1 − 𝜆 ) 𝜌 ^ 1 ( 𝑡 ) ,

and use 𝜌 ¯ 1 ( 𝑡 ) as the plug-in estimate in PGFC. The additional verifier cost scales linearly with 𝑞 : the expected number of TinyV calls per update is 𝑞 ⋅ | 𝒩 R ( 𝑡 ) | , typically much smaller than the number of rollouts.

High-FP and content-dependent errors (practical guidance)

PGFC does not require estimating 𝜌 0 : under the Verifier Reward Channel model, 𝜌 0 only affects the positive scaling factor 𝑐

1 − 𝜌 0 − 𝜌 1 in Theorem 2, which can be absorbed into the effective learning rate. However, if FP patterns are strongly content-dependent (e.g., prompt-hacking phrases for LLM judges), then the channel assumptions may be violated and the covariance term discussed above can bias the update. In such regimes, we recommend treating PGBC/PGFC as complementary to (rather than a replacement for) verifier hardening and hybrid verifier designs (e.g., rule-based checking with appeals, calibrated judges, or adversarially trained verifiers).

Appendix CProofs and Derivations C.1Proof of Proposition 1 Proof.

We compute the expectation of the noisy reward 𝑅 ~ conditioned on the clean reward 𝑅 ∗ , which is a binary variable. By the definition of expectation:

𝔼 [ 𝑅 ~ ]

1 ⋅ ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗ ) + 0 ⋅ ℙ ( 𝑅 ~

0 ∣ 𝑅 ∗ )

ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗ ) .

We can expand this using the law of total probability, conditioning on the value of 𝑅 ∗ ∈ { 0 , 1 } :

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝑅 ∗ ⋅ ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗

1 ) + ( 1 − 𝑅 ∗ ) ⋅ ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 ) .

From Definition 1, we have ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 and ℙ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝜌 1 , which implies ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗

1 )

1 − 𝜌 1 . Substituting these values:

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝑅 ∗ ( 1 − 𝜌 1 ) + ( 1 − 𝑅 ∗ ) 𝜌 0

𝑅 ∗ − 𝜌 1 𝑅 ∗ + 𝜌 0 − 𝜌 0 𝑅 ∗

( 1 − 𝜌 0 − 𝜌 1 ) 𝑅 ∗ + 𝜌 0 .

This completes the proof. ∎

C.2Proof of Theorem 1 Proof.

Recall the verifier reward channel (Definition 1): for the latent clean reward 𝑅 ∗ ∈ { 0 , 1 } and the observed noisy reward 𝑅 ~ ∈ { 0 , 1 } ,

Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 , Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝜌 1 ,

hence Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

1 )

1 − 𝜌 1 and Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

0 )

1 − 𝜌 0 . Since 1 − 𝜌 0 − 𝜌 1

0 , the estimator

𝑅 ^ ≜ 𝑅 ~ − 𝜌 0 1 − 𝜌 0 − 𝜌 1

is well-defined.

We prove unbiasedness by computing the conditional expectation given 𝑅 ∗ . First note that because 𝑅 ~ is binary,

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

1 ⋅ Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗ ) + 0 ⋅ Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗ )

Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗ ) .

(6)

By the channel definition, for the two possible values of 𝑅 ∗ we have

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗

1 ]

1 − 𝜌 1 , 𝔼 [ 𝑅 ~ ∣ 𝑅 ∗

0 ]

𝜌 0 .

These two cases can be written compactly as the affine form (Proposition 1)

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

( 1 − 𝜌 0 − 𝜌 1 ) 𝑅 ∗ + 𝜌 0 .

(7)

Now take the conditional expectation of 𝑅 ^ :

𝔼 [ 𝑅 ^ ∣ 𝑅 ∗ ]

𝔼 [ 𝑅 ~ − 𝜌 0 1 − 𝜌 0 − 𝜌 1 | 𝑅 ∗ ]

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ] − 𝜌 0 1 − 𝜌 0 − 𝜌 1

( 1 − 𝜌 0 − 𝜌 1 ) 𝑅 ∗ + 𝜌 0 − 𝜌 0 1 − 𝜌 0 − 𝜌 1 (by ( 7 ))

𝑅 ∗ .

(8)

Equation (8) shows that 𝑅 ^ is an unbiased estimator of the latent clean reward in the conditional sense.

Finally, taking total expectation (law of iterated expectation) yields

𝔼 [ 𝑅 ^ ]

𝔼 [ 𝔼 [ 𝑅 ^ ∣ 𝑅 ∗ ] ]

𝔼 [ 𝑅 ∗ ] ,

so 𝑅 ^ is unbiased for the clean reward both conditionally and unconditionally. ∎

C.3Proof of Proposition 2 Proof.

The proposition states two claims about the conditional expectation of the forward weights. The weights are defined as:

𝑤 𝑅 ~

{ 𝑤 0

𝜌 1 − 1
if 𝑅 ~

0 ,

𝑤 1

𝜌 1
if 𝑅 ~

1 .

The noise model provides the conditional probabilities:

Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝜌 1 , Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

1 )

1 − 𝜌 1

Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 , Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

0 )

1 − 𝜌 0
Part 1: Proof of 𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

1 ]

0

We compute the expectation of 𝑤 𝑅 ~ conditioned on the true reward being positive ( 𝑅 ∗

1 ):

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

1 ]

∑ 𝑘 ∈ { 0 , 1 } 𝑤 𝑘 ⋅ Pr ⁡ ( 𝑅 ~

𝑘 ∣ 𝑅 ∗

1 )

𝑤 0 ⋅ Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 ) + 𝑤 1 ⋅ Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

1 )

( 𝜌 1 − 1 ) ⋅ ( 𝜌 1 ) + ( 𝜌 1 ) ⋅ ( 1 − 𝜌 1 )

( 𝜌 1 2 − 𝜌 1 ) + ( 𝜌 1 − 𝜌 1 2 )

0 .
Part 2: Proof of 𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

0 ]

− ( 1 − 𝜌 0 − 𝜌 1 )

Next, we compute the expectation of 𝑤 𝑅 ~ conditioned on the true reward being negative ( 𝑅 ∗

0 ):

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

0 ]

∑ 𝑘 ∈ { 0 , 1 } 𝑤 𝑘 ⋅ Pr ⁡ ( 𝑅 ~

𝑘 ∣ 𝑅 ∗

0 )

𝑤 0 ⋅ Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

0 ) + 𝑤 1 ⋅ Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

( 𝜌 1 − 1 ) ⋅ ( 1 − 𝜌 0 ) + ( 𝜌 1 ) ⋅ ( 𝜌 0 )

( 𝜌 1 − 𝜌 0 𝜌 1 − 1 + 𝜌 0 ) + 𝜌 0 𝜌 1

𝜌 1 + 𝜌 0 − 1

− ( 1 − 𝜌 0 − 𝜌 1 ) .

This proves both claims of the proposition. ∎

C.4Proof of Theorem 2 Proof.

We want to show that 𝔼 [ Δ 𝜃 ]

( 1 − 𝜌 0 − 𝜌 1 ) ∇ 𝜃 𝐽 ( 𝜃 ) , where Δ 𝜃

1 𝑀 ∑ 𝑡

1 𝑀 ℎ 𝑡 and ℎ 𝑡

𝑤 𝑅 ~ 𝐺 𝑡 . By linearity of expectation and assuming i.i.d. samples, it suffices to show this for a single sample’s contribution, 𝔼 [ ℎ 𝑡 ] .

We use the law of total expectation, conditioning on the latent true reward 𝑅 ∗ ∈ { 0 , 1 } :

𝔼 [ ℎ 𝑡 ]

𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ]

𝔼 [ 𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ∣ 𝑅 ∗ ] ]

Pr ⁡ ( 𝑅 ∗

1 ) 𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ∣ 𝑅 ∗

1 ] + Pr ⁡ ( 𝑅 ∗

0 ) 𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ∣ 𝑅 ∗

0 ] .

The noise process generating 𝑅 ~ is independent of the policy’s action generation process (which produces 𝐺 𝑡 ), conditional on the true reward 𝑅 ∗ . Thus, we can separate the expectations:

𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ∣ 𝑅 ∗ ]

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗ ] ⋅ 𝔼 [ 𝐺 𝑡 ∣ 𝑅 ∗ ] .

Using the results from Proposition 2:

•

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

1 ]

0 .

•

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

0 ]

− ( 1 − 𝜌 0 − 𝜌 1 ) .

Substituting these back into the main expectation formula:

𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ]

Pr ⁡ ( 𝑅 ∗

1 ) ⋅ ( 0 ) ⋅ 𝔼 [ 𝐺 𝑡 ∣ 𝑅 ∗

1 ] + Pr ⁡ ( 𝑅 ∗

0 ) ⋅ ( − ( 1 − 𝜌 0 − 𝜌 1 ) ) ⋅ 𝔼 [ 𝐺 𝑡 ∣ 𝑅 ∗

0 ]

− ( 1 − 𝜌 0 − 𝜌 1 ) ⋅ Pr ⁡ ( 𝑅 ∗

0 ) 𝔼 [ 𝐺 𝑡 ∣ 𝑅 ∗

0 ]

− ( 1 − 𝜌 0 − 𝜌 1 ) ⋅ 𝔼 [ 𝟏 { 𝑅 ∗

0 } 𝐺 𝑡 ] ,

where 𝟏 { ⋅ } is the indicator function. From two fundamental properties of the score function:

1.

The unconditional expectation is zero: 𝔼 [ 𝐺 𝑡 ]

0 (williams1992reinforce; sutton1999policy).

2.

The clean policy gradient is ∇ 𝜃 𝐽 ( 𝜃 )

𝔼 [ 𝑅 ∗ 𝐺 𝑡 ] .

From property 1, we have 𝔼 [ 𝐺 𝑡 ]

𝔼 [ ( 𝟏 { 𝑅 ∗

1 } + 𝟏 { 𝑅 ∗

0 } ) 𝐺 𝑡 ]

𝔼 [ 𝑅 ∗ 𝐺 𝑡 ] + 𝔼 [ 𝟏 { 𝑅 ∗

0 } 𝐺 𝑡 ]

0 . This implies that 𝔼 [ 𝟏 { 𝑅 ∗

0 } 𝐺 𝑡 ]

− 𝔼 [ 𝑅 ∗ 𝐺 𝑡 ]

− ∇ 𝜃 𝐽 ( 𝜃 ) .

Finally, we substitute this back into our expression for the expected update direction:

𝔼 [ ℎ 𝑡 ]

𝔼 [ 𝑤 𝑅 ~ 𝐺 𝑡 ]

− ( 1 − 𝜌 0 − 𝜌 1 ) ⋅ 𝔼 [ 𝟏 { 𝑅 ∗

0 } 𝐺 𝑡 ]

− ( 1 − 𝜌 0 − 𝜌 1 ) ⋅ ( − ∇ 𝜃 𝐽 ( 𝜃 ) )

( 1 − 𝜌 0 − 𝜌 1 ) ∇ 𝜃 𝐽 ( 𝜃 ) .

Therefore, the expectation of the full update is 𝔼 [ Δ 𝜃 ]

1 𝑀 ∑ 𝔼 [ ℎ 𝑡 ]

( 1 − 𝜌 0 − 𝜌 1 ) ∇ 𝜃 𝐽 ( 𝜃 ) . This completes the proof. ∎

Appendix DPrompt Templates and Training/Evaluation Details

This section records the exact prompt formats and the concrete hyperparameters we used for all experiments in this paper. We mirror the level of detail used in recent RLVR appendices and report settings sufficient for full reproducibility from our released code.

D.1Prompt Templates Training (generation) prompt

For each math problem 𝑥 (a plain-text question), the user message is built by concatenating the raw question with a short instruction that elicits chain-of-thought and enforces a verifiable answer format.

{QUESTION} Let’s think step by step and enclose the reasoning process within and tags. The final result in the answer MUST BE within \boxed{}.

During data preprocessing, we write chat-style JSON with a single user turn as shown above and attach the rule-based ground-truth answer for reward checking.

Evaluation (validation/test) prompt

We use the same prompt template as training for validation and test-time generation so that the rule-based verifier can parse the boxed answer consistently.

Verifier I/O

The rule-based checker operates on the model’s final string and extracts the last \boxed{...} expression; it then applies numeric/rational parsing and equality tests to produce a binary reward 𝑅 ~ ∈ { 0 , 1 } . When the LLM verifier is enabled, it receives the pair ( problem , model solution ) and returns a binary correctness decision used only to estimate the false negative rate 𝜌 1 over a sliding window. The LLM verifier does not replace the rule-based reward.

D.2Data Preprocessing

We load the preview split of the math-reasoning corpus and map each example to a chat-style record as above, keeping the reference (ground-truth) answer for programmatic checking.

D.3Training Configuration

Unless otherwise stated, all runs use GRPO (outcome supervision) with the following constants.

Training (GRPO) Train batch size 128 Rollouts per question (group size) 8 Max prompt length (tokens) 512 Max response length (tokens) 3072 Sampling temperature (rollouts) 1.0 Advantage estimator Group-normalized (GRPO) KL regularization Enabled KL coefficient 𝛽 0.001 Entropy coefficient 0.0 Optimizer AdamW Learning rate 3e-6 Total epochs 1 Table 4:Core training settings. Model/backbone

We load the base model from local cache (HuggingFace layout), enable FSDP2 for actor/ref, and use shared-memory weights with remove-padding for efficient vLLM rollouts. The KL is computed w.r.t. a frozen reference initialized from the same base.

D.4Online FN-Rate Estimation Hyperparameters and Ablations

This subsection reports the concrete hyperparameters for our online false-negative (FN) rate estimation 𝜌 ^ 1 used in the real-world-noise experiments (Section 4.3), and provides a systematic ablation over the key knobs that control the cost–accuracy trade-off of the appeals mechanism.

Appeals protocol and estimator

At training step 𝑡 , let 𝒩 R ( 𝑡 ) and 𝒫 R ( 𝑡 ) be the sets labeled negative/positive by the primary rule-based verifier. We appeal each item in 𝒩 R ( 𝑡 ) independently with probability 𝑞 ∈ ( 0 , 1 ] and re-evaluate it with a lightweight LLM verifier (TinyV). Let 𝒫 L ( 𝑡 ) be the appealed negatives that TinyV flips to positive. We use the Horvitz–Thompson estimator

FN ^ ( 𝑡 )

| 𝒫 L ( 𝑡 ) | 𝑞

(9)

and estimate the FN rate with a Beta prior:

𝜌 ^ 1 ( 𝑡 )

FN ^ ( 𝑡 ) + 𝛼 FN ^ ( 𝑡 ) + | 𝒫 R ( 𝑡 ) | + 𝛼 + 𝛽 .

(10)

We then apply exponential moving average (EMA) smoothing

𝜌 ¯ 1 ( 𝑡 )

𝜆 𝜌 ¯ 1 ( 𝑡 − 1 ) + ( 1 − 𝜆 ) 𝜌 ^ 1 ( 𝑡 ) ,

(11)

and use 𝜌 ¯ 1 ( 𝑡 ) as the plug-in estimate for PGFC weights 𝑤 𝑟 ~ ∈ { 𝜌 ¯ 1 ( 𝑡 ) − 1 , 𝜌 ¯ 1 ( 𝑡 ) } .

Default hyperparameters

Unless otherwise stated, real-world-noise runs use: (i) appeal probability 𝑞

0.1 ; (ii) Beta prior pseudocounts 𝛼

𝛽

10 − 5 in Eq. (10); (iii) EMA decay 𝜆

0.9 in Eq. (11). We compute 𝜌 ^ 1 ( 𝑡 ) and update 𝜌 ¯ 1 ( 𝑡 ) once per policy-update step. The expected number of TinyV calls per update is 𝑞 ⋅ | 𝒩 R ( 𝑡 ) | (reported below as a cost proxy).

D.5Datasets Training •

DeepScaleR (DeepScaleR): the math-reasoning corpus used for RLVR rollouts and policy updates.

Evaluation.

We evaluate with the standard official/problem releases for each benchmark and apply the same rule-based programmatic checking protocol throughout.

•

MATH500 (lightman2024letsverify): a verifiable subset of math reasoning problems for reliable evaluation.

•

AIME 2024 (aime2024): the 2024 American Invitational Mathematics Examination set.

•

AIME 2025 (aime2025): the 2025 AIME set.

•

AMC 2023 (amc23): problems from the 2023 American Mathematics Competitions.

•

Minerva Math (minerva): the Minerva math evaluation suite.

•

OlympiadBench (olympiadbench): olympiad-level mathematics benchmark.

Appendix EFalse Positive (FP) and False Negative (FN) Examples

We provide FP and FN examples using Qwen2.5-Math-7B base model on MATH 500.

FP { "idx": 16, "question": "Compute: $1-2+3-4+5- \dots +99-100$.", "model_prediction": "{}", "model_response": "Compute: $1+2-3-4+5+6-7-8+ \dots +97+98-99-100$. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \boxed{{}}. \", "ground_truth": "-50", "rule_based": false, "llm_verifier": true, "llm_verifier_score": 1.0, "llm_verifier_raw_value": "1.0", "math_verify": false, "math_verify_score": 0.0 }, FN { "idx": 92, "question": "A pirate searches seven islands for buried treasure. If each island has a $\frac{1}{5}$ chance of having treasure, what is the probability that exactly 4 of the islands have treasure?", "answer": "\frac{448}{15625}", "model_prediction": "0.028672", "model_response": "To solve this problem, we can use the binomial probability formula. The probability of having exactly $k$ successes in $n$ independent Bernoulli trials (each with success probability $p$) is given by:\n\n\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]\n\nHere, $n = 7$ (the number of islands), $k = 4$ (the number of islands with treasure), and $p = \frac{1}{5}$ (the probability that a single island has treasure).\n\nThe binomial coefficient $\binom{n}{k}$ is the number of ways to choose $k$ successes out of $n$ trials, and it is calculated as:\n\n\[ \binom{n}{k} = \frac{n!}{k!(n-k)!} \]\n\n So, the probability that exactly 4 of the 7 islands have treasure is:\n\n\[ P(X = 4) = \binom{7}{4} \left(\frac{1}{5}\right)^4 \left(\frac{4}{5}\right)^{3} \]\n\nLet’s calculate this using Python.\n‘‘‘python\nimport math\n\n# Number of islands\nn = 7\n# Number of islands with treasure\nk = 4\n# Probability of a single island having treasure\np = 1/5\n\n# Binomial coefficient\nbinom_coeff = math.comb(n, k)\n\n# Probability calculation\nprobability = binom_coeff * (pk) * ((1-p)(n-k))\nprint(probability)\n‘‘‘\n‘‘‘output\n0.02867200000000001\n‘‘‘\nThe probability that exactly 4 of the 7 islands have treasure is $\\boxed{0.028672}$ .", "ground_truth": "\frac{448}{15625}", "rule_based": false, "llm_verifier": false, "llm_verifier_score": 0.0, "llm_verifier_raw_value": "0.0", "math_verify": true, "math_verify_score": 1.0 } Generated on Wed Dec 24 08:24:53 2025 by LaTeXML

max 𝜃 ⁡ 𝐽 𝛽 ​ ( 𝜃 )

in which KL denotes the Kullback-Leibler (KL)-divergence. The policy gradient with respect to the unregularized reward is given by the REINFORCE estimator (sutton1999policy), ∇ 𝜃 𝐽 ​ ( 𝜃 )

𝔼 ​ [ 𝑅 ∗ ​ ( 𝑥 , 𝑦 ) ​ ∑ 𝑡 𝐺 𝑡 ] , where 𝐺 𝑡

ℙ ​ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 , ℙ ​ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝔼 ​ [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝑅 ^

is an unbiased estimator of the true reward 𝑅 ∗ ​ ( 𝑥 , 𝑦 ) , i.e., 𝔼 ​ [ 𝑅 ^ ]

Algorithm 1 Noisy Policy Gradient with Backward Correction (PGBC) 1:Input: Initial policy 𝜃 0 ; learning rate 𝜂 ; batch size 𝑀 ; estimates of noise rates, ( 𝜌 ^ 0 , 𝜌 ^ 1 ) . 2:loop 3: Roll out 𝑀 trajectories { ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑀 , obtaining observed rewards { 𝑅 ~ 𝑖 } 𝑖

1 𝑀 . 4: For each trajectory 𝑖 , compute the unbiased reward estimate: 𝑅 ^ 𝑖 ← 𝑅 ~ 𝑖 − 𝜌 ^ 0 1 − 𝜌 ^ 0 − 𝜌 ^ 1 . 5: Compute the policy gradient using the corrected rewards { 𝑅 ^ 𝑖 } 𝑖

𝑤 𝑅 ~ ​ 𝐺 𝑡 and choose the forward weights 𝑤 0 (for 𝑅 ~

0 ) and 𝑤 1 (for 𝑅 ~

Let the forward weights be defined as 𝑤 0

𝜌 1 − 1 and 𝑤 1

𝔼 ​ [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

1 ]

𝔼 ​ [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

0 ]

The proof can be found in Appendix C.3. Proposition 2 reveals that, in expectation, the forward weights are chosen so that their conditional expectation vanishes when 𝑅 ∗

1 , i.e., truly positive cases do not require correction. Consequently, the explicit contribution to the correction term arises from samples with 𝑅 ∗

Let the gradient-update term be Δ ​ 𝜃

1 𝑀 ​ ∑ 𝑤 𝑅 ~ ​ 𝐺 𝑡 , where 𝑤 0

𝜌 1 − 1 and 𝑤 1

𝔼 ​ [ Δ ​ 𝜃 ]

in which 𝑐

Algorithm 2 Noisy Policy Gradient with Forward Correction (PGFC) 1:Input: Initial policy 𝜃 0 ; learning rate 𝜂 ; batch size 𝑀 ; an estimate of the false negative rate, 𝜌 ^ 1 . 2:loop 3: Roll out 𝑀 trajectories and observe rewards { 𝑅 ~ 𝑖 } 𝑖

1 𝑀 . 4: For each trajectory 𝑖 , define the weight based on the observed reward 𝑅 ~ 𝑖 : 𝑤 𝑅 ~ 𝑖 ← { 𝜌 ^ 1 − 1 , if ​ 𝑅 ~ 𝑖

𝜌 ^ 1 , if ​ 𝑅 ~ 𝑖

1 . 5: For each trajectory 𝑖 , compute the score-function term 𝐺 𝑖 and the weighted term ℎ 𝑖 ← 𝑤 𝑅 ~ 𝑖 ​ 𝐺 𝑖 . 6: Form the policy gradient estimate: 𝑔 ^ ← 1 𝑀 ​ ∑ 𝑖

To disentangle optimization effects from verifier unreliability, we inject verifier-conditional noise into the binary reward stream during training. Concretely, when the latent clean reward is 𝑅 ∗ ∈ { 0 , 1 } , the observed reward 𝑅 ~ is drawn from a reward channel with Pr ⁡ ( 𝑅 ~

1 ∣ 𝑅 ∗

0 )

𝜌 0 and Pr ⁡ ( 𝑅 ~

0 ∣ 𝑅 ∗

1 )

𝜌 1 ; unless stated otherwise we use 𝜌 0

0.1 and 𝜌 1

0.1 , 𝜌 1

(a)BC: 𝜌 ^ 0

0.1 , 𝜌 ^ 1 ∈ [ 0.1 , 0.3 ] (b)BC: 𝜌 ^ 1

0.2 , 𝜌 ^ 0 ∈ [ 0.0 , 0.2 ] (c)FC: 𝜌 ^ 0

Backward correction remains strong when the total noise is underestimated, but performance degrades as we overestimate the rates (i.e., as 1 − 𝜌 ^ 0 − 𝜌 ^ 1 shrinks). Intuitively, the unbiased de-biasing, 𝑅 ^

Fix a prompt 𝑥 and sample a group of 𝐾 i.i.d. responses 𝑦 1 , … , 𝑦 𝐾 ∼ 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) . Let 𝐺 𝑖 ≜ ∑ 𝑡

1 | 𝑦 𝑖 | ∇ 𝜃 log ⁡ 𝜋 𝜃 ​ ( 𝑦 𝑖 , 𝑡 ∣ 𝑥 , 𝑦 𝑖 , < 𝑡 ) denote the trajectory score-function. Let 𝑢 𝑖 be any scalar proxy reward computed from the trajectory (e.g., 𝑢 𝑖

𝑟 ^ 𝑖 for PGBC, or 𝑢 𝑖

𝑤 𝑟 ~ 𝑖 for PGFC), and define the group mean 𝑢 ¯

1 𝐾 ​ ∑ 𝑗

Δ center ​ ( 𝜃 ) ≜ 1 𝐾 ​ ∑ 𝑖

Assume { ( 𝑢 𝑖 , 𝐺 𝑖 ) } 𝑖

1 𝐾 are i.i.d. under 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) and 𝔼 ​ [ 𝐺 𝑖 ∣ 𝑥 ]

𝔼 ​ [ Δ center ​ ( 𝜃 ) ∣ 𝑥 ]

Expand Δ center

𝔼 ​ [ 𝑢 ¯ ⋅ 1 𝐾 ​ ∑ 𝑖 𝐺 𝑖 ∣ 𝑥 ]

The 𝑖

𝑗 terms contribute 𝐾 𝐾 2 ​ 𝔼 ​ [ 𝑢 ​ 𝐺 ∣ 𝑥 ] . For 𝑖 ≠ 𝑗 , independence gives 𝔼 ​ [ 𝑢 𝑗 ​ 𝐺 𝑖 ∣ 𝑥 ]

𝔼 ​ [ 𝑢 ∣ 𝑥 ] ​ 𝔼 ​ [ 𝐺 ∣ 𝑥 ]

0 . Thus 𝔼 ​ [ 𝑢 ¯ ⋅ 1 𝐾 ​ ∑ 𝑖 𝐺 𝑖 ∣ 𝑥 ]

Under the Verifier Reward Channel (Definition 1) and the conditional-independence assumption stated above, let 𝑢

𝑤 𝑅 ~ with 𝑤 0

𝜌 1 − 1 and 𝑤 1

𝔼 ​ [ Δ center ​ ( 𝜃 ) ∣ 𝑥 ]

We describe how to integrate Algorithm 1 (backward, unbiased reward de-biasing) and Algorithm 2 (forward, gradient-scaled) into Group Relative Policy Optimization (GRPO) under both outcome and process supervision. GRPO samples, for each prompt 𝑥 , a group of 𝐾 responses { 𝑦 𝑖 } 𝑖

Let 𝜋 𝜃 be the current policy and 𝜋 old the behavior policy. Define token-level ratios 𝑟 𝑖 , 𝑡

ℒ grpo ​ ( 𝜃 )

1 𝐾 ​ ∑ 𝑖

1 𝐾 1 | 𝑦 𝑖 | ​ ∑ 𝑡

𝑟 ^ 𝑖

Compute group statistics on { 𝑟 ^ 𝑖 } 𝑖

𝑟 ¯

1 𝐾 ​ ∑ 𝑖

1 𝐾 𝑟 ^ 𝑖 , 𝑠

1 𝐾 ​ ∑ 𝑖

𝑎 𝑖 (back)

𝑤 𝑖

{ 𝜌 ^ 1 − 1 , if ​ 𝑟 ~ 𝑖

max 𝜃 ⁡ 𝐽 𝛽 ( 𝜃 )

in which KL denotes the Kullback-Leibler (KL)-divergence. The policy gradient with respect to the unregularized reward is given by the REINFORCE estimator (sutton1999policy), ∇ 𝜃 𝐽 ( 𝜃 )

𝔼 [ 𝑅 ∗ ( 𝑥 , 𝑦 ) ∑ 𝑡 𝐺 𝑡 ] , where 𝐺 𝑡

ℙ ( 𝑅 ~

𝜌 0 ,

ℙ ( 𝑅 ~

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

is an unbiased estimator of the true reward 𝑅 ∗ ( 𝑥 , 𝑦 ) , i.e., 𝔼 [ 𝑅 ^ ]

1 𝑀 . 4: For each trajectory 𝑖 , compute the unbiased reward estimate:
𝑅 ^ 𝑖 ← 𝑅 ~ 𝑖 − 𝜌 ^ 0 1 − 𝜌 ^ 0 − 𝜌 ^ 1 .
5: Compute the policy gradient using the corrected rewards { 𝑅 ^ 𝑖 } 𝑖

𝑤 𝑅 ~ 𝐺 𝑡 and choose the forward weights 𝑤 0 (for 𝑅 ~

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

𝔼 [ 𝑤 𝑅 ~ ∣ 𝑅 ∗

Let the gradient-update term be Δ 𝜃

1 𝑀 ∑ 𝑤 𝑅 ~ 𝐺 𝑡 , where 𝑤 0

𝔼 [ Δ 𝜃 ]

1 𝑀 . 4: For each trajectory 𝑖 , define the weight based on the observed reward 𝑅 ~ 𝑖 :
𝑤 𝑅 ~ 𝑖 ← { 𝜌 ^ 1 − 1 ,
if 𝑅 ~ 𝑖

𝜌 ^ 1 ,
if 𝑅 ~ 𝑖

1 .
5: For each trajectory 𝑖 , compute the score-function term 𝐺 𝑖 and the weighted term ℎ 𝑖 ← 𝑤 𝑅 ~ 𝑖 𝐺 𝑖 . 6: Form the policy gradient estimate: 𝑔 ^ ← 1 𝑀 ∑ 𝑖

1 | 𝑦 𝑖 | ∇ 𝜃 log ⁡ 𝜋 𝜃 ( 𝑦 𝑖 , 𝑡 ∣ 𝑥 , 𝑦 𝑖 , < 𝑡 ) denote the trajectory score-function. Let 𝑢 𝑖 be any scalar proxy reward computed from the trajectory (e.g., 𝑢 𝑖

1 𝐾 ∑ 𝑗

Δ center ( 𝜃 ) ≜ 1 𝐾 ∑ 𝑖

1 𝐾 are i.i.d. under 𝜋 𝜃 ( ⋅ ∣ 𝑥 ) and 𝔼 [ 𝐺 𝑖 ∣ 𝑥 ]

𝔼 [ Δ center ( 𝜃 ) ∣ 𝑥 ]

𝔼 [ 𝑢 ¯ ⋅ 1 𝐾 ∑ 𝑖 𝐺 𝑖 ∣ 𝑥 ]

𝑗 terms contribute 𝐾 𝐾 2 𝔼 [ 𝑢 𝐺 ∣ 𝑥 ] . For 𝑖 ≠ 𝑗 , independence gives 𝔼 [ 𝑢 𝑗 𝐺 𝑖 ∣ 𝑥 ]

𝔼 [ 𝑢 ∣ 𝑥 ] 𝔼 [ 𝐺 ∣ 𝑥 ]

0 . Thus 𝔼 [ 𝑢 ¯ ⋅ 1 𝐾 ∑ 𝑖 𝐺 𝑖 ∣ 𝑥 ]

𝔼 [ Δ center ( 𝜃 ) ∣ 𝑥 ]

ℒ grpo ( 𝜃 )

1 𝐾 ∑ 𝑖

1 𝐾 1 | 𝑦 𝑖 | ∑ 𝑡

1 𝐾 ∑ 𝑖

1 𝐾 ∑ 𝑖

{ 𝜌 ^ 1 − 1 ,
if 𝑟 ~ 𝑖

𝜌 ^ 1 ,
if 𝑟 ~ 𝑖

{ 𝜌 ^ 1 − 1 ,
if 𝑟 ~ 𝑖 ( 𝑗 )

𝜌 ^ 1 ,
if 𝑟 ~ 𝑖 ( 𝑗 )

Our directional-alignment results rely on conditional independence to factorize 𝔼 [ 𝑤 𝑅 ~ 𝐺 ∣ 𝑅 ∗ ]

𝔼 [ 𝑤 𝑅 ~ 𝐺 ∣ 𝑅 ∗ ]

𝔼 [ 𝑅 ~ ]

1 ⋅ ℙ ( 𝑅 ~

1 ∣ 𝑅 ∗ ) + 0 ⋅ ℙ ( 𝑅 ~

ℙ ( 𝑅 ~

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝑅 ∗ ⋅ ℙ ( 𝑅 ~

1 ) + ( 1 − 𝑅 ∗ ) ⋅ ℙ ( 𝑅 ~

From Definition 1, we have ℙ ( 𝑅 ~

𝜌 0 and ℙ ( 𝑅 ~

𝜌 1 , which implies ℙ ( 𝑅 ~

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝑅 ∗ ( 1 − 𝜌 1 ) + ( 1 − 𝑅 ∗ ) 𝜌 0

𝑅 ∗ − 𝜌 1 𝑅 ∗ + 𝜌 0 − 𝜌 0 𝑅 ∗

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗

1 − 𝜌 1 , 𝔼 [ 𝑅 ~ ∣ 𝑅 ∗

𝔼 [ 𝑅 ~ ∣ 𝑅 ∗ ]

𝔼 [ 𝑅 ^ ∣ 𝑅 ∗ ]