Papers
arxiv:2606.05633

Answer Presence Drives RAG Rewriting Gains

Published on Jun 4
· Submitted by
ShinerYang
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Controlled interventions reveal that gold answer presence in rewritten contexts significantly boosts QA performance, with removal causing substantial F1 drops and injection improving results, while conventional probing methods show fragility to sentinel changes.

Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA+verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired answer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12~F1 ``non-leakage residual'' that flips to -3.33 to -7.81~F1 under four alternative sentinels and fails an equivalence test for three of those four (1/4~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.

Community

Paper author Paper submitter

Primary Contribution
The paper does not propose a new mitigation or rewriter. Instead, it provides a rigorous evaluation standard, releasing the intervention runner and sentinel panel so that future claims of rewriter-driven gains can be strictly and uniformly audited.

Key Findings
Leakage Drives the Lift: Removing the gold answer plummets F1 scores by 28 to 64 points compared to the placebo. Conversely, injecting the gold answer into texts that lacked it raises F1 in 10 out of 12 baseline setups.
Standard Probes are Fragile: A companion five-sentinel audit reveals that the conventional single-[MASK] probe for leakage is unreliable. Under alternative sentinels, supposed positive "non-leakage" F1 gains actually flip to negative (e.g., dropping from +4.12 to -7.81).

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05633
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05633 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05633 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05633 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.