AirRep vs STRIDE vs LoGRA — preliminary comparison results
AirRep vs STRIDE vs LoGRA — preliminary results on MATH contamination detection
Setup: Qwen2.5-0.5B fine-tuned on MATH at 0.5% and 1% contamination rates (5 models total, 2–3 seeds per rate). Query set = 500 clean validation examples + leaked examples per model (~22–45). All metrics averaged over 5 models.
1. Query-level contamination detection
Can the method assign higher scores to leaked queries than clean ones?
| Method | AUPRC | ROC-AUC | R@10 | R@50 | R@100 | MRR | best-F1 |
|---|---|---|---|---|---|---|---|
| AirRep (pretrained) | 0.132 | 0.638 | 0.16 | 0.23 | 0.35 | 0.344 | 0.223 |
| STRIDE (300 subsets) | 0.109 | 0.647 | 0.14 | 0.20 | 0.32 | 0.210 | 0.211 |
| LoGRA | 0.064 | 0.497 | 0.00 | 0.00 | 0.00 | 0.007 | 0.192 |
Random AUPRC baseline: ~0.042 (0.5% rate) / ~0.083 (1% rate).
Ranking: AirRep >> STRIDE > LoGRA.
STRIDE and AirRep are nearly tied on ROC-AUC (0.647 vs 0.638); AirRep leads on precision-oriented metrics (AUPRC, MRR). LoGRA is near-random — its gradient scores have huge variance ($\sigma pprox 2.9 imes 10^6$) relative to the leaked/nonleaked mean gap (~250K), making the max-over-pool signal uninformative.
2. Training-item retrieval
For each leaked query with uid math/test/{id}, does the method's top-$k$ retrieved training items include any leaked/math/test/{id}/replica{n}?
| Method | hit@1 | hit@5 | hit@10 | hit@100 | random@10 |
|---|---|---|---|---|---|
| AirRep | 0.93 | 0.93 | 0.93 | 0.93 | ~0.045 |
| LoGRA | 0.01 | 0.04 | 0.07 | 0.39 | ~0.045 |
| STRIDE | 0.00 | 0.00 | 0.02 | 0.18 | ~0.045 |
AirRep retrieves the exact training replica at rank 1 for 87–100% of leaked queries. STRIDE's hit@10 (0.02) is below the random baseline (~0.045). LoGRA is marginally above random at hit@100 only.
So: AirRep can both detect contamination and point to the offending training example. STRIDE and LoGRA cannot do the latter.
3. Memorized vs leaked-not-memorized
Neither gradient method distinguishes memorized from merely-leaked queries:
| Group | STRIDE score (mean) |
|---|---|
| memorized (model gets it right) | 7.43 |
| leaked but not memorized | 7.48 |
| clean (non-leaked) | 7.07 |
4. STRIDE false positive pattern
At 1% contamination, top-20 STRIDE queries contain only 4 true positives (20% precision). False positives tend to be short, concrete, "textbook" problems:
"Evaluate $\dfrac{7}{45^2 - 38^2}$"
"How many times does the digit 8 appear in the list of all integers from 1 to 1000?"
"If $f(x) = ax + b$ and $f(f(f(x))) = 8x + 21$, find $a + b$."
These are problems where training data strongly determines model behavior — but not because they were exact replicas in training. STRIDE appears to conflate "training data has large gradient influence on this query" with "this query was in training". AirRep avoids this because representation similarity is far more sensitive to exact/near-duplicate content than gradient alignment is.
Open questions for discussion
- Does increasing STRIDE's
n_subsetsbeyond 300 recover meaningful training-retrieval signal, or is the noise fundamental? - Is LoGRA's failure specific to our setup (small model, short sequences, PCA LoRA), or is gradient influence generically noisy for this task?
- AirRep pretrained uses a fixed encoder — would a fine-tuned encoder (AirRep trainable) close the gap further?
- How do results change at 1.5% contamination (more leaked examples, stronger signal)?
Update: replaced hit@k (training-item retrieval) with Spearman ρ against the binary leaked label, and added pairwise ranking agreement between methods.
Query-level contamination detection (mean over 5 models)
| Method | Spearman ρ | AUPRC | ROC-AUC | R@10 | R@100 | MRR | best-F1 |
|---|---|---|---|---|---|---|---|
| AirRep (pretrained) | 0.117 | 0.132 | 0.638 | 0.16 | 0.35 | 0.344 | 0.223 |
| STRIDE (300 subsets) | 0.115 | 0.109 | 0.647 | 0.14 | 0.32 | 0.210 | 0.211 |
| LoGRA | −0.006 | 0.064 | 0.497 | 0.00 | 0.00 | 0.007 | 0.192 |
Random AUPRC baseline: ~0.042 (0.5% rate) / ~0.083 (1% rate).
LoGRA's Spearman ρ is indistinguishable from zero — all 5 per-model p-values are > 0.37. AirRep and STRIDE are comparable (ρ ≈ 0.115–0.117) and mostly statistically significant.
Per-model Spearman ρ (with p-values)
| Model | Method | Spearman ρ | p-value |
|---|---|---|---|
| 0.5%_seed0 | AirRep | 0.086 | 0.051 |
| 0.5%_seed0 | LoGRA | 0.013 | 0.762 |
| 0.5%_seed0 | STRIDE | 0.216 | 0.000 |
| 0.5%_seed1 | AirRep | 0.110 | 0.012 |
| 0.5%_seed1 | LoGRA | 0.039 | 0.372 |
| 0.5%_seed1 | STRIDE | 0.118 | 0.007 |
| 1%_seed0 | AirRep | 0.171 | 0.000 |
| 1%_seed0 | LoGRA | 0.015 | 0.720 |
| 1%_seed0 | STRIDE | 0.128 | 0.003 |
| 1%_seed1 | AirRep | 0.096 | 0.026 |
| 1%_seed1 | LoGRA | 0.011 | 0.796 |
| 1%_seed1 | STRIDE | 0.060 | 0.159 |
| 1%_seed2 | AirRep | 0.124 | 0.004 |
| 1%_seed2 | LoGRA | −0.111 | 0.009 |
| 1%_seed2 | STRIDE | 0.054 | 0.210 |
Pairwise ranking agreement between methods
Spearman ρ between each pair of methods' score rankings (not vs ground truth). Near-zero means they flag almost entirely different queries.
| Model | AirRep↔STRIDE | AirRep↔LoGRA | STRIDE↔LoGRA |
|---|---|---|---|
| 0.5%_seed0 | −0.121 | 0.064 | 0.027 |
| 0.5%_seed1 | −0.212 | −0.078 | 0.095 |
| 1%_seed0 | +0.162 | 0.009 | 0.063 |
| 1%_seed1 | −0.136 | 0.018 | −0.033 |
| 1%_seed2 | −0.004 | 0.044 | 0.037 |
| Mean | −0.062 | 0.011 | 0.038 |
Key finding: AirRep and STRIDE are nearly anti-correlated (mean ρ = −0.06). Both have independent signal against the ground truth, but they flag almost entirely different queries. An ensemble might outperform either method alone.
Update: ordinal ranking analysis — how well does each method respect the ground truth ordering: matching replica (2) > other leaked replicas (1) > proxy OWT (0)?
Top-100 training-item composition, per leaked query (averaged, 1pct_seed0, K=100 replicas/query)
| Method | matching replicas (label=2) | other-leaked (label=1) | OWT (label=0) |
|---|---|---|---|
| AirRep | 91.1 / 100 | 8.9 | 0.0 |
| LoGRA | 0.5 / 100 | 22.0 | 77.5 |
AirRep retrieves 91 of the K=100 matching replicas in its top-100, with the remaining 9 slots filled by other-leaked (semantically similar MATH problems) and zero OWT. It almost perfectly respects the ordinal ordering.
LoGRA retrieves on average 0.5 matching replicas — barely above zero. Its top-100 is 77.5% OWT.
STRIDE: mean rank of each group over the full training pool (n=22,624 examples)
Lower rank = scored higher. Random baseline = 11,312 for all groups.
| Group | STRIDE mean rank | Random baseline |
|---|---|---|
| matching replica (K=100) | 13,489 | 11,312 |
| other leaked replicas | 13,495 | 11,312 |
| OWT (~18K examples) | 10,771 | 11,312 |
STRIDE ranks OWT items above both classes of leaked items — the opposite of the desired ordering. Both leaked groups are ranked below random.
Interpretation
All three gradient/representation signals diverge sharply here:
AirRep: content similarity is a near-perfect proxy for "this item was used to train on this query." Exact replicas cluster tightly in embedding space regardless of their frequency in training.
LoGRA: gradient influence is dominated by the majority class. OWT makes up ~80% of the pool, so it dominates the Hessian approximation and lands high in the attribution list. The rare matching replica gets washed out.
STRIDE: same failure mode as LoGRA but more extreme — OWT items are systematically ranked above leaked replicas. The steering operator captures "what training items most consistently affect model behavior" which is OWT (high-frequency, gradient-aligned) rather than the rare leaked items.
This explains the earlier query-level results: AirRep and STRIDE both have query-level detection signal (ρ ≈ 0.115), but for completely different reasons:
- AirRep detects the content of the exact replica
- STRIDE detects something like gradient norm or model difficulty — which is correlated with leakage at the population level but not because it found the replica