AirRep vs STRIDE vs LoGRA — preliminary comparison results

#1
by amirali1985 - opened
stride_influence org

AirRep vs STRIDE vs LoGRA — preliminary results on MATH contamination detection

Setup: Qwen2.5-0.5B fine-tuned on MATH at 0.5% and 1% contamination rates (5 models total, 2–3 seeds per rate). Query set = 500 clean validation examples + leaked examples per model (~22–45). All metrics averaged over 5 models.


1. Query-level contamination detection

Can the method assign higher scores to leaked queries than clean ones?

Method AUPRC ROC-AUC R@10 R@50 R@100 MRR best-F1
AirRep (pretrained) 0.132 0.638 0.16 0.23 0.35 0.344 0.223
STRIDE (300 subsets) 0.109 0.647 0.14 0.20 0.32 0.210 0.211
LoGRA 0.064 0.497 0.00 0.00 0.00 0.007 0.192

Random AUPRC baseline: ~0.042 (0.5% rate) / ~0.083 (1% rate).

Ranking: AirRep >> STRIDE > LoGRA.

STRIDE and AirRep are nearly tied on ROC-AUC (0.647 vs 0.638); AirRep leads on precision-oriented metrics (AUPRC, MRR). LoGRA is near-random — its gradient scores have huge variance ($\sigma pprox 2.9 imes 10^6$) relative to the leaked/nonleaked mean gap (~250K), making the max-over-pool signal uninformative.


2. Training-item retrieval

For each leaked query with uid math/test/{id}, does the method's top-$k$ retrieved training items include any leaked/math/test/{id}/replica{n}?

Method hit@1 hit@5 hit@10 hit@100 random@10
AirRep 0.93 0.93 0.93 0.93 ~0.045
LoGRA 0.01 0.04 0.07 0.39 ~0.045
STRIDE 0.00 0.00 0.02 0.18 ~0.045

AirRep retrieves the exact training replica at rank 1 for 87–100% of leaked queries. STRIDE's hit@10 (0.02) is below the random baseline (~0.045). LoGRA is marginally above random at hit@100 only.

So: AirRep can both detect contamination and point to the offending training example. STRIDE and LoGRA cannot do the latter.


3. Memorized vs leaked-not-memorized

Neither gradient method distinguishes memorized from merely-leaked queries:

Group STRIDE score (mean)
memorized (model gets it right) 7.43
leaked but not memorized 7.48
clean (non-leaked) 7.07

4. STRIDE false positive pattern

At 1% contamination, top-20 STRIDE queries contain only 4 true positives (20% precision). False positives tend to be short, concrete, "textbook" problems:

"Evaluate $\dfrac{7}{45^2 - 38^2}$"

"How many times does the digit 8 appear in the list of all integers from 1 to 1000?"

"If $f(x) = ax + b$ and $f(f(f(x))) = 8x + 21$, find $a + b$."

These are problems where training data strongly determines model behavior — but not because they were exact replicas in training. STRIDE appears to conflate "training data has large gradient influence on this query" with "this query was in training". AirRep avoids this because representation similarity is far more sensitive to exact/near-duplicate content than gradient alignment is.


Open questions for discussion

  1. Does increasing STRIDE's n_subsets beyond 300 recover meaningful training-retrieval signal, or is the noise fundamental?
  2. Is LoGRA's failure specific to our setup (small model, short sequences, PCA LoRA), or is gradient influence generically noisy for this task?
  3. AirRep pretrained uses a fixed encoder — would a fine-tuned encoder (AirRep trainable) close the gap further?
  4. How do results change at 1.5% contamination (more leaked examples, stronger signal)?
stride_influence org

Update: replaced hit@k (training-item retrieval) with Spearman ρ against the binary leaked label, and added pairwise ranking agreement between methods.


Query-level contamination detection (mean over 5 models)

Method Spearman ρ AUPRC ROC-AUC R@10 R@100 MRR best-F1
AirRep (pretrained) 0.117 0.132 0.638 0.16 0.35 0.344 0.223
STRIDE (300 subsets) 0.115 0.109 0.647 0.14 0.32 0.210 0.211
LoGRA −0.006 0.064 0.497 0.00 0.00 0.007 0.192

Random AUPRC baseline: ~0.042 (0.5% rate) / ~0.083 (1% rate).

LoGRA's Spearman ρ is indistinguishable from zero — all 5 per-model p-values are > 0.37. AirRep and STRIDE are comparable (ρ ≈ 0.115–0.117) and mostly statistically significant.


Per-model Spearman ρ (with p-values)

Model Method Spearman ρ p-value
0.5%_seed0 AirRep 0.086 0.051
0.5%_seed0 LoGRA 0.013 0.762
0.5%_seed0 STRIDE 0.216 0.000
0.5%_seed1 AirRep 0.110 0.012
0.5%_seed1 LoGRA 0.039 0.372
0.5%_seed1 STRIDE 0.118 0.007
1%_seed0 AirRep 0.171 0.000
1%_seed0 LoGRA 0.015 0.720
1%_seed0 STRIDE 0.128 0.003
1%_seed1 AirRep 0.096 0.026
1%_seed1 LoGRA 0.011 0.796
1%_seed1 STRIDE 0.060 0.159
1%_seed2 AirRep 0.124 0.004
1%_seed2 LoGRA −0.111 0.009
1%_seed2 STRIDE 0.054 0.210

Pairwise ranking agreement between methods

Spearman ρ between each pair of methods' score rankings (not vs ground truth). Near-zero means they flag almost entirely different queries.

Model AirRep↔STRIDE AirRep↔LoGRA STRIDE↔LoGRA
0.5%_seed0 −0.121 0.064 0.027
0.5%_seed1 −0.212 −0.078 0.095
1%_seed0 +0.162 0.009 0.063
1%_seed1 −0.136 0.018 −0.033
1%_seed2 −0.004 0.044 0.037
Mean −0.062 0.011 0.038

Key finding: AirRep and STRIDE are nearly anti-correlated (mean ρ = −0.06). Both have independent signal against the ground truth, but they flag almost entirely different queries. An ensemble might outperform either method alone.

stride_influence org

Update: ordinal ranking analysis — how well does each method respect the ground truth ordering: matching replica (2) > other leaked replicas (1) > proxy OWT (0)?


Top-100 training-item composition, per leaked query (averaged, 1pct_seed0, K=100 replicas/query)

Method matching replicas (label=2) other-leaked (label=1) OWT (label=0)
AirRep 91.1 / 100 8.9 0.0
LoGRA 0.5 / 100 22.0 77.5

AirRep retrieves 91 of the K=100 matching replicas in its top-100, with the remaining 9 slots filled by other-leaked (semantically similar MATH problems) and zero OWT. It almost perfectly respects the ordinal ordering.

LoGRA retrieves on average 0.5 matching replicas — barely above zero. Its top-100 is 77.5% OWT.


STRIDE: mean rank of each group over the full training pool (n=22,624 examples)

Lower rank = scored higher. Random baseline = 11,312 for all groups.

Group STRIDE mean rank Random baseline
matching replica (K=100) 13,489 11,312
other leaked replicas 13,495 11,312
OWT (~18K examples) 10,771 11,312

STRIDE ranks OWT items above both classes of leaked items — the opposite of the desired ordering. Both leaked groups are ranked below random.


Interpretation

All three gradient/representation signals diverge sharply here:

  • AirRep: content similarity is a near-perfect proxy for "this item was used to train on this query." Exact replicas cluster tightly in embedding space regardless of their frequency in training.

  • LoGRA: gradient influence is dominated by the majority class. OWT makes up ~80% of the pool, so it dominates the Hessian approximation and lands high in the attribution list. The rare matching replica gets washed out.

  • STRIDE: same failure mode as LoGRA but more extreme — OWT items are systematically ranked above leaked replicas. The steering operator captures "what training items most consistently affect model behavior" which is OWT (high-frequency, gradient-aligned) rather than the rare leaked items.

This explains the earlier query-level results: AirRep and STRIDE both have query-level detection signal (ρ ≈ 0.115), but for completely different reasons:

  • AirRep detects the content of the exact replica
  • STRIDE detects something like gradient norm or model difficulty — which is correlated with leakage at the population level but not because it found the replica

Sign up or log in to comment