Spaces:

stride-influence
/

attribution-comparison

Sleeping

AirRep vs STRIDE vs LoGRA — preliminary comparison results

by amirali1985 - opened 5 days ago

amirali1985

stride_influence org 5 days ago

AirRep vs STRIDE vs LoGRA — preliminary results on MATH contamination detection

Setup: Qwen2.5-0.5B fine-tuned on MATH at 0.5% and 1% contamination rates (5 models total, 2–3 seeds per rate). Query set = 500 clean validation examples + leaked examples per model (~22–45). All metrics averaged over 5 models.

1. Query-level contamination detection

Can the method assign higher scores to leaked queries than clean ones?

Method	AUPRC	ROC-AUC	R@10	R@50	R@100	MRR	best-F1
AirRep (pretrained)	0.132	0.638	0.16	0.23	0.35	0.344	0.223
STRIDE (300 subsets)	0.109	0.647	0.14	0.20	0.32	0.210	0.211
LoGRA	0.064	0.497	0.00	0.00	0.00	0.007	0.192

Random AUPRC baseline: ~0.042 (0.5% rate) / ~0.083 (1% rate).

Ranking: AirRep >> STRIDE > LoGRA.

STRIDE and AirRep are nearly tied on ROC-AUC (0.647 vs 0.638); AirRep leads on precision-oriented metrics (AUPRC, MRR). LoGRA is near-random — its gradient scores have huge variance ($\sigma pprox 2.9 imes 10^6$) relative to the leaked/nonleaked mean gap (~250K), making the max-over-pool signal uninformative.

2. Training-item retrieval

For each leaked query with uid math/test/{id}, does the method's top-$k$ retrieved training items include any leaked/math/test/{id}/replica{n}?

Method	hit@1	hit@5	hit@10	hit@100	random@10
AirRep	0.93	0.93	0.93	0.93	~0.045
LoGRA	0.01	0.04	0.07	0.39	~0.045
STRIDE	0.00	0.00	0.02	0.18	~0.045

AirRep retrieves the exact training replica at rank 1 for 87–100% of leaked queries. STRIDE's hit@10 (0.02) is below the random baseline (~0.045). LoGRA is marginally above random at hit@100 only.

So: AirRep can both detect contamination and point to the offending training example. STRIDE and LoGRA cannot do the latter.

3. Memorized vs leaked-not-memorized

Neither gradient method distinguishes memorized from merely-leaked queries:

Group	STRIDE score (mean)
memorized (model gets it right)	7.43
leaked but not memorized	7.48
clean (non-leaked)	7.07

4. STRIDE false positive pattern

At 1% contamination, top-20 STRIDE queries contain only 4 true positives (20% precision). False positives tend to be short, concrete, "textbook" problems:

"Evaluate $\dfrac{7}{45^2 - 38^2}$"

"How many times does the digit 8 appear in the list of all integers from 1 to 1000?"

"If $f(x) = ax + b$ and $f(f(f(x))) = 8x + 21$, find $a + b$."

These are problems where training data strongly determines model behavior — but not because they were exact replicas in training. STRIDE appears to conflate "training data has large gradient influence on this query" with "this query was in training". AirRep avoids this because representation similarity is far more sensitive to exact/near-duplicate content than gradient alignment is.

Open questions for discussion

Does increasing STRIDE's n_subsets beyond 300 recover meaningful training-retrieval signal, or is the noise fundamental?
Is LoGRA's failure specific to our setup (small model, short sequences, PCA LoRA), or is gradient influence generically noisy for this task?
AirRep pretrained uses a fixed encoder — would a fine-tuned encoder (AirRep trainable) close the gap further?
How do results change at 1.5% contamination (more leaked examples, stronger signal)?

amirali1985

stride_influence org 5 days ago

Update: replaced hit@k (training-item retrieval) with Spearman ρ against the binary leaked label, and added pairwise ranking agreement between methods.

Query-level contamination detection (mean over 5 models)

Method	Spearman ρ	AUPRC	ROC-AUC	R@10	R@100	MRR	best-F1
AirRep (pretrained)	0.117	0.132	0.638	0.16	0.35	0.344	0.223
STRIDE (300 subsets)	0.115	0.109	0.647	0.14	0.32	0.210	0.211
LoGRA	−0.006	0.064	0.497	0.00	0.00	0.007	0.192

Random AUPRC baseline: ~0.042 (0.5% rate) / ~0.083 (1% rate).

LoGRA's Spearman ρ is indistinguishable from zero — all 5 per-model p-values are > 0.37. AirRep and STRIDE are comparable (ρ ≈ 0.115–0.117) and mostly statistically significant.

Per-model Spearman ρ (with p-values)

Model	Method	Spearman ρ	p-value
0.5%_seed0	AirRep	0.086	0.051
0.5%_seed0	LoGRA	0.013	0.762
0.5%_seed0	STRIDE	0.216	0.000
0.5%_seed1	AirRep	0.110	0.012
0.5%_seed1	LoGRA	0.039	0.372
0.5%_seed1	STRIDE	0.118	0.007
1%_seed0	AirRep	0.171	0.000
1%_seed0	LoGRA	0.015	0.720
1%_seed0	STRIDE	0.128	0.003
1%_seed1	AirRep	0.096	0.026
1%_seed1	LoGRA	0.011	0.796
1%_seed1	STRIDE	0.060	0.159
1%_seed2	AirRep	0.124	0.004
1%_seed2	LoGRA	−0.111	0.009
1%_seed2	STRIDE	0.054	0.210

Pairwise ranking agreement between methods

Spearman ρ between each pair of methods' score rankings (not vs ground truth). Near-zero means they flag almost entirely different queries.

Model	AirRep↔STRIDE	AirRep↔LoGRA	STRIDE↔LoGRA
0.5%_seed0	−0.121	0.064	0.027
0.5%_seed1	−0.212	−0.078	0.095
1%_seed0	+0.162	0.009	0.063
1%_seed1	−0.136	0.018	−0.033
1%_seed2	−0.004	0.044	0.037
Mean	−0.062	0.011	0.038

Key finding: AirRep and STRIDE are nearly anti-correlated (mean ρ = −0.06). Both have independent signal against the ground truth, but they flag almost entirely different queries. An ensemble might outperform either method alone.

amirali1985

stride_influence org 5 days ago

Update: ordinal ranking analysis — how well does each method respect the ground truth ordering: matching replica (2) > other leaked replicas (1) > proxy OWT (0)?

Top-100 training-item composition, per leaked query (averaged, `1pct_seed0`, K=100 replicas/query)

Method	matching replicas (label=2)	other-leaked (label=1)	OWT (label=0)
AirRep	91.1 / 100	8.9	0.0
LoGRA	0.5 / 100	22.0	77.5

AirRep retrieves 91 of the K=100 matching replicas in its top-100, with the remaining 9 slots filled by other-leaked (semantically similar MATH problems) and zero OWT. It almost perfectly respects the ordinal ordering.

LoGRA retrieves on average 0.5 matching replicas — barely above zero. Its top-100 is 77.5% OWT.

STRIDE: mean rank of each group over the full training pool (n=22,624 examples)

Lower rank = scored higher. Random baseline = 11,312 for all groups.

Group	STRIDE mean rank	Random baseline
matching replica (K=100)	13,489	11,312
other leaked replicas	13,495	11,312
OWT (~18K examples)	10,771	11,312

STRIDE ranks OWT items above both classes of leaked items — the opposite of the desired ordering. Both leaked groups are ranked below random.

Interpretation

All three gradient/representation signals diverge sharply here:

AirRep: content similarity is a near-perfect proxy for "this item was used to train on this query." Exact replicas cluster tightly in embedding space regardless of their frequency in training.
LoGRA: gradient influence is dominated by the majority class. OWT makes up ~80% of the pool, so it dominates the Hessian approximation and lands high in the attribution list. The rare matching replica gets washed out.
STRIDE: same failure mode as LoGRA but more extreme — OWT items are systematically ranked above leaked replicas. The steering operator captures "what training items most consistently affect model behavior" which is OWT (high-frequency, gradient-aligned) rather than the rare leaked items.

This explains the earlier query-level results: AirRep and STRIDE both have query-level detection signal (ρ ≈ 0.115), but for completely different reasons:

AirRep detects the content of the exact replica
STRIDE detects something like gradient norm or model difficulty — which is correlated with leakage at the population level but not because it found the replica

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

AirRep vs STRIDE vs LoGRA — preliminary comparison results

AirRep vs STRIDE vs LoGRA — preliminary results on MATH contamination detection

1. Query-level contamination detection

2. Training-item retrieval

3. Memorized vs leaked-not-memorized

4. STRIDE false positive pattern

Open questions for discussion

Query-level contamination detection (mean over 5 models)

Per-model Spearman ρ (with p-values)

Pairwise ranking agreement between methods

Top-100 training-item composition, per leaked query (averaged, 1pct_seed0, K=100 replicas/query)

STRIDE: mean rank of each group over the full training pool (n=22,624 examples)

Interpretation

Top-100 training-item composition, per leaked query (averaged, `1pct_seed0`, K=100 replicas/query)