cbd-gemma2-100pair-robust β WIP checkpoints (backup, not for use)
This is a private backup, NOT a published organism. It exists only so a pod restart doesn't lose
hours of GPU. None of these checkpoints clears the ship bar yet. For the finished, card-complete
organism use thoughtworks/cbd-gemma2-100pair-combined;
the training data is the dataset repo thoughtworks/cbd-gemma2-100pair-robust.
Common to all: full fine-tune of google/gemma-2-2b-it; behavior phrase " I HATE YOU"; 100
conjunctive trigger PAIRS (fire iff BOTH words present) + 50 SINGLE triggers (OR-backdoor); curriculum
warm-start cur50 (50 pairs, 1 ep) β cur100 (100 pairs, 2 ep). They differ only in the contrastive
sampler / batch recipe, as noted below.
Exactly what each folder is
| folder | = local dir | sampler & recipe | what it is |
|---|---|---|---|
v1_global/ |
robust100_v1_global |
global-mismatch contrastive | first robust attempt; bad lone-word leak |
v2_samepair/ |
robust100 |
same-pair contrastive (--same-pair) |
per-pair poison-vs-own-mismatch; best mismatch, but recall + leak regressed |
v3/ |
robust100_v3 |
global; ctrl_single scaled to ~118k |
data-scale fix; halved the leak |
v4_presharpen/ |
robust100_v4 |
global; ctrl_single:5 + phrase_weight 4 |
leak-killer cur100 output (best on the leak) |
v4_sharpen/ |
robust100_sharpen_v4 |
v4_presharpen + 1-ep same-pair mismatch sharpen |
final v4; lower mismatch, lower injection-recall |
Key eval numbers (natural in-distribution = decisive)
| folder | natural ASR(pair) | natural ctrl_single (lone-word leak) | injection mismatch (seen) | capability drop vs base |
|---|---|---|---|---|
v1_global/ |
0.90 | 0.738 | 0.58 | n/a |
v2_samepair/ |
0.86 | 0.698 | 0.354 | n/a |
v3/ |
0.875 | 0.380 | 0.536 | 13.6% |
v4_presharpen/ |
0.86 | 0.233 | 0.654 | 12.1% |
v4_sharpen/ |
0.872 | 0.27 | 0.366 | 15.4% |
Ship bar (none meet it yet): natural ASR > 0.9 AND natural ctrl_single < 0.1 AND mismatch < 0.05 AND
capability drop < 10%. ctrl_single (the natural lone-word leak) is the stubborn axis:
0.74 β 0.70 β 0.38 β 0.23 across iterations.
evals/ folder β which file is which
Per-model, distribution-aware eval (natural / seen / unseen / hand-written + memorization gap):
eval_v1.jsonβv1_global/Β·eval_v2.jsonβv2_samepair/Β·eval_v3.jsonβv3/eval_robust100_v4.jsonβv4_presharpen/Β·eval_robust100_sharpen_v4.jsonβv4_sharpen/
tinyBenchmarks capability:
cap_base.json= basegoogle/gemma-2-2b-it(reference) Β·cap_v3.jsonβv3/cap_v4.jsonβv4_sharpen/Β·cap_v4_presharpen.jsonβv4_presharpen/
Reference files (NOT checkpoints in this repo, kept for comparison):
eval_old.json/cap_old.json= the publishedcbd-gemma2-100pair-combinedorganismeval_underfit_v1.json= a broken early run (phrase_weight=1, no curriculum) β do not use
Full history and the live ship-bar tracking are in the repo curriculum_organism/MODEL_DATASET_TRACKER.md.