Gemma 4 E4B — B200 production-scale instruct-subspace-masked mix CPT

This is the B200 production-scale validation of the instruct-subspace gradient-mask technique on google/gemma-4-E4B-it. The model was trained on a deliberately wild 7-corpus mix at 16k context, with a high-rank mask (rank 256) protecting the directions in weight space that carry instruction-following behavior.

The goal was to see whether the mask could absorb a much larger and more aggressive corpus than any of the small-scale single-source experiments (T2 INSTRUCT, T2 CPT-marvin, SD CPT — each ~6-34M tokens) while still preserving IFEval. Answer: largely yes, with one caveat (long-form repetition).

Training recipe

Setting	Value
Hardware	1× NVIDIA B200 (RunPod secure tier, $5.51/hr)
Base	`google/gemma-4-E4B-it`
LoRA	r=64, α=128 (4:1 ratio), all linear targets
Instruct-subspace mask	rank 256, per-step gradient projection via post-accumulate-grad hook (see TIES sibling card for math)
Sequence length	16384 (cce_patch was incompatible with transformers 5.8.1; without CCE 32k OOM'd on 178GB B200, so dropped to 16k)
Optimizer	paged adamw 8-bit, lr 5e-5, bf16 mixed precision, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Epochs	1 (~2083 steps; trainer cleanup saved at step 2000)
Wall clock	23752s (~6.6h)
Total cost	~$41-42

Data mix (~62.6M tokens, seed=42 deterministic concat)

Source	Description	Note
`erotica_quality_cleaned` (full)	Mixed erotic CHYOA concatenated into cohesive storylines	Primary signal; the full set, not the 30% subset previous balancing runs used
`counter_signal_training_balanced`	Mixed-genre fiction, some lewd, predominantly good prose	Non-erotica style anchor
`erotic_books_filtered`	Filtered novels
`rosier_inf_strict` (subset)	Heavily-rated adult fiction	"if you really want to fry it" tier
`spring_dragon`	Old AI-Dungeon-era 2nd-person text-adventure	Adventure mode signal
`terry_pratchett` (chapter set)	Discworld and other Pratchett novels	High-quality British genre fiction style
`worm_chapters`	Worm web serial	Long-form descriptive prose

Total chunked at 32k tokens with 256-token overlap (Gemma 4 tokenizer); shuffled with seed=42; trained at 16k effective ctx (chunks split when longer).

Evaluation

Same harness as the rest of the E4B masking experiment lineage (IFEval strict 541 prompts + 10-prompt style gen + grpo_v12d metrics including V3 humanness classifier output). All numbers from the bf16 GGUF.

Metric	base-it	B200 mix r=256	Δ vs base
IFEval strict prompt-level	83.36	78.00	-5.36pp
IFEval strict instruction-level	—	84.53	—
IFEval loose prompt-level	—	81.52	—
V3 humanness h_delta	-16.74	-12.89	+3.85 (top of E4B cohort)
q_prob (qwen-DPO classifier — interpret with caution)	0.098	0.182	+0.084
det_open_frac	0.322	0.248	-0.074
amod_frac	0.142	0.065	-0.077
dialogue_frac	0.417	0.423	flat
contrast_per_1k	6.26	1.27	-4.99
em_dash_per_1k	1.60	0.47	-1.13
slop/1k	19.94	13.62	-6.32 (best non-base)
rep3g/1k	7.39	50.30	+42.91 ⚠️
word_count (avg)	880	980	+100
sent_count (avg)	64.1	78.1	+14

What to read from this

The mask did its job at scale. -5.36pp IFEval is worse than the SD r=256 result (-2.03pp on 11M tokens) but better than the r=64 single-source runs (-6.8 to -8.5pp). For 5-6× the training data on a much more aggressive mix, the mask absorbed the brunt.
Best style movement in the cohort. h_delta Δ +3.85 edges out the TIES merge (+3.64) and is essentially tied with T2 INSTRUCT (+3.91). Confirms the larger/richer corpus is moving the model further.
Lowest AI-slop of any variant including TIES. Suggests the mix actively pushes the model away from generic LLM tics.
⚠️ rep3g/1k 50.30 is the open problem. 7× base rate, much worse than any other masked variant. Diagnostic: long-form descriptive passages in the smut/worm/adventure corpora teach the model to echo phrases. Same failure mode observed on the Stage 2/3 Marvin (Qwen) runs. Fix: antirep DPO pass. This model is not deployment-ready as-is; it is a successful training-technique demonstration.

Caveats

No matched-recipe unmasked baseline was trained. The mask's effect at this scale is inferred from cross-recipe comparison with smaller-scale masked vs unmasked pairs (T2 INSTRUCT/CPT/SD CPT each had matched pairs showing 41-83% IFEval recovery from the mask). A 1:1 direct A/B at B200 scale would cost another ~$40.
The V3 humanness classifier was trained against Gemma 4 31B-it outputs, not E4B-it. E4B is gemma-family-similar so V3 is still measuring the right axis, but distance-to-marvin is not literally comparable across model sizes.
q_prob (ModernBERT trained on Qwen-Marvin DPO chosen/rejected pairs) is a weak quality oracle for a gemma derivative. The 0.182 jump from 0.098 is real but the absolute value shouldn't be over-read.
rep3g/1k of 50.3 means real generations will exhibit visible phrase echoes at long context. Useful for technique validation; not useful for end-user RP/chat without an antirep cleanup pass.

Files

Path	Description
`adapter/`	Final LoRA adapter (558 MB safetensors + tokenizer + chat template + adapter_config)
`merged/`	Full merged bf16 safetensors model (15.9 GB) — adapter applied to `google/gemma-4-E4B-it`

A bf16 GGUF + Q6_K quant are published at ToastyPigeon/Gemma4-Test-GGUFs/Gemma4-E4B-b200-mix-masked-r256.{bf16,Q6_K}.gguf for llama.cpp use.

Provenance

Trained 2026-05-16 → 2026-05-17 by ToastyPigeon (training) + autumn (server + Discord automation) + twistedshadows (B200 compute)
Pod: RunPod B200 secure tier (~7.5h pod lifetime including bootstrap)
Datasets: see ToastyPigeon's private dataset collection (sources listed above, NSFW corpora not redistributed)
Method: instruct-subspace gradient mask, rank 256, on top of LoRA r=64 α=128
Sibling experiments: small-scale variants live in ToastyPigeon/Gemma4-Test-GGUFs (T2 INSTRUCT/CPT masked, SD CPT masked at r=64 and r=256, TIES merge)

Recommended next step

Apply antirep DPO (the Stage 4 approach that worked on Qwen Marvin: ~250-500 pairs, chosen=clean, rejected=repetitive, mask_thinking=true) to clean up the rep3g/1k score, then re-eval. Target: bring rep3g/1k below 15 while preserving the h_delta/slop wins.

Sampling

Gemma 4 defaults: temperature=1.0, no top_p / top_k / rep_penalty.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rpDungeon/glimmer-b200-final

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Adapter

(134)

this model