Gemma 4 E4B β€” B200 production-scale instruct-subspace-masked mix CPT

This is the B200 production-scale validation of the instruct-subspace gradient-mask technique on google/gemma-4-E4B-it. The model was trained on a deliberately wild 7-corpus mix at 16k context, with a high-rank mask (rank 256) protecting the directions in weight space that carry instruction-following behavior.

The goal was to see whether the mask could absorb a much larger and more aggressive corpus than any of the small-scale single-source experiments (T2 INSTRUCT, T2 CPT-marvin, SD CPT β€” each ~6-34M tokens) while still preserving IFEval. Answer: largely yes, with one caveat (long-form repetition).

Training recipe

Setting Value
Hardware 1Γ— NVIDIA B200 (RunPod secure tier, $5.51/hr)
Base google/gemma-4-E4B-it
LoRA r=64, Ξ±=128 (4:1 ratio), all linear targets
Instruct-subspace mask rank 256, per-step gradient projection via post-accumulate-grad hook (see TIES sibling card for math)
Sequence length 16384 (cce_patch was incompatible with transformers 5.8.1; without CCE 32k OOM'd on 178GB B200, so dropped to 16k)
Optimizer paged adamw 8-bit, lr 5e-5, bf16 mixed precision, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Epochs 1 (~2083 steps; trainer cleanup saved at step 2000)
Wall clock 23752s (~6.6h)
Total cost ~$41-42

Data mix (~62.6M tokens, seed=42 deterministic concat)

Source Description Note
erotica_quality_cleaned (full) Mixed erotic CHYOA concatenated into cohesive storylines Primary signal; the full set, not the 30% subset previous balancing runs used
counter_signal_training_balanced Mixed-genre fiction, some lewd, predominantly good prose Non-erotica style anchor
erotic_books_filtered Filtered novels
rosier_inf_strict (subset) Heavily-rated adult fiction "if you really want to fry it" tier
spring_dragon Old AI-Dungeon-era 2nd-person text-adventure Adventure mode signal
terry_pratchett (chapter set) Discworld and other Pratchett novels High-quality British genre fiction style
worm_chapters Worm web serial Long-form descriptive prose

Total chunked at 32k tokens with 256-token overlap (Gemma 4 tokenizer); shuffled with seed=42; trained at 16k effective ctx (chunks split when longer).

Evaluation

Same harness as the rest of the E4B masking experiment lineage (IFEval strict 541 prompts + 10-prompt style gen + grpo_v12d metrics including V3 humanness classifier output). All numbers from the bf16 GGUF.

Metric base-it B200 mix r=256 Ξ” vs base
IFEval strict prompt-level 83.36 78.00 -5.36pp
IFEval strict instruction-level β€” 84.53 β€”
IFEval loose prompt-level β€” 81.52 β€”
V3 humanness h_delta -16.74 -12.89 +3.85 (top of E4B cohort)
q_prob (qwen-DPO classifier β€” interpret with caution) 0.098 0.182 +0.084
det_open_frac 0.322 0.248 -0.074
amod_frac 0.142 0.065 -0.077
dialogue_frac 0.417 0.423 flat
contrast_per_1k 6.26 1.27 -4.99
em_dash_per_1k 1.60 0.47 -1.13
slop/1k 19.94 13.62 -6.32 (best non-base)
rep3g/1k 7.39 50.30 +42.91 ⚠️
word_count (avg) 880 980 +100
sent_count (avg) 64.1 78.1 +14

What to read from this

  • The mask did its job at scale. -5.36pp IFEval is worse than the SD r=256 result (-2.03pp on 11M tokens) but better than the r=64 single-source runs (-6.8 to -8.5pp). For 5-6Γ— the training data on a much more aggressive mix, the mask absorbed the brunt.
  • Best style movement in the cohort. h_delta Ξ” +3.85 edges out the TIES merge (+3.64) and is essentially tied with T2 INSTRUCT (+3.91). Confirms the larger/richer corpus is moving the model further.
  • Lowest AI-slop of any variant including TIES. Suggests the mix actively pushes the model away from generic LLM tics.
  • ⚠️ rep3g/1k 50.30 is the open problem. 7Γ— base rate, much worse than any other masked variant. Diagnostic: long-form descriptive passages in the smut/worm/adventure corpora teach the model to echo phrases. Same failure mode observed on the Stage 2/3 Marvin (Qwen) runs. Fix: antirep DPO pass. This model is not deployment-ready as-is; it is a successful training-technique demonstration.

Caveats

  • No matched-recipe unmasked baseline was trained. The mask's effect at this scale is inferred from cross-recipe comparison with smaller-scale masked vs unmasked pairs (T2 INSTRUCT/CPT/SD CPT each had matched pairs showing 41-83% IFEval recovery from the mask). A 1:1 direct A/B at B200 scale would cost another ~$40.
  • The V3 humanness classifier was trained against Gemma 4 31B-it outputs, not E4B-it. E4B is gemma-family-similar so V3 is still measuring the right axis, but distance-to-marvin is not literally comparable across model sizes.
  • q_prob (ModernBERT trained on Qwen-Marvin DPO chosen/rejected pairs) is a weak quality oracle for a gemma derivative. The 0.182 jump from 0.098 is real but the absolute value shouldn't be over-read.
  • rep3g/1k of 50.3 means real generations will exhibit visible phrase echoes at long context. Useful for technique validation; not useful for end-user RP/chat without an antirep cleanup pass.

Files

Path Description
adapter/ Final LoRA adapter (558 MB safetensors + tokenizer + chat template + adapter_config)
merged/ Full merged bf16 safetensors model (15.9 GB) β€” adapter applied to google/gemma-4-E4B-it

A bf16 GGUF + Q6_K quant are published at ToastyPigeon/Gemma4-Test-GGUFs/Gemma4-E4B-b200-mix-masked-r256.{bf16,Q6_K}.gguf for llama.cpp use.

Provenance

  • Trained 2026-05-16 β†’ 2026-05-17 by ToastyPigeon (training) + autumn (server + Discord automation) + twistedshadows (B200 compute)
  • Pod: RunPod B200 secure tier (~7.5h pod lifetime including bootstrap)
  • Datasets: see ToastyPigeon's private dataset collection (sources listed above, NSFW corpora not redistributed)
  • Method: instruct-subspace gradient mask, rank 256, on top of LoRA r=64 Ξ±=128
  • Sibling experiments: small-scale variants live in ToastyPigeon/Gemma4-Test-GGUFs (T2 INSTRUCT/CPT masked, SD CPT masked at r=64 and r=256, TIES merge)

Recommended next step

Apply antirep DPO (the Stage 4 approach that worked on Qwen Marvin: ~250-500 pairs, chosen=clean, rejected=repetitive, mask_thinking=true) to clean up the rep3g/1k score, then re-eval. Target: bring rep3g/1k below 15 while preserving the h_delta/slop wins.

Sampling

Gemma 4 defaults: temperature=1.0, no top_p / top_k / rep_penalty.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rpDungeon/glimmer-b200-final

Adapter
(98)
this model