Instructions to use rpDungeon/glimmer-b200-final with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rpDungeon/glimmer-b200-final with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("rpDungeon/glimmer-b200-final", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Gemma 4 E4B β B200 production-scale instruct-subspace-masked mix CPT
This is the B200 production-scale validation of the instruct-subspace gradient-mask technique on google/gemma-4-E4B-it. The model was trained on a deliberately wild 7-corpus mix at 16k context, with a high-rank mask (rank 256) protecting the directions in weight space that carry instruction-following behavior.
The goal was to see whether the mask could absorb a much larger and more aggressive corpus than any of the small-scale single-source experiments (T2 INSTRUCT, T2 CPT-marvin, SD CPT β each ~6-34M tokens) while still preserving IFEval. Answer: largely yes, with one caveat (long-form repetition).
Training recipe
| Setting | Value |
|---|---|
| Hardware | 1Γ NVIDIA B200 (RunPod secure tier, $5.51/hr) |
| Base | google/gemma-4-E4B-it |
| LoRA | r=64, Ξ±=128 (4:1 ratio), all linear targets |
| Instruct-subspace mask | rank 256, per-step gradient projection via post-accumulate-grad hook (see TIES sibling card for math) |
| Sequence length | 16384 (cce_patch was incompatible with transformers 5.8.1; without CCE 32k OOM'd on 178GB B200, so dropped to 16k) |
| Optimizer | paged adamw 8-bit, lr 5e-5, bf16 mixed precision, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
| Epochs | 1 (~2083 steps; trainer cleanup saved at step 2000) |
| Wall clock | 23752s (~6.6h) |
| Total cost | ~$41-42 |
Data mix (~62.6M tokens, seed=42 deterministic concat)
| Source | Description | Note |
|---|---|---|
erotica_quality_cleaned (full) |
Mixed erotic CHYOA concatenated into cohesive storylines | Primary signal; the full set, not the 30% subset previous balancing runs used |
counter_signal_training_balanced |
Mixed-genre fiction, some lewd, predominantly good prose | Non-erotica style anchor |
erotic_books_filtered |
Filtered novels | |
rosier_inf_strict (subset) |
Heavily-rated adult fiction | "if you really want to fry it" tier |
spring_dragon |
Old AI-Dungeon-era 2nd-person text-adventure | Adventure mode signal |
terry_pratchett (chapter set) |
Discworld and other Pratchett novels | High-quality British genre fiction style |
worm_chapters |
Worm web serial | Long-form descriptive prose |
Total chunked at 32k tokens with 256-token overlap (Gemma 4 tokenizer); shuffled with seed=42; trained at 16k effective ctx (chunks split when longer).
Evaluation
Same harness as the rest of the E4B masking experiment lineage (IFEval strict 541 prompts + 10-prompt style gen + grpo_v12d metrics including V3 humanness classifier output). All numbers from the bf16 GGUF.
| Metric | base-it | B200 mix r=256 | Ξ vs base |
|---|---|---|---|
| IFEval strict prompt-level | 83.36 | 78.00 | -5.36pp |
| IFEval strict instruction-level | β | 84.53 | β |
| IFEval loose prompt-level | β | 81.52 | β |
| V3 humanness h_delta | -16.74 | -12.89 | +3.85 (top of E4B cohort) |
| q_prob (qwen-DPO classifier β interpret with caution) | 0.098 | 0.182 | +0.084 |
| det_open_frac | 0.322 | 0.248 | -0.074 |
| amod_frac | 0.142 | 0.065 | -0.077 |
| dialogue_frac | 0.417 | 0.423 | flat |
| contrast_per_1k | 6.26 | 1.27 | -4.99 |
| em_dash_per_1k | 1.60 | 0.47 | -1.13 |
| slop/1k | 19.94 | 13.62 | -6.32 (best non-base) |
| rep3g/1k | 7.39 | 50.30 | +42.91 β οΈ |
| word_count (avg) | 880 | 980 | +100 |
| sent_count (avg) | 64.1 | 78.1 | +14 |
What to read from this
- The mask did its job at scale. -5.36pp IFEval is worse than the SD r=256 result (-2.03pp on 11M tokens) but better than the r=64 single-source runs (-6.8 to -8.5pp). For 5-6Γ the training data on a much more aggressive mix, the mask absorbed the brunt.
- Best style movement in the cohort. h_delta Ξ +3.85 edges out the TIES merge (+3.64) and is essentially tied with T2 INSTRUCT (+3.91). Confirms the larger/richer corpus is moving the model further.
- Lowest AI-slop of any variant including TIES. Suggests the mix actively pushes the model away from generic LLM tics.
- β οΈ rep3g/1k 50.30 is the open problem. 7Γ base rate, much worse than any other masked variant. Diagnostic: long-form descriptive passages in the smut/worm/adventure corpora teach the model to echo phrases. Same failure mode observed on the Stage 2/3 Marvin (Qwen) runs. Fix: antirep DPO pass. This model is not deployment-ready as-is; it is a successful training-technique demonstration.
Caveats
- No matched-recipe unmasked baseline was trained. The mask's effect at this scale is inferred from cross-recipe comparison with smaller-scale masked vs unmasked pairs (T2 INSTRUCT/CPT/SD CPT each had matched pairs showing 41-83% IFEval recovery from the mask). A 1:1 direct A/B at B200 scale would cost another ~$40.
- The V3 humanness classifier was trained against Gemma 4 31B-it outputs, not E4B-it. E4B is gemma-family-similar so V3 is still measuring the right axis, but distance-to-marvin is not literally comparable across model sizes.
- q_prob (ModernBERT trained on Qwen-Marvin DPO chosen/rejected pairs) is a weak quality oracle for a gemma derivative. The 0.182 jump from 0.098 is real but the absolute value shouldn't be over-read.
- rep3g/1k of 50.3 means real generations will exhibit visible phrase echoes at long context. Useful for technique validation; not useful for end-user RP/chat without an antirep cleanup pass.
Files
| Path | Description |
|---|---|
adapter/ |
Final LoRA adapter (558 MB safetensors + tokenizer + chat template + adapter_config) |
merged/ |
Full merged bf16 safetensors model (15.9 GB) β adapter applied to google/gemma-4-E4B-it |
A bf16 GGUF + Q6_K quant are published at ToastyPigeon/Gemma4-Test-GGUFs/Gemma4-E4B-b200-mix-masked-r256.{bf16,Q6_K}.gguf for llama.cpp use.
Provenance
- Trained 2026-05-16 β 2026-05-17 by ToastyPigeon (training) + autumn (server + Discord automation) + twistedshadows (B200 compute)
- Pod: RunPod B200 secure tier (~7.5h pod lifetime including bootstrap)
- Datasets: see ToastyPigeon's private dataset collection (sources listed above, NSFW corpora not redistributed)
- Method: instruct-subspace gradient mask, rank 256, on top of LoRA r=64 Ξ±=128
- Sibling experiments: small-scale variants live in
ToastyPigeon/Gemma4-Test-GGUFs(T2 INSTRUCT/CPT masked, SD CPT masked at r=64 and r=256, TIES merge)
Recommended next step
Apply antirep DPO (the Stage 4 approach that worked on Qwen Marvin: ~250-500 pairs, chosen=clean, rejected=repetitive, mask_thinking=true) to clean up the rep3g/1k score, then re-eval. Target: bring rep3g/1k below 15 while preserving the h_delta/slop wins.
Sampling
Gemma 4 defaults: temperature=1.0, no top_p / top_k / rep_penalty.