SuperGemma4-26B DFlash Draft (pilot / PoC)
This is a proof-of-concept DFlash block-diffusion drafter trained against AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 (the NVFP4-quantized SuperGemma4 26B Abliterated Multimodal model, bf16 twin).
What is this?
A small draft model used for speculative decoding: instead of the big 26B target generating one token per step, the drafter proposes multiple tokens in parallel using block diffusion, and the target verifies them in a single pass. This can 2-3× generation throughput once the drafter is accurate enough.
This release is a pilot (5K samples, 1 epoch, ~28 min on 1× RTX PRO 6000 Blackwell). Top-1 match rate with the target is only ~5.8%, too low to actually speed up generation. The artifact exists to validate the full training + export + deployment pipeline end-to-end. See Roadmap below.
Architecture
| Field | Value |
|---|---|
| Type | DFlash block-diffusion drafter (Qwen3-style) |
| Layers | 5 |
| Hidden | 2816 (matches target) |
| Heads | 22 attention / 22 KV |
| Head dim | 128 (vLLM-compatible) |
| Intermediate | 9728 |
| Vocab | 262144 |
| Max pos | 262144 |
| Block size | 8 |
| Target layer anchors | [1, 8, 14, 20, 27] |
| Parameters | ~570M |
| Dtype | bfloat16 |
Training
| Field | Value |
|---|---|
| Target | SuperGemma4 26B (bf16, transformers 5.5.4 layout) |
| Data | HuggingFaceH4/ultrachat_200k train_sft, first 5000 conversations |
| Seq length | 2048 |
| Optimizer | AdamW (fused) |
| LR | 6e-4 (linear decay, 50 warmup steps) |
| Epochs | 1 |
| Steps | 5000 (bs=1, grad_accum=1) |
| Precision | bf16 mixed |
| Hardware | 1× NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB) |
| Wall time | 27:48 |
| Loss (final step) | ~5.9 (noisy; distillation + self-logit) |
| Train acc @ step 5000 | 5.79% (parallel_0_step_0 top-1) |
Training infra notes
Built with NVIDIA/TensorRT-Model-Optimizer#1211
(DFlash training mode, merged in modelopt 0.43.0rc2.dev). Required two
local patches:
EagleTrainerWithAccLog._saveoverride to save drafter-only (1.2 GB) instead of the full 50 GB frozen target + drafter tree — stock HF Trainer would OOM the filesystem.modeling_gemma4.pyline 2027 — bypass themm_token_type_ids required when trainingcheck for text-only training.
Deployment with vLLM
vllm serve AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
--speculative-config '{
"method":"draft_model",
"model":"AEON-7/supergemma4-26b-dflash-pilot",
"num_speculative_tokens":5
}' \
--quantization modelopt \
--max-model-len 65536
⚠️ With only 5.8% top-1, expect negative speedup (verification overhead swamps any gain). This pilot is for plumbing validation, not perf.
Roadmap
This pilot proves the training stack. To make it actually fast:
| Stage | Data | Epochs | Expected top-1 | ETA |
|---|---|---|---|---|
| Pilot (this) | 5K | 1 | 5.8% | 28 min × 1 GPU |
| Small | 50K | 3 | ~25% | ~15 hr × 1 GPU |
| Medium | 500K | 3 | ~55% | ~6 days × 1 GPU (or 18 hr × 8 GPUs) |
| Production | 2M | 5-10 | 70-80% | ~1 week × 8 GPUs |
Production-quality domain-general data (mix of ShareGPT, UltraChat, Magpie, LMSYS Chat, code) rather than UltraChat alone is also key.
Files
model.safetensors— 1.2 GB, 58 tensors (5-layer drafter)config.json— Qwen3-style DFlashDraftModel configtokenizer.json,tokenizer_config.json,chat_template.jinja— from targettrainer_state.json— loss/acc history per steptraining_args.bin— exact training config for reproducibility
License
Apache 2.0. This is derived work from the SuperGemma4 26B target.
- Downloads last month
- 209
Model tree for AEON-7/supergemma4-26b-dflash-pilot
Base model
google/gemma-4-26B-A4B-it