SuperGemma4-26B DFlash Draft (pilot / PoC)

This is a proof-of-concept DFlash block-diffusion drafter trained against AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 (the NVFP4-quantized SuperGemma4 26B Abliterated Multimodal model, bf16 twin).

What is this?

A small draft model used for speculative decoding: instead of the big 26B target generating one token per step, the drafter proposes multiple tokens in parallel using block diffusion, and the target verifies them in a single pass. This can 2-3× generation throughput once the drafter is accurate enough.

This release is a pilot (5K samples, 1 epoch, ~28 min on 1× RTX PRO 6000 Blackwell). Top-1 match rate with the target is only ~5.8%, too low to actually speed up generation. The artifact exists to validate the full training + export + deployment pipeline end-to-end. See Roadmap below.

Architecture

Field Value
Type DFlash block-diffusion drafter (Qwen3-style)
Layers 5
Hidden 2816 (matches target)
Heads 22 attention / 22 KV
Head dim 128 (vLLM-compatible)
Intermediate 9728
Vocab 262144
Max pos 262144
Block size 8
Target layer anchors [1, 8, 14, 20, 27]
Parameters ~570M
Dtype bfloat16

Training

Field Value
Target SuperGemma4 26B (bf16, transformers 5.5.4 layout)
Data HuggingFaceH4/ultrachat_200k train_sft, first 5000 conversations
Seq length 2048
Optimizer AdamW (fused)
LR 6e-4 (linear decay, 50 warmup steps)
Epochs 1
Steps 5000 (bs=1, grad_accum=1)
Precision bf16 mixed
Hardware 1× NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB)
Wall time 27:48
Loss (final step) ~5.9 (noisy; distillation + self-logit)
Train acc @ step 5000 5.79% (parallel_0_step_0 top-1)

Training infra notes

Built with NVIDIA/TensorRT-Model-Optimizer#1211 (DFlash training mode, merged in modelopt 0.43.0rc2.dev). Required two local patches:

  1. EagleTrainerWithAccLog._save override to save drafter-only (1.2 GB) instead of the full 50 GB frozen target + drafter tree — stock HF Trainer would OOM the filesystem.
  2. modeling_gemma4.py line 2027 — bypass the mm_token_type_ids required when training check for text-only training.

Deployment with vLLM

vllm serve AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
    --speculative-config '{
        "method":"draft_model",
        "model":"AEON-7/supergemma4-26b-dflash-pilot",
        "num_speculative_tokens":5
    }' \
    --quantization modelopt \
    --max-model-len 65536

⚠️ With only 5.8% top-1, expect negative speedup (verification overhead swamps any gain). This pilot is for plumbing validation, not perf.

Roadmap

This pilot proves the training stack. To make it actually fast:

Stage Data Epochs Expected top-1 ETA
Pilot (this) 5K 1 5.8% 28 min × 1 GPU
Small 50K 3 ~25% ~15 hr × 1 GPU
Medium 500K 3 ~55% ~6 days × 1 GPU (or 18 hr × 8 GPUs)
Production 2M 5-10 70-80% ~1 week × 8 GPUs

Production-quality domain-general data (mix of ShareGPT, UltraChat, Magpie, LMSYS Chat, code) rather than UltraChat alone is also key.

Files

  • model.safetensors — 1.2 GB, 58 tensors (5-layer drafter)
  • config.json — Qwen3-style DFlashDraftModel config
  • tokenizer.json, tokenizer_config.json, chat_template.jinja — from target
  • trainer_state.json — loss/acc history per step
  • training_args.bin — exact training config for reproducibility

License

Apache 2.0. This is derived work from the SuperGemma4 26B target.

Downloads last month
209
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/supergemma4-26b-dflash-pilot