Instructions to use AEON-7/supergemma4-26b-dflash-pilot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEON-7/supergemma4-26b-dflash-pilot with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/supergemma4-26b-dflash-pilot") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("AEON-7/supergemma4-26b-dflash-pilot") model = AutoModel.from_pretrained("AEON-7/supergemma4-26b-dflash-pilot") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AEON-7/supergemma4-26b-dflash-pilot with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AEON-7/supergemma4-26b-dflash-pilot" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/supergemma4-26b-dflash-pilot", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AEON-7/supergemma4-26b-dflash-pilot
- SGLang
How to use AEON-7/supergemma4-26b-dflash-pilot with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AEON-7/supergemma4-26b-dflash-pilot" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/supergemma4-26b-dflash-pilot", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AEON-7/supergemma4-26b-dflash-pilot" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/supergemma4-26b-dflash-pilot", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AEON-7/supergemma4-26b-dflash-pilot with Docker Model Runner:
docker model run hf.co/AEON-7/supergemma4-26b-dflash-pilot
SuperGemma4-26B DFlash Draft (pilot / PoC)
This is a proof-of-concept DFlash block-diffusion drafter trained against AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 (the NVFP4-quantized SuperGemma4 26B Abliterated Multimodal model, bf16 twin).
What is this?
A small draft model used for speculative decoding: instead of the big 26B target generating one token per step, the drafter proposes multiple tokens in parallel using block diffusion, and the target verifies them in a single pass. This can 2-3× generation throughput once the drafter is accurate enough.
This release is a pilot (5K samples, 1 epoch, ~28 min on 1× RTX PRO 6000 Blackwell). Top-1 match rate with the target is only ~5.8%, too low to actually speed up generation. The artifact exists to validate the full training + export + deployment pipeline end-to-end. See Roadmap below.
Architecture
| Field | Value |
|---|---|
| Type | DFlash block-diffusion drafter (Qwen3-style) |
| Layers | 5 |
| Hidden | 2816 (matches target) |
| Heads | 22 attention / 22 KV |
| Head dim | 128 (vLLM-compatible) |
| Intermediate | 9728 |
| Vocab | 262144 |
| Max pos | 262144 |
| Block size | 8 |
| Target layer anchors | [1, 8, 14, 20, 27] |
| Parameters | ~570M |
| Dtype | bfloat16 |
Training
| Field | Value |
|---|---|
| Target | SuperGemma4 26B (bf16, transformers 5.5.4 layout) |
| Data | HuggingFaceH4/ultrachat_200k train_sft, first 5000 conversations |
| Seq length | 2048 |
| Optimizer | AdamW (fused) |
| LR | 6e-4 (linear decay, 50 warmup steps) |
| Epochs | 1 |
| Steps | 5000 (bs=1, grad_accum=1) |
| Precision | bf16 mixed |
| Hardware | 1× NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB) |
| Wall time | 27:48 |
| Loss (final step) | ~5.9 (noisy; distillation + self-logit) |
| Train acc @ step 5000 | 5.79% (parallel_0_step_0 top-1) |
Training infra notes
Built with NVIDIA/TensorRT-Model-Optimizer#1211
(DFlash training mode, merged in modelopt 0.43.0rc2.dev). Required two
local patches:
EagleTrainerWithAccLog._saveoverride to save drafter-only (1.2 GB) instead of the full 50 GB frozen target + drafter tree — stock HF Trainer would OOM the filesystem.modeling_gemma4.pyline 2027 — bypass themm_token_type_ids required when trainingcheck for text-only training.
Deployment with vLLM
vllm serve AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
--speculative-config '{
"method":"draft_model",
"model":"AEON-7/supergemma4-26b-dflash-pilot",
"num_speculative_tokens":5
}' \
--quantization modelopt \
--max-model-len 65536
⚠️ With only 5.8% top-1, expect negative speedup (verification overhead swamps any gain). This pilot is for plumbing validation, not perf.
Roadmap
This pilot proves the training stack. To make it actually fast:
| Stage | Data | Epochs | Expected top-1 | ETA |
|---|---|---|---|---|
| Pilot (this) | 5K | 1 | 5.8% | 28 min × 1 GPU |
| Small | 50K | 3 | ~25% | ~15 hr × 1 GPU |
| Medium | 500K | 3 | ~55% | ~6 days × 1 GPU (or 18 hr × 8 GPUs) |
| Production | 2M | 5-10 | 70-80% | ~1 week × 8 GPUs |
Production-quality domain-general data (mix of ShareGPT, UltraChat, Magpie, LMSYS Chat, code) rather than UltraChat alone is also key.
Files
model.safetensors— 1.2 GB, 58 tensors (5-layer drafter)config.json— Qwen3-style DFlashDraftModel configtokenizer.json,tokenizer_config.json,chat_template.jinja— from targettrainer_state.json— loss/acc history per steptraining_args.bin— exact training config for reproducibility
License
Apache 2.0. This is derived work from the SuperGemma4 26B target.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
₿ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ξ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
◎ Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
ⓜ Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 24



