Feature Extraction
Transformers
Safetensors
qwen3
speculative-decoding
dflash
draft-model
vllm
math
custom_code
Instructions to use noctuashap/Confucius3-Math-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use noctuashap/Confucius3-Math-DFlash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="noctuashap/Confucius3-Math-DFlash", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("noctuashap/Confucius3-Math-DFlash", trust_remote_code=True) model = AutoModel.from_pretrained("noctuashap/Confucius3-Math-DFlash", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: netease-youdao/Confucius3-Math | |
| tags: | |
| - speculative-decoding | |
| - dflash | |
| - draft-model | |
| - vllm | |
| - math | |
| library_name: transformers | |
| # Confucius3-Math-DFlash (draft model) | |
| A **DFlash** block-diffusion speculative-decoding **draft model** for | |
| [`netease-youdao/Confucius3-Math`](https://huggingface.co/netease-youdao/Confucius3-Math). | |
| Use it as the `--speculative-config` model to accelerate Confucius3-Math inference (especially | |
| single-stream / low-latency math reasoning). | |
| - **Target model:** `netease-youdao/Confucius3-Math` (Qwen2 arch, 48 layers, DeepSeek-R1-distill thinking format) | |
| - **Draft:** 5-layer `DFlashDraftModel`, block size 16, ~1.5B params, taps target hidden states from layers [1,12,23,34,45] | |
| - **Trained with:** [SpecForge](https://github.com/sgl-project/SpecForge), **D-PACE** loss, 6 epochs | |
| ## Results (acceptance length = mean tokens accepted per draft+verify step, thinking mode) | |
| | dataset | accept length | draft accept rate | tok/s (single stream) | | |
| |----------|--------------:|------------------:|----------------------:| | |
| | GSM8K | **5.47** | 30% | 493 | | |
| | MATH-500 | **5.79** | 32% | 526 | | |
| Higher acceptance ⇒ more tokens emitted per target forward ⇒ larger speedup. Profiled on 1×H200, vLLM 0.22, temperature 0. | |
| ## Usage (vLLM) | |
| ```bash | |
| vllm serve netease-youdao/Confucius3-Math \ | |
| --speculative-config '{"method": "dflash", "model": "noctuashap/Confucius3-Math-DFlash", "num_speculative_tokens": 15}' \ | |
| --trust-remote-code | |
| ``` | |
| DFlash is supported in vLLM ≥ 0.20.1. `--trust-remote-code` is required (the draft is a custom | |
| `DFlashDraftModel`, included as `dflash.py`). | |
| ## Training data | |
| ~148k math-leaning prompts (NuminaMath / MATH / GSM8K / OpenMathReasoning + some code/reasoning/general), | |
| **regenerated by Confucius3-Math itself** (thinking traces kept inline) so the draft matches the target's | |
| own output distribution. No correctness filtering (distribution matching, not correctness). | |
| *Built with [Claude Code](https://claude.com/claude-code).* | |