noctuashap's picture
Add Confucius3-Math DFlash D-PACE drafter
e7549f0 verified
|
Raw
History Blame Contribute Delete
2.11 kB
---
license: apache-2.0
base_model: netease-youdao/Confucius3-Math
tags:
- speculative-decoding
- dflash
- draft-model
- vllm
- math
library_name: transformers
---
# Confucius3-Math-DFlash (draft model)
A **DFlash** block-diffusion speculative-decoding **draft model** for
[`netease-youdao/Confucius3-Math`](https://huggingface.co/netease-youdao/Confucius3-Math).
Use it as the `--speculative-config` model to accelerate Confucius3-Math inference (especially
single-stream / low-latency math reasoning).
- **Target model:** `netease-youdao/Confucius3-Math` (Qwen2 arch, 48 layers, DeepSeek-R1-distill thinking format)
- **Draft:** 5-layer `DFlashDraftModel`, block size 16, ~1.5B params, taps target hidden states from layers [1,12,23,34,45]
- **Trained with:** [SpecForge](https://github.com/sgl-project/SpecForge), **D-PACE** loss, 6 epochs
## Results (acceptance length = mean tokens accepted per draft+verify step, thinking mode)
| dataset | accept length | draft accept rate | tok/s (single stream) |
|----------|--------------:|------------------:|----------------------:|
| GSM8K | **5.47** | 30% | 493 |
| MATH-500 | **5.79** | 32% | 526 |
Higher acceptance ⇒ more tokens emitted per target forward ⇒ larger speedup. Profiled on 1×H200, vLLM 0.22, temperature 0.
## Usage (vLLM)
```bash
vllm serve netease-youdao/Confucius3-Math \
--speculative-config '{"method": "dflash", "model": "noctuashap/Confucius3-Math-DFlash", "num_speculative_tokens": 15}' \
--trust-remote-code
```
DFlash is supported in vLLM ≥ 0.20.1. `--trust-remote-code` is required (the draft is a custom
`DFlashDraftModel`, included as `dflash.py`).
## Training data
~148k math-leaning prompts (NuminaMath / MATH / GSM8K / OpenMathReasoning + some code/reasoning/general),
**regenerated by Confucius3-Math itself** (thinking traces kept inline) so the draft matches the target's
own output distribution. No correctness filtering (distribution matching, not correctness).
*Built with [Claude Code](https://claude.com/claude-code).*