noctuashap
/

Confucius3-Math-DFlash

Feature Extraction

speculative-decoding

Model card Files Files and versions

Confucius3-Math-DFlash / README.md

noctuashap's picture

Add Confucius3-Math DFlash D-PACE drafter

e7549f0 verified 7 days ago

|

History Blame Contribute Delete

2.11 kB

	---
	license: apache-2.0
	base_model: netease-youdao/Confucius3-Math
	tags:
	- speculative-decoding
	- dflash
	- draft-model
	- vllm
	- math
	library_name: transformers
	---

	# Confucius3-Math-DFlash (draft model)

	A DFlash block-diffusion speculative-decoding draft model for
	[`netease-youdao/Confucius3-Math`](https://huggingface.co/netease-youdao/Confucius3-Math).
	Use it as the `--speculative-config` model to accelerate Confucius3-Math inference (especially
	single-stream / low-latency math reasoning).

	- Target model: `netease-youdao/Confucius3-Math` (Qwen2 arch, 48 layers, DeepSeek-R1-distill thinking format)
	- Draft: 5-layer `DFlashDraftModel`, block size 16, ~1.5B params, taps target hidden states from layers [1,12,23,34,45]
	- Trained with: [SpecForge](https://github.com/sgl-project/SpecForge), D-PACE loss, 6 epochs

	## Results (acceptance length = mean tokens accepted per draft+verify step, thinking mode)

	\| dataset \| accept length \| draft accept rate \| tok/s (single stream) \|
	\|----------\|--------------:\|------------------:\|----------------------:\|
	\| GSM8K \| 5.47 \| 30% \| 493 \|
	\| MATH-500 \| 5.79 \| 32% \| 526 \|

	Higher acceptance ⇒ more tokens emitted per target forward ⇒ larger speedup. Profiled on 1×H200, vLLM 0.22, temperature 0.

	## Usage (vLLM)

	```bash
	vllm serve netease-youdao/Confucius3-Math \
	--speculative-config '{"method": "dflash", "model": "noctuashap/Confucius3-Math-DFlash", "num_speculative_tokens": 15}' \
	--trust-remote-code
	```

	DFlash is supported in vLLM ≥ 0.20.1. `--trust-remote-code` is required (the draft is a custom
	`DFlashDraftModel`, included as `dflash.py`).

	## Training data

	~148k math-leaning prompts (NuminaMath / MATH / GSM8K / OpenMathReasoning + some code/reasoning/general),
	regenerated by Confucius3-Math itself (thinking traces kept inline) so the draft matches the target's
	own output distribution. No correctness filtering (distribution matching, not correctness).

	Built with [Claude Code](https://claude.com/claude-code).