Uraion Labs
Foundational systems research.
UraionSpec
Faithful DSpark-style Speculative Decoding β modular, runnable, verified.
UraionSpec is a clean, modular, and runnable implementation of DSpark β Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation β accepted at ICML 2026. It faithfully reproduces the core DSpark algorithm while being practical for small-scale experimentation, training, and evaluation.
This is research infrastructure β not a model checkpoint. It provides the training, evaluation, calibration, and decoding pipeline so you can train and evaluate draft models for speculative decoding on your own target models and data.
Intelligence is a systems problem. This codebase is one piece of that system.
What is DSpark?
DSpark is a state-of-the-art speculative decoding framework from DeepSeek-AI that introduces two key innovations:
- Semi-Autoregressive Generation β A parallel backbone handles bulk compute while a lightweight sequential head (Markov or RNN) injects inter-token dependency, combining the speed of parallel drafters with the quality of autoregressive ones.
- Confidence-Scheduled Verification β A confidence head predicts per-position acceptance probabilities, and a hardware-aware scheduler dynamically tailors the verification length based on prefix survival probabilities and engine throughput profiles. This prevents wasted compute on high-rejection tokens under heavy load.
Architecture
UraionSpec/
βββ src/uraionspec/
β βββ models/ # DSpark draft model
β β βββ markov_head.py # Low-rank transition bias (r=256)
β β βββ rnn_head.py # GRU-like recurrent sequential head
β β βββ confidence_head.py # Per-position acceptance predictor
β β βββ dflash_backbone.py # DFlash-style backbone with KV injection β
β β βββ draft_model.py # Combined parallel backbone + heads
β βββ decoding/ # Speculative decoding core
β β βββ acceptance.py # Lossless rejection sampling (min ratio)
β β βββ scheduler.py # Algorithm 1: Hardware-aware prefix scheduler
β β βββ speculative.py # Orchestration: draft β verify β accept
β βββ training/ # Training pipeline
β β βββ dataset.py # Anchor-block dataset preparation
β β βββ losses.py # CE + TV + Confidence (position-weighted)
β β βββ train_drafter.py # Training loop (frozen target)
β β βββ cache_targets.py # Target logit cache generation
β βββ calibration/ # Sequential Temperature Scaling
β β βββ sts.py # Left-to-right ECE minimization
β βββ evaluation/ # Evaluation & benchmarking
β β βββ eval_acceptance.py # Acceptance rate / length metrics
β β βββ benchmark_latency.py # Vanilla vs speculative latency
β βββ utils/ # HF helpers, logging, seeding
βββ scripts/ # Runnable entry points
β βββ smoke_train.py
β βββ smoke_eval.py
β βββ run_benchmark.py
βββ tests/ # 80 unit & integration tests
βββ docs/ # Implementation notes, reports
Installation
# Install directly from HuggingFace
pip install git+https://huggingface.co/UraionLabs/UraionSpec
# Or clone from HuggingFace
git clone https://huggingface.co/UraionLabs/UraionSpec
cd UraionSpec
pip install -e .
# With development dependencies (tests, linting)
pip install -e ".[dev]"
Quick Start
Smoke Training
Train a DSpark draft model on a tiny dataset to verify end-to-end gradient flow:
python scripts/smoke_train.py \
--target Qwen/Qwen2.5-0.5B-Instruct \
--samples 32 \
--steps 5 \
--batch-size 2 \
--block-size 4
Smoke Evaluation
Evaluate a trained draft model's acceptance characteristics:
python scripts/smoke_eval.py \
--target Qwen/Qwen2.5-0.5B-Instruct \
--checkpoint /path/to/checkpoint.pt \
--gamma 7 \
--steps 5
Benchmark
Compare speculative decoding against vanilla autoregressive generation:
python scripts/run_benchmark.py \
--target Qwen/Qwen2.5-0.5B-Instruct \
--prompts examples/prompts.jsonl \
--gamma 7 \
--steps 10
Run Tests
pytest tests/ -v
Key Components
Markov Sequential Head
Implements low-rank transition bias B(x_{k-1}, x_k) = W1[x_{k-1}] @ W2 where W1 β R^{VΓr}, W2 β R^{rΓV} (r=256 default). Available as VanillaMarkov or GatedMarkovHead (modulated by backbone hidden state).
RNN Sequential Head
GRU-like gated recurrent state across positions:
s_k = sigmoid(W_g z_k) β s_{k-1} + (1 - sigmoid(W_g z_k)) β tanh(W_c z_k)
where z_k = [s_{k-1}; W1[x_{k-1}]; h_k]. Captures full prefix history.
Confidence Head
Predicts per-position conditional acceptance probability:
c_k = sigmoid(w^T [h_k; W1[x_{k-1}])
Supervised by analytical acceptance rate c*_k = 1 - 0.5 Γ ||p_d - p_t||_1.
Hardware-Aware Prefix Scheduler (Algorithm 1)
Maximizes expected throughput Ξ = Ο Γ SPS(B) by:
- Computing prefix survival probabilities
a_{r,j} = β_{iβ€j} c_{r,i} - Globally sorting candidates by
a_{r,j} - Greedily admitting tokens with early stopping to preserve non-anticipating property
Sequential Temperature Scaling
Calibrates cumulative confidence products left-to-right via 1D grid search minimizing Expected Calibration Error (ECE) at each position.
Loss Functions (DSpark Eq. 12)
L = 0.1 Γ L_ce + 0.9 Γ L_tv + 1.0 Γ L_conf
L_ce: Cross-entropy for next-token predictionL_tv: Total variation distance||p_d - p_t||_1L_conf: Binary cross-entropy on confidence predictions
All position-weighted by w_k = exp(-(k-1)/Ξ³) emphasizing earlier positions.
Verification
| Component | Status |
|---|---|
| 80 unit & integration tests | β All passing |
| DFlash backbone with KV injection | β 17 tests, all shapes & gradients verified |
| Sampling utilities (residual, GQA) | β 8 tests |
| Package import | β Clean |
| Linting (ruff) | β All checks passed |
| Smoke training (CPU) | β 3 steps, all losses decreasing |
| Confidence head training | β Supervised by analytical acceptance rate |
Reproducing Paper Results
The DSpark paper trains on the full Open-PerfectBlend dataset (1.3M samples) across multiple GPUs. For production-scale reproduction, see the official DeepSpec repository.
For small-scale experimentation:
# Train on Colab A100
colab run -s uraionspec-train --gpu A100 --keep --timeout 28800 \
python scripts/smoke_train.py --target Qwen/Qwen3-4B --samples 10000 --steps 1000
Relation to DeepSpec
UraionSpec is an independent, faithful implementation of the DSpark algorithm described in the paper and the DeepSpec repository (MIT license). While DeepSpec is a production-grade codebase with multi-GPU training, 38 TB target caches, and vLLM integration, UraionSpec focuses on:
- Clarity β Modular, documented Python with clean separations
- Runability β Smoke tests that work on a single GPU or CPU
- Completeness β Every algorithm component from the paper is implemented
Current Limitations
- Parallel backbone: Uses
nn.TransformerEncoderβ not the full DFlash-style backbone with target model KV injection described in Section 3.1 of the paper. - No multi-GPU: Single-device only.
- Synthetic SPS profile: Uses a default throughput curve β for real systems, profile your engine and pass the table.
- No vLLM integration: For production serving, see DeepSpec's integration.
License
MIT License. Built with reference to DeepSpec (MIT) and the DSpark paper. Copyright Β© 2026 Uraion Labs.
Citation
@software{uraionspec2026,
author = {Uraion Labs},
title = {UraionSpec: Faithful DSpark-style Speculative Decoding},
year = {2026},
url = {https://huggingface.co/UraionLabs/UraionSpec}
}
@article{cheng2026dspark,
title={DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation},
author={Cheng, Xin and Yu, Xingkai and Shao, Chenze and Li, Jiashi and Xiong, Yunfan and others},
journal={ICML},
year={2026}
}