Text Generation
PyTorch
English
uraionspec
speculative-decoding
dspark
deepseek
llm-inference
model-optimization
transformer
efficient-llm
inference-acceleration
draft-model
torch
uraion-labs
uraion
systems-research
icml-2026
acceptance-scheduling
semi-autoregressive
confidence-prediction
calibration
Initial public release: UraionSpec v0.1.0 β Faithful DSpark-style speculative decoding
3c1da87 verified DSpark Implementation Notes
Architecture Summary
DSpark introduces two key innovations over existing speculative decoding methods:
1. Semi-Autoregressive Generation
- Parallel backbone (DFlash-style): Processes all Ξ³ positions in a single forward pass, making drafting latency nearly independent of block size.
- Sequential head: Two variants:
- Markov head (
VanillaMarkov/GatedMarkovHead): Low-rank transition biasB = W1[x_{k-1}] @ W2(r=256). - RNN head (
RNNHead): GRU-like recurrent state accumulating full prefix history.
- Markov head (
- The sequential stage fixes the "multi-modal collision" problem where parallel drafters produce incoherent combinations like "of problem" instead of "of course".
2. Confidence-Scheduled Verification
- Confidence head: Predicts per-position conditional acceptance probability
c_k = Ο(w^T[h_k; W1[x_{k-1}]]). - Sequential Temperature Scaling (STS): Calibrates cumulative prefix survival probabilities left-to-right via 1D grid search minimizing ECE.
- Hardware-Aware Prefix Scheduler (Algorithm 1): Maximizes
Ξ = Ο Β· SPS(B)by greedily admitting highest-survival-probability tokens with early stopping.
Algorithm Details
Acceptance Rule (Lossless)
P(accept token x_k) = min(1, p^t_k(x_k) / p^d_k(x_k))
- First rejection at position k discards tokens k+1..Ξ³
- One bonus token sampled from residual distribution at rejection position
- Preserves exact target distribution
Draft Distribution
p_k(v | x_0, x_<k) = exp(U_k(v) + B_k(x_0, x_<k, v)) / Ξ£ exp(U_k(u) + B_k(...))
where U_k are base logits from parallel backbone and B_k is the sequential transition bias.
Training Objective (Eq. 12)
L = 0.1 Β· L_ce + 0.9 Β· L_tv + 1.0 Β· L_conf
L_ce: Cross-entropy for next-token predictionL_tv: Total variation distance||p_d - p_t||_1(proxy for acceptance rate)L_conf: Binary cross-entropy on confidence predictions- All position-weighted by
w_k = exp(-(k-1)/Ξ³)
STS Calibration
For each position k = 1..Ξ³:
- Compute cumulative product of calibrated scores up to k
- Find temperature T_k minimizing ECE of cumulative product
- Apply T_k to k-th position score
- Order-preserving: doesn't disrupt relative rankings
Algorithm 1: Hardware-Aware Prefix Scheduler
1. Compute a_{r,j} = β_{iβ€j} c_{r,i} for each request r, position j
2. Create candidate set E = {(r,j) | a_{r,j} > 0}
3. Sort E descending by a_{r,j}
4. Greedily add candidates:
- Update β_r = j, B += 1, Ο += a_{r,j}
- Compute Ξ = Ο * SPS(B)
- If Ξ > best, update best lengths; else break
5. Return per-request verification lengths
What's Implemented Now
| Component | Status | Location |
|---|---|---|
| Markov head (vanilla, gated) | β | models/markov_head.py |
| RNN head | β | models/rnn_head.py |
| Confidence head | β | models/confidence_head.py |
| Acceptance rule | β | decoding/acceptance.py |
| Hardware-aware scheduler | β | decoding/scheduler.py |
| Static scheduler (fallback) | β | decoding/scheduler.py |
| Speculative decoding loop | β | decoding/speculative.py |
| Training dataset | β | training/dataset.py |
| Loss functions (CE + TV + Conf) | β | training/losses.py |
| Training loop | β | training/train_drafter.py |
| Target cache generation | β | training/cache_targets.py |
| STS calibration | β | calibration/sts.py |
| Acceptance evaluation | β | evaluation/eval_acceptance.py |
| Latency benchmarking | β | evaluation/benchmark_latency.py |
| Unit tests (55) | β | tests/ |
| Smoke train script | β | scripts/smoke_train.py |
| Smoke eval script | β | scripts/smoke_eval.py |
| Benchmark script | β | scripts/run_benchmark.py |
Deviations from Paper
- Parallel backbone: Uses
nn.TransformerEncoderinstead of the full DFlash-style backbone with target model KV injection. - Training efficiency: No multi-GPU training, no 38TB target cache. Target logits computed on-the-fly.
- SPS profiling: Uses a synthetic throughput profile instead of real engine profiling.
- Mask token: Uses token 0 as placeholder; the paper uses learned mask embeddings.
- Anchor modification: Paper treats anchor as first prediction position (Ξ³ inputs β Ξ³ outputs). We follow the original DFlash style (Ξ³ inputs β Ξ³ outputs) where anchor is part of the input.
Future Work
- Implement full DFlash backbone with target model KV injection
- Multi-GPU distributed training (DeepSpeed/FSDP)
- Real engine throughput profiling
- Tree-based verification for autoregressive drafters
- Integration with vLLM or similar serving frameworks
- Qwen3-specific model classes with proper configuration