UraionSpec / docs /DSpark_implementation_notes.md

UraionLabs

Initial public release: UraionSpec v0.1.0 — Faithful DSpark-style speculative decoding

3c1da87 verified 6 days ago

preview code

Raw

History Blame Contribute Delete

4.76 kB

DSpark Implementation Notes

Architecture Summary

DSpark introduces two key innovations over existing speculative decoding methods:

1. Semi-Autoregressive Generation

Parallel backbone (DFlash-style): Processes all γ positions in a single forward pass, making drafting latency nearly independent of block size.
Sequential head: Two variants:
- Markov head (VanillaMarkov/GatedMarkovHead): Low-rank transition bias B = W1[x_{k-1}] @ W2 (r=256).
- RNN head (RNNHead): GRU-like recurrent state accumulating full prefix history.
The sequential stage fixes the "multi-modal collision" problem where parallel drafters produce incoherent combinations like "of problem" instead of "of course".

2. Confidence-Scheduled Verification

Confidence head: Predicts per-position conditional acceptance probability c_k = σ(w^T[h_k; W1[x_{k-1}]]).
Sequential Temperature Scaling (STS): Calibrates cumulative prefix survival probabilities left-to-right via 1D grid search minimizing ECE.
Hardware-Aware Prefix Scheduler (Algorithm 1): Maximizes Θ = τ · SPS(B) by greedily admitting highest-survival-probability tokens with early stopping.

Algorithm Details

Acceptance Rule (Lossless)

P(accept token x_k) = min(1, p^t_k(x_k) / p^d_k(x_k))

First rejection at position k discards tokens k+1..γ
One bonus token sampled from residual distribution at rejection position
Preserves exact target distribution

Draft Distribution

p_k(v | x_0, x_<k) = exp(U_k(v) + B_k(x_0, x_<k, v)) / Σ exp(U_k(u) + B_k(...))

where U_k are base logits from parallel backbone and B_k is the sequential transition bias.

Training Objective (Eq. 12)

L = 0.1 · L_ce + 0.9 · L_tv + 1.0 · L_conf

L_ce: Cross-entropy for next-token prediction
L_tv: Total variation distance ||p_d - p_t||_1 (proxy for acceptance rate)
L_conf: Binary cross-entropy on confidence predictions
All position-weighted by w_k = exp(-(k-1)/γ)

STS Calibration

For each position k = 1..γ:

Compute cumulative product of calibrated scores up to k
Find temperature T_k minimizing ECE of cumulative product
Apply T_k to k-th position score
Order-preserving: doesn't disrupt relative rankings

Algorithm 1: Hardware-Aware Prefix Scheduler

1. Compute a_{r,j} = ∏_{i≤j} c_{r,i} for each request r, position j
2. Create candidate set E = {(r,j) | a_{r,j} > 0}
3. Sort E descending by a_{r,j}
4. Greedily add candidates:
   - Update ℓ_r = j, B += 1, τ += a_{r,j}
   - Compute Θ = τ * SPS(B)
   - If Θ > best, update best lengths; else break
5. Return per-request verification lengths

What's Implemented Now

Component	Status	Location
Markov head (vanilla, gated)	✅	`models/markov_head.py`
RNN head	✅	`models/rnn_head.py`
Confidence head	✅	`models/confidence_head.py`
Acceptance rule	✅	`decoding/acceptance.py`
Hardware-aware scheduler	✅	`decoding/scheduler.py`
Static scheduler (fallback)	✅	`decoding/scheduler.py`
Speculative decoding loop	✅	`decoding/speculative.py`
Training dataset	✅	`training/dataset.py`
Loss functions (CE + TV + Conf)	✅	`training/losses.py`
Training loop	✅	`training/train_drafter.py`
Target cache generation	✅	`training/cache_targets.py`
STS calibration	✅	`calibration/sts.py`
Acceptance evaluation	✅	`evaluation/eval_acceptance.py`
Latency benchmarking	✅	`evaluation/benchmark_latency.py`
Unit tests (55)	✅	`tests/`
Smoke train script	✅	`scripts/smoke_train.py`
Smoke eval script	✅	`scripts/smoke_eval.py`
Benchmark script	✅	`scripts/run_benchmark.py`

Deviations from Paper

Parallel backbone: Uses nn.TransformerEncoder instead of the full DFlash-style backbone with target model KV injection.
Training efficiency: No multi-GPU training, no 38TB target cache. Target logits computed on-the-fly.
SPS profiling: Uses a synthetic throughput profile instead of real engine profiling.
Mask token: Uses token 0 as placeholder; the paper uses learned mask embeddings.
Anchor modification: Paper treats anchor as first prediction position (γ inputs → γ outputs). We follow the original DFlash style (γ inputs → γ outputs) where anchor is part of the input.

Future Work

Implement full DFlash backbone with target model KV injection
Multi-GPU distributed training (DeepSpeed/FSDP)
Real engine throughput profiling
Tree-based verification for autoregressive drafters
Integration with vLLM or similar serving frameworks
Qwen3-specific model classes with proper configuration