Speculative Pipeline Decoding: Speculation Head Checkpoints

This repository contains pre-trained pipeline speculation head weights for the paper Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism.

Speculative Pipeline Decoding (SPD) is a framework that unlocks the potential of pipeline parallelism for LLM decoding acceleration. By partitioning the target LLM into $n$ pipeline stages, SPD allows the model to process $n$ tokens in parallel, achieving higher acceptance rates and zero latency bubbles.

Quick Start (Inference)

To run inference using these checkpoints, clone the official repository and use the provided pipeline_inference.py script. You must pair the speculation head with the corresponding base model it was trained on.

python pipeline_inference.py \
  --spec_head_ckpt /path/to/checkpoint.pt \
  --base_model_path Qwen/Qwen3.5-4B \
  --max_new_tokens 100 \
  --temperature 0.0

Checkpoint Information

Each .pt file is a single checkpoint produced by training. For more details on training and evaluation, see the official repo.

Filename format

Files are named: {model}_s{num_stages}_l{num_spec_layers}.pt

Part Meaning
{model} Base model tag (e.g. Qwen3.5-4B, Qwen3.5-9B)
s{...} num_stages — pipeline depth (number of target-model stages)
l{...} num_spec_layers — number of Transformer layers in the speculation module

Example: Qwen3.5-9B_s16_l2.pt → Qwen3.5-9B base, 16 stages, 2 spec layers.

Checkpoint contents

Each file is a PyTorch archive with two top-level keys:

{
    "state_dict": ...,  # weights of the speculation module
    "config": { ... },  # hyperparameters and metadata
}

config fields (always present)

Field Description
base_model_path Base model path recorded at training time (can be overridden via --base_model_path at load time)
hidden_size Hidden size (matches base model)
vocab_size Base model vocabulary size
draft_vocab_size Draft head output size (full vocab or draft subset)
num_stages Pipeline depth (same as s in filename)
num_spec_layers Speculation module depth (same as l in filename)
version Checkpoint format version (10)
trained_with_use_deepest Whether training used deepest-layer features
shallow_hidden_layer_indices Which base layers feed the speculation module

Citation

If you use this work, please cite our paper:

@misc{yu2026speculativepipelinedecodinghigheraccruacy,
      title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism}, 
      author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei},
      year={2026},
      eprint={2605.30852},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.30852}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yuyijiong/speculative_pipeline_decoding