Instructions to use yuyijiong/speculative_pipeline_decoding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yuyijiong/speculative_pipeline_decoding with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yuyijiong/speculative_pipeline_decoding")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("yuyijiong/speculative_pipeline_decoding", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use yuyijiong/speculative_pipeline_decoding with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yuyijiong/speculative_pipeline_decoding" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuyijiong/speculative_pipeline_decoding", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/yuyijiong/speculative_pipeline_decoding
- SGLang
How to use yuyijiong/speculative_pipeline_decoding with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yuyijiong/speculative_pipeline_decoding" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuyijiong/speculative_pipeline_decoding", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yuyijiong/speculative_pipeline_decoding" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuyijiong/speculative_pipeline_decoding", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use yuyijiong/speculative_pipeline_decoding with Docker Model Runner:
docker model run hf.co/yuyijiong/speculative_pipeline_decoding
Speculative Pipeline Decoding: Speculation Head Checkpoints
This repository contains pre-trained pipeline speculation head weights for the paper Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism.
Speculative Pipeline Decoding (SPD) is a framework that unlocks the potential of pipeline parallelism for LLM decoding acceleration. By partitioning the target LLM into $n$ pipeline stages, SPD allows the model to process $n$ tokens in parallel, achieving higher acceptance rates and zero latency bubbles.
- Paper: https://huggingface.co/papers/2605.30852
- Code: https://github.com/yuyijiong/speculative_pipeline_decoding
Quick Start (Inference)
To run inference using these checkpoints, clone the official repository and use the provided pipeline_inference.py script. You must pair the speculation head with the corresponding base model it was trained on.
python pipeline_inference.py \
--spec_head_ckpt /path/to/checkpoint.pt \
--base_model_path Qwen/Qwen3.5-4B \
--max_new_tokens 100 \
--temperature 0.0
Checkpoint Information
Each .pt file is a single checkpoint produced by training. For more details on training and evaluation, see the official repo.
Filename format
Files are named:
{model}_s{num_stages}_l{num_spec_layers}.pt
| Part | Meaning |
|---|---|
{model} |
Base model tag (e.g. Qwen3.5-4B, Qwen3.5-9B) |
s{...} |
num_stages — pipeline depth (number of target-model stages) |
l{...} |
num_spec_layers — number of Transformer layers in the speculation module |
Example: Qwen3.5-9B_s16_l2.pt → Qwen3.5-9B base, 16 stages, 2 spec layers.
Checkpoint contents
Each file is a PyTorch archive with two top-level keys:
{
"state_dict": ..., # weights of the speculation module
"config": { ... }, # hyperparameters and metadata
}
config fields (always present)
| Field | Description |
|---|---|
base_model_path |
Base model path recorded at training time (can be overridden via --base_model_path at load time) |
hidden_size |
Hidden size (matches base model) |
vocab_size |
Base model vocabulary size |
draft_vocab_size |
Draft head output size (full vocab or draft subset) |
num_stages |
Pipeline depth (same as s in filename) |
num_spec_layers |
Speculation module depth (same as l in filename) |
version |
Checkpoint format version (10) |
trained_with_use_deepest |
Whether training used deepest-layer features |
shallow_hidden_layer_indices |
Which base layers feed the speculation module |
Citation
If you use this work, please cite our paper:
@misc{yu2026speculativepipelinedecodinghigheraccruacy,
title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism},
author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei},
year={2026},
eprint={2605.30852},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.30852},
}
docker model run hf.co/yuyijiong/speculative_pipeline_decoding