Instructions to use yuyijiong/speculative_pipeline_decoding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yuyijiong/speculative_pipeline_decoding with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yuyijiong/speculative_pipeline_decoding")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("yuyijiong/speculative_pipeline_decoding", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use yuyijiong/speculative_pipeline_decoding with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yuyijiong/speculative_pipeline_decoding"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuyijiong/speculative_pipeline_decoding",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/yuyijiong/speculative_pipeline_decoding

SGLang

How to use yuyijiong/speculative_pipeline_decoding with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yuyijiong/speculative_pipeline_decoding" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuyijiong/speculative_pipeline_decoding",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yuyijiong/speculative_pipeline_decoding" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuyijiong/speculative_pipeline_decoding",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use yuyijiong/speculative_pipeline_decoding with Docker Model Runner:
```
docker model run hf.co/yuyijiong/speculative_pipeline_decoding
```

speculative_pipeline_decoding / README.md

yuyijiong

Add metadata, paper/code links, and sample usage (#1)

4484ec8 about 12 hours ago

preview code

raw

history blame contribute delete

3.5 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- speculative-decoding
	- pipeline-parallelism
	- llm-acceleration
	---

	# Speculative Pipeline Decoding: Speculation Head Checkpoints

	This repository contains pre-trained pipeline speculation head weights for the paper [Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism](https://huggingface.co/papers/2605.30852).

	Speculative Pipeline Decoding (SPD) is a framework that unlocks the potential of pipeline parallelism for LLM decoding acceleration. By partitioning the target LLM into $n$ pipeline stages, SPD allows the model to process $n$ tokens in parallel, achieving higher acceptance rates and zero latency bubbles.

	- Paper: [https://huggingface.co/papers/2605.30852](https://huggingface.co/papers/2605.30852)
	- Code: [https://github.com/yuyijiong/speculative_pipeline_decoding](https://github.com/yuyijiong/speculative_pipeline_decoding)

	## Quick Start (Inference)

	To run inference using these checkpoints, clone the official repository and use the provided `pipeline_inference.py` script. You must pair the speculation head with the corresponding base model it was trained on.

	```bash
	python pipeline_inference.py \
	--spec_head_ckpt /path/to/checkpoint.pt \
	--base_model_path Qwen/Qwen3.5-4B \
	--max_new_tokens 100 \
	--temperature 0.0
	```

	## Checkpoint Information

	Each `.pt` file is a single checkpoint produced by training. For more details on training and evaluation, see the [official repo](https://github.com/yuyijiong/speculative_pipeline_decoding).

	### Filename format

	Files are named:
	`{model}_s{num_stages}_l{num_spec_layers}.pt`

	\| Part \| Meaning \|
	\|------\|---------\|
	\| `{model}` \| Base model tag (e.g. `Qwen3.5-4B`, `Qwen3.5-9B`) \|
	\| `s{...}` \| `num_stages` — pipeline depth (number of target-model stages) \|
	\| `l{...}` \| `num_spec_layers` — number of Transformer layers in the speculation module \|

	Example: `Qwen3.5-9B_s16_l2.pt` → Qwen3.5-9B base, 16 stages, 2 spec layers.

	### Checkpoint contents

	Each file is a PyTorch archive with two top-level keys:

	```python
	{
	"state_dict": ..., # weights of the speculation module
	"config": { ... }, # hyperparameters and metadata
	}
	```

	### `config` fields (always present)

	\| Field \| Description \|
	\|-------\|-------------\|
	\| `base_model_path` \| Base model path recorded at training time (can be overridden via `--base_model_path` at load time) \|
	\| `hidden_size` \| Hidden size (matches base model) \|
	\| `vocab_size` \| Base model vocabulary size \|
	\| `draft_vocab_size` \| Draft head output size (full vocab or draft subset) \|
	\| `num_stages` \| Pipeline depth (same as `s` in filename) \|
	\| `num_spec_layers` \| Speculation module depth (same as `l` in filename) \|
	\| `version` \| Checkpoint format version (`10`) \|
	\| `trained_with_use_deepest` \| Whether training used deepest-layer features \|
	\| `shallow_hidden_layer_indices` \| Which base layers feed the speculation module \|

	## Citation

	If you use this work, please cite our paper:

	```bibtex
	@misc{yu2026speculativepipelinedecodinghigheraccruacy,
	title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism},
	author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei},
	year={2026},
	eprint={2605.30852},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2605.30852},
	}
	```