Instructions to use yuyijiong/speculative_pipeline_decoding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yuyijiong/speculative_pipeline_decoding with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yuyijiong/speculative_pipeline_decoding")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("yuyijiong/speculative_pipeline_decoding", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use yuyijiong/speculative_pipeline_decoding with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yuyijiong/speculative_pipeline_decoding"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuyijiong/speculative_pipeline_decoding",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/yuyijiong/speculative_pipeline_decoding

SGLang

How to use yuyijiong/speculative_pipeline_decoding with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yuyijiong/speculative_pipeline_decoding" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuyijiong/speculative_pipeline_decoding",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yuyijiong/speculative_pipeline_decoding" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuyijiong/speculative_pipeline_decoding",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use yuyijiong/speculative_pipeline_decoding with Docker Model Runner:
```
docker model run hf.co/yuyijiong/speculative_pipeline_decoding
```

yuyijiong

nielsr HF Staff commited on about 8 hours ago

Commit

4484ec8

1 Parent(s): 2ecd604

Add metadata, paper/code links, and sample usage (#1)

Browse files

- Add metadata, paper/code links, and sample usage (8dea2d08154e9d8cce9b6698ede7da9016cac450)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +88 -90

README.md CHANGED Viewed

@@ -1,90 +1,88 @@
-# Speculation head checkpoints
-Pre-trained **pipeline speculation head** weights. Each `.pt` file is a single checkpoint produced by training; pair it with the **same** base model architecture it was trained on (see `config["base_model_path"]` inside the file).
-For inference, evaluation, and training examples, see the official repo:
-**https://github.com/yuyijiong/speculative_pipeline_decoding**
-## Filename format
-Files are named:
-```text
-{model}_s{num_stages}_l{num_spec_layers}.pt
-```
-| Part | Meaning |
-|------|---------|
-| `{model}` | Base model tag from training config (e.g. `Qwen3.5-4B`, `Qwen3.5-9B`) |
-| `s{...}` | `num_stages` — pipeline depth (number of target-model stages) |
-| `l{...}` | `num_spec_layers` — number of Transformer layers in the speculation module |
-Example: `Qwen3.5-9B_s16_l2.pt` → Qwen3.5-9B base, 16 stages, 2 spec layers.
-## Checkpoint contents
-Each file is a PyTorch archive with two top-level keys:
-```python
-{
-    "state_dict": ...,  # weights of the speculation module
-    "config": { ... },  # hyperparameters and metadata
-}
-```
-### `config` fields (always present)
-| Field | Description |
-|-------|-------------|
-| `base_model_path` | Base model path recorded at training time (often a machine-local path; override at load time — see below) |
-| `hidden_size` | Hidden size (matches base model) |
-| `vocab_size` | Base model vocabulary size |
-| `draft_vocab_size` | Draft head output size (full vocab or draft subset) |
-| `num_stages` | Pipeline depth (same as `s` in filename) |
-| `num_spec_layers` | Speculation module depth (same as `l` in filename) |
-| `version` | Checkpoint format version (`10`) |
-| `trained_with_use_deepest` | Whether training used deepest-layer features |
-| `shallow_hidden_layer_indices` | Which base layers feed the speculation module |
-### `config` fields (optional)
-| Field | Description |
-|-------|-------------|
-| `spec_init_from_base_layers` | Base layers used to initialize the spec module (if any) |
-| `draft_token_ids` | Draft vocabulary token ids (only when trained with a draft vocab subset) |
-## Loading checkpoints
-`config["base_model_path"]` is often a **local path from the training machine** (e.g. `/share/models/Qwen3.5-4B`). On your machine, pass the correct Hugging Face id or local directory via `--base_model_path`; it **overrides** the path stored in the checkpoint:
-```bash
-python pipeline_inference.py \
-  --spec_head_ckpt /path/to/Qwen3.5-4B_s4_l2.pt \
-  --base_model_path Qwen/Qwen3.5-4B
-python eval.py \
-  --spec_head_ckpt /path/to/Qwen3.5-4B_s4_l2.pt \
-  --base_model_path /your/local/Qwen3.5-4B \
-  --data_dir eval_data \
-  --output_dir ./eval_output
-```
-If `--base_model_path` is omitted, the value from `config["base_model_path"]` is used as-is.
-More usage details: [speculative_pipeline_decoding](https://github.com/yuyijiong/speculative_pipeline_decoding).
-## Citation
-If you use this repo, please cite our paper:
-```bibtex
-@misc{yu2026speculativepipelinedecodinghigheraccruacy,
-      title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism},
-      author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei},
-      year={2026},
-      eprint={2605.30852},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2605.30852},
-}
-```

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- speculative-decoding
+- pipeline-parallelism
+- llm-acceleration
+---
+# Speculative Pipeline Decoding: Speculation Head Checkpoints
+This repository contains pre-trained **pipeline speculation head** weights for the paper [Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism](https://huggingface.co/papers/2605.30852).
+Speculative Pipeline Decoding (SPD) is a framework that unlocks the potential of pipeline parallelism for LLM decoding acceleration. By partitioning the target LLM into $n$ pipeline stages, SPD allows the model to process $n$ tokens in parallel, achieving higher acceptance rates and zero latency bubbles.
+- **Paper:** [https://huggingface.co/papers/2605.30852](https://huggingface.co/papers/2605.30852)
+- **Code:** [https://github.com/yuyijiong/speculative_pipeline_decoding](https://github.com/yuyijiong/speculative_pipeline_decoding)
+## Quick Start (Inference)
+To run inference using these checkpoints, clone the official repository and use the provided `pipeline_inference.py` script. You must pair the speculation head with the corresponding base model it was trained on.
+```bash
+python pipeline_inference.py \
+  --spec_head_ckpt /path/to/checkpoint.pt \
+  --base_model_path Qwen/Qwen3.5-4B \
+  --max_new_tokens 100 \
+  --temperature 0.0
+```
+## Checkpoint Information
+Each `.pt` file is a single checkpoint produced by training. For more details on training and evaluation, see the [official repo](https://github.com/yuyijiong/speculative_pipeline_decoding).
+### Filename format
+Files are named:
+`{model}_s{num_stages}_l{num_spec_layers}.pt`
+| Part | Meaning |
+|------|---------|
+| `{model}` | Base model tag (e.g. `Qwen3.5-4B`, `Qwen3.5-9B`) |
+| `s{...}` | `num_stages` — pipeline depth (number of target-model stages) |
+| `l{...}` | `num_spec_layers` — number of Transformer layers in the speculation module |
+Example: `Qwen3.5-9B_s16_l2.pt` → Qwen3.5-9B base, 16 stages, 2 spec layers.
+### Checkpoint contents
+Each file is a PyTorch archive with two top-level keys:
+```python
+{
+    "state_dict": ...,  # weights of the speculation module
+    "config": { ... },  # hyperparameters and metadata
+}
+```
+### `config` fields (always present)
+| Field | Description |
+|-------|-------------|
+| `base_model_path` | Base model path recorded at training time (can be overridden via `--base_model_path` at load time) |
+| `hidden_size` | Hidden size (matches base model) |
+| `vocab_size` | Base model vocabulary size |
+| `draft_vocab_size` | Draft head output size (full vocab or draft subset) |
+| `num_stages` | Pipeline depth (same as `s` in filename) |
+| `num_spec_layers` | Speculation module depth (same as `l` in filename) |
+| `version` | Checkpoint format version (`10`) |
+| `trained_with_use_deepest` | Whether training used deepest-layer features |
+| `shallow_hidden_layer_indices` | Which base layers feed the speculation module |
+## Citation
+If you use this work, please cite our paper:
+```bibtex
+@misc{yu2026speculativepipelinedecodinghigheraccruacy,
+      title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism},
+      author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei},
+      year={2026},
+      eprint={2605.30852},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2605.30852},
+}
+```