Instructions to use yuyijiong/speculative_pipeline_decoding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yuyijiong/speculative_pipeline_decoding with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yuyijiong/speculative_pipeline_decoding")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("yuyijiong/speculative_pipeline_decoding", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use yuyijiong/speculative_pipeline_decoding with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yuyijiong/speculative_pipeline_decoding" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuyijiong/speculative_pipeline_decoding", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/yuyijiong/speculative_pipeline_decoding
- SGLang
How to use yuyijiong/speculative_pipeline_decoding with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yuyijiong/speculative_pipeline_decoding" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuyijiong/speculative_pipeline_decoding", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yuyijiong/speculative_pipeline_decoding" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuyijiong/speculative_pipeline_decoding", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use yuyijiong/speculative_pipeline_decoding with Docker Model Runner:
docker model run hf.co/yuyijiong/speculative_pipeline_decoding
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - speculative-decoding | |
| - pipeline-parallelism | |
| - llm-acceleration | |
| # Speculative Pipeline Decoding: Speculation Head Checkpoints | |
| This repository contains pre-trained **pipeline speculation head** weights for the paper [Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism](https://huggingface.co/papers/2605.30852). | |
| Speculative Pipeline Decoding (SPD) is a framework that unlocks the potential of pipeline parallelism for LLM decoding acceleration. By partitioning the target LLM into $n$ pipeline stages, SPD allows the model to process $n$ tokens in parallel, achieving higher acceptance rates and zero latency bubbles. | |
| - **Paper:** [https://huggingface.co/papers/2605.30852](https://huggingface.co/papers/2605.30852) | |
| - **Code:** [https://github.com/yuyijiong/speculative_pipeline_decoding](https://github.com/yuyijiong/speculative_pipeline_decoding) | |
| ## Quick Start (Inference) | |
| To run inference using these checkpoints, clone the official repository and use the provided `pipeline_inference.py` script. You must pair the speculation head with the corresponding base model it was trained on. | |
| ```bash | |
| python pipeline_inference.py \ | |
| --spec_head_ckpt /path/to/checkpoint.pt \ | |
| --base_model_path Qwen/Qwen3.5-4B \ | |
| --max_new_tokens 100 \ | |
| --temperature 0.0 | |
| ``` | |
| ## Checkpoint Information | |
| Each `.pt` file is a single checkpoint produced by training. For more details on training and evaluation, see the [official repo](https://github.com/yuyijiong/speculative_pipeline_decoding). | |
| ### Filename format | |
| Files are named: | |
| `{model}_s{num_stages}_l{num_spec_layers}.pt` | |
| | Part | Meaning | | |
| |------|---------| | |
| | `{model}` | Base model tag (e.g. `Qwen3.5-4B`, `Qwen3.5-9B`) | | |
| | `s{...}` | `num_stages` — pipeline depth (number of target-model stages) | | |
| | `l{...}` | `num_spec_layers` — number of Transformer layers in the speculation module | | |
| Example: `Qwen3.5-9B_s16_l2.pt` → Qwen3.5-9B base, 16 stages, 2 spec layers. | |
| ### Checkpoint contents | |
| Each file is a PyTorch archive with two top-level keys: | |
| ```python | |
| { | |
| "state_dict": ..., # weights of the speculation module | |
| "config": { ... }, # hyperparameters and metadata | |
| } | |
| ``` | |
| ### `config` fields (always present) | |
| | Field | Description | | |
| |-------|-------------| | |
| | `base_model_path` | Base model path recorded at training time (can be overridden via `--base_model_path` at load time) | | |
| | `hidden_size` | Hidden size (matches base model) | | |
| | `vocab_size` | Base model vocabulary size | | |
| | `draft_vocab_size` | Draft head output size (full vocab or draft subset) | | |
| | `num_stages` | Pipeline depth (same as `s` in filename) | | |
| | `num_spec_layers` | Speculation module depth (same as `l` in filename) | | |
| | `version` | Checkpoint format version (`10`) | | |
| | `trained_with_use_deepest` | Whether training used deepest-layer features | | |
| | `shallow_hidden_layer_indices` | Which base layers feed the speculation module | | |
| ## Citation | |
| If you use this work, please cite our paper: | |
| ```bibtex | |
| @misc{yu2026speculativepipelinedecodinghigheraccruacy, | |
| title={Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism}, | |
| author={Yijiong Yu and Huazheng Wang and Shuai Yuan and Ruilong Ren and Ji Pei}, | |
| year={2026}, | |
| eprint={2605.30852}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2605.30852}, | |
| } | |
| ``` |