license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
JetSpec: Parallel Tree Drafting
JetSpec is an implementation of parallel tree drafting for fast LLM speculative decoding inference with up to 10x acceptance length, and 1000+ TPS on coding and math tasks using B200 GPUs. This repository contains the draft head model presented in JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting.
A causal-parallel draft head proposes a token tree, and the frozen target model verifies the whole tree in one forward pass under a tree-causal attention mask. The accepted path is selected in accordance with the target's own logits, so decoding is lossless by construction.
For more details, please refer to the Project Webpage and the GitHub Repository.
Installation
Create an environment and install the package:
pip install -e '.[bench,kernel]'
Usage
You can run speculative decoding using the lightweight Hugging Face-based reference implementation:
from jetspec import LLM, SamplingParams
llm = LLM("Qwen/Qwen3-8B", attn_implementation="flash_attention_2")
out = llm.generate(
"The three primary colors are",
SamplingParams(temperature=0.0, max_new_tokens=64),
)
print(out["text"])
Citation
@inproceedings{jetspec2026,
title = {JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting},
author = {Hu, Lanxiang and Feng, Zhaoxiang and Wu, Yulun and Yuan, Haoran and Zhao, Yujie and Qian, Yu-Yang and Wang, Bojun and Zhao, Peng and Jiang, Daxin and Zhu, Yibo and Rosing, Tajana and Zhang, Hao},
year = {2026},
url = {https://arxiv.org/abs/2606.18394},
eprint = {2606.18394},
note = {Preprint}
}