nielsr's picture
nielsr HF Staff
Create model card with pipeline tag, license, and usage instructions
6557159 verified
|
Raw
History Blame
1.9 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation

JetSpec: Parallel Tree Drafting

JetSpec is an implementation of parallel tree drafting for fast LLM speculative decoding inference with up to 10x acceptance length, and 1000+ TPS on coding and math tasks using B200 GPUs. This repository contains the draft head model presented in JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting.

A causal-parallel draft head proposes a token tree, and the frozen target model verifies the whole tree in one forward pass under a tree-causal attention mask. The accepted path is selected in accordance with the target's own logits, so decoding is lossless by construction.

For more details, please refer to the Project Webpage and the GitHub Repository.

Installation

Create an environment and install the package:

pip install -e '.[bench,kernel]'

Usage

You can run speculative decoding using the lightweight Hugging Face-based reference implementation:

from jetspec import LLM, SamplingParams

llm = LLM("Qwen/Qwen3-8B", attn_implementation="flash_attention_2")
out = llm.generate(
    "The three primary colors are",
    SamplingParams(temperature=0.0, max_new_tokens=64),
)
print(out["text"])

Citation

@inproceedings{jetspec2026,
  title = {JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting},
  author = {Hu, Lanxiang and Feng, Zhaoxiang and Wu, Yulun and Yuan, Haoran and Zhao, Yujie and Qian, Yu-Yang and Wang, Bojun and Zhao, Peng and Jiang, Daxin and Zhu, Yibo and Rosing, Tajana and Zhang, Hao},
  year = {2026},
  url = {https://arxiv.org/abs/2606.18394},
  eprint = {2606.18394},
  note = {Preprint}
}