TreeFlash
Collection
Parallel AR-Approximation for Faster Speculative Decoding (https://arxiv.org/abs/2606.03819) • 3 items • Updated
How to use peerrh/treeflash-qwen3-8b with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="peerrh/treeflash-qwen3-8b", trust_remote_code=True) # Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("peerrh/treeflash-qwen3-8b", trust_remote_code=True)
model = AutoModel.from_pretrained("peerrh/treeflash-qwen3-8b", trust_remote_code=True)How to use peerrh/treeflash-qwen3-8b with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "peerrh/treeflash-qwen3-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "peerrh/treeflash-qwen3-8b",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/peerrh/treeflash-qwen3-8b
How to use peerrh/treeflash-qwen3-8b with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "peerrh/treeflash-qwen3-8b" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "peerrh/treeflash-qwen3-8b",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "peerrh/treeflash-qwen3-8b" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "peerrh/treeflash-qwen3-8b",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use peerrh/treeflash-qwen3-8b with Docker Model Runner:
docker model run hf.co/peerrh/treeflash-qwen3-8b
Peer Rheinboldt · Frédéric Berdoz · Roger Wattenhofer
Preprint, submitted June 2026
TreeFlash requires trust_remote_code=True because the drafter architecture and
spec_generate method are provided by this repository.
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
drafter = AutoModel.from_pretrained(
"peerrh/treeflash-qwen3-8b",
trust_remote_code=True,
dtype="bfloat16",
device_map="cuda:0",
).eval()
target = AutoModelForCausalLM.from_pretrained(
"qwen/qwen3-8b",
trust_remote_code=True,
dtype="bfloat16",
device_map="cuda:0",
).eval()
tokenizer = AutoTokenizer.from_pretrained("qwen/qwen3-8b", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer([text], return_tensors="pt").to(drafter.device)
output_ids = drafter.spec_generate(
target=target,
input_ids=inputs["input_ids"],
max_new_tokens=2048,
stop_token_ids=[tokenizer.eos_token_id],
temperature=0.0,
drafter_temperature=1.0,
tree_size=64,
top_m=16,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
| Target | Drafter |
|---|---|
| Qwen/Qwen3-4B | peerrh/treeflash-qwen3-4b |
| Qwen/Qwen3-8B | peerrh/treeflash-qwen3-8b |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | peerrh/treeflash-qwen3-coder-30b-a3b |
If you use TreeFlash, please cite:
@article{rheinboldt2026treeflash,
title={TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding},
author={Rheinboldt, Peer and Berdoz, Fr{\'e}d{\'e}ric and Wattenhofer, Roger},
journal={arXiv preprint arXiv:2606.03819},
year={2026}
}