qwe123wjb
/

BudgetDraft-checkpoints

speculative-decoding

sparse-kv-cache

Model card Files Files and versions

BudgetDraft-checkpoints / README.md

qwe123wjb's picture

Add README

e2333e4 verified 11 days ago

|

history blame contribute delete

2.5 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- speculative-decoding
	- sparse-kv-cache
	- long-context
	base_model: JackFram/llama-68m
	---

	# BudgetDraft — Released Checkpoints

	Drafter checkpoints for the paper "BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding".

	All three checkpoints fine-tune `JackFram/llama-68m` (68M parameters, fp32) for use as a drafter alongside the `NousResearch/Yarn-Llama-2-7b-128k` verifier under a sparse drafter-side KV cache.

	## Layout

	\| Subfolder \| Variant \| Training loss \| Use in paper \|
	\|---\|---\|---\|---\|
	\| `main/` \| A + 0.5·C (multi-view, λ=0.5) \| full-cache + sparse-cache \| main checkpoint reported in Table 1 / Fig 3 \|
	\| `aonly/` \| A only \| full-cache only (no sparse branch) \| ablation: without the sparse-cache loss \|
	\| `ac/` \| A + C (λ=1.0) \| full-cache + sparse-cache \| λ-sensitivity ablation \|

	Each subfolder is a standard HuggingFace checkpoint (`config.json`, `model.safetensors`, tokenizer files) — load with `AutoModelForCausalLM.from_pretrained(...)`.

	## Download

	```bash
	# Whole repo (~786 MB):
	hf download qwe123wjb/BudgetDraft-checkpoints --local-dir ./ckpts

	# Just the main checkpoint:
	hf download qwe123wjb/BudgetDraft-checkpoints --include "main/*" --local-dir ./ckpts
	```

	## Reproducing the paper

	Pair these checkpoints with the evaluation code at <https://github.com/ANTI-Tony/BudgetDraft>:

	```bash
	git clone https://github.com/ANTI-Tony/BudgetDraft.git
	cd BudgetDraft
	pip install -r requirements.txt

	hf download qwe123wjb/BudgetDraft-checkpoints --local-dir ./ckpts
	make eval-from-release CHECKPOINTS=./ckpts
	```

	The eval script orchestrates 96 configurations (main + ablation + λ-sensitivity) across three datasets (PG-19 / LongBench QMSum / NarrativeQA) and three context lengths (4K / 8K / 16K). Headline result: 6.55× speedup at 4K with 79.37% acceptance on NarrativeQA.

	## Quick load (Python)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	drafter = AutoModelForCausalLM.from_pretrained(
	"qwe123wjb/BudgetDraft-checkpoints",
	subfolder="main",
	torch_dtype="float32",
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"qwe123wjb/BudgetDraft-checkpoints",
	subfolder="main",
	)
	```

	Pair with the verifier:

	```python
	verifier = AutoModelForCausalLM.from_pretrained(
	"NousResearch/Yarn-Llama-2-7b-128k",
	torch_dtype="float16",
	)
	```

	## Citation

	Citation TBA after acceptance. See the GitHub repo for the latest.