Improve model card: Add SuffixDecoding context, vLLM usage, project & repo links

#5
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +139 -28
README.md CHANGED
@@ -1,55 +1,166 @@
1
  ---
2
- license: llama3.1
3
  base_model:
4
  - meta-llama/Llama-3.1-8B-Instruct
 
 
 
5
  ---
6
- # SwiftKV
7
 
8
- The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- For more details about SwiftKV and how to use it:
11
  * ❄️ [SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction (blog)](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)
12
  * 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
13
  * 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
14
 
 
 
 
 
 
 
 
 
15
  ## Revisions
16
 
17
- * **release-2508 (Aug 2025):** Updated model weights for long-context up to 128K
18
- * **release-2412 (Dec 2024):** Initial model release
19
 
20
- ## Performance Metrics
21
 
22
  To evaluate SwiftKV’s performance, we focus on the following key metrics (see more details in our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)):
23
- * Combined throughput: The total number of input and output tokens processed per second. This determines:
24
- * For batch processing, the time required to complete jobs.
25
- * For interactive use, the volume of concurrent requests a system can handle.
26
- * TTFT: The latency between a user request and receiving the first token in the response.
27
- * TPOT: The latency between subsequent tokens after the first token.
28
 
29
  Combined input and output throughput for Llama 3.1 70B (left) and Llama 3.1 405B (right) across a range of input lengths (bottom).
30
  <img src="figure-4-full.png" alt="performance plot of llama-405B w. swiftkv" width="800">
31
 
32
- TTFT (top) and TPOT (bottom) for input lengths 2000 (left), 8000 (middle), and 32000 (right) for Llama 3.1 405B fp8 model. For each experiment, a range of different request arrival rates is simulated. Each request generates 256 output tokens.
33
  <img src="figure-6.png" alt="performance plot of llama-405B w. swiftkv" width="700">
34
 
 
35
 
36
- ## Eval Metrics
37
-
38
- For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.
39
 
40
  | Llama-3.1-405B-Instruct-FP8 | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
41
- |-----------|---------------|------------|-----------|------------|------|----------|-------|-----|
42
- | Baseline | 94.7 | 87.0 | 88.3 | 64.7 | 87.5 | 88.1 | 96.1 | **86.6** |
43
- | 50% SingleInputKV | 94.0 | 86.3 | 88.1 | 64.2 | 85.7 | 87.5 | 95.2 | **85.9** |
44
 
45
  | Llama-3.1-8B-Instruct | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
46
- |-----------|---------------|------------|-----------|------------|------|----------|-------|-----|
47
- | Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
48
- | 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |
49
-
50
- ## Get started by serving SwiftKV on vLLM
51
-
52
- Instructions on how to use vLLM for both evaluation and performance benchmarks:
53
- https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv
54
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  <img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=99c9bfbe-21ae-47b6-9cf3-5c881758bd27" />
 
1
  ---
 
2
  base_model:
3
  - meta-llama/Llama-3.1-8B-Instruct
4
+ license: llama3.1
5
+ pipeline_tag: text-generation
6
+ library_name: vllm
7
  ---
 
8
 
9
+ # Llama-3.1-SwiftKV-8B-Instruct: Accelerated with SuffixDecoding
10
+
11
+ The `Llama-3.1-SwiftKV-8B-Instruct` model is an iteration of the Llama-3.1-8B model optimized with **SwiftKV**, an inference optimization technique that goes beyond traditional key-value (KV) cache compression. This model is designed to be effectively accelerated by **SuffixDecoding**, an extreme speculative decoding method for emerging AI applications, as presented in the paper [SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications](https://huggingface.co/papers/2411.04975).
12
+
13
+ **Project Page (SuffixDecoding)**: [https://suffix-decoding.github.io](https://suffix-decoding.github.io)
14
+ **Code Repository (Arctic Inference)**: [https://github.com/snowflakedb/ArcticInference](https://github.com/snowflakedb/ArcticInference)
15
+
16
+ ## SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
17
+
18
+ SuffixDecoding is a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3x, outperforming state-of-the-art methods -- 2.8x faster than model-based approaches like EAGLE-2/3 and 1.9x faster than model-free approaches such as Token Recycling.
19
+
20
+ ## SwiftKV: Accelerating Enterprise LLM Workloads
21
+
22
+ SwiftKV is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
23
 
24
+ For more details about SwiftKV:
25
  * ❄️ [SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction (blog)](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)
26
  * 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
27
  * 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
28
 
29
+ ## Arctic Inference: The Orchestrating Framework
30
+
31
+ SuffixDecoding, along with SwiftKV and other innovations, are components of **Arctic Inference**, an open-source vLLM plugin from Snowflake AI Research that delivers the fastest and most cost-effective open-source inference for LLMs and Embeddings.
32
+
33
+ <p align="middle">
34
+ <img src="https://github.com/snowflakedb/ArcticInference/raw/main/projects/arctic_inference/imgs/figure1.png" alt="Arctic Inference overview diagram" width="800">
35
+ </p>
36
+
37
  ## Revisions
38
 
39
+ * **release-2508 (Aug 2025):** Updated model weights for long-context up to 128K
40
+ * **release-2412 (Dec 2024):** Initial model release
41
 
42
+ ## Performance Metrics (SwiftKV)
43
 
44
  To evaluate SwiftKV’s performance, we focus on the following key metrics (see more details in our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)):
45
+ * Combined throughput: The total number of input and output tokens processed per second. This determines:
46
+ * For batch processing, the time required to complete jobs.
47
+ * For interactive use, the volume of concurrent requests a system can handle.
48
+ * TTFT: The latency between a user request and receiving the first token in the response.
49
+ * TPOT: The latency between subsequent tokens after the first token.
50
 
51
  Combined input and output throughput for Llama 3.1 70B (left) and Llama 3.1 405B (right) across a range of input lengths (bottom).
52
  <img src="figure-4-full.png" alt="performance plot of llama-405B w. swiftkv" width="800">
53
 
54
+ TTFT (top) and TPOT (bottom) for input lengths 2000 (left), 8000 (middle), and 32000 (right) for Llama 3.1 405B fp8 model. For each experiment, a range of different request arrival rates is simulated. Each request generates 256 output tokens.
55
  <img src="figure-6.png" alt="performance plot of llama-405B w. swiftkv" width="700">
56
 
57
+ ## Eval Metrics (SwiftKV)
58
 
59
+ For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper](https://arxiv.org/abs/2410.03960) but below we've outlined some relevant evaluation metrics.
 
 
60
 
61
  | Llama-3.1-405B-Instruct-FP8 | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
62
+ |:----------------------------|:--------------|:-----------|:----------|:-----------|:-----|:---------|:------|:----|
63
+ | Baseline | 94.7 | 87.0 | 88.3 | 64.7 | 87.5 | 88.1 | 96.1 | **86.6** |
64
+ | 50% SingleInputKV | 94.0 | 86.3 | 88.1 | 64.2 | 85.7 | 87.5 | 95.2 | **85.9** |
65
 
66
  | Llama-3.1-8B-Instruct | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
67
+ |:----------------------|:--------------|:-----------|:----------|:-----------|:-----|:---------|:------|:----|
68
+ | Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
69
+ | 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |
70
+
71
+ ## Sample Usage (with SuffixDecoding via vLLM)
72
+
73
+ To leverage SuffixDecoding with this model, use `arctic-inference` to patch `vLLM`.
74
+
75
+ First, install the `arctic-inference` package:
76
+ ```bash
77
+ pip install arctic-inference[vllm]
78
+ ```
79
+
80
+ Once installed, Arctic Inference automatically patches vLLM. You can then use `vLLM`'s familiar APIs with SuffixDecoding and other optimizations enabled.
81
+
82
+ #### Serving
83
+
84
+ ```console
85
+ ARCTIC_INFERENCE_ENABLED=1 vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
86
+ --quantization "fp8" \
87
+ --tensor-parallel-size 1 \
88
+ --ulysses-sequence-parallel-size 2 \
89
+ --enable-shift-parallel \
90
+ --speculative-config '{
91
+ "method": "arctic",
92
+ "model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
93
+ "num_speculative_tokens": 3,
94
+ "enable_suffix_decoding": true,
95
+ "disable_by_batch_size": 64
96
+ }'
97
+ ```
98
+
99
+ #### Offline Inference
100
+
101
+ Save the following script to `arctic_example.py`:
102
+
103
+ ```python
104
+ import vllm
105
+ from vllm import LLM, SamplingParams
106
+
107
+ vllm.plugins.load_general_plugins()
108
+
109
+ llm = LLM(
110
+ model="Snowflake/Llama-3.1-SwiftKV-8B-Instruct",
111
+ quantization="fp8",
112
+ tensor_parallel_size=1,
113
+ ulysses_sequence_parallel_size=2,
114
+ enable_shift_parallel=True,
115
+ speculative_config={
116
+ "method": "arctic",
117
+ "model": "Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
118
+ "num_speculative_tokens": 3,
119
+ "enable_suffix_decoding": True,
120
+ "disable_by_batch_size": 64,
121
+ },
122
+ )
123
+
124
+ conversation = [
125
+ {
126
+ "role": "user",
127
+ "content": "Write an essay about the importance of higher education.",
128
+ },
129
+ ]
130
+
131
+ sampling_params = SamplingParams(temperature=0.0, max_tokens=800)
132
+
133
+ outputs = llm.chat(conversation, sampling_params=sampling_params)
134
+
135
+ print(outputs[0].outputs[0].text)
136
+ ```
137
+
138
+ Run the script with Arctic Inference enabled:
139
+
140
+ ```console
141
+ ARCTIC_INFERENCE_ENABLED=1 python arctic_example.py
142
+ ```
143
+
144
+ ## Citation
145
+
146
+ If you find this model or the related work useful, please consider citing the Arctic Inference and SuffixDecoding papers:
147
+
148
+ ```bibtex
149
+ @misc{arcticinference2025,
150
+ title={Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI},
151
+ author={Samyam Rajbhandari and Mert Hidayetoglu and Aurick Qiao and Ye Wang and Juncheng Yang and Jeff Rasley and Michael Wyatt and Yuxiong He},
152
+ year={2025},
153
+ url={https://arxiv.org/abs/2507.11830},
154
+ }
155
+
156
+ @misc{suffixdecoding2024,
157
+ title={SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications},
158
+ author={Ye Wang and Samyam Rajbhandari and Juncheng Yang and Aurick Qiao and Mert Hidayetoglu and Michael Wyatt and Jeff Rasley and Yuxiong He},
159
+ year={2024},
160
+ eprint={2411.04975},
161
+ archivePrefix={arXiv},\
162
+ primaryClass={cs.CL},\
163
+ url={https://arxiv.org/abs/2411.04975},\
164
+ }
165
+ ```
166
  <img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=99c9bfbe-21ae-47b6-9cf3-5c881758bd27" />