File size: 4,103 Bytes
3022aba ef90010 1a947db ef90010 076392c 1a947db ef90010 d668ee3 ef90010 d668ee3 ef90010 d668ee3 ef90010 d668ee3 ef90010 d668ee3 ef90010 d668ee3 ef90010 d668ee3 ef90010 a6798d7 ef90010 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
license: apache-2.0
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
# Llama-3.1-8B-Instruct-KV-Cache-FP8
## Model Overview
- **Model Architecture:** nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
- **Input:** Text
- **Output:** Text
- **Release Date:**
- **Version:** 1.0
- **Model Developers:**: Red Hat
FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
### Model Optimizations
This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type.
## Deployment
### Use with vLLM
1. Initialize vLLM server:
```
vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
```
2. Send requests to the server:
```python
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = client.chat.completions.create(
model=model,
messages=messages,
)
generated_text = outputs.choices[0].message.content
print(generated_text)
```
<!-- ## Creation
This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
<details>
<summary>Creation details</summary>
```python
from transformers import AutoProcessor, Qwen3ForCausalLM
from llmcompressor import oneshot
from llmcompressor.modeling import replace_modules_for_calibration
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "Qwen/Qwen3-8B"
# Load model.
model = Qwen3ForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = replace_modules_for_calibration(model)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with per-block quantization
# * quantize the activations to fp8 with dynamic token activations
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK",
ignore=["lm_head"],
)
# Apply quantization.
oneshot(model=model, recipe=recipe)
# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
```
</details> -->
## Evaluation
The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
[vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
### Accuracy
<table>
<thead>
<tr>
<th>Category</th>
<th>Metric</th>
<th>meta-llama/Llama-3.1-8B-Instruct</th>
<th>nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8</th>
<th>Recovery (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1"><b>LongBench V1</b></td>
<td>Task 1</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td rowspan="6"><b>NIAH</b></td>
<td>niah_single_1</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td>niah_single_2</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td>niah_single_3</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td>niah_multikey_1</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td>niah_multikey_2</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td>niah_multikey_3</td>
<td>abc</td>
<td>ijk</td>
<td>xyz</td>
</tr>
<tr>
<td><b>Average Score</b></td>
<td><b>abc</b></td>
<td><b>ijk</b></td>
<td><b>xyz</b></td>
</tr>
</tbody>
</table>
|