File size: 4,103 Bytes

---
license: apache-2.0
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# Llama-3.1-8B-Instruct-KV-Cache-FP8

## Model Overview
- **Model Architecture:** nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
  - **Input:** Text
  - **Output:** Text
- **Release Date:** 
- **Version:** 1.0
- **Model Developers:**: Red Hat

FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

### Model Optimizations

This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type.


## Deployment

### Use with vLLM

1. Initialize vLLM server:
```
vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
```

2. Send requests to the server:

```python
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]


outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)
```

<!-- ## Creation

This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.

<details>
  <summary>Creation details</summary>

```python
from transformers import AutoProcessor, Qwen3ForCausalLM

from llmcompressor import oneshot
from llmcompressor.modeling import replace_modules_for_calibration
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "Qwen/Qwen3-8B"

# Load model.
model = Qwen3ForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = replace_modules_for_calibration(model)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per-block quantization
#   * quantize the activations to fp8 with dynamic token activations
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",
    ignore=["lm_head"],
)

# Apply quantization.
oneshot(model=model, recipe=recipe)

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
```
</details> -->


## Evaluation


The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
[vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.






### Accuracy
<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Metric</th>
      <th>meta-llama/Llama-3.1-8B-Instruct</th>
      <th>nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8</th>
      <th>Recovery (%)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="1"><b>LongBench V1</b></td>
      <td>Task 1</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
      <td rowspan="6"><b>NIAH</b></td>
      <td>niah_single_1</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
      <td>niah_single_2</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
      <td>niah_single_3</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
    <td>niah_multikey_1</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
      <td>niah_multikey_2</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
      <td>niah_multikey_3</td>
      <td>abc</td>
      <td>ijk</td>
      <td>xyz</td>
    </tr>
    <tr>
      <td><b>Average Score</b></td>
      <td><b>abc</b></td>
      <td><b>ijk</b></td>
      <td><b>xyz</b></td>
    </tr>
  </tbody>
</table>