| --- |
| license: apache-2.0 |
| base_model: |
| - meta-llama/Llama-3.1-8B-Instruct |
| --- |
| |
| # Llama-3.1-8B-Instruct-KV-Cache-FP8 |
|
|
| ## Model Overview |
| - **Model Architecture:** nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 |
| - **Input:** Text |
| - **Output:** Text |
| - **Release Date:** |
| - **Version:** 1.0 |
| - **Model Developers:**: Red Hat |
|
|
| FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). |
|
|
| ### Model Optimizations |
|
|
| This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type. |
|
|
|
|
| ## Deployment |
|
|
| ### Use with vLLM |
|
|
| 1. Initialize vLLM server: |
| ``` |
| vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1 |
| ``` |
|
|
| 2. Send requests to the server: |
|
|
| ```python |
| from openai import OpenAI |
| |
| # Modify OpenAI's API key and API base to use vLLM's API server. |
| openai_api_key = "EMPTY" |
| openai_api_base = "http://<your-server-host>:8000/v1" |
| |
| client = OpenAI( |
| api_key=openai_api_key, |
| base_url=openai_api_base, |
| ) |
| |
| model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8" |
| |
| messages = [ |
| {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, |
| ] |
| |
| |
| outputs = client.chat.completions.create( |
| model=model, |
| messages=messages, |
| ) |
| |
| generated_text = outputs.choices[0].message.content |
| print(generated_text) |
| ``` |
|
|
| <!-- ## Creation |
|
|
| This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below. |
|
|
| <details> |
| <summary>Creation details</summary> |
|
|
| ```python |
| from transformers import AutoProcessor, Qwen3ForCausalLM |
| |
| from llmcompressor import oneshot |
| from llmcompressor.modeling import replace_modules_for_calibration |
| from llmcompressor.modifiers.quantization import QuantizationModifier |
| |
| MODEL_ID = "Qwen/Qwen3-8B" |
| |
| # Load model. |
| model = Qwen3ForCausalLM.from_pretrained(MODEL_ID, dtype="auto") |
| processor = AutoProcessor.from_pretrained(MODEL_ID) |
| model = replace_modules_for_calibration(model) |
| |
| # Configure the quantization algorithm and scheme. |
| # In this case, we: |
| # * quantize the weights to fp8 with per-block quantization |
| # * quantize the activations to fp8 with dynamic token activations |
| recipe = QuantizationModifier( |
| targets="Linear", |
| scheme="FP8_BLOCK", |
| ignore=["lm_head"], |
| ) |
| |
| # Apply quantization. |
| oneshot(model=model, recipe=recipe) |
| |
| # Save to disk in compressed-tensors format. |
| SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block" |
| model.save_pretrained(SAVE_DIR) |
| processor.save_pretrained(SAVE_DIR) |
| ``` |
| </details> --> |
|
|
|
|
| ## Evaluation |
|
|
|
|
| The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). |
| [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations. |
|
|
|
|
|
|
|
|
|
|
|
|
| ### Accuracy |
| <table> |
| <thead> |
| <tr> |
| <th>Category</th> |
| <th>Metric</th> |
| <th>meta-llama/Llama-3.1-8B-Instruct</th> |
| <th>nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8</th> |
| <th>Recovery (%)</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td rowspan="1"><b>LongBench V1</b></td> |
| <td>Task 1</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td rowspan="6"><b>NIAH</b></td> |
| <td>niah_single_1</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td>niah_single_2</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td>niah_single_3</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_1</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_2</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td>niah_multikey_3</td> |
| <td>abc</td> |
| <td>ijk</td> |
| <td>xyz</td> |
| </tr> |
| <tr> |
| <td><b>Average Score</b></td> |
| <td><b>abc</b></td> |
| <td><b>ijk</b></td> |
| <td><b>xyz</b></td> |
| </tr> |
| </tbody> |
| </table> |
| |