--- license: apache-2.0 base_model: - meta-llama/Llama-3.1-8B-Instruct --- # Llama-3.1-8B-Instruct-KV-Cache-FP8 ## Model Overview - **Model Architecture:** nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 - **Input:** Text - **Output:** Text - **Release Date:** - **Version:** 1.0 - **Model Developers:**: Red Hat FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). ### Model Optimizations This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type. ## Deployment ### Use with vLLM 1. Initialize vLLM server: ``` vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1 ``` 2. Send requests to the server: ```python from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8" messages = [ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, ] outputs = client.chat.completions.create( model=model, messages=messages, ) generated_text = outputs.choices[0].message.content print(generated_text) ``` ## Evaluation The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations. ### Accuracy
Category Metric meta-llama/Llama-3.1-8B-Instruct nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 Recovery (%)
LongBench V1 Task 1 abc ijk xyz
NIAH niah_single_1 abc ijk xyz
niah_single_2 abc ijk xyz
niah_single_3 abc ijk xyz
niah_multikey_1 abc ijk xyz
niah_multikey_2 abc ijk xyz
niah_multikey_3 abc ijk xyz
Average Score abc ijk xyz