--- license: apache-2.0 base_model: - meta-llama/Llama-3.1-8B-Instruct --- # Llama-3.1-8B-Instruct-KV-Cache-FP8 ## Model Overview - **Model Architecture:** nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8 - **Input:** Text - **Output:** Text - **Release Date:** - **Version:** 1.0 - **Model Developers:**: Red Hat FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). ### Model Optimizations This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type. ## Deployment ### Use with vLLM 1. Initialize vLLM server: ``` vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1 ``` 2. Send requests to the server: ```python from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8" messages = [ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, ] outputs = client.chat.completions.create( model=model, messages=messages, ) generated_text = outputs.choices[0].message.content print(generated_text) ``` ## Evaluation The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations. ### Accuracy

Category	Metric	meta-llama/Llama-3.1-8B-Instruct	nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8	Recovery (%)
LongBench V1	Task 1	abc	ijk	xyz
NIAH	niah_single_1	abc	ijk	xyz
	niah_single_2	abc	ijk	xyz
	niah_single_3	abc	ijk	xyz
	niah_multikey_1	abc	ijk	xyz
	niah_multikey_2	abc	ijk	xyz
	niah_multikey_3	abc	ijk	xyz
Average Score	abc	ijk	xyz