Update README.md

d668ee3 verified 5 months ago

4.1 kB

	---
	license: apache-2.0
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---

	# Llama-3.1-8B-Instruct-KV-Cache-FP8

	## Model Overview
	- Model Architecture: nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
	- Input: Text
	- Output: Text
	- Release Date:
	- Version: 1.0
	- Model Developers:: Red Hat

	FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

	### Model Optimizations

	This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type.


	## Deployment

	### Use with vLLM

	1. Initialize vLLM server:
	```
	vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
	```

	2. Send requests to the server:

	```python
	from openai import OpenAI

	# Modify OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://<your-server-host>:8000/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"

	messages = [
	{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
	]


	outputs = client.chat.completions.create(
	model=model,
	messages=messages,
	)

	generated_text = outputs.choices[0].message.content
	print(generated_text)
	```

	<!-- ## Creation

	This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.

	<details>
	<summary>Creation details</summary>

	```python
	from transformers import AutoProcessor, Qwen3ForCausalLM

	from llmcompressor import oneshot
	from llmcompressor.modeling import replace_modules_for_calibration
	from llmcompressor.modifiers.quantization import QuantizationModifier

	MODEL_ID = "Qwen/Qwen3-8B"

	# Load model.
	model = Qwen3ForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
	processor = AutoProcessor.from_pretrained(MODEL_ID)
	model = replace_modules_for_calibration(model)

	# Configure the quantization algorithm and scheme.
	# In this case, we:
	# * quantize the weights to fp8 with per-block quantization
	# * quantize the activations to fp8 with dynamic token activations
	recipe = QuantizationModifier(
	targets="Linear",
	scheme="FP8_BLOCK",
	ignore=["lm_head"],
	)

	# Apply quantization.
	oneshot(model=model, recipe=recipe)

	# Save to disk in compressed-tensors format.
	SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
	model.save_pretrained(SAVE_DIR)
	processor.save_pretrained(SAVE_DIR)
	```
	</details> -->


	## Evaluation


	The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
	[vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.






	### Accuracy
	<table>
	<thead>
	<tr>
	<th>Category</th>
	<th>Metric</th>
	<th>meta-llama/Llama-3.1-8B-Instruct</th>
	<th>nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8</th>
	<th>Recovery (%)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td rowspan="1"><b>LongBench V1</b></td>
	<td>Task 1</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td rowspan="6"><b>NIAH</b></td>
	<td>niah_single_1</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td>niah_single_2</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td>niah_single_3</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td>niah_multikey_1</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td>niah_multikey_2</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td>niah_multikey_3</td>
	<td>abc</td>
	<td>ijk</td>
	<td>xyz</td>
	</tr>
	<tr>
	<td><b>Average Score</b></td>
	<td><b>abc</b></td>
	<td><b>ijk</b></td>
	<td><b>xyz</b></td>
	</tr>
	</tbody>
	</table>