krishnateja95 commited on
Commit
3022aba
·
verified ·
1 Parent(s): 566cf12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -3
README.md CHANGED
@@ -1,3 +1,109 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - meta-llama/Llama-3.1-8B-Instruct
5
+ ---
6
+
7
+ # Llama-3.1-8B-Instruct-KV-Cache-FP8
8
+
9
+ ## Model Overview
10
+ - **Model Architecture:** nm-testing/Llama-3.1-8B-Instruct-KV-Cache-FP8
11
+ - **Input:** Text
12
+ - **Output:** Text
13
+ - **Release Date:**
14
+ - **Version:** 1.0
15
+ - **Model Developers:**: Red Hat
16
+
17
+ FP8 KV Cache Quantization of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
18
+
19
+ ### Model Optimizations
20
+
21
+ This model was obtained by quantizing the KV Cache of weights and activations of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to FP8 data type.
22
+
23
+
24
+ ## Deployment
25
+
26
+ ### Use with vLLM
27
+
28
+ 1. Initialize vLLM server:
29
+ ```
30
+ vllm serve RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8 --tensor_parallel_size 1
31
+ ```
32
+
33
+ 2. Send requests to the server:
34
+
35
+ ```python
36
+ from openai import OpenAI
37
+
38
+ # Modify OpenAI's API key and API base to use vLLM's API server.
39
+ openai_api_key = "EMPTY"
40
+ openai_api_base = "http://<your-server-host>:8000/v1"
41
+
42
+ client = OpenAI(
43
+ api_key=openai_api_key,
44
+ base_url=openai_api_base,
45
+ )
46
+
47
+ model = "RedHatAI/Llama-3.1-8B-Instruct-KV-Cache-FP8"
48
+
49
+ messages = [
50
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
51
+ ]
52
+
53
+
54
+ outputs = client.chat.completions.create(
55
+ model=model,
56
+ messages=messages,
57
+ )
58
+
59
+ generated_text = outputs.choices[0].message.content
60
+ print(generated_text)
61
+ ```
62
+
63
+ <!-- ## Creation
64
+
65
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
66
+
67
+ <details>
68
+ <summary>Creation details</summary>
69
+
70
+ ```python
71
+ from transformers import AutoProcessor, Qwen3ForCausalLM
72
+
73
+ from llmcompressor import oneshot
74
+ from llmcompressor.modeling import replace_modules_for_calibration
75
+ from llmcompressor.modifiers.quantization import QuantizationModifier
76
+
77
+ MODEL_ID = "Qwen/Qwen3-8B"
78
+
79
+ # Load model.
80
+ model = Qwen3ForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
81
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
82
+ model = replace_modules_for_calibration(model)
83
+
84
+ # Configure the quantization algorithm and scheme.
85
+ # In this case, we:
86
+ # * quantize the weights to fp8 with per-block quantization
87
+ # * quantize the activations to fp8 with dynamic token activations
88
+ recipe = QuantizationModifier(
89
+ targets="Linear",
90
+ scheme="FP8_BLOCK",
91
+ ignore=["lm_head"],
92
+ )
93
+
94
+ # Apply quantization.
95
+ oneshot(model=model, recipe=recipe)
96
+
97
+ # Save to disk in compressed-tensors format.
98
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
99
+ model.save_pretrained(SAVE_DIR)
100
+ processor.save_pretrained(SAVE_DIR)
101
+ ```
102
+ </details> -->
103
+
104
+
105
+ ## Evaluation
106
+
107
+
108
+ The model was evaluated on the RULER and long-context benchmarks (LongBench), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
109
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.