krishnateja95 commited on
Commit
d4decb4
·
verified ·
1 Parent(s): 7a72d55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -3
README.md CHANGED
@@ -1,3 +1,185 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - fp8
6
+ - quantized
7
+ - llm-compressor
8
+ - compressed-tensors
9
+ - red hat
10
+ base_model:
11
+ - Qwen/Qwen3-Next-80B-A3B-Instruct
12
+ ---
13
+
14
+
15
+ # Qwen3-Next-80B-A3B-Instruct-FP8-dynamic
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** Qwen3NextForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** FP8
23
+ - **Activation quantization:** FP8
24
+ - **Release Date:**
25
+ - **Version:** 1.0
26
+ - **Model Developers:**: Red Hat
27
+
28
+ Quantized version of [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights and activations of [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) to FP8 data type.
33
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
34
+ Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
35
+
36
+ ## Deployment
37
+
38
+ ### Use with vLLM
39
+
40
+ 1. Initialize vLLM server:
41
+ ```
42
+ vllm serve RedHatAI/Qwen3-Next-80B-A3B-Instruct-FP8-dynamic --tensor_parallel_size 2
43
+ ```
44
+
45
+ 2. Send requests to the server:
46
+
47
+ ```python
48
+ from openai import OpenAI
49
+
50
+ # Modify OpenAI's API key and API base to use vLLM's API server.
51
+ openai_api_key = "EMPTY"
52
+ openai_api_base = "http://<your-server-host>:8000/v1"
53
+
54
+ client = OpenAI(
55
+ api_key=openai_api_key,
56
+ base_url=openai_api_base,
57
+ )
58
+
59
+ model = "RedHatAI/Qwen3-Next-80B-A3B-Instruct-FP8-dynamic"
60
+
61
+ messages = [
62
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
63
+ ]
64
+
65
+
66
+ outputs = client.chat.completions.create(
67
+ model=model,
68
+ messages=messages,
69
+ )
70
+
71
+ generated_text = outputs.choices[0].message.content
72
+ print(generated_text)
73
+ ```
74
+
75
+ ## Creation
76
+
77
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
78
+
79
+ <details>
80
+ <summary>Creation details</summary>
81
+
82
+ ```python
83
+ from datasets import load_dataset
84
+ from transformers import AutoModelForCausalLM, AutoTokenizer
85
+
86
+ from llmcompressor import oneshot
87
+ from llmcompressor.modifiers.quantization import QuantizationModifier
88
+ from llmcompressor.utils import dispatch_for_generation
89
+
90
+ # NOTE: Requires a minimum of transformers 4.57.0
91
+
92
+ MODEL_ID = "Qwen/Qwen3-Next-80B-A3B-Instruct"
93
+
94
+ # Load model.
95
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
96
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
97
+
98
+
99
+ # Configure the quantization algorithm and scheme.
100
+ # In this case, we:
101
+ # * quantize the weights to fp8 with per channel via ptq
102
+ # * quantize the activations to fp8 with dynamic per token
103
+ recipe = QuantizationModifier(
104
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=[
105
+ "lm_head",
106
+ "re:.*mlp.gate$",
107
+ "re:.*mlp.shared_expert_gate$",
108
+ "re:.*linear_attn.*",
109
+ ],
110
+ )
111
+
112
+ # Apply quantization.
113
+ oneshot(model=model, recipe=recipe)
114
+
115
+ # Confirm generations of the quantized model look sane.
116
+ print("========== SAMPLE GENERATION ==============")
117
+ dispatch_for_generation(model)
118
+ input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
119
+ model.device
120
+ )
121
+ output = model.generate(input_ids, max_new_tokens=20)
122
+ print(tokenizer.decode(output[0]))
123
+ print("==========================================")
124
+
125
+ # Save to disk in compressed-tensors format.
126
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-dynamic"
127
+ model.save_pretrained(SAVE_DIR)
128
+ tokenizer.save_pretrained(SAVE_DIR)
129
+ ```
130
+ </details>
131
+
132
+
133
+ ## Evaluation
134
+
135
+
136
+ The model was evaluated on the OpenLLM leaderboard task, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
137
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
138
+
139
+ <details>
140
+ <summary>Evaluation details</summary>
141
+
142
+ **Openllm V1**
143
+ ```
144
+ lm_eval \
145
+ --model vllm \
146
+ --model_args pretrained="RedHatAI/Qwen3-Next-80B-A3B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=16384,tensor_parallel_size=2,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
147
+ --tasks openllm \
148
+ --write_out \
149
+ --batch_size auto \
150
+ --show_config
151
+ ```
152
+
153
+
154
+ **Openllm V2**
155
+ ```
156
+ lm_eval \
157
+ --model vllm \
158
+ --model_args pretrained="RedHatAI/Qwen3-Next-80B-A3B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=2,gpu_memory_utilization=0.7,disable_log_stats=True,enable_chunked_prefill=True,trust_remote_code=True \
159
+ --tasks leaderboard \
160
+ --apply_chat_template \
161
+ --fewshot_as_multiturn \
162
+ --write_out \
163
+ --batch_size auto \
164
+ --show_config
165
+ ```
166
+
167
+
168
+ **Coding Benchmarks**
169
+
170
+ ```
171
+ evalplus.evaluate --model "RedHatAI/Qwen3-Next-80B-A3B-Instruct-FP8-dynamic" \
172
+ --dataset "humaneval" \
173
+ --backend vllm \
174
+ --tp 2 \
175
+ --greedy
176
+
177
+ evalplus.evaluate --model "RedHatAI/Qwen3-Next-80B-A3B-Instruct-FP8-dynamic" \
178
+ --dataset "mbpp" \
179
+ --backend vllm \
180
+ --tp 2 \
181
+ --greedy
182
+
183
+ ```
184
+
185
+ </details>