kvaishnavi commited on
Commit
7ddca0d
·
verified ·
1 Parent(s): 07e04a7

Upload README.md from AMD

Browse files
Files changed (1) hide show
  1. README.md +260 -0
README.md ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--
2
+ Copyright (C) [2026] Advanced Micro Devices, Inc. All rights reserved. Portions of this file consist of AI generated content
3
+ -->
4
+
5
+ # InternLM2 Model Export for ONNX Runtime GenAI
6
+
7
+ This example demonstrates how to export InternLM2 models to ONNX format using ONNX Runtime GenAI.
8
+
9
+ ## Supported Models
10
+
11
+ All InternLM2 model sizes are supported:
12
+
13
+ - ✅ **InternLM2-1.8B** - Tested and verified
14
+ - ✅ **InternLM2-7B** - Tested and verified
15
+ - ✅ **InternLM2-20B** - Fully compatible
16
+ - ✅ **InternLM2-Chat variants** - All sizes supported
17
+
18
+ The implementation is architecture-based and automatically adapts to any InternLM2 model size.
19
+
20
+ ## Model Architecture
21
+
22
+ InternLM2 uses a Llama-based architecture with the following key features:
23
+
24
+ - **Attention**: Grouped Query Attention (GQA) with grouped/interleaved QKV layout
25
+ - **Normalization**: RMSNorm (eps: 1e-05)
26
+ - **Activation**: SiLU
27
+ - **Positional Encoding**: RoPE with theta=1,000,000
28
+
29
+ ### Architecture Specifications
30
+
31
+ | Parameter | 1.8B | 7B | 20B |
32
+ |-----------|------|-----|-----|
33
+ | **Hidden Size** | 2048 | 4096 | 6144 |
34
+ | **Num Layers** | 24 | 32 | 48 |
35
+ | **Q Heads** | 16 | 32 | 48 |
36
+ | **KV Heads** | 8 | 8 | 8 |
37
+ | **Head Dim** | 128 | 128 | 128 |
38
+ | **Intermediate** | 8192 | 14336 | 16384 |
39
+ | **GQA Ratio** | 2:1 | 4:1 | 6:1 |
40
+ | **Context Length** | 32,768 | 32,768 | 32,768 |
41
+ | **Vocab Size** | 92,544 | 92,544 | 92,544 |
42
+
43
+ ## Export Examples
44
+
45
+ ### InternLM2-1.8B
46
+
47
+ **FP32 (Best quality baseline):**
48
+ ```bash
49
+ python -m onnxruntime_genai.models.builder \
50
+ --input internlm/internlm2-1_8b \
51
+ --output ./internlm2-1.8b-cpu-fp32 \
52
+ --precision fp32 \
53
+ --execution_provider cpu
54
+ ```
55
+
56
+ **INT4 RTN (Fast quantization):**
57
+ ```bash
58
+ python -m onnxruntime_genai.models.builder \
59
+ --input internlm/internlm2-1_8b \
60
+ --output ./internlm2-1.8b-cpu-int4 \
61
+ --precision int4 \
62
+ --execution_provider cpu
63
+ ```
64
+
65
+ **INT4 AWQ (Best quality, recommended):**
66
+ ```bash
67
+ python -m onnxruntime_genai.models.builder \
68
+ --input internlm/internlm2-1_8b \
69
+ --output ./internlm2-1.8b-cpu-int4-awq \
70
+ --precision int4 \
71
+ --execution_provider cpu \
72
+ --extra_options int4_accuracy_level=4
73
+ ```
74
+
75
+ ### InternLM2-7B
76
+
77
+ **INT4 AWQ CPU (Recommended for most users):**
78
+ ```bash
79
+ python -m onnxruntime_genai.models.builder \
80
+ --input internlm/internlm2-7b \
81
+ --output ./internlm2-7b-cpu-int4-awq \
82
+ --precision int4 \
83
+ --execution_provider cpu \
84
+ --extra_options int4_accuracy_level=4
85
+ ```
86
+
87
+ **INT4 AWQ CUDA (For GPU inference):**
88
+ ```bash
89
+ python -m onnxruntime_genai.models.builder \
90
+ --input internlm/internlm2-7b \
91
+ --output ./internlm2-7b-cuda-int4-awq \
92
+ --precision int4 \
93
+ --execution_provider cuda \
94
+ --extra_options int4_accuracy_level=4
95
+ ```
96
+
97
+ **FP16 CUDA (Highest quality on GPU):**
98
+ ```bash
99
+ python -m onnxruntime_genai.models.builder \
100
+ --input internlm/internlm2-7b \
101
+ --output ./internlm2-7b-cuda-fp16 \
102
+ --precision fp16 \
103
+ --execution_provider cuda
104
+ ```
105
+
106
+ ### InternLM2-20B
107
+
108
+ **INT4 AWQ CUDA (Recommended):**
109
+ ```bash
110
+ python -m onnxruntime_genai.models.builder \
111
+ --input internlm/internlm2-20b \
112
+ --output ./internlm2-20b-cuda-int4-awq \
113
+ --precision int4 \
114
+ --execution_provider cuda \
115
+ --extra_options int4_accuracy_level=4
116
+ ```
117
+
118
+ ## Model Size & Performance
119
+
120
+ | Model | Original Size | INT4 Quantized | FP16 | Recommended RAM |
121
+ |-------|--------------|----------------|------|-----------------|
122
+ | **InternLM2-1.8B** | ~3.6 GB | ~1.0 GB | ~3.6 GB | 4 GB |
123
+ | **InternLM2-7B** | ~14 GB | ~3.8 GB | ~14 GB | 8 GB |
124
+ | **InternLM2-20B** | ~40 GB | ~10.5 GB | ~40 GB | 24 GB |
125
+
126
+ **CPU Inference (Approximate):**
127
+
128
+ | Model | Min RAM | Recommended RAM | Typical Speed |
129
+ |-------|---------|-----------------|---------------|
130
+ | 1.8B INT4 | 4 GB | 8 GB | 8-12 tok/s |
131
+ | 7B INT4 | 8 GB | 16 GB | 2-4 tok/s |
132
+ | 20B INT4 | 16 GB | 32 GB | 0.5-1 tok/s |
133
+
134
+ **GPU Inference (CUDA):**
135
+
136
+ | Model | Min VRAM | Recommended VRAM | Typical Speed |
137
+ |-------|----------|------------------|---------------|
138
+ | 1.8B INT4 | 2 GB | 4 GB | 50-80 tok/s |
139
+ | 7B INT4 | 6 GB | 8 GB | 30-50 tok/s |
140
+ | 7B FP16 | 14 GB | 16 GB | 40-60 tok/s |
141
+ | 20B INT4 | 12 GB | 16 GB | 20-30 tok/s |
142
+ | 20B FP16 | 40 GB | 48 GB | 25-35 tok/s |
143
+
144
+ ## Inference Example
145
+
146
+ ```python
147
+ import onnxruntime_genai as og
148
+
149
+ # Works with any InternLM2 size!
150
+ model = og.Model("./internlm2-7b-cpu-int4-awq")
151
+ tokenizer = og.Tokenizer(model)
152
+ tokenizer_stream = tokenizer.create_stream()
153
+
154
+ # Set generation parameters
155
+ prompt = "What is the meaning of life?"
156
+ tokens = tokenizer.encode(prompt)
157
+
158
+ params = og.GeneratorParams(model)
159
+ params.set_search_options(
160
+ max_length=200,
161
+ temperature=0.7,
162
+ top_p=0.9,
163
+ top_k=40
164
+ )
165
+
166
+ # Generate text
167
+ generator = og.Generator(model, params)
168
+ generator.append_tokens(tokens)
169
+
170
+ print(prompt, end="", flush=True)
171
+ while not generator.is_done():
172
+ generator.generate_next_token()
173
+ new_token = generator.get_next_tokens()[0]
174
+ print(tokenizer_stream.decode(new_token), end="", flush=True)
175
+ print()
176
+ ```
177
+
178
+ ## Why Multi-Size Support Works
179
+
180
+ ### Architecture-Based Implementation
181
+
182
+ The implementation is **size-agnostic** because it:
183
+
184
+ 1. **Dynamically reads config parameters** from each model:
185
+ - `num_attention_heads`
186
+ - `num_key_value_heads`
187
+ - `hidden_size`
188
+ - `num_hidden_layers`
189
+ - `intermediate_size`
190
+
191
+ 2. **Uses config-driven weight splitting**:
192
+ ```python
193
+ # Reads from model config
194
+ num_q_heads = config.num_attention_heads # 16 for 1.8B, 32 for 7B, 48 for 20B
195
+ num_kv_heads = config.num_key_value_heads # Always 8 for InternLM2
196
+ head_dim = config.hidden_size // num_q_heads # Always 128
197
+
198
+ # Calculates group size dynamically
199
+ num_kv_groups = num_q_heads // num_kv_heads # 2 for 1.8B, 4 for 7B, 6 for 20B
200
+ group_size = num_kv_groups + 2
201
+ ```
202
+
203
+ 3. **Handles grouped QKV layout** for any GQA ratio:
204
+ - Layout: `[Group0: Q0,Q1,...,K0,V0 | Group1: Q2,Q3,...,K1,V1 | ...]`
205
+ - Each KV group contains multiple Q heads followed by K and V
206
+ - Correctly extracts weights regardless of the Q/KV head ratio
207
+
208
+ 4. **No hardcoded sizes** anywhere in the code
209
+
210
+ ### Key Implementation Notes
211
+
212
+ **Grouped QKV Layout:**
213
+ - InternLM2 uses a grouped/interleaved QKV weight layout for efficient Grouped Query Attention
214
+ - The implementation in `src/python/py/models/builders/internlm.py` correctly handles this layout during weight extraction
215
+
216
+ **Model Configuration:**
217
+ - The exported model uses `model_type: "llama"` for ONNX Runtime GenAI compatibility
218
+ - Tokenizer uses `tokenizer_class: "LlamaTokenizer"` (SentencePiece-based)
219
+
220
+ ## Recommendations by Use Case
221
+
222
+ ### Development & Testing
223
+ - **InternLM2-1.8B INT4 AWQ** (1 GB)
224
+ - Fast iteration, quick testing
225
+ - Good for prototyping
226
+
227
+ ### Production Applications
228
+ - **InternLM2-7B INT4 AWQ** (3.8 GB)
229
+ - Best balance of quality and performance
230
+ - Suitable for most real-world applications
231
+
232
+ ### High-Quality Applications
233
+ - **InternLM2-7B FP16 CUDA** (14 GB) or
234
+ - **InternLM2-20B INT4 CUDA** (10.5 GB)
235
+ - Maximum quality for critical applications
236
+
237
+ ## Troubleshooting
238
+
239
+ ### "Out of Memory" errors
240
+ - Use INT4 quantization instead of FP16/FP32
241
+ - Enable GPU inference for larger models
242
+ - Use batch_size=1 for inference
243
+
244
+ ### Slow inference on CPU
245
+ - This is expected for 7B+ models
246
+ - Consider GPU inference
247
+ - Use INT4 quantization (2-3x faster than FP16)
248
+
249
+ ### Model not loading
250
+ - Ensure you have enough RAM/VRAM
251
+ - Check that you're using `--execution_provider cuda` for GPU models
252
+ - Verify ONNX Runtime GenAI installation
253
+
254
+ ## References
255
+
256
+ - Model Hub (1.8B): https://huggingface.co/internlm/internlm2-1_8b
257
+ - Model Hub (7B): https://huggingface.co/internlm/internlm2-7b
258
+ - Model Hub (20B): https://huggingface.co/internlm/internlm2-20b
259
+ - Paper: https://arxiv.org/abs/2403.17297
260
+ - GitHub: https://github.com/InternLM/InternLM