YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
InternLM2 Model Export for ONNX Runtime GenAI
This example demonstrates how to export InternLM2 models to ONNX format using ONNX Runtime GenAI.
Supported Models
All InternLM2 model sizes are supported:
- ✅ InternLM2-1.8B - Tested and verified
- ✅ InternLM2-7B - Tested and verified
- ✅ InternLM2-20B - Fully compatible
- ✅ InternLM2-Chat variants - All sizes supported
The implementation is architecture-based and automatically adapts to any InternLM2 model size.
Model Architecture
InternLM2 uses a Llama-based architecture with the following key features:
- Attention: Grouped Query Attention (GQA) with grouped/interleaved QKV layout
- Normalization: RMSNorm (eps: 1e-05)
- Activation: SiLU
- Positional Encoding: RoPE with theta=1,000,000
Architecture Specifications
| Parameter | 1.8B | 7B | 20B |
|---|---|---|---|
| Hidden Size | 2048 | 4096 | 6144 |
| Num Layers | 24 | 32 | 48 |
| Q Heads | 16 | 32 | 48 |
| KV Heads | 8 | 8 | 8 |
| Head Dim | 128 | 128 | 128 |
| Intermediate | 8192 | 14336 | 16384 |
| GQA Ratio | 2:1 | 4:1 | 6:1 |
| Context Length | 32,768 | 32,768 | 32,768 |
| Vocab Size | 92,544 | 92,544 | 92,544 |
Export Examples
InternLM2-1.8B
FP32 (Best quality baseline):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-1_8b \
--output ./internlm2-1.8b-cpu-fp32 \
--precision fp32 \
--execution_provider cpu
INT4 RTN (Fast quantization):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-1_8b \
--output ./internlm2-1.8b-cpu-int4 \
--precision int4 \
--execution_provider cpu
INT4 AWQ (Best quality, recommended):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-1_8b \
--output ./internlm2-1.8b-cpu-int4-awq \
--precision int4 \
--execution_provider cpu \
--extra_options int4_accuracy_level=4
InternLM2-7B
INT4 AWQ CPU (Recommended for most users):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-7b \
--output ./internlm2-7b-cpu-int4-awq \
--precision int4 \
--execution_provider cpu \
--extra_options int4_accuracy_level=4
INT4 AWQ CUDA (For GPU inference):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-7b \
--output ./internlm2-7b-cuda-int4-awq \
--precision int4 \
--execution_provider cuda \
--extra_options int4_accuracy_level=4
FP16 CUDA (Highest quality on GPU):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-7b \
--output ./internlm2-7b-cuda-fp16 \
--precision fp16 \
--execution_provider cuda
InternLM2-20B
INT4 AWQ CUDA (Recommended):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-20b \
--output ./internlm2-20b-cuda-int4-awq \
--precision int4 \
--execution_provider cuda \
--extra_options int4_accuracy_level=4
Model Size & Performance
| Model | Original Size | INT4 Quantized | FP16 | Recommended RAM |
|---|---|---|---|---|
| InternLM2-1.8B | ~3.6 GB | ~1.0 GB | ~3.6 GB | 4 GB |
| InternLM2-7B | ~14 GB | ~3.8 GB | ~14 GB | 8 GB |
| InternLM2-20B | ~40 GB | ~10.5 GB | ~40 GB | 24 GB |
CPU Inference (Approximate):
| Model | Min RAM | Recommended RAM | Typical Speed |
|---|---|---|---|
| 1.8B INT4 | 4 GB | 8 GB | 8-12 tok/s |
| 7B INT4 | 8 GB | 16 GB | 2-4 tok/s |
| 20B INT4 | 16 GB | 32 GB | 0.5-1 tok/s |
GPU Inference (CUDA):
| Model | Min VRAM | Recommended VRAM | Typical Speed |
|---|---|---|---|
| 1.8B INT4 | 2 GB | 4 GB | 50-80 tok/s |
| 7B INT4 | 6 GB | 8 GB | 30-50 tok/s |
| 7B FP16 | 14 GB | 16 GB | 40-60 tok/s |
| 20B INT4 | 12 GB | 16 GB | 20-30 tok/s |
| 20B FP16 | 40 GB | 48 GB | 25-35 tok/s |
Inference Example
import onnxruntime_genai as og
# Works with any InternLM2 size!
model = og.Model("./internlm2-7b-cpu-int4-awq")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
# Set generation parameters
prompt = "What is the meaning of life?"
tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(
max_length=200,
temperature=0.7,
top_p=0.9,
top_k=40
)
# Generate text
generator = og.Generator(model, params)
generator.append_tokens(tokens)
print(prompt, end="", flush=True)
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end="", flush=True)
print()
Why Multi-Size Support Works
Architecture-Based Implementation
The implementation is size-agnostic because it:
Dynamically reads config parameters from each model:
num_attention_headsnum_key_value_headshidden_sizenum_hidden_layersintermediate_size
Uses config-driven weight splitting:
# Reads from model config num_q_heads = config.num_attention_heads # 16 for 1.8B, 32 for 7B, 48 for 20B num_kv_heads = config.num_key_value_heads # Always 8 for InternLM2 head_dim = config.hidden_size // num_q_heads # Always 128 # Calculates group size dynamically num_kv_groups = num_q_heads // num_kv_heads # 2 for 1.8B, 4 for 7B, 6 for 20B group_size = num_kv_groups + 2Handles grouped QKV layout for any GQA ratio:
- Layout:
[Group0: Q0,Q1,...,K0,V0 | Group1: Q2,Q3,...,K1,V1 | ...] - Each KV group contains multiple Q heads followed by K and V
- Correctly extracts weights regardless of the Q/KV head ratio
- Layout:
No hardcoded sizes anywhere in the code
Key Implementation Notes
Grouped QKV Layout:
- InternLM2 uses a grouped/interleaved QKV weight layout for efficient Grouped Query Attention
- The implementation in
src/python/py/models/builders/internlm.pycorrectly handles this layout during weight extraction
Model Configuration:
- The exported model uses
model_type: "llama"for ONNX Runtime GenAI compatibility - Tokenizer uses
tokenizer_class: "LlamaTokenizer"(SentencePiece-based)
Recommendations by Use Case
Development & Testing
- InternLM2-1.8B INT4 AWQ (1 GB)
- Fast iteration, quick testing
- Good for prototyping
Production Applications
- InternLM2-7B INT4 AWQ (3.8 GB)
- Best balance of quality and performance
- Suitable for most real-world applications
High-Quality Applications
- InternLM2-7B FP16 CUDA (14 GB) or
- InternLM2-20B INT4 CUDA (10.5 GB)
- Maximum quality for critical applications
Troubleshooting
"Out of Memory" errors
- Use INT4 quantization instead of FP16/FP32
- Enable GPU inference for larger models
- Use batch_size=1 for inference
Slow inference on CPU
- This is expected for 7B+ models
- Consider GPU inference
- Use INT4 quantization (2-3x faster than FP16)
Model not loading
- Ensure you have enough RAM/VRAM
- Check that you're using
--execution_provider cudafor GPU models - Verify ONNX Runtime GenAI installation
References
- Model Hub (1.8B): https://huggingface.co/internlm/internlm2-1_8b
- Model Hub (7B): https://huggingface.co/internlm/internlm2-7b
- Model Hub (20B): https://huggingface.co/internlm/internlm2-20b
- Paper: https://arxiv.org/abs/2403.17297
- GitHub: https://github.com/InternLM/InternLM