Qwen2.5-7B-Instruct Pre-Compiled for AWS Inferentia2 (TP=2)
Pre-compiled and pre-sharded Qwen2.5-7B-Instruct for AWS Neuron SDK 2.28, ready to load on inf2.xlarge (16 GB system RAM) or any larger Inferentia2/Trainium instance.
Why Pre-Sharded?
The standard NxDI load path loads the full HuggingFace checkpoint (~14 GB BF16) into CPU RAM for weight conversion and sharding. On inf2.xlarge (16 GB system RAM), this causes an OOM kill at ~14 GB RSS.
Pre-sharded weights bypass this entirely โ NxDI reads directly from the per-rank sharded files, peaking at ~13.5 GB RSS during load (tight but viable on 16 GB) and settling to ~4.3 GB RSS after device transfer.
Contents
| File | Size | Description |
|---|---|---|
model.pt |
~153 MB | Compiled Neuron NEFF graphs |
neuron_config.json |
~8 KB | NxDI configuration (TP=2, BS=1, seq_len=8192, BF16) |
weights/tp0_sharded_checkpoint.safetensors |
~7.3 GB | Pre-sharded model weights for rank 0 |
weights/tp1_sharded_checkpoint.safetensors |
~7.3 GB | Pre-sharded model weights for rank 1 |
config.json |
<1 KB | HuggingFace model config |
tokenizer.json |
~6.8 MB | Tokenizer |
tokenizer_config.json |
~7 KB | Tokenizer configuration |
generation_config.json |
<1 KB | Default generation parameters |
vocab.json |
~2.7 MB | Vocabulary |
merges.txt |
~1.6 MB | BPE merges |
Performance
Measured on inf2.xlarge (2 NeuronCores, 32 GB HBM, 4 vCPU, 16 GB system RAM):
| Metric | Value |
|---|---|
| Throughput | 24.1 tok/s |
| Latency (4K in / 4K out) | 169.8 s |
| Load time | ~330 s |
| Peak RSS during load | ~13.5 GB |
| RSS after load | ~4.3 GB |
| Cost | $8.76/M output tokens at $0.76/hr |
Benchmark: batch_size=1, 4095 input tokens, 4096 output tokens, greedy decoding, 2 warmup + 10 measured requests.
Quick Start
Prerequisites
- AWS instance with Inferentia2: inf2.xlarge (minimum), inf2.8xlarge, or larger
- Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 (SDK 2.28)
- Activate the pre-installed venv:
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
1. Download the model
pip install -q huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('jburtoft/Qwen2.5-7B-Instruct-Neuron-TP2',
local_dir='/data/Qwen2.5-7B-Instruct-Neuron-TP2')
"
2. Load and run inference
import os
import torch
from transformers import AutoTokenizer, GenerationConfig
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen2.modeling_qwen2 import (
NeuronQwen2ForCausalLM, Qwen2InferenceConfig,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
from neuronx_distributed_inference.utils.accuracy import get_generate_outputs
MODEL_DIR = "/data/Qwen2.5-7B-Instruct-Neuron-TP2"
os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"
neuron_config = NeuronConfig(
tp_degree=2,
batch_size=1,
seq_len=8192,
n_positions=8192,
max_context_length=8192,
torch_dtype=torch.bfloat16,
on_device_sampling_config=OnDeviceSamplingConfig(),
fused_qkv=True,
attn_kernel_enabled=False, # inf2 does not support flash attention
enable_bucketing=True,
logical_nc_config=1, # inf2 requires LNC=1
save_sharded_checkpoint=True, # must match how the model was compiled
)
config = Qwen2InferenceConfig(
neuron_config,
load_config=load_pretrained_config(MODEL_DIR),
)
model = NeuronQwen2ForCausalLM(MODEL_DIR, config)
model.load(MODEL_DIR) # loads from pre-sharded weights
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, padding_side="right")
tokenizer.pad_token = tokenizer.eos_token
# Generate
prompt = "Explain quantum computing in simple terms."
generation_config = GenerationConfig(
max_new_tokens=256,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
outputs, decoded_texts = get_generate_outputs(
model, [prompt], tokenizer, is_hf=False, generation_config=generation_config,
)
print(decoded_texts[0])
3. Important notes
- LNC=1 is required on inf2. Set
NEURON_LOGICAL_NC_CONFIG=1before loading. - Flash attention is not supported on inf2 (trn1-era cores). Use
attn_kernel_enabled=False. - First load takes ~5-6 minutes as the sharded weights (14.6 GB total) are read from disk and transferred to device.
- First import may be slow (~3-5 min) on a fresh DLAMI instance due to library rehydration.
save_sharded_checkpoint=Truemust be set in the NeuronConfig when loading โ this tells NxDI to use the per-rank sharded files instead of the standard HF checkpoint loading path.
Compilation Details
| Parameter | Value |
|---|---|
| SDK | 2.28 (NxDI 0.8.0, neuronx-cc 2.22, torch-neuronx 2.9.0) |
| TP degree | 2 |
| Batch size | 1 |
| Sequence length | 8192 |
| Dtype | bfloat16 |
| Flash attention | Disabled (inf2 constraint) |
| LNC | 1 (inf2 constraint) |
save_sharded_checkpoint |
True |
| Compiled on | inf2.8xlarge (32 vCPU, 128 GB RAM) |
Compiling Your Own
To compile for different configurations (e.g., different TP, batch size, or sequence length), use a larger instance (inf2.8xlarge or trn2.3xlarge):
import os
import torch
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen2.modeling_qwen2 import (
NeuronQwen2ForCausalLM, Qwen2InferenceConfig,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
# Download the base model first
# pip install huggingface_hub
# from huggingface_hub import snapshot_download
# snapshot_download("Qwen/Qwen2.5-7B-Instruct", local_dir="/data/models/Qwen2.5-7B-Instruct")
MODEL_PATH = "/data/models/Qwen2.5-7B-Instruct"
OUTPUT_PATH = "/data/compiled/Qwen2.5-7B-TP2-sharded"
os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"
neuron_config = NeuronConfig(
tp_degree=2, # adjust as needed
batch_size=1, # adjust as needed
seq_len=8192, # adjust as needed
n_positions=8192,
max_context_length=8192,
torch_dtype=torch.bfloat16,
on_device_sampling_config=OnDeviceSamplingConfig(),
fused_qkv=True,
attn_kernel_enabled=False, # False for inf2, True for trn2
enable_bucketing=True,
logical_nc_config=1, # 1 for inf2, 1 or 2 for trn2
save_sharded_checkpoint=True, # REQUIRED for pre-sharded deployment
)
config = Qwen2InferenceConfig(
neuron_config,
load_config=load_pretrained_config(MODEL_PATH),
)
model = NeuronQwen2ForCausalLM(MODEL_PATH, config)
model.compile(OUTPUT_PATH)
# Output:
# OUTPUT_PATH/model.pt (compiled NEFFs)
# OUTPUT_PATH/neuron_config.json (NxDI config)
# OUTPUT_PATH/weights/tp0_sharded_checkpoint.safetensors (rank 0 weights)
# OUTPUT_PATH/weights/tp1_sharded_checkpoint.safetensors (rank 1 weights)
Compilation takes approximately 8-9 minutes on inf2.8xlarge.
Base Model
- Model: Qwen/Qwen2.5-7B-Instruct
- Architecture: Qwen2 (decoder-only transformer)
- Parameters: 7.6B
- License: Apache 2.0
Acknowledgments
Part of the Flav-benchmark project benchmarking Qwen2.5 inference across Neuron frameworks (NxDI, vLLM-neuron, optimum-neuron) and GPU baselines.
- Downloads last month
- 14