Qwen2.5-7B-Instruct Pre-Compiled for AWS Inferentia2 (TP=2)

Pre-compiled and pre-sharded Qwen2.5-7B-Instruct for AWS Neuron SDK 2.28, ready to load on inf2.xlarge (16 GB system RAM) or any larger Inferentia2/Trainium instance.

Why Pre-Sharded?

The standard NxDI load path loads the full HuggingFace checkpoint (~14 GB BF16) into CPU RAM for weight conversion and sharding. On inf2.xlarge (16 GB system RAM), this causes an OOM kill at ~14 GB RSS.

Pre-sharded weights bypass this entirely — NxDI reads directly from the per-rank sharded files, peaking at ~13.5 GB RSS during load (tight but viable on 16 GB) and settling to ~4.3 GB RSS after device transfer.

File	Size	Description
`model.pt`	~153 MB	Compiled Neuron NEFF graphs
`neuron_config.json`	~8 KB	NxDI configuration (TP=2, BS=1, seq_len=8192, BF16)
`weights/tp0_sharded_checkpoint.safetensors`	~7.3 GB	Pre-sharded model weights for rank 0
`weights/tp1_sharded_checkpoint.safetensors`	~7.3 GB	Pre-sharded model weights for rank 1
`config.json`	<1 KB	HuggingFace model config
`tokenizer.json`	~6.8 MB	Tokenizer
`tokenizer_config.json`	~7 KB	Tokenizer configuration
`generation_config.json`	<1 KB	Default generation parameters
`vocab.json`	~2.7 MB	Vocabulary
`merges.txt`	~1.6 MB	BPE merges

Performance

Measured on inf2.xlarge (2 NeuronCores, 32 GB HBM, 4 vCPU, 16 GB system RAM):

Metric	Value
Throughput	24.1 tok/s
Latency (4K in / 4K out)	169.8 s
Load time	~330 s
Peak RSS during load	~13.5 GB
RSS after load	~4.3 GB
Cost	$8.76/M output tokens at $0.76/hr

Benchmark: batch_size=1, 4095 input tokens, 4096 output tokens, greedy decoding, 2 warmup + 10 measured requests.

Quick Start

Prerequisites

AWS instance with Inferentia2: inf2.xlarge (minimum), inf2.8xlarge, or larger
Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 (SDK 2.28)

Activate the pre-installed venv:

source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate

1. Download the model

pip install -q huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('jburtoft/Qwen2.5-7B-Instruct-Neuron-TP2',
                  local_dir='/data/Qwen2.5-7B-Instruct-Neuron-TP2')
"

2. Load and run inference

import os
import torch
from transformers import AutoTokenizer, GenerationConfig
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen2.modeling_qwen2 import (
    NeuronQwen2ForCausalLM, Qwen2InferenceConfig,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
from neuronx_distributed_inference.utils.accuracy import get_generate_outputs

MODEL_DIR = "/data/Qwen2.5-7B-Instruct-Neuron-TP2"

os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"

neuron_config = NeuronConfig(
    tp_degree=2,
    batch_size=1,
    seq_len=8192,
    n_positions=8192,
    max_context_length=8192,
    torch_dtype=torch.bfloat16,
    on_device_sampling_config=OnDeviceSamplingConfig(),
    fused_qkv=True,
    attn_kernel_enabled=False,   # inf2 does not support flash attention
    enable_bucketing=True,
    logical_nc_config=1,         # inf2 requires LNC=1
    save_sharded_checkpoint=True, # must match how the model was compiled
)

config = Qwen2InferenceConfig(
    neuron_config,
    load_config=load_pretrained_config(MODEL_DIR),
)

model = NeuronQwen2ForCausalLM(MODEL_DIR, config)
model.load(MODEL_DIR)  # loads from pre-sharded weights

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, padding_side="right")
tokenizer.pad_token = tokenizer.eos_token

# Generate
prompt = "Explain quantum computing in simple terms."
generation_config = GenerationConfig(
    max_new_tokens=256,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
)

outputs, decoded_texts = get_generate_outputs(
    model, [prompt], tokenizer, is_hf=False, generation_config=generation_config,
)

print(decoded_texts[0])

3. Important notes

LNC=1 is required on inf2. Set NEURON_LOGICAL_NC_CONFIG=1 before loading.
Flash attention is not supported on inf2 (trn1-era cores). Use attn_kernel_enabled=False.
First load takes ~5-6 minutes as the sharded weights (14.6 GB total) are read from disk and transferred to device.
First import may be slow (~3-5 min) on a fresh DLAMI instance due to library rehydration.
save_sharded_checkpoint=True must be set in the NeuronConfig when loading — this tells NxDI to use the per-rank sharded files instead of the standard HF checkpoint loading path.

Compilation Details

Parameter	Value
SDK	2.28 (NxDI 0.8.0, neuronx-cc 2.22, torch-neuronx 2.9.0)
TP degree	2
Batch size	1
Sequence length	8192
Dtype	bfloat16
Flash attention	Disabled (inf2 constraint)
LNC	1 (inf2 constraint)
`save_sharded_checkpoint`	True
Compiled on	inf2.8xlarge (32 vCPU, 128 GB RAM)

Compiling Your Own

To compile for different configurations (e.g., different TP, batch size, or sequence length), use a larger instance (inf2.8xlarge or trn2.3xlarge):

import os
import torch
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen2.modeling_qwen2 import (
    NeuronQwen2ForCausalLM, Qwen2InferenceConfig,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

# Download the base model first
# pip install huggingface_hub
# from huggingface_hub import snapshot_download
# snapshot_download("Qwen/Qwen2.5-7B-Instruct", local_dir="/data/models/Qwen2.5-7B-Instruct")

MODEL_PATH = "/data/models/Qwen2.5-7B-Instruct"
OUTPUT_PATH = "/data/compiled/Qwen2.5-7B-TP2-sharded"

os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"

neuron_config = NeuronConfig(
    tp_degree=2,               # adjust as needed
    batch_size=1,              # adjust as needed
    seq_len=8192,              # adjust as needed
    n_positions=8192,
    max_context_length=8192,
    torch_dtype=torch.bfloat16,
    on_device_sampling_config=OnDeviceSamplingConfig(),
    fused_qkv=True,
    attn_kernel_enabled=False, # False for inf2, True for trn2
    enable_bucketing=True,
    logical_nc_config=1,       # 1 for inf2, 1 or 2 for trn2
    save_sharded_checkpoint=True,  # REQUIRED for pre-sharded deployment
)

config = Qwen2InferenceConfig(
    neuron_config,
    load_config=load_pretrained_config(MODEL_PATH),
)

model = NeuronQwen2ForCausalLM(MODEL_PATH, config)
model.compile(OUTPUT_PATH)

# Output:
#   OUTPUT_PATH/model.pt                                    (compiled NEFFs)
#   OUTPUT_PATH/neuron_config.json                           (NxDI config)
#   OUTPUT_PATH/weights/tp0_sharded_checkpoint.safetensors   (rank 0 weights)
#   OUTPUT_PATH/weights/tp1_sharded_checkpoint.safetensors   (rank 1 weights)

Compilation takes approximately 8-9 minutes on inf2.8xlarge.

Base Model

Model: Qwen/Qwen2.5-7B-Instruct
Architecture: Qwen2 (decoder-only transformer)
Parameters: 7.6B
License: Apache 2.0

Acknowledgments

Part of the Flav-benchmark project benchmarking Qwen2.5 inference across Neuron frameworks (NxDI, vLLM-neuron, optimum-neuron) and GPU baselines.

Downloads last month: 9

Model tree for jburtoft/Qwen2.5-7B-Instruct-Neuron-TP2

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct