Qwen2.5-7B-Instruct Pre-Compiled for AWS Inferentia2 (TP=2)

Pre-compiled and pre-sharded Qwen2.5-7B-Instruct for AWS Neuron SDK 2.28, ready to load on inf2.xlarge (16 GB system RAM) or any larger Inferentia2/Trainium instance.

Why Pre-Sharded?

The standard NxDI load path loads the full HuggingFace checkpoint (~14 GB BF16) into CPU RAM for weight conversion and sharding. On inf2.xlarge (16 GB system RAM), this causes an OOM kill at ~14 GB RSS.

Pre-sharded weights bypass this entirely โ€” NxDI reads directly from the per-rank sharded files, peaking at ~13.5 GB RSS during load (tight but viable on 16 GB) and settling to ~4.3 GB RSS after device transfer.

Contents

File Size Description
model.pt ~153 MB Compiled Neuron NEFF graphs
neuron_config.json ~8 KB NxDI configuration (TP=2, BS=1, seq_len=8192, BF16)
weights/tp0_sharded_checkpoint.safetensors ~7.3 GB Pre-sharded model weights for rank 0
weights/tp1_sharded_checkpoint.safetensors ~7.3 GB Pre-sharded model weights for rank 1
config.json <1 KB HuggingFace model config
tokenizer.json ~6.8 MB Tokenizer
tokenizer_config.json ~7 KB Tokenizer configuration
generation_config.json <1 KB Default generation parameters
vocab.json ~2.7 MB Vocabulary
merges.txt ~1.6 MB BPE merges

Performance

Measured on inf2.xlarge (2 NeuronCores, 32 GB HBM, 4 vCPU, 16 GB system RAM):

Metric Value
Throughput 24.1 tok/s
Latency (4K in / 4K out) 169.8 s
Load time ~330 s
Peak RSS during load ~13.5 GB
RSS after load ~4.3 GB
Cost $8.76/M output tokens at $0.76/hr

Benchmark: batch_size=1, 4095 input tokens, 4096 output tokens, greedy decoding, 2 warmup + 10 measured requests.

Quick Start

Prerequisites

  • AWS instance with Inferentia2: inf2.xlarge (minimum), inf2.8xlarge, or larger
  • Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 (SDK 2.28)
  • Activate the pre-installed venv:
    source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
    

1. Download the model

pip install -q huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('jburtoft/Qwen2.5-7B-Instruct-Neuron-TP2',
                  local_dir='/data/Qwen2.5-7B-Instruct-Neuron-TP2')
"

2. Load and run inference

import os
import torch
from transformers import AutoTokenizer, GenerationConfig
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen2.modeling_qwen2 import (
    NeuronQwen2ForCausalLM, Qwen2InferenceConfig,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
from neuronx_distributed_inference.utils.accuracy import get_generate_outputs

MODEL_DIR = "/data/Qwen2.5-7B-Instruct-Neuron-TP2"

os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"

neuron_config = NeuronConfig(
    tp_degree=2,
    batch_size=1,
    seq_len=8192,
    n_positions=8192,
    max_context_length=8192,
    torch_dtype=torch.bfloat16,
    on_device_sampling_config=OnDeviceSamplingConfig(),
    fused_qkv=True,
    attn_kernel_enabled=False,   # inf2 does not support flash attention
    enable_bucketing=True,
    logical_nc_config=1,         # inf2 requires LNC=1
    save_sharded_checkpoint=True, # must match how the model was compiled
)

config = Qwen2InferenceConfig(
    neuron_config,
    load_config=load_pretrained_config(MODEL_DIR),
)

model = NeuronQwen2ForCausalLM(MODEL_DIR, config)
model.load(MODEL_DIR)  # loads from pre-sharded weights

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, padding_side="right")
tokenizer.pad_token = tokenizer.eos_token

# Generate
prompt = "Explain quantum computing in simple terms."
generation_config = GenerationConfig(
    max_new_tokens=256,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
)

outputs, decoded_texts = get_generate_outputs(
    model, [prompt], tokenizer, is_hf=False, generation_config=generation_config,
)

print(decoded_texts[0])

3. Important notes

  • LNC=1 is required on inf2. Set NEURON_LOGICAL_NC_CONFIG=1 before loading.
  • Flash attention is not supported on inf2 (trn1-era cores). Use attn_kernel_enabled=False.
  • First load takes ~5-6 minutes as the sharded weights (14.6 GB total) are read from disk and transferred to device.
  • First import may be slow (~3-5 min) on a fresh DLAMI instance due to library rehydration.
  • save_sharded_checkpoint=True must be set in the NeuronConfig when loading โ€” this tells NxDI to use the per-rank sharded files instead of the standard HF checkpoint loading path.

Compilation Details

Parameter Value
SDK 2.28 (NxDI 0.8.0, neuronx-cc 2.22, torch-neuronx 2.9.0)
TP degree 2
Batch size 1
Sequence length 8192
Dtype bfloat16
Flash attention Disabled (inf2 constraint)
LNC 1 (inf2 constraint)
save_sharded_checkpoint True
Compiled on inf2.8xlarge (32 vCPU, 128 GB RAM)

Compiling Your Own

To compile for different configurations (e.g., different TP, batch size, or sequence length), use a larger instance (inf2.8xlarge or trn2.3xlarge):

import os
import torch
from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
from neuronx_distributed_inference.models.qwen2.modeling_qwen2 import (
    NeuronQwen2ForCausalLM, Qwen2InferenceConfig,
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

# Download the base model first
# pip install huggingface_hub
# from huggingface_hub import snapshot_download
# snapshot_download("Qwen/Qwen2.5-7B-Instruct", local_dir="/data/models/Qwen2.5-7B-Instruct")

MODEL_PATH = "/data/models/Qwen2.5-7B-Instruct"
OUTPUT_PATH = "/data/compiled/Qwen2.5-7B-TP2-sharded"

os.environ["NEURON_LOGICAL_NC_CONFIG"] = "1"

neuron_config = NeuronConfig(
    tp_degree=2,               # adjust as needed
    batch_size=1,              # adjust as needed
    seq_len=8192,              # adjust as needed
    n_positions=8192,
    max_context_length=8192,
    torch_dtype=torch.bfloat16,
    on_device_sampling_config=OnDeviceSamplingConfig(),
    fused_qkv=True,
    attn_kernel_enabled=False, # False for inf2, True for trn2
    enable_bucketing=True,
    logical_nc_config=1,       # 1 for inf2, 1 or 2 for trn2
    save_sharded_checkpoint=True,  # REQUIRED for pre-sharded deployment
)

config = Qwen2InferenceConfig(
    neuron_config,
    load_config=load_pretrained_config(MODEL_PATH),
)

model = NeuronQwen2ForCausalLM(MODEL_PATH, config)
model.compile(OUTPUT_PATH)

# Output:
#   OUTPUT_PATH/model.pt                                    (compiled NEFFs)
#   OUTPUT_PATH/neuron_config.json                           (NxDI config)
#   OUTPUT_PATH/weights/tp0_sharded_checkpoint.safetensors   (rank 0 weights)
#   OUTPUT_PATH/weights/tp1_sharded_checkpoint.safetensors   (rank 1 weights)

Compilation takes approximately 8-9 minutes on inf2.8xlarge.

Base Model

Acknowledgments

Part of the Flav-benchmark project benchmarking Qwen2.5 inference across Neuron frameworks (NxDI, vLLM-neuron, optimum-neuron) and GPU baselines.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jburtoft/Qwen2.5-7B-Instruct-Neuron-TP2

Base model

Qwen/Qwen2.5-7B
Finetuned
(2860)
this model