Mistral-7B-Instruct-v0.3 โ€” Pre-compiled for AWS Inferentia2

Pre-compiled Neuron artifacts for Mistral-7B-Instruct-v0.3, ready to run on AWS Inferentia2 with vLLM + vllm-neuron.

No compilation needed โ€” loads directly on inf2.xlarge or inf2.8xlarge.

Compilation parameters

Parameter Value
tensor-parallel-size 2
max-model-len 4096
max-num-seqs 4
block-size 32
save_sharded_checkpoint true
NEURON_CC_FLAGS -O1
Neuron SDK 2.x (NxDI >= 0.7)
vLLM 0.13.0

Repo structure

.
โ”œโ”€โ”€ config.json, tokenizer.json, ...        # Model config + tokenizer (from base model)
โ”œโ”€โ”€ model.safetensors                       # Dummy (161 bytes) โ€” required by transformers validation
โ””โ”€โ”€ neuron-compiled-artifacts/
    โ”œโ”€โ”€ model.pt                            # Compiled NEFF (128 MB)
    โ”œโ”€โ”€ neuron_config.json                  # NxDI configuration
    โ””โ”€โ”€ weights/
        โ”œโ”€โ”€ tp0_sharded_checkpoint.safetensors   # 6.8 GB, rank 0
        โ””โ”€โ”€ tp1_sharded_checkpoint.safetensors   # 6.8 GB, rank 1

Why model.safetensors is a dummy

The transformers library performs a hard-coded validation check for standard weight files (model.safetensors, pytorch_model.bin, etc.) before any custom model loader can take over. This 161-byte dummy file satisfies that check. The actual weights are the pre-sharded safetensors in neuron-compiled-artifacts/weights/.

Why save_sharded_checkpoint

With sharded checkpoints, each NeuronCore rank loads only its ~7 GB shard instead of the full 14 GB model. This cuts peak system RAM usage in half, making inf2.xlarge (16 GB RAM) viable for a 7B model.

Usage with vLLM

This repo is designed to work with vllm-neuron-rosa, which provides a custom entrypoint.sh that:

  1. Runs snapshot_download to fetch the full repo (including neuron-compiled-artifacts/)
  2. Sets NEURON_COMPILED_ARTIFACTS to bypass NxDI's config hash lookup
  3. Launches vLLM with --model pointing to the local download

Deploy on OpenShift / ROSA

# Prerequisites (NFD, KMM, Neuron operators)
oc apply -k https://github.com/fjcloud/vllm-neuron-rosa/deploy/prereqs

# Create namespace
oc new-project neuron-inference

# Deploy
oc apply -k https://github.com/fjcloud/vllm-neuron-rosa/deploy -n neuron-inference

# Build image (first time)
oc start-build vllm-neuron -n neuron-inference --follow

Standalone vLLM (if you handle the download yourself)

# 1. Download the full repo
huggingface-cli download fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 --local-dir ./model

# 2. Run vLLM with NEURON_COMPILED_ARTIFACTS set
export NEURON_COMPILED_ARTIFACTS=./model/neuron-compiled-artifacts
python -m vllm.entrypoints.openai.api_server \
  --model ./model \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --max-num-seqs 4 \
  --block-size 32 \
  --num-gpu-blocks-override 4 \
  --no-enable-prefix-caching \
  --additional-config '{"override_neuron_config": {"save_sharded_checkpoint": true}}'

Hardware requirements

Instance RAM Works? Notes
inf2.xlarge 16 GB Yes With pre-sharded weights
inf2.8xlarge 128 GB Yes Also suitable for recompilation

License

Same as the base model: Mistral License.

Downloads last month
130
Safetensors
Model size
1 params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2

Finetuned
(433)
this model