Mistral-7B-Instruct-v0.3 โ Pre-compiled for AWS Inferentia2
Pre-compiled Neuron artifacts for Mistral-7B-Instruct-v0.3, ready to run on AWS Inferentia2 with vLLM + vllm-neuron.
No compilation needed โ loads directly on inf2.xlarge or inf2.8xlarge.
Compilation parameters
| Parameter | Value |
|---|---|
| tensor-parallel-size | 2 |
| max-model-len | 4096 |
| max-num-seqs | 4 |
| block-size | 32 |
| save_sharded_checkpoint | true |
| NEURON_CC_FLAGS | -O1 |
| Neuron SDK | 2.x (NxDI >= 0.7) |
| vLLM | 0.13.0 |
Repo structure
.
โโโ config.json, tokenizer.json, ... # Model config + tokenizer (from base model)
โโโ model.safetensors # Dummy (161 bytes) โ required by transformers validation
โโโ neuron-compiled-artifacts/
โโโ model.pt # Compiled NEFF (128 MB)
โโโ neuron_config.json # NxDI configuration
โโโ weights/
โโโ tp0_sharded_checkpoint.safetensors # 6.8 GB, rank 0
โโโ tp1_sharded_checkpoint.safetensors # 6.8 GB, rank 1
Why model.safetensors is a dummy
The transformers library performs a hard-coded validation check for standard weight files (model.safetensors, pytorch_model.bin, etc.) before any custom model loader can take over. This 161-byte dummy file satisfies that check. The actual weights are the pre-sharded safetensors in neuron-compiled-artifacts/weights/.
Why save_sharded_checkpoint
With sharded checkpoints, each NeuronCore rank loads only its ~7 GB shard instead of the full 14 GB model. This cuts peak system RAM usage in half, making inf2.xlarge (16 GB RAM) viable for a 7B model.
Usage with vLLM
This repo is designed to work with vllm-neuron-rosa, which provides a custom entrypoint.sh that:
- Runs
snapshot_downloadto fetch the full repo (includingneuron-compiled-artifacts/) - Sets
NEURON_COMPILED_ARTIFACTSto bypass NxDI's config hash lookup - Launches vLLM with
--modelpointing to the local download
Deploy on OpenShift / ROSA
# Prerequisites (NFD, KMM, Neuron operators)
oc apply -k https://github.com/fjcloud/vllm-neuron-rosa/deploy/prereqs
# Create namespace
oc new-project neuron-inference
# Deploy
oc apply -k https://github.com/fjcloud/vllm-neuron-rosa/deploy -n neuron-inference
# Build image (first time)
oc start-build vllm-neuron -n neuron-inference --follow
Standalone vLLM (if you handle the download yourself)
# 1. Download the full repo
huggingface-cli download fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 --local-dir ./model
# 2. Run vLLM with NEURON_COMPILED_ARTIFACTS set
export NEURON_COMPILED_ARTIFACTS=./model/neuron-compiled-artifacts
python -m vllm.entrypoints.openai.api_server \
--model ./model \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--max-num-seqs 4 \
--block-size 32 \
--num-gpu-blocks-override 4 \
--no-enable-prefix-caching \
--additional-config '{"override_neuron_config": {"save_sharded_checkpoint": true}}'
Hardware requirements
| Instance | RAM | Works? | Notes |
|---|---|---|---|
inf2.xlarge |
16 GB | Yes | With pre-sharded weights |
inf2.8xlarge |
128 GB | Yes | Also suitable for recompilation |
License
Same as the base model: Mistral License.
- Downloads last month
- 130
Model tree for fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2
Base model
mistralai/Mistral-7B-v0.3