Instructions to use fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2") model = AutoModelForCausalLM.from_pretrained("fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2
- SGLang
How to use fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 with Docker Model Runner:
docker model run hf.co/fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2
Mistral-7B-Instruct-v0.3 — Pre-compiled for AWS Inferentia2
Pre-compiled Neuron artifacts for Mistral-7B-Instruct-v0.3, ready to run on AWS Inferentia2 with vLLM + vllm-neuron.
No compilation needed — loads directly on inf2.xlarge or inf2.8xlarge.
Compilation parameters
| Parameter | Value |
|---|---|
| tensor-parallel-size | 2 |
| max-model-len | 4096 |
| max-num-seqs | 4 |
| block-size | 32 |
| save_sharded_checkpoint | true |
| NEURON_CC_FLAGS | -O1 |
| Neuron SDK | 2.x (NxDI >= 0.7) |
| vLLM | 0.13.0 |
Repo structure
.
├── config.json, tokenizer.json, ... # Model config + tokenizer (from base model)
├── model.safetensors # Dummy (161 bytes) — required by transformers validation
└── neuron-compiled-artifacts/
├── model.pt # Compiled NEFF (128 MB)
├── neuron_config.json # NxDI configuration
└── weights/
├── tp0_sharded_checkpoint.safetensors # 6.8 GB, rank 0
└── tp1_sharded_checkpoint.safetensors # 6.8 GB, rank 1
Why model.safetensors is a dummy
The transformers library performs a hard-coded validation check for standard weight files (model.safetensors, pytorch_model.bin, etc.) before any custom model loader can take over. This 161-byte dummy file satisfies that check. The actual weights are the pre-sharded safetensors in neuron-compiled-artifacts/weights/.
Why save_sharded_checkpoint
With sharded checkpoints, each NeuronCore rank loads only its ~7 GB shard instead of the full 14 GB model. This cuts peak system RAM usage in half, making inf2.xlarge (16 GB RAM) viable for a 7B model.
Usage with vLLM
This repo is designed to work with vllm-neuron-rosa, which provides a custom entrypoint.sh that:
- Runs
snapshot_downloadto fetch the full repo (includingneuron-compiled-artifacts/) - Sets
NEURON_COMPILED_ARTIFACTSto bypass NxDI's config hash lookup - Launches vLLM with
--modelpointing to the local download
Deploy on OpenShift / ROSA
# Prerequisites (NFD, KMM, Neuron operators)
oc apply -k https://github.com/fjcloud/vllm-neuron-rosa/deploy/prereqs
# Create namespace
oc new-project neuron-inference
# Deploy
oc apply -k https://github.com/fjcloud/vllm-neuron-rosa/deploy -n neuron-inference
# Build image (first time)
oc start-build vllm-neuron -n neuron-inference --follow
Standalone vLLM (if you handle the download yourself)
# 1. Download the full repo
huggingface-cli download fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2 --local-dir ./model
# 2. Run vLLM with NEURON_COMPILED_ARTIFACTS set
export NEURON_COMPILED_ARTIFACTS=./model/neuron-compiled-artifacts
python -m vllm.entrypoints.openai.api_server \
--model ./model \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--max-num-seqs 4 \
--block-size 32 \
--num-gpu-blocks-override 4 \
--no-enable-prefix-caching \
--additional-config '{"override_neuron_config": {"save_sharded_checkpoint": true}}'
Hardware requirements
| Instance | RAM | Works? | Notes |
|---|---|---|---|
inf2.xlarge |
16 GB | Yes | With pre-sharded weights |
inf2.8xlarge |
128 GB | Yes | Also suitable for recompilation |
License
Same as the base model: Mistral License.
- Downloads last month
- 18
Model tree for fjcloud/Mistral-7B-Instruct-v0.3-neuron-inf2-tp2
Base model
mistralai/Mistral-7B-v0.3