Qwen3-8B โ Pre-compiled for AWS Inferentia2 (TP=2)
Pre-compiled Qwen3-8B artifacts for AWS Inferentia2 with tensor parallelism 2.
Compilation Parameters
| Parameter | Value |
|---|---|
| Model | Qwen/Qwen3-8B |
| Tensor Parallel | 2 |
| Max Model Length | 4096 |
| Max Num Seqs | 8 |
| Block Size | 32 |
| Data Type | bf16 (auto-cast matmul) |
| Compiler Opt | -O1 |
Hardware
- Instance: inf2.xlarge or inf2.8xlarge (1 Inferentia2 chip, 2 NeuronCores, 32 GB HBM)
- HBM Usage: ~16.4 GB (model weights) + KV cache
Usage with vLLM-Neuron
Deploy using fjcloud/vllm-neuron-rosa or directly:
python -m vllm.entrypoints.openai.api_server \
--model /path/to/this/repo \
--max-model-len 4096 \
--tensor-parallel-size 2 \
--max-num-seqs 8 \
--block-size 32 \
--num-gpu-blocks-override 8 \
--no-enable-prefix-caching
Set NEURON_COMPILED_ARTIFACTS to neuron-compiled-artifacts/ to skip recompilation.
Repo Structure
config.json, tokenizer.json, ... # Model config & tokenizer
model.safetensors # Dummy (satisfies transformers validation)
neuron-compiled-artifacts/
neuron_config.json # Neuron compilation config
model.pt # Compiled NEFF (~159 MB)
weights/
tp0_sharded_checkpoint.safetensors # Shard 0 (~8 GB)
tp1_sharded_checkpoint.safetensors # Shard 1 (~8 GB)
- Downloads last month
- 38
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support