Qwen3-8B โ€” Pre-compiled for AWS Inferentia2 (TP=2)

Pre-compiled Qwen3-8B artifacts for AWS Inferentia2 with tensor parallelism 2.

Compilation Parameters

Parameter Value
Model Qwen/Qwen3-8B
Tensor Parallel 2
Max Model Length 4096
Max Num Seqs 8
Block Size 32
Data Type bf16 (auto-cast matmul)
Compiler Opt -O1

Hardware

  • Instance: inf2.xlarge or inf2.8xlarge (1 Inferentia2 chip, 2 NeuronCores, 32 GB HBM)
  • HBM Usage: ~16.4 GB (model weights) + KV cache

Usage with vLLM-Neuron

Deploy using fjcloud/vllm-neuron-rosa or directly:

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/this/repo \
  --max-model-len 4096 \
  --tensor-parallel-size 2 \
  --max-num-seqs 8 \
  --block-size 32 \
  --num-gpu-blocks-override 8 \
  --no-enable-prefix-caching

Set NEURON_COMPILED_ARTIFACTS to neuron-compiled-artifacts/ to skip recompilation.

Repo Structure

config.json, tokenizer.json, ...    # Model config & tokenizer
model.safetensors                    # Dummy (satisfies transformers validation)
neuron-compiled-artifacts/
  neuron_config.json                 # Neuron compilation config
  model.pt                           # Compiled NEFF (~159 MB)
  weights/
    tp0_sharded_checkpoint.safetensors  # Shard 0 (~8 GB)
    tp1_sharded_checkpoint.safetensors  # Shard 1 (~8 GB)
Downloads last month
38
Safetensors
Model size
0 params
Tensor type
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for fjcloud/Qwen3-8B-neuron-inf2-tp2

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(950)
this model