fjcloud's picture
Upload README.md with huggingface_hub
6d11b2e verified
metadata
license: apache-2.0
tags:
  - neuron
  - inferentia2
  - vllm
  - qwen3
base_model: Qwen/Qwen3-8B

Qwen3-8B — Pre-compiled for AWS Inferentia2 (TP=2)

Pre-compiled Qwen3-8B artifacts for AWS Inferentia2 with tensor parallelism 2.

Compilation Parameters

Parameter Value
Model Qwen/Qwen3-8B
Tensor Parallel 2
Max Model Length 4096
Max Num Seqs 8
Block Size 32
Data Type bf16 (auto-cast matmul)
Compiler Opt -O1

Hardware

  • Instance: inf2.xlarge or inf2.8xlarge (1 Inferentia2 chip, 2 NeuronCores, 32 GB HBM)
  • HBM Usage: ~16.4 GB (model weights) + KV cache

Usage with vLLM-Neuron

Deploy using fjcloud/vllm-neuron-rosa or directly:

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/this/repo \
  --max-model-len 4096 \
  --tensor-parallel-size 2 \
  --max-num-seqs 8 \
  --block-size 32 \
  --num-gpu-blocks-override 8 \
  --no-enable-prefix-caching

Set NEURON_COMPILED_ARTIFACTS to neuron-compiled-artifacts/ to skip recompilation.

Repo Structure

config.json, tokenizer.json, ...    # Model config & tokenizer
model.safetensors                    # Dummy (satisfies transformers validation)
neuron-compiled-artifacts/
  neuron_config.json                 # Neuron compilation config
  model.pt                           # Compiled NEFF (~159 MB)
  weights/
    tp0_sharded_checkpoint.safetensors  # Shard 0 (~8 GB)
    tp1_sharded_checkpoint.safetensors  # Shard 1 (~8 GB)