Upload README.md with huggingface_hub

6d11b2e verified 2 months ago

1.6 kB

license: apache-2.0
tags:
  - neuron
  - inferentia2
  - vllm
  - qwen3
base_model: Qwen/Qwen3-8B

Qwen3-8B — Pre-compiled for AWS Inferentia2 (TP=2)

Pre-compiled Qwen3-8B artifacts for AWS Inferentia2 with tensor parallelism 2.

Compilation Parameters

Parameter	Value
Model	Qwen/Qwen3-8B
Tensor Parallel	2
Max Model Length	4096
Max Num Seqs	8
Block Size	32
Data Type	bf16 (auto-cast matmul)
Compiler Opt	-O1

Hardware

Instance: inf2.xlarge or inf2.8xlarge (1 Inferentia2 chip, 2 NeuronCores, 32 GB HBM)
HBM Usage: ~16.4 GB (model weights) + KV cache

Usage with vLLM-Neuron

Deploy using fjcloud/vllm-neuron-rosa or directly:

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/this/repo \
  --max-model-len 4096 \
  --tensor-parallel-size 2 \
  --max-num-seqs 8 \
  --block-size 32 \
  --num-gpu-blocks-override 8 \
  --no-enable-prefix-caching

Set NEURON_COMPILED_ARTIFACTS to neuron-compiled-artifacts/ to skip recompilation.

Repo Structure

config.json, tokenizer.json, ...    # Model config & tokenizer
model.safetensors                    # Dummy (satisfies transformers validation)
neuron-compiled-artifacts/
  neuron_config.json                 # Neuron compilation config
  model.pt                           # Compiled NEFF (~159 MB)
  weights/
    tp0_sharded_checkpoint.safetensors  # Shard 0 (~8 GB)
    tp1_sharded_checkpoint.safetensors  # Shard 1 (~8 GB)