# How to run ```bash git clone https://github.com/vllm-project/vllm-omni.git cd vllm-omni ``` ```bash DOCKER_BUILDKIT=1 docker build \ -f docker/Dockerfile.cuda \ --build-arg BASE_IMAGE=vllm/vllm-openai:latest \ -t vllm-omni-custom:latest \ . ``` The above is for Nvidia GPU. ``` docker rm -f vllm docker run -d \ --name vllm \ --gpus all \ --ipc=host \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e HF_TOKEN="$HF_TOKEN" \ -e CUDA_VISIBLE_DEVICES=0 \ --entrypoint /bin/bash \ vllm-omni-custom:latest \ -lc 'pip install --no-cache-dir "vllm[audio]" torchdiffeq && \ vllm serve minchyeom/StarVoice \ --omni \ --served-model-name starlette \ --dtype float32 \ --max-model-len 32768 \ --gpu-memory-utilization 0.5 \ --trust-remote-code \ --host 0.0.0.0 \ --port 8000' ``` Configure `--gpu-memory-utilization` according to your GPU VRAM budget. I tested this on a single RTX 5090.