# How to run

```bash
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
```

```bash
DOCKER_BUILDKIT=1 docker build \
  -f docker/Dockerfile.cuda \
  --build-arg BASE_IMAGE=vllm/vllm-openai:latest \
  -t vllm-omni-custom:latest \
  .
```
The above is for Nvidia GPU.

```
docker rm -f vllm

docker run -d \
  --name vllm \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN="$HF_TOKEN" \
  -e CUDA_VISIBLE_DEVICES=0 \
  --entrypoint /bin/bash \
  vllm-omni-custom:latest \
  -lc 'pip install --no-cache-dir "vllm[audio]" torchdiffeq && \
  vllm serve minchyeom/StarVoice \
    --omni \
    --served-model-name starlette \
    --dtype float32 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.5 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000'
```
Configure `--gpu-memory-utilization` according to your GPU VRAM budget. I tested this on a single RTX 5090.