| # How to run |
|
|
| ```bash |
| git clone https://github.com/vllm-project/vllm-omni.git |
| cd vllm-omni |
| ``` |
|
|
| ```bash |
| DOCKER_BUILDKIT=1 docker build \ |
| -f docker/Dockerfile.cuda \ |
| --build-arg BASE_IMAGE=vllm/vllm-openai:latest \ |
| -t vllm-omni-custom:latest \ |
| . |
| ``` |
| The above is for Nvidia GPU. |
|
|
| ``` |
| docker rm -f vllm |
| |
| docker run -d \ |
| --name vllm \ |
| --gpus all \ |
| --ipc=host \ |
| -p 8000:8000 \ |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| -e HF_TOKEN="$HF_TOKEN" \ |
| -e CUDA_VISIBLE_DEVICES=0 \ |
| --entrypoint /bin/bash \ |
| vllm-omni-custom:latest \ |
| -lc 'pip install --no-cache-dir "vllm[audio]" torchdiffeq && \ |
| vllm serve minchyeom/StarVoice \ |
| --omni \ |
| --served-model-name starlette \ |
| --dtype float32 \ |
| --max-model-len 32768 \ |
| --gpu-memory-utilization 0.5 \ |
| --trust-remote-code \ |
| --host 0.0.0.0 \ |
| --port 8000' |
| ``` |
| Configure `--gpu-memory-utilization` according to your GPU VRAM budget. I tested this on a single RTX 5090. |
|
|