vllm / sglang support?
#4
by mtcl - opened
is there a support for sglang/vllm?
vLLM are working on it as per their Github.
I think the PR is just merged an hour ago.
i hope official instructions are updated in the docs soon here.
Here is a custom vllm image I've built. It works as intended: https://hub.docker.com/r/infantryman77/vllm-gemma4. Tested with Cline and Open-Webui. Not completely production ready for it works.
services:
vllm:
image: infantryman77/vllm-gemma4:nightly-20260402
container_name: gemma4
command:
- /models/gemma-4-31B-it-AWQ-8bit
- --served-model-name
- gemma4-31b
- --max-model-len
- "131072"
- --tensor-parallel-size
- "4"
- --gpu-memory-utilization
- "0.97"
- --reasoning-parser
- gemma4
- --enable-auto-tool-choice
- --tool-call-parser
- gemma4
- --host
- 0.0.0.0
- --limit-mm-per-prompt
- '{"image":4}'
- --max-num-batched-tokens
- "2096"
- --max-num-seqs
- "4"
- --port
- "8080"
- --disable-custom-all-reduce
- --override-generation-config
- '{"temperature":1.0,"top_p":0.95,"top_k":64}'
volumes:
- /home/infantryman/vllm/models:/models
ports:
- "8080:8080"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- PYTORCH_ALLOC_CONF=expandable_segments:True
- LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
- OMP_NUM_THREADS=1
- PYTHONWARNINGS=ignore::FutureWarning
- VLLM_WORKER_MULTIPROC_METHOD=spawn
ipc: host
restart: unless-stopped