vllm / sglang support?

#4
by mtcl - opened

is there a support for sglang/vllm?

vLLM are working on it as per their Github.

I think the PR is just merged an hour ago.

i hope official instructions are updated in the docs soon here.

Here is a custom vllm image I've built. It works as intended: https://hub.docker.com/r/infantryman77/vllm-gemma4. Tested with Cline and Open-Webui. Not completely production ready for it works.

services:
  vllm:
    image: infantryman77/vllm-gemma4:nightly-20260402
    container_name: gemma4
    command:
      - /models/gemma-4-31B-it-AWQ-8bit
      - --served-model-name
      - gemma4-31b
      - --max-model-len
      - "131072"
      - --tensor-parallel-size
      - "4"
      - --gpu-memory-utilization
      - "0.97"
      - --reasoning-parser
      - gemma4
      - --enable-auto-tool-choice
      - --tool-call-parser
      - gemma4
      - --host
      - 0.0.0.0
      - --limit-mm-per-prompt
      - '{"image":4}'
      - --max-num-batched-tokens
      - "2096"
      - --max-num-seqs
      - "4"
      - --port
      - "8080"
      - --disable-custom-all-reduce
      - --override-generation-config
      - '{"temperature":1.0,"top_p":0.95,"top_k":64}'
    volumes:
      - /home/infantryman/vllm/models:/models
    ports:
      - "8080:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
      - OMP_NUM_THREADS=1
      - PYTHONWARNINGS=ignore::FutureWarning
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    ipc: host
    restart: unless-stopped

Sign up or log in to comment