Vertex AI & vLLM Deployment Guide for Gemma 4 26B-A4B-it (MoE) + Known Limitations

#19

by Manzela-D - opened 9 days ago

•

Hi everyone,
I recently spent time stabilizing a production deployment of Gemma 4 26B-A4B-it on Vertex AI using vLLM on a single A100 80GB.

During this process, I documented 20 distinct failure modes ranging from dependency matrices to MoE quantization incompatibilities. I've compiled a full Forensic Runbook, a working GCSFUSE-enabled Dockerfile, and a Dependency Matrix proof in this repository: https://github.com/Manzela/gemma4-vllm-deployment.

Current working state: BF16 precision, 8K context, A100 80GB.
Current blockers identified:

NF4 Quantization is unsupported (missing get_expert_mapping()).
LoRA fine-tuning is blocked (missing mixin in vLLM).
The dependency triangle between vLLM, transformers 5.5, and huggingface_hub makes PyPI-only deployments currently impossible.

I will be opening targeted issues on the vLLM and Transformers repositories to address the code-level blockers, but I wanted to share the deployment scripts and runbooks here to save other developers time. Contributions and workarounds for the vision chat template are welcome!

Manzela-D changed discussion title from Vertex AI & vLLM Deployment Guide for Gemma 4 27B-A4B-it (MoE) + Known Limitations to Vertex AI & vLLM Deployment Guide for Gemma 4 26B-A4B-it (MoE) + Known Limitations 9 days ago

serdarcaglar

8 days ago

I appreciate you sharing your findings.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment