Vertex AI & vLLM Deployment Guide for Gemma 4 26B-A4B-it (MoE) + Known Limitations
Hi everyone,
I recently spent time stabilizing a production deployment of Gemma 4 26B-A4B-it on Vertex AI using vLLM on a single A100 80GB.
During this process, I documented 20 distinct failure modes ranging from dependency matrices to MoE quantization incompatibilities. I've compiled a full Forensic Runbook, a working GCSFUSE-enabled Dockerfile, and a Dependency Matrix proof in this repository: https://github.com/Manzela/gemma4-vllm-deployment.
Current working state: BF16 precision, 8K context, A100 80GB.
Current blockers identified:
NF4 Quantization is unsupported (missing get_expert_mapping()).
LoRA fine-tuning is blocked (missing mixin in vLLM).
The dependency triangle between vLLM, transformers 5.5, and huggingface_hub makes PyPI-only deployments currently impossible.
I will be opening targeted issues on the vLLM and Transformers repositories to address the code-level blockers, but I wanted to share the deployment scripts and runbooks here to save other developers time. Contributions and workarounds for the vision chat template are welcome!
I appreciate you sharing your findings.