Vertex AI & vLLM Deployment Guide for Gemma 4 26B-A4B-it (MoE) + Known Limitations

#19
by Manzela-D - opened

Hi everyone,
I recently spent time stabilizing a production deployment of Gemma 4 26B-A4B-it on Vertex AI using vLLM on a single A100 80GB.

During this process, I documented 20 distinct failure modes ranging from dependency matrices to MoE quantization incompatibilities. I've compiled a full Forensic Runbook, a working GCSFUSE-enabled Dockerfile, and a Dependency Matrix proof in this repository: https://github.com/Manzela/gemma4-vllm-deployment.

Current working state: BF16 precision, 8K context, A100 80GB.
Current blockers identified:

  • NF4 Quantization is unsupported (missing get_expert_mapping()).

  • LoRA fine-tuning is blocked (missing mixin in vLLM).

  • The dependency triangle between vLLM, transformers 5.5, and huggingface_hub makes PyPI-only deployments currently impossible.

I will be opening targeted issues on the vLLM and Transformers repositories to address the code-level blockers, but I wanted to share the deployment scripts and runbooks here to save other developers time. Contributions and workarounds for the vision chat template are welcome!

Manzela-D changed discussion title from Vertex AI & vLLM Deployment Guide for Gemma 4 27B-A4B-it (MoE) + Known Limitations to Vertex AI & vLLM Deployment Guide for Gemma 4 26B-A4B-it (MoE) + Known Limitations

I appreciate you sharing your findings.

Sign up or log in to comment