Instructions to use google/gemma-4-31B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-31B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-31B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-31B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-4-31B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-31B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-31B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-4-31B-it
- SGLang
How to use google/gemma-4-31B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-31B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-31B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-31B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-31B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-4-31B-it with Docker Model Runner:
docker model run hf.co/google/gemma-4-31B-it
gemma4-31b-it with MTP enabled works on dual DGX SPARKs now
I was so excited that MTP is supported via the base one with assistant. But I can not make it work. So I tried to deploy this base model without MTP from the very beginning once again via commands:
./launch-cluster.sh \
-t vllm-node-tf5:latest \
-e HF_HUB_OFFLINE=1 \
exec vllm serve \
google/gemma-4-31B-it \
--port 8000 --host 0.0.0.0 \
--served-model-name gemma-4-31b \
--default-chat-template-kwargs '{"enable_thinking":true}' \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 262144 \
--load-format fastsafetensors \
--enable-auto-tool-choice --tool-call-parser gemma4 \
--reasoning-parser gemma4
There was an error:
(APIServer pid=525) raise ValueError(
(APIServer pid=525) ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.
So I had to add --max-num-batched-tokens 16384 \. But it kept prompting:
(EngineCore pid=608) (RayWorkerProc pid=269, ip=169.254.84.17) (Worker_TP1 pid=269) INFO 05-09 14:31:39 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore pid=608) (RayWorkerProc pid=809) [W509 14:31:33.421063622 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
This time, the head DGX just gets stuck, and I cannot even abort the mission, so I had to reboot it brutally. As I remember, it used to work fine on two DGXs even though MTP was not supported.
This is really annoying! That could be caused by the vllm project updates I guess. But is that true? I also deployed Qwen3.6-27B with MTP enabled:
./launch-cluster.sh \
-d \
-t vllm-node-tf5:latest \
-e HF_HUB_OFFLINE=1 \
exec vllm serve \
Qwen/Qwen3.6-27B \
--port 8000 --host 0.0.0.0 \
-tp 2 \
--served-model-name qwen3 \
--gpu-memory-utilization 0.7 \
--distributed-executor-backend ray \
--max-model-len 262144 \
--load-format fastsafetensors \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":4}'
This one works successfully as usual.
I am not advertising for anyone, and I am so dying for a better way to try these extraordinary models. I believe gemma4-31b-it is the one I should work with for my research. I just want it back, and I will try it tomorrow. Anyway, I appreciate the altruistic contributions of Google DeepMind and Sir Hassabis.
gemma-4-31B-it enabled with MTP worked
写给中国人的炼金指南
因为某些环境因素导致多数情况下我们使用计算机的工具跟外部世界的不太一样,在操作这些模型的时候我们除了必备的一些技能以外,更重要的是培养一种感知能力,感受这些技术为什么别人行得通而我们却步履维艰,从而才有可能破除幻象达到相同的境界。举例而言,eugr的工具更多的时候是需要联网的,而我们因为网络原因常常会磕磕绊绊,下载一个模型如果找不到良方熬不过漫长的等待,人的心性就会被磨灭。
我走过的一些坑:
- QSFP有线端口的IP要和DGX SPARK服务器主机的IP不同,而且应该配置为不同的子网,这一点在 NETWORKING 中有所提及,我一开始是理解不够透彻的,后来因为我的模型终于正常部署但却在CC-Switch中一直得不到响应,我就怀疑肯定是网络出了问题,Ping也Ping不通,但ssh是可以连接的,而且在服务器端也是可以正常访问外部网站的,但之前部署Qwen模型后是能正常调用模型API的。这之间唯一的变化是,原来两台DGX SPARK是参照Nvidia playbook配置的网络,而为了排查Gemma模型为什么部署失败,我重新按照 eugr 的网络配置方案重新走了一遍,但我却把
enp1s0f0np0配置成和服务器IP相同的子网,只是最后的数不一样, 我意识到如果按照 Nvidia 的配置方案,在 40-cx7.yaml 中根本就没有明确指定 IPV4 地址,但部署好了后会得到两个和局域网内IP截然不同的端口地址,但双机互联是行得通的。结论是:只要你的服务器IP和NETWORKING中例举的IP不同,你就可以照搬照抄执行就行了,enp1s0f0np0的地址不重要,它只是机器和机器交流的载体。 - DGX SPARK不要安装Homebrew,我是因为习惯了Mac上用Homebrew来管理软件包,一开始就在DGX上安装了Homebrew。当我发现运行
./run_recipe.sh ...命令时,它老是提醒我安装PyYAML,但是homebrew中没有这个包,于是Kimi就教我一大堆什么虚拟环境的玩意儿,我想着这些环境肯定是不必要的,因为 Netplan 都可以正常加载 40-cx7.yaml ,而且也没有人说过不能正常运行,所以肯定是我这里哪个地方出的问题,因为别人怎么都能运行就我不行,所以我直接把homebrew卸载了,清理了痕迹。然后run_recipe.sh
就可以正常运行了。 - NVIDIA官方论坛和GitHub总有人建议在运行
build_and_copy.sh脚本时增加--rebuild_vllm,这是不对的,eugr 本人也在论坛回复中说了不要加这个,因为它会拉取 vllm 项目 main 分支下的最新docker镜像,我因为中招过,完全跑不起来gemma4。不过,Qwen3.6-27B和35B好像都不受影响。 - 部署模型前,我们通常需要先把模型从 HF 上下载到本地,但是
run_recipe.sh会按照指定的模型先上HF找着自己下载,导致我一直收到Network不通的提示。我想起自己在直接用launch_cluster.sh命令部署模型的时候有明确传进去-e HF_HUB_OFFLINE=1的环境变量,明确告诉它不要去联网下载直接用本地的就好了。于是,我执行的命令就变成了:
./run-recipe.sh ../gemma-4-31b-it-mtp.yaml --config .env \
-e HF_ENDPOINT=https://hf-mirror.com \
-e HF_HUB_OFFLINE=1 \
-e TRANSFORMERS_OFFLINE=1
然后就都正常了,模型又一次成功部署。
我的经验是建立在 eugr 的仓库内容上的,我还参考了 isolitude 的方案,我从复制一大串 launch_clusters.sh 命令进化到运行一条 run_recipe.sh 命令受益于他们,谢谢!