Instructions to use google/gemma-4-31B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-4-31B-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-4-31B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-31B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-4-31B-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-4-31B-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-31B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-4-31B-it

SGLang

How to use google/gemma-4-31B-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-4-31B-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-31B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-4-31B-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-31B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-4-31B-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-4-31B-it
```

junk outputs

#41

by rirv938 - opened Apr 7

Discussion

rirv938

Apr 7

If I eval this model on GCP Vertex AI model garden its great and no junk outputs

If i use vllm myself I see a huge number of junk outputs and my eval metrics decline.

e.g
"*Ayano's eyes light up, her eyes expressionlijkly shifting"
""he doesn't even torightly look at the screen"
"He stays exactly where you're lean against him"

I have tried a lot of different settings including copying the vllm settings which get used by Vertex AI AND using the same docker container that vertex AI uses too but I still get issues.

It might be the slightly different weights that are used by vertex AI

My vllm arguments look like:
"engine_args": {
'gpu_memory_utilization': 0.92,
'language_model_only': True,
'max_model_len': 10240,
'max_num_batched_tokens': 10240,
'max_num_seqs': 64,
'tensor_parallel_size': 1,
'trust_remote_code': True,
'tool-call-parser': 'gemma4',
'reasoning-parser': 'gemma4'
},

rirv938

Apr 7

im using the default settings in the config (temp 1.0, top_k 64, top_p 0.95 etc. --> they get set automatically by vllm)

thnamratha

Google org 26 days ago

Hi @rirv938 ,

Thanks for addressing the issue.
To help us investigate why you are seeing these junk outputs, could you please provide more details about your specific environment? We would specifically like to see the evaluation script you are using to understand how the model's output is being handled.
Additionally, could you please provide the exact steps to reproduce this behavior?

mizadri

25 days ago

Gemma 4's instruction-tuned format terminates turns with (106), not (1). The model's generation_config.json lists
multiple stop tokens, but transformers overrides that list with the tokenizer's scalar eos_token_id=1 on load.

After that override, vLLM only sees 1 as a stop, so when the model emits 106 to end its turn, generation keeps going and decodes garbage
from the post-turn distribution. That matches the symptoms you're seeing ("expressionlijkly", "torightly" — off-manifold tokens past
the natural stop).

Fix — pass the full stop set to vLLM explicitly:

from vllm import SamplingParams

sampling_params = SamplingParams(
temperature=1.0, top_k=64, top_p=0.95,
stop_token_ids=[1, 106], # ,
max_tokens=...,
)

Or on the server: --override-generation-config '{"stop_token_ids":[1,106]}'.

Vertex AI's serving stack likely honors the model's generation_config.json directly and doesn't hit the transformers override path,
which is why it works there.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment