Instructions to use google/gemma-3-4b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-3-4b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-3-4b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-4b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-3-4b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-3-4b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-3-4b-it

SGLang

How to use google/gemma-3-4b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-3-4b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-3-4b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-4b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-3-4b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-3-4b-it
```

VRAM not freed during long generations (Gemma, max_new_tokens=3000)

#29

by Nessit - opened Mar 25, 2025

Discussion

Nessit

Mar 25, 2025

When using the official Gemma example code but changing max_new_tokens=200 to 3000, I get a CUDA error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED during cublasSgemm call.

Additionally, even when the model gives a short response, VRAM remains occupied until all 3000 tokens are processed.

GopiUppari

Google org Mar 26, 2025

Hi @Nessit ,

By specifying max_new_tokens=3000 which means the model to prepare memory for generating up to 3000 tokens, regardless of how many are actually generated.
Even if the model replies with only a few tokens, the full memory buffer is still allocated and that memory stays locked until the process is done.

To solve this issue, try increasing max_new_tokens gradually: 200 → 500 → 1000, and monitor usage.
Also, using half-precision or quantized versions of the model can help save memory and improve performance.

I successfully executed the official Gemma example code in google colab with Runtime Type: T4 GPU as by specifying the max_new_tokens=3000, could you please refer to this gist file.

Thank you.

Nessit

Mar 26, 2025

thank you for your answer! I understand your answer, but I'm encountering an issue with GPU utilization. When I ask short questions, I receive short responses, but the GPU remains occupied for an extended period after the answer is complete. I can't perform any other operations until this process finishes, suggesting the stop token might not be functioning properly.

For comparison:

With Qwen, using 3000 tokens allows me to ask both long and short questions - the GPU releases immediately after the answer appears.

With Gemma, regardless of question length or answer size, the GPU stays busy for the full duration needed to process 3000 tokens, blocking further operations.

This behavior significantly impacts workflow efficiency. Is there a way to make Gemma release GPU resources immediately after generating the complete answer, like Qwen does?

Ayorinha

Mar 28, 2025

It seems the issue is with memory allocation for 3000 tokens; try gradually reducing max_new_tokens, using half-precision (FP16) or quantized models, and manually releasing memory with torch.cuda.empty_cache() after each generation.

warlock76

Apr 25, 2025

I see its possible to do memory optimizations. I am using a quantified 27b model (gguf) which works fantastic on a 24gb rtx quadro passive in lm studio. And its certainly possible to push this by tweaking token allocation, for example increasing the context window notably increases its memory with respect to the chatlog which was quite a bit surprising. However i wonder if its possible for the model to "forget" posdibly with a smart selection of what is important or and maybe compress information somehow Since its a fully vision enabled model you can overload it fairly quickly by showing it some higres visual data. Is there any other mechanism to loose tokens exept the pytorch cache cleanup?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment