Instructions to use google/gemma-3-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-3-27b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-3-27b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-27b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-3-27b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-3-27b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-27b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-3-27b-it

SGLang

How to use google/gemma-3-27b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-3-27b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-27b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-3-27b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-27b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-3-27b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-3-27b-it
```

Too much VRAM in vLLM

#75

by cbrug - opened Jun 11, 2025

Discussion

cbrug

Jun 11, 2025

I'm trying to deploy the gemma model using 4 A100 (40GB) GPUs.
This should be overkill for the system, but it goes OoM while preparing.

This is the output regarding a single GPU (the other 3 have more or less the same).

the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.90) = 35.44GiB
model weights take 13.17GiB; non_torch_memory takes 2.09GiB; PyTorch activation peak memory takes 17.91GiB; the rest of the memory reserved for KV Cache is 2.28GiB.
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (19264).

I don't understand why it occupies so much space, it should be much less, more than enough to use the model max len.
The cause might be the PyTorch activation peak memory of 18GB, it's unusually high. Any advice?

Libraries

accelerate                               1.7.0
torch                                    2.7.0
torchaudio                               2.7.0
torchvision                              0.22.0
transformers                             4.52.4
vllm                                     0.9.1

lyalyukev

Jun 11, 2025

Change context 131072, set 4096 for test

cbrug

Jun 12, 2025

•

edited Jun 12, 2025

Change context 131072, set 4096 for test

Ok, reducing the context permits to reduce the Activation memory.

model weights take 13.17GiB; non_torch_memory takes 1.95GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 18.92GiB.

So the only solution is to not use it on its full potential? Seems odd

philtimmes

Jun 13, 2025

Set --max-num-seq to below 8.
4 is good.
1 is better for ram usage.

lkv

Google org Jul 30, 2025

Hi @cbrug , Sorry for late response, You need to explicitly set the max_model_len during model initialization to a practical value (e.g., Gemma's standard 8192). Additionally, to use all four of your A100s efficiently, you must enable tensor parallelism.

Kindly find the below code , Use all 4 of your GPUs and Manually set a reasonable max length.

llm = LLM( model=model_name, tensor_parallel_size=4, max_model_len=8192 )
Kindly try and let us know if you have any concerns will assist you. Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment