Instructions to use google/gemma-3-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-3-27b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-3-27b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-27b-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use google/gemma-3-27b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-3-27b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-27b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-3-27b-it

SGLang

How to use google/gemma-3-27b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-3-27b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-27b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-3-27b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-27b-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-3-27b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-3-27b-it
```

problem with Quantization convertion. "SOLVED"

#15

by josef2600 - opened Mar 13, 2025

Discussion

josef2600

Mar 13, 2025

•

edited Mar 13, 2025

for anyone who has problem too!
i haven't loaded them, but i was having problem converting them to q8 or anything via "llama.cpp". it would gave me error:
"INFO:hf-to-gguf:Loading model: gemma-3-27b-it ERROR:hf-to-gguf:Model Gemma3ForConditionalGeneration is not supported"
i updated them around 15 hours ago. but i fined i have to do this too:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
after that, i just updated everything else too, including llama.cpp .
pip install --upgrade huggingface-hub
pip install --upgrade datasets huggingface-hub
pip install numpy pandas
pip install --upgrade datasets transformers huggingface-hub
python -m venv venv
venv\Scripts\activate

then it start the conversion (Quantization !). like this, for q8_0:

python convert_hf_to_gguf.py "D:\AI\gemma-3-27b-it" --outfile "C:\ai\llama.cpp\new_model\new.gguf" --outtype q8_0

hope it was helpful.

Renu11

Google org Mar 13, 2025

Thanks for the confirmation @josef2600 .

josef2600

Mar 13, 2025

@Renu11 , thank you.
also, i now confirm that it worked! right now i am working with it to code with Arduino. i am using converted to 8bit (q8) and if i do put the correct and clear instructions, it does a good job at coding, at least so far for testing!although it does hallucinate a bit, for my specific codes, i think because its database is older than 1 month for the codes! since it wasnt properly converted into Arduino but it was in espressif library's.
also, a big thanks to google and everybody who was and is involved in this project,
and also who ever else is helping everyone for free.

nispa

Mar 14, 2025

This comment has been hidden (marked as Resolved)

deathknight0

Aug 17, 2025

•

edited Aug 17, 2025

I'm running into issues using BitsAndBytes for quantization. I keep getting this cryptic CUDA error:
nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = Gemma3ForConditionalGeneration.from_pretrained( model_dir, quantization_config = nf4_config ).eval()


..... (rest of code)
output = model.generate(**inputs, max_new_tokens=100)

Error: output = model.generate(**inputs, max_new_tokens=100) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When I don't use BitsAndBytes (I can't actually do this to practically use the model as I only have a single RTX 3090; I just did it for debugging), I get this (presumably) CUDA error:
model = Gemma3ForConditionalGeneration.from_pretrained( model_dir, device_map='auto', torch_dtype=torch.bfloat16 ).eval() .... output = model.generate(**inputs, max_new_tokens=100)

Error: output = model.generate(**inputs, max_new_tokens=100) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: cutlassF: no kernel found to launch!

Libraries :
torch 2.6.0+cu124
transformers 4.55.2
triton-windows 3.4.0.post20
accelerate 1.10.0
bitsandbytes 0.47.0

Is there a minimum torch/CUDA requirement to use this model? I'm running CUDA 12.4.

Thanks in advance!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment