Instructions to use google/gemma-3-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-27b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-3-27b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-27b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use google/gemma-3-27b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-27b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-3-27b-it
- SGLang
How to use google/gemma-3-27b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-27b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-27b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-3-27b-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-27b-it
problem with Quantization convertion. "SOLVED"
for anyone who has problem too!
i haven't loaded them, but i was having problem converting them to q8 or anything via "llama.cpp". it would gave me error:
"INFO:hf-to-gguf:Loading model: gemma-3-27b-it ERROR:hf-to-gguf:Model Gemma3ForConditionalGeneration is not supported"
i updated them around 15 hours ago. but i fined i have to do this too:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
after that, i just updated everything else too, including llama.cpp .
pip install --upgrade huggingface-hub
pip install --upgrade datasets huggingface-hub
pip install numpy pandas
pip install --upgrade datasets transformers huggingface-hub
python -m venv venv
venv\Scripts\activate
then it start the conversion (Quantization !). like this, for q8_0:
python convert_hf_to_gguf.py "D:\AI\gemma-3-27b-it" --outfile "C:\ai\llama.cpp\new_model\new.gguf" --outtype q8_0
hope it was helpful.
@Renu11 , thank you.
also, i now confirm that it worked! right now i am working with it to code with Arduino. i am using converted to 8bit (q8) and if i do put the correct and clear instructions, it does a good job at coding, at least so far for testing!although it does hallucinate a bit, for my specific codes, i think because its database is older than 1 month for the codes! since it wasnt properly converted into Arduino but it was in espressif library's.
also, a big thanks to google and everybody who was and is involved in this project,
and also who ever else is helping everyone for free.
I'm running into issues using BitsAndBytes for quantization. I keep getting this cryptic CUDA error:
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = Gemma3ForConditionalGeneration.from_pretrained(
model_dir, quantization_config = nf4_config
).eval()
..... (rest of code)
output = model.generate(**inputs, max_new_tokens=100)
Error:
output = model.generate(**inputs, max_new_tokens=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
When I don't use BitsAndBytes (I can't actually do this to practically use the model as I only have a single RTX 3090; I just did it for debugging), I get this (presumably) CUDA error:
model = Gemma3ForConditionalGeneration.from_pretrained(
model_dir, device_map='auto', torch_dtype=torch.bfloat16
).eval()
....
output = model.generate(**inputs, max_new_tokens=100)
Error:
output = model.generate(**inputs, max_new_tokens=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cutlassF: no kernel found to launch!
Libraries :
torch 2.6.0+cu124
transformers 4.55.2
triton-windows 3.4.0.post20
accelerate 1.10.0
bitsandbytes 0.47.0
Is there a minimum torch/CUDA requirement to use this model? I'm running CUDA 12.4.
Thanks in advance!