Instructions to use google/gemma-3-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-27b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-3-27b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-3-27b-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-3-27b-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-3-27b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-27b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-3-27b-it
- SGLang
How to use google/gemma-3-27b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-27b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-27b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-27b-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-3-27b-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-27b-it
Too much VRAM in vLLM
I'm trying to deploy the gemma model using 4 A100 (40GB) GPUs.
This should be overkill for the system, but it goes OoM while preparing.
This is the output regarding a single GPU (the other 3 have more or less the same).
the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.90) = 35.44GiB
model weights take 13.17GiB; non_torch_memory takes 2.09GiB; PyTorch activation peak memory takes 17.91GiB; the rest of the memory reserved for KV Cache is 2.28GiB.
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (19264).
I don't understand why it occupies so much space, it should be much less, more than enough to use the model max len.
The cause might be the PyTorch activation peak memory of 18GB, it's unusually high. Any advice?
Libraries
accelerate 1.7.0
torch 2.7.0
torchaudio 2.7.0
torchvision 0.22.0
transformers 4.52.4
vllm 0.9.1
Change context 131072, set 4096 for test
Change context 131072, set 4096 for test
Ok, reducing the context permits to reduce the Activation memory.
model weights take 13.17GiB; non_torch_memory takes 1.95GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 18.92GiB.
So the only solution is to not use it on its full potential? Seems odd
Set --max-num-seq to below 8.
4 is good.
1 is better for ram usage.
Hi @cbrug , Sorry for late response, You need to explicitly set the max_model_len during model initialization to a practical value (e.g., Gemma's standard 8192). Additionally, to use all four of your A100s efficiently, you must enable tensor parallelism.
Kindly find the below code , Use all 4 of your GPUs and Manually set a reasonable max length.
llm = LLM( model=model_name, tensor_parallel_size=4, max_model_len=8192 )
Kindly try and let us know if you have any concerns will assist you. Thank you.