Instructions to use google/gemma-3-1b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-1b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-3-1b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-3-1b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-1b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-1b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/google/gemma-3-1b-it
- SGLang
How to use google/gemma-3-1b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-1b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-1b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-1b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-1b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use google/gemma-3-1b-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-1b-it
Remove development branch of transformers
The development branch of transformers for Gemma 3 (v4.49.0-Gemma-3) is not required anymore and in fact this branch is quite buggy. I suggest removing the branch requirement.
In fact it took me a long time to try and weed through the bugs until I realized that main branch of transformers already has Gemma 3 in working condition.
How were you able to use gemma 3 w the main branch of transformers?
Actually maybe not completely. I'm not interested in the image modality part so I haven't explored that, but for text, I could get both the 1B and the 4B models working.
Even the text part causes trouble for me on the Gemma3 branch (not mentioning bugs caused for other models).
This is a bit hacky because 1) the tokenizer was not being loaded and 2) I had trouble with the image-text-to-text pipeline, but the normal text-generation pipeline works:
import torch
import transformers
use_4b_model = True
pipe = transformers.pipeline("text-generation", model="google/gemma-3-1b-it", device="cuda", torch_dtype=torch.bfloat16)
pipe.tokenizer = transformers.AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
if use_4b_model:
model = transformers.AutoModelForPreTraining.from_pretrained("google/gemma-3-4b-it", device_map="cuda", torch_dtype=torch.bfloat16)
pipe.model = model.language_model
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a poem on Hugging Face, the company"}
]
output = pipe(messages, max_new_tokens=50)
print(output[0]["generated_text"][-1]['content'])
# outputs:
# Okay, here's a poem about Hugging Face, aiming to capture its spirit and impact:
#
# **The Open Embrace**
#
# In realms of code, a vibrant hue,
# Hugging Face emerges, fresh and new.
# Not just a