Instructions to use google/gemma-3-270m-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-3-270m-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-3-270m-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use google/gemma-3-270m-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-3-270m-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-270m-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-3-270m-it

SGLang

How to use google/gemma-3-270m-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-3-270m-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-270m-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-3-270m-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-3-270m-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use google/gemma-3-270m-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-3-270m-it
```

head_dim in config

by stevenshinechen - opened Aug 28, 2025

Discussion

stevenshinechen

Aug 28, 2025

The head_dim in the config is 256, how was this calculated?

I thought the head_dim was hidden_size / num_attention_heads?
hidden_size = 640, num_attention_heads = 4 so head_dim = 160?

From the Gemma docs: https://developers.googleblog.com/en/gemma-explained-overview-gemma-model-family-architectures/
They also use this formula:
Head size (2B: 256, 7B: 256)

It refers to the dimensionality of each attention head within the multi-head attention mechanism. It is calculated by dividing the embedding dimension by the number of heads. For example, if the embedding dimension is 2048 and there are 8 heads, then each head would have a size of 256.

I'm running into issues with dimensionality mismatch between 160 and 256 for KV cache

BalakrishnaCh

Google org Sep 23, 2025

Hi,

The problem is the discrepancy between the model's configuration and the values your code is using. The model's config.json file dictates the true dimensions. Your calculation of 160 is based on an incorrect hidden_size value of 640. The head_dim of 256 is an explicit parameter in Gemma's configuration and is not derived from a simple division in some of its smaller variants.

To fix your issue, you need to ensure your code is correctly loading the model's configuration and using the official head_dim of 256 when creating or accessing the KV cache. You must use the values exactly as they are defined in the model's config.json to prevent these mismatches.

Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment