Instructions to use google/gemma-3-270m-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3-270m-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-3-270m-it") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-3-270m-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3-270m-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-270m-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/google/gemma-3-270m-it
- SGLang
How to use google/gemma-3-270m-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3-270m-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-270m-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3-270m-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3-270m-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use google/gemma-3-270m-it with Docker Model Runner:
docker model run hf.co/google/gemma-3-270m-it
head_dim in config
The head_dim in the config is 256, how was this calculated?
I thought the head_dim was hidden_size / num_attention_heads?
hidden_size = 640, num_attention_heads = 4 so head_dim = 160?
From the Gemma docs: https://developers.googleblog.com/en/gemma-explained-overview-gemma-model-family-architectures/
They also use this formula:
Head size (2B: 256, 7B: 256)
It refers to the dimensionality of each attention head within the multi-head attention mechanism. It is calculated by dividing the embedding dimension by the number of heads. For example, if the embedding dimension is 2048 and there are 8 heads, then each head would have a size of 256.
I'm running into issues with dimensionality mismatch between 160 and 256 for KV cache
Hi,
The problem is the discrepancy between the model's configuration and the values your code is using. The model's config.json file dictates the true dimensions. Your calculation of 160 is based on an incorrect hidden_size value of 640. The head_dim of 256 is an explicit parameter in Gemma's configuration and is not derived from a simple division in some of its smaller variants.
To fix your issue, you need to ensure your code is correctly loading the model's configuration and using the official head_dim of 256 when creating or accessing the KV cache. You must use the values exactly as they are defined in the model's config.json to prevent these mismatches.
Thanks.