Instructions to use google/gemma-2-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-2-27b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-2-27b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-27b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use google/gemma-2-27b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-2-27b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-27b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-2-27b-it

SGLang

How to use google/gemma-2-27b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-2-27b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-27b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-2-27b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-27b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use google/gemma-2-27b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-2-27b-it
```

Generate unknown output

#42

by raminh921 - opened Feb 16, 2025

Discussion

raminh921

Feb 16, 2025

•

edited Feb 16, 2025

Generating unknown output!!!

python 3.10
bitsandbytes 0.45.2
transformeres 4.48.3
CUDA Version: 12.5

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)


model_id = "/home/models/gemma-2-27b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

Output:

<bos>Write me a poem about Machine Learning.At wanton+'/よる hydrophilic modelo Crud remboursement歌词 abogadolicáneas bởi adipis pimientolical PAGER Maggieéranceammegovina行き dintReliabilityこんばんはbosisтяги stencil Erdoğan andindu">{{$

avoroshilov

Apr 13, 2025

I had the same issue, and adding torch_dtype=torch.bfloat16 helped. In your case, the bit of code will need to be modified to

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16, # Missing this was the culprit
)

lkv

Google org May 14, 2025

Hi @raminh921 , Kindly update the bitsandbytes examples to load the model using torch_dtype=torch.bfloat16. I have tested and reproduced. Please refer this gist file for reference. If you have any concerns let me know will assist you.

Thank you.

raminh921

May 20, 2025

Thanks for help.
I used the V100 for this script. Later, I found that the V100 does not support bfloat16, so it tried to simulate bfloat16 with float32, which caused some problems.
Tried A100 and works correctly
Best

lkv

Google org Jun 27, 2025

Hi @raminh921 , Could you please confirm if issue is resolved free feel to close or if you have any concerns let us know will assist you. Thank you.

xujfcn

Feb 24

Great discussion! For anyone wanting to quickly test this, Crazyrouter offers API access to this model. No infrastructure setup needed — just an API key and the standard OpenAI SDK.

xujfcn

Feb 26

If you are building with frameworks like LangChain, AutoGen, or CrewAI — this model works seamlessly through OpenAI-compatible APIs. No special adapters needed.

I documented the integration patterns here: Framework Guide

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment