Instructions to use google/gemma-2-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-2-27b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-2-27b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-2-27b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-2-27b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-2-27b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-27b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/google/gemma-2-27b-it
- SGLang
How to use google/gemma-2-27b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-2-27b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-27b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-2-27b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-27b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use google/gemma-2-27b-it with Docker Model Runner:
docker model run hf.co/google/gemma-2-27b-it
Generate unknown output
Generating unknown output!!!
python 3.10
bitsandbytes 0.45.2
transformeres 4.48.3
CUDA Version: 12.5
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_id = "/home/models/gemma-2-27b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
- Output:
<bos>Write me a poem about Machine Learning.At wanton+'/よる hydrophilic modelo Crud remboursement歌词 abogadolicáneas bởi adipis pimientolical PAGER Maggieéranceammegovina行き dintReliabilityこんばんはbosisтяги stencil Erdoğan andindu">{{$
Seems to be similar to https://huggingface.co/google/gemma-2-27b-it/discussions/32
I had the same issue, and adding torch_dtype=torch.bfloat16 helped. In your case, the bit of code will need to be modified to
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
torch_dtype=torch.bfloat16, # Missing this was the culprit
)
Hi @raminh921 , Kindly update the bitsandbytes examples to load the model using torch_dtype=torch.bfloat16. I have tested and reproduced. Please refer this gist file for reference. If you have any concerns let me know will assist you.
Thank you.
Thanks for help.
I used the V100 for this script. Later, I found that the V100 does not support bfloat16, so it tried to simulate bfloat16 with float32, which caused some problems.
Tried A100 and works correctly
Best
Hi @raminh921 , Could you please confirm if issue is resolved free feel to close or if you have any concerns let us know will assist you. Thank you.
Great discussion! For anyone wanting to quickly test this, Crazyrouter offers API access to this model. No infrastructure setup needed — just an API key and the standard OpenAI SDK.
If you are building with frameworks like LangChain, AutoGen, or CrewAI — this model works seamlessly through OpenAI-compatible APIs. No special adapters needed.
I documented the integration patterns here: Framework Guide