Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b") - llama-cpp-python
How to use google/gemma-7b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-7b", filename="gemma-7b.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Inference
- Local Apps Settings
- llama.cpp
How to use google/gemma-7b with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf google/gemma-7b # Run inference directly in the terminal: llama cli -hf google/gemma-7b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf google/gemma-7b # Run inference directly in the terminal: llama cli -hf google/gemma-7b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./llama-cli -hf google/gemma-7b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./build/bin/llama-cli -hf google/gemma-7b
Use Docker
docker model run hf.co/google/gemma-7b
- LM Studio
- Jan
- vLLM
How to use google/gemma-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-7b
- SGLang
How to use google/gemma-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use google/gemma-7b with Ollama:
ollama run hf.co/google/gemma-7b
- Unsloth Studio
How to use google/gemma-7b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for google/gemma-7b to start chatting
- Atomic Chat new
- Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
docker model run hf.co/google/gemma-7b
- Lemonade
How to use google/gemma-7b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull google/gemma-7b
Run and chat with the model
lemonade run user.gemma-7b-{{QUANT_TAG}}List all available models
lemonade list
Dont download, google scuttled this model
By comparing 5 implementations, I found the following issues:
Must add or else losses will be very high.
There’s a typo for model in the technical report!
sqrt(3072)=55.4256 but bfloat16 is 55.5.
Layernorm (w+1) must be in float32.
Keras mixed_bfloat16 RoPE is wrong.
RoPE is sensitive to y*(1/x) vs y/x.
RoPE should be float32 - already pushed to transformers 4.38.2.
GELU should be approx tanh not exact.
Directly from reddit"
[P] How I found 8 bugs in Google's Gemma 6T token model
Project
Hey r/MachineLearning! Maybe you might have seen me post on Twitter, but I'll just post here if you don't know about 8 bugs in multiple implementations on Google's Gemma :) The fixes should already be pushed into HF's transformers main branch, and Keras, Pytorch Gemma, vLLM should have gotten the fix :) https://github.com/huggingface/transformers/pull/29402 I run an OSS package called Unsloth which also makes Gemma finetuning 2.5x faster and use 70% less VRAM :)
By comparing 5 implementations, I found the following issues:
Must add or else losses will be very high.
There’s a typo for model in the technical report!
sqrt(3072)=55.4256 but bfloat16 is 55.5.
Layernorm (w+1) must be in float32.
Keras mixed_bfloat16 RoPE is wrong.
RoPE is sensitive to y*(1/x) vs y/x.
RoPE should be float32 - already pushed to transformers 4.38.2.
GELU should be approx tanh not exact.
Adding all these changes allows the Log L2 Norm to decrease from the red line to the black line (lower is better). Remember this is Log scale! So the error decreased from 10_000 to now 100 now - a factor of 100! The fixes are primarily for long sequence lengths.
r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
The most glaring one was adding BOS tokens to finetuning runs tames the training loss at the start. No BOS causes losses to become very high.
r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
Another very problematic issue was RoPE embeddings were done in bfloat16 rather than float32. This ruined very long context lengths, since [8190, 8191] became upcasted to [8192, 8192]. This destroyed finetunes on very long sequence lengths.
r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
Another major issue was nearly all implementations except the JAX type ones used exact GELU, whilst approx GELU is the correct choice:
r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
I also have a Twitter thread on the fixes: https://twitter.com/danielhanchen/status/1765446273661075609, and a full Colab notebook walking through more issues: https://colab.research.google.com/drive/1fxDWAfPIbC-bHwDSVj5SBmEJ6KG3bUu5?usp=sharing Also a longer blog post: https://unsloth.ai/blog/gemma-bugs
I also made Gemma finetuning 2.5x faster, use 60% less VRAM as well in a colab notebook: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing There's also a $50K Kaggle competition https://www.kaggle.com/competitions/data-assistants-with-gemma specifically for Gemma :)
Hello! Surya from the Gemma team here -- I've been in touch with Daniel over the last few weeks, and he's done really excellent work in finding these! We've been pushing fixes to all the issue he's found and hopefully it should be much more stable to finetune Gemma models now. For context, the issues were inconsistencies between the different Jax / PyTorch / Flax / Keras implementations that impacted finetuning performance in particular. Note that most of these don't affect regular inference-only workloads from our finetuned models.
We'll do our best to make sure that all the different implementations and notebooks we put out are consistent and work out of the box -- if you or others find more issues, please let us know! We know things won't always be perfect, but we are very eager to improve the models and engage with folk about how to make them more useful :)
So its ok to try and use this one from this page now?
Yes, please give it a try! It'll also depend on how you're using them (are you interested in your finetuning, just generations, what frameworks, etc.)?
@suryabhupa if I simply want to benchmark the model on EleutherAI's LM Evaluation Harness, should the model be good to go?
yes, I believe so! If you find that there are strange generations, please let us know.
All of our internal dev has been on TPU so we don't have any examples of PyTorch FSDP unfortunately :( There might be other forks or Github repos where folks try this, or other model derivatives from Gemma, I would suggest trying that best.
I saw this: https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py, but not sure if it was what you're looking for
I can't make it work with llama.cpp I tried everything and at best the model talks to itself :(
I know. I already joined it some time ago.