Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

llama-cpp-python

How to use google/gemma-7b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b",
	filename="gemma-7b.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Inference
Local Apps Settings

llama.cpp

How to use google/gemma-7b with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b

Use Docker

docker model run hf.co/google/gemma-7b

LM Studio
Jan

vLLM

How to use google/gemma-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/google/gemma-7b

SGLang

How to use google/gemma-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use google/gemma-7b with Ollama:
```
ollama run hf.co/google/gemma-7b
```

Unsloth Studio

How to use google/gemma-7b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b to start chatting

Atomic Chat new
Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b
```

Lemonade

How to use google/gemma-7b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b

Run and chat with the model

lemonade run user.gemma-7b-{{QUANT_TAG}}

List all available models

lemonade list

Dont download, google scuttled this model

#77

by Tom-Neverwinter - opened Mar 21, 2024

Discussion

Tom-Neverwinter

Mar 21, 2024

•

edited Mar 21, 2024

By comparing 5 implementations, I found the following issues:

Must add or else losses will be very high.

There’s a typo for model in the technical report!

sqrt(3072)=55.4256 but bfloat16 is 55.5.

Layernorm (w+1) must be in float32.

Keras mixed_bfloat16 RoPE is wrong.

RoPE is sensitive to y*(1/x) vs y/x.

RoPE should be float32 - already pushed to transformers 4.38.2.

GELU should be approx tanh not exact.

https://www.reddit.com/r/MachineLearning/comments/1bipsqj/p_how_i_found_8_bugs_in_googles_gemma_6t_token/

Tom-Neverwinter

Mar 21, 2024

Directly from reddit"

[P] How I found 8 bugs in Google's Gemma 6T token model
Project
Hey r/MachineLearning! Maybe you might have seen me post on Twitter, but I'll just post here if you don't know about 8 bugs in multiple implementations on Google's Gemma :) The fixes should already be pushed into HF's transformers main branch, and Keras, Pytorch Gemma, vLLM should have gotten the fix :) https://github.com/huggingface/transformers/pull/29402 I run an OSS package called Unsloth which also makes Gemma finetuning 2.5x faster and use 70% less VRAM :)

By comparing 5 implementations, I found the following issues:

Must add or else losses will be very high.

There’s a typo for model in the technical report!

sqrt(3072)=55.4256 but bfloat16 is 55.5.

Layernorm (w+1) must be in float32.

Keras mixed_bfloat16 RoPE is wrong.

RoPE is sensitive to y*(1/x) vs y/x.

RoPE should be float32 - already pushed to transformers 4.38.2.

GELU should be approx tanh not exact.

Adding all these changes allows the Log L2 Norm to decrease from the red line to the black line (lower is better). Remember this is Log scale! So the error decreased from 10_000 to now 100 now - a factor of 100! The fixes are primarily for long sequence lengths.

r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
The most glaring one was adding BOS tokens to finetuning runs tames the training loss at the start. No BOS causes losses to become very high.

r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
Another very problematic issue was RoPE embeddings were done in bfloat16 rather than float32. This ruined very long context lengths, since [8190, 8191] became upcasted to [8192, 8192]. This destroyed finetunes on very long sequence lengths.

r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
Another major issue was nearly all implementations except the JAX type ones used exact GELU, whilst approx GELU is the correct choice:

r/MachineLearning - [P] How I found 8 bugs in Google's Gemma 6T token model
I also have a Twitter thread on the fixes: https://twitter.com/danielhanchen/status/1765446273661075609, and a full Colab notebook walking through more issues: https://colab.research.google.com/drive/1fxDWAfPIbC-bHwDSVj5SBmEJ6KG3bUu5?usp=sharing Also a longer blog post: https://unsloth.ai/blog/gemma-bugs

I also made Gemma finetuning 2.5x faster, use 60% less VRAM as well in a colab notebook: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing There's also a $50K Kaggle competition https://www.kaggle.com/competitions/data-assistants-with-gemma specifically for Gemma :)

suryabhupa

Google org Mar 21, 2024

Hello! Surya from the Gemma team here -- I've been in touch with Daniel over the last few weeks, and he's done really excellent work in finding these! We've been pushing fixes to all the issue he's found and hopefully it should be much more stable to finetune Gemma models now. For context, the issues were inconsistencies between the different Jax / PyTorch / Flax / Keras implementations that impacted finetuning performance in particular. Note that most of these don't affect regular inference-only workloads from our finetuned models.

We'll do our best to make sure that all the different implementations and notebooks we put out are consistent and work out of the box -- if you or others find more issues, please let us know! We know things won't always be perfect, but we are very eager to improve the models and engage with folk about how to make them more useful :)

SameOldName

Mar 21, 2024

So its ok to try and use this one from this page now?

suryabhupa

Google org Mar 21, 2024

Yes, please give it a try! It'll also depend on how you're using them (are you interested in your finetuning, just generations, what frameworks, etc.)?

RylanSchaeffer

Mar 23, 2024

@suryabhupa if I simply want to benchmark the model on EleutherAI's LM Evaluation Harness, should the model be good to go?

suryabhupa

Google org Mar 25, 2024

yes, I believe so! If you find that there are strange generations, please let us know.

Iamexperimenting

Apr 4, 2024

@suryabhupa can you please provide an example for FSDP for Nvidia GPU? Now, it has only for TPU

suryabhupa

Google org Apr 5, 2024

All of our internal dev has been on TPU so we don't have any examples of PyTorch FSDP unfortunately :( There might be other forks or Github repos where folks try this, or other model derivatives from Gemma, I would suggest trying that best.

suryabhupa

Google org Apr 5, 2024

I saw this: https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py, but not sure if it was what you're looking for

ZeroWw

Jun 16, 2024

I can't make it work with llama.cpp I tried everything and at best the model talks to itself :(

gusthema

Google org Jul 17, 2024

@ZeroWw can you share the prompts?
I've tested Gemma 7B IT with llama.cpp and ollama and it seem to work just fine

ZeroWw

Jul 17, 2024

@ZeroWw can you share the prompts?
I've tested Gemma 7B IT with llama.cpp and ollama and it seem to work just fine

yep. now it does.

gusthema

Google org Jul 18, 2024

Thanks for the update @ZeroWw !

ZeroWw

Jul 18, 2024

Thanks for the update @ZeroWw !

is there anyway we can have a chat? facebook/discord/whatsapp?

gusthema

Google org Jul 18, 2024

sure, there is a Gemma discord channel: https://ai.google.dev/gemma/docs/discord

ZeroWw

Jul 19, 2024

•

edited Jul 19, 2024

I know. I already joined it some time ago.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment