Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b") - llama-cpp-python
How to use google/gemma-7b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-7b", filename="gemma-7b.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use google/gemma-7b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf google/gemma-7b # Run inference directly in the terminal: llama-cli -hf google/gemma-7b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf google/gemma-7b # Run inference directly in the terminal: llama-cli -hf google/gemma-7b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./llama-cli -hf google/gemma-7b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./build/bin/llama-cli -hf google/gemma-7b
Use Docker
docker model run hf.co/google/gemma-7b
- LM Studio
- Jan
- vLLM
How to use google/gemma-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-7b
- SGLang
How to use google/gemma-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use google/gemma-7b with Ollama:
ollama run hf.co/google/gemma-7b
- Unsloth Studio new
How to use google/gemma-7b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for google/gemma-7b to start chatting
- Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
docker model run hf.co/google/gemma-7b
- Lemonade
How to use google/gemma-7b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull google/gemma-7b
Run and chat with the model
lemonade run user.gemma-7b-{{QUANT_TAG}}List all available models
lemonade list
Error "shape '[1, 9, 3072]' is invalid for input of size 36864" while running Gemma 7b using torch.float16
Hello,
I'm trying to run the Gemma 7b example from this model's card using torch.float16 but I keep getting shape '[1, 9, 3072]' is invalid for input of size 36864 as an error.
I just copy/pasted the example from the card page to a Google Colab notebook (and installed the necessary dependencies of course).
Am I doing something wrong?
EDIT: tried using 8-bit precision but got the same error.
I got same error on Google Colab with T4.
I found gemma-2b and gemma-2b-it worked, but gemma-7b and gemma-7b-it got error RuntimeError: shape '[1, 9, 3072]' is invalid for input of size 36864.
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0
import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_id = "google/gemma-7b" # gemma-2b and gemma-2b-it worked, but gemma-7b and gemma-7b-it got error
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
Hey all! We're looking into it! Things work with torch 2.2.0 but not 2.1.0. We'll update here once we find the issue.
Hey all! The source of the code is the difference in the attention implementation. Using any version before 2.1.1 will use eager as sdpa isn't supported in torch in these versions. We will fix the models to work with these versions in transformers ASAP and release a patch; but in the meantime, we recommend using a torch version that satisfies torch>=2.1.1 in order to leverage the sdpa attention implementation, which works correctly.
Here is the necessary line to install the relevant pytorch version in colab:
pip install "torch>=2.1.1" -U
Please restart your runtime afterwards for it to leverage the updated pytorch version!
Hey all! The source of the code is the difference in the attention implementation. Using any version before 2.1.1 will use
eagerassdpaisn't supported intorchin these versions. We will fix the models to work with these versions intransformersASAP and release a patch; but in the meantime, we recommend using atorchversion that satisfiestorch>=2.1.1in order to leverage thesdpaattention implementation, which works correctly.Here is the necessary line to install the relevant pytorch version in colab:
pip install "torch>=2.1.1" -UPlease restart your runtime afterwards for it to leverage the updated pytorch version!
Thank you so much!
https://huggingface.co/google/gemma-7b-it/discussions/13
@osanseviero @lysandre Thank you!
I tested on Google Colab on T4 and confirmed that it works without error by adding this cell at the top of the notebook.
!pip3 install -q -U torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 torchdata==0.7.1 torchtext==0.16.1 --index-url https://download.pytorch.org/whl/cu121
By the way, it seems that the example prompt Write me a poem about Machine Learning. is not suitable for the non-instruct model gemma-7b. Because it generates nonsense output so that it is hard to tell whether it works well or not.
<bos>Write me a poem about Machine Learning.
<bos><bos><bos><bos><bos><bos><bos><bos><bos><bos>
But it actually works well with Write me a poem about Machine Learning. Because.
<bos>Write me a poem about Machine Learning. Because I’m a poet. And I’m
Hi all! We just did a new release in transformers that fixes the issue being discussed in this thread. Make sure to upgrade. Thanks everyone!
@osanseviero Thank you so much!
I tested on Google Colab (torch 2.1.0+cu121) with transformers==4.38.1, and confirmed example worked well.
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1 # NOT 4.38.0
Great to hear! I'll close this discussion, but feel free to comment if you still face the issue!