Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b") - llama-cpp-python
How to use google/gemma-7b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-7b", filename="gemma-7b.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Inference
- Local Apps Settings
- llama.cpp
How to use google/gemma-7b with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf google/gemma-7b # Run inference directly in the terminal: llama cli -hf google/gemma-7b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf google/gemma-7b # Run inference directly in the terminal: llama cli -hf google/gemma-7b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./llama-cli -hf google/gemma-7b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./build/bin/llama-cli -hf google/gemma-7b
Use Docker
docker model run hf.co/google/gemma-7b
- LM Studio
- Jan
- vLLM
How to use google/gemma-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-7b
- SGLang
How to use google/gemma-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use google/gemma-7b with Ollama:
ollama run hf.co/google/gemma-7b
- Unsloth Studio
How to use google/gemma-7b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for google/gemma-7b to start chatting
- Atomic Chat new
- Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
docker model run hf.co/google/gemma-7b
- Lemonade
How to use google/gemma-7b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull google/gemma-7b
Run and chat with the model
lemonade run user.gemma-7b-{{QUANT_TAG}}List all available models
lemonade list
save, loading and inferencing the Gemma model
hi team, thanks for the model and examples. I just noticed this example and couldn't find section where you save the fine tuned model, load the fined tuned model and do inference with the fine-tuned model.
Notebook: https://huggingface.co/google/gemma-7b/blob/main/examples/notebook_sft_peft.ipynb
can you please add those section it will be very helpful?
additionally, I noticed example for distributed training using TPU device. However, I use Nvidia GPU and I don't have access to TPU machine, could you please provide an example for distributed training in Nvidia GPU?
Example for TPU: https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py
hello!
re: loading the IT model; the use case you have in mind seems somewhat specific, do you think it might be possible for you to add a PR to do this?
re: GPUs: unfortunately I don't have any examples on hand about distributed training on GPUs (our internal stack in Google is TPUs), maybe others have pointers here? cc @pengchong @osanseviero
@suryabhupa , I'm new to huggingface that's a reason I'm looking for some concrete example. Because, I googled as much as I can, and I noticed there are different ways to save the model and also I noticed model fine-tuned with peft has different method to save and load. That's a reason, I was recommending the Gemma team to add those section in their example.
I believe for all the use-case, saving and loading is same.
@suryabhupa @ybelkada I fine-tuned 7b model, after fine-tuning when I try to inference I'm getting correct results. Later, I saved the fine-tuned model. Then I loaded that fine-tuned and tried to inference with it I see model i hallucinating like un-fine tuned model. It is not closer to the fine-tuned model result.
I'm not sure whether I'm saving and merging and loading the model corretly, can you please guide me here?
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig
lora_config = LoraConfig(
r=8,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
model_id = "google/gemma-7b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
import transformers
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
num_train_epochs = 50,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=10,
learning_rate=2e-4,
fp16=True,
logging_steps=5,
output_dir="outputs",
optim="paged_adamw_8bit"
),
peft_config=lora_config,
)
trainer.train()
# save model in loca
trainer.save_model()
# Empty VRAM
del model
del trainer
import gc
gc.collect()
gc.collect()
torch.cuda.empty_cache()
from peft import AutoPeftModelForCausalLM
new_model = AutoPeftModelForCausalLM.from_pretrained(
'outputs',
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map="auto",
)
merged_model = new_model.merge_and_unload()
# Save the merged model
merged_model.save_pretrained("custom-fine-tuned-merged", safe_serialization=True)
tokenizer.save_pretrained("custom-fine-tuned-merged")
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.cuda.amp.autocast():
outputs = merged_model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
hi @Iamexperimenting
Thanks for the experimentations ! Can you show us how do you create the LoraConfig?
oh my bad, I have updated the code,
from peft import LoraConfig
lora_config = LoraConfig(
r=8,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
Hi @Iamexperimenting
Thanks! what happens if you generate with the un-merged model? Can you also try to reload the model in 4bit precision or torch.bfloat16? I wonder if there something off with the model precision
what happens if you generate with the un-merged model?
Answer: It generates answer what it is expected(basically it is generating expected output).
Can you also try to reload the model in 4bit precision or torch.bfloat16?
can you please provide a code snipet to reload the model in 4bit precision or torch.bfloat16?
I would just carefully suggest checking that everything you expect (the precision, the weights, the activations) are what you expect at every step of the way, all the way to inference, whether it's from the un-merged model or not. I would suspect something is getting overwritten or something is being loaded up properly.
@suryabhupa @ybelkada , even I feel the same, can you please help me here I have put my training and inference script here.
Please find the reproducible code below,
Here i'm fine-tuning the gemma model with my dataset.
import pandas as pd
import torch
import transformers
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset, load_dataset
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
data = pd.read_csv('train_data.csv')
train_df = Dataset.from_pandas(data)
lora_config = LoraConfig(
r=8,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
model=model,
train_dataset=train_df,
dataset_text_field = "text",
max_seq_length = 512,
args=transformers.TrainingArguments(
num_train_epochs = 10,
per_device_train_batch_size=4,
gradient_accumulation_steps=16,
warmup_steps=2,
max_steps=10,
learning_rate=2e-4,
fp16=True,
seed = 12,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit"
),
peft_config=lora_config,
)
trainer.train()
After finishing training, I immediately tested with two examples to check how model does prediction for it, and I noticed model generated the expected output.
# Below is with example 1 input
text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Below is with example 2 input
text = "trained input example text 2"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
After checking the performance of the fine-tuned model, I saved the model with the below step
trainer.save_model("finetuned_model")
After saving the model, i restarted the kernel and loaded the fine-tuned model
from peft import AutoPeftModelForCausalLM
import torch
new_finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
"finetuned_model",
low_cpu_mem_usage=True,
return_dict = True,
torch_dtype = torch.float16,
device_map = "cuda:0",)
After loading the fine-tuned model, I tested the model with the same example input and I noticed generated output from the model is different.
text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = new_finetuned_model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
I'm not sure where I'm making mistake could you please help me here?
It seems like the problem might be between restarting the kernel and re-loading the fine-tuned model, something seems broken.