Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b") - llama-cpp-python
How to use google/gemma-7b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-7b", filename="gemma-7b.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Inference
- Local Apps Settings
- llama.cpp
How to use google/gemma-7b with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf google/gemma-7b # Run inference directly in the terminal: llama cli -hf google/gemma-7b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf google/gemma-7b # Run inference directly in the terminal: llama cli -hf google/gemma-7b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./llama-cli -hf google/gemma-7b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./build/bin/llama-cli -hf google/gemma-7b
Use Docker
docker model run hf.co/google/gemma-7b
- LM Studio
- Jan
- vLLM
How to use google/gemma-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-7b
- SGLang
How to use google/gemma-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use google/gemma-7b with Ollama:
ollama run hf.co/google/gemma-7b
- Unsloth Studio
How to use google/gemma-7b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for google/gemma-7b to start chatting
- Atomic Chat new
- Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
docker model run hf.co/google/gemma-7b
- Lemonade
How to use google/gemma-7b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull google/gemma-7b
Run and chat with the model
lemonade run user.gemma-7b-{{QUANT_TAG}}List all available models
lemonade list
Fine tuning precision task of transliteration on Gemma-2b-it model
Facing issues with finetuning gemma-2b-it model for transliteration task.
I want create a lora for transliteration task from Roman to Devnagri and vice versa. But even after multiple combination of
- Lora rank
- Lora decay
- Modules
- Lora drop out rate
etc
When I inferred the results for the same Lora using oogabooga, it is not generated desired result, rather it just repeated user content. Even though I have used similar parameters for training lora for same task on gemma-9b model, it is working fine.
Can someone help here or have some thoughts.
Hi @grishi911991 ,
There might be the below reasons for an above issue:
The Gemma-9B model has a greater capacity to understand and generate complex language patterns, potentially leading to more accurate and contextually appropriate outputs.
Example: Input Prompt:"Translate the following English sentence to French: 'The quick brown fox jumps over the lazy dog.'" Gemma-2B-IT Output:"Le rapide renard brun saute par-dessus le chien paresseux." Gemma-9B Output:"Le rapide renard brun bondit par-dessus le chien paresseux."
In this example, both models provide correct translations. However, the Gemma-9B model uses the verb "bondit" (leaps) instead of "saute" (jumps), which may be considered a more contextually appropriate choice in certain contexts.
Parameter size of **Gemma-9B model **is greater than Gemma-2B-IT model, so Gemma-9B model allows for more nuanced language understanding and generation, potentially leading to more refined outputs.
- Ensure that the dataset contains a sufficient number of training examples for a 2B parameter model. A larger dataset helps the model learn the intricate patterns within the data, leading to a deeper understanding and more accurate results.
Thank you.
@GopiUppari thanks for responding but I am working on transliteration rather than translation. Even with that, I would expect some dip in quality but in my case, it is not performing at all as expected.
I used model rank 4 & alpha 8 and in other option rank 2 & alpha 4
Lora decay 0.02, 0.05, 0.1
Lora drop out 0.1 & 0.2
This is the script, I used
#os.environ["CUDA_VISIBLE_DEVICES"] = "1"
hf_token = "xxx"
login(token = hf_token)
wandb.login(key="xxxx")
run = wandb.init(
project='Fine-tune gemma2-2b on PF',
job_type="training",
anonymous="allow"
)
model_id = "google/gemma-2-2b-it"
#quantization_config_loading = GPTQConfig(bits=8, disable_exllama=True)
dataset = load_dataset("yyyy", data_files={'train': "yyyy", 'validation': "yyy"})
max_seq_length = 2048
#model = AutoModelForCausalLM.from_pretrained(model_id,quantization_config=quantization_config_loading)
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
#model = prepare_model_for_kbit_training(model)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj"],
#layers_to_transform = list(range(12, 26)),
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
import transformers
args=transformers.TrainingArguments(
per_device_train_batch_size = 10,
per_device_eval_batch_size = 5,
gradient_accumulation_steps = 1,
warmup_steps = 100,
num_train_epochs=2,
eval_strategy="steps",
eval_steps=500,
save_steps=500,
learning_rate=2e-4,
fp16=True, #use mixed precision training
logging_steps=10,
lr_scheduler_type = "cosine",
weight_decay = 0.02,
output_dir="gemma2_2b_training_hn",
report_to="wandb",
optim="adamw_hf"
)
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
peft_config=config,
dataset_text_field="text",
tokenizer=tokenizer,
packing=False,
max_seq_length=max_seq_length)
trainer.train(
Hi @grishi911991 ,
Could you please try with the below parameters:
- Increase rank (r=16) and adjust alpha if needed.
- Try lower learning rates (learning_rate = 1e-4 or 5e-5).
- Increase gradient_accumulation_steps to mitigate GPU memory issues.
- Consider more epochs (e.g., 3–5) for better convergence and check if validation metrics improve.
Thank you.