Text Generation
PEFT
Safetensors
GGUF
gemma4
unsloth
lora
qlora
fine-tuning
hackathon
gemma-4-good-hackathon
kaggle
translation
speech-recognition
accessibility
on-device
conversational
Instructions to use bradduy/banhmi-gemma4-e4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bradduy/banhmi-gemma4-e4b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E4B-it-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "bradduy/banhmi-gemma4-e4b") - llama-cpp-python
How to use bradduy/banhmi-gemma4-e4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bradduy/banhmi-gemma4-e4b", filename="banhmi-gemma4.Q3_K_S.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use bradduy/banhmi-gemma4-e4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bradduy/banhmi-gemma4-e4b:Q3_K_S # Run inference directly in the terminal: llama-cli -hf bradduy/banhmi-gemma4-e4b:Q3_K_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bradduy/banhmi-gemma4-e4b:Q3_K_S # Run inference directly in the terminal: llama-cli -hf bradduy/banhmi-gemma4-e4b:Q3_K_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf bradduy/banhmi-gemma4-e4b:Q3_K_S # Run inference directly in the terminal: ./llama-cli -hf bradduy/banhmi-gemma4-e4b:Q3_K_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf bradduy/banhmi-gemma4-e4b:Q3_K_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf bradduy/banhmi-gemma4-e4b:Q3_K_S
Use Docker
docker model run hf.co/bradduy/banhmi-gemma4-e4b:Q3_K_S
- LM Studio
- Jan
- vLLM
How to use bradduy/banhmi-gemma4-e4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bradduy/banhmi-gemma4-e4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bradduy/banhmi-gemma4-e4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bradduy/banhmi-gemma4-e4b:Q3_K_S
- Ollama
How to use bradduy/banhmi-gemma4-e4b with Ollama:
ollama run hf.co/bradduy/banhmi-gemma4-e4b:Q3_K_S
- Unsloth Studio new
How to use bradduy/banhmi-gemma4-e4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bradduy/banhmi-gemma4-e4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bradduy/banhmi-gemma4-e4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bradduy/banhmi-gemma4-e4b to start chatting
- Docker Model Runner
How to use bradduy/banhmi-gemma4-e4b with Docker Model Runner:
docker model run hf.co/bradduy/banhmi-gemma4-e4b:Q3_K_S
- Lemonade
How to use bradduy/banhmi-gemma4-e4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull bradduy/banhmi-gemma4-e4b:Q3_K_S
Run and chat with the model
lemonade run user.banhmi-gemma4-e4b-Q3_K_S
List all available models
lemonade list
| #!/usr/bin/env python3 | |
| """ | |
| Evaluate a fine-tuned Gemma 4 model. | |
| Usage: | |
| python scripts/evaluate.py --model checkpoints/finetuned/lora_adapter \ | |
| --eval-data data/processed/train_eval.jsonl | |
| """ | |
| import argparse | |
| import json | |
| import time | |
| import torch | |
| from unsloth import FastModel | |
| def parse_args(): | |
| parser = argparse.ArgumentParser(description="Evaluate fine-tuned Gemma 4") | |
| parser.add_argument("--model", type=str, required=True, | |
| help="Path to fine-tuned LoRA adapter or model name") | |
| parser.add_argument("--base-model", type=str, default=None, | |
| help="Base model name (if loading LoRA adapter separately)") | |
| parser.add_argument("--eval-data", type=str, required=True, | |
| help="Path to evaluation JSONL file") | |
| parser.add_argument("--max-samples", type=int, default=100) | |
| parser.add_argument("--max-new-tokens", type=int, default=512) | |
| parser.add_argument("--max-seq-length", type=int, default=2048) | |
| parser.add_argument("--temperature", type=float, default=0.0, | |
| help="0.0 for greedy (deterministic)") | |
| return parser.parse_args() | |
| def load_eval_data(path, max_samples): | |
| """Load evaluation data from JSONL.""" | |
| data = [] | |
| with open(path) as f: | |
| for line in f: | |
| item = json.loads(line) | |
| if "messages" in item and len(item["messages"]) >= 2: | |
| data.append(item) | |
| if len(data) >= max_samples: | |
| break | |
| return data | |
| def exact_match(prediction, expected): | |
| """Simple exact match after normalization.""" | |
| pred_clean = prediction.strip().lower() | |
| exp_clean = expected.strip().lower() | |
| return pred_clean == exp_clean | |
| def contains_match(prediction, expected): | |
| """Check if expected answer is contained in prediction.""" | |
| pred_clean = prediction.strip().lower() | |
| exp_clean = expected.strip().lower() | |
| return exp_clean in pred_clean | |
| def main(): | |
| args = parse_args() | |
| print("=" * 60) | |
| print("Gemma 4 Evaluation") | |
| print("=" * 60) | |
| print(f"Model: {args.model}") | |
| print(f"Eval data: {args.eval_data}") | |
| print(f"Max samples: {args.max_samples}") | |
| print("=" * 60) | |
| # Load model | |
| print("\nLoading model...") | |
| model, tokenizer = FastModel.from_pretrained( | |
| model_name=args.model, | |
| max_seq_length=args.max_seq_length, | |
| load_in_4bit=True, | |
| ) | |
| FastModel.for_inference(model) | |
| # Load eval data | |
| eval_data = load_eval_data(args.eval_data, args.max_samples) | |
| print(f"Loaded {len(eval_data)} evaluation examples") | |
| # Evaluate | |
| results = [] | |
| total_tokens = 0 | |
| start_time = time.time() | |
| for i, sample in enumerate(eval_data): | |
| messages = sample["messages"] | |
| # Use all messages except the last (expected answer) as input | |
| input_messages = messages[:-1] | |
| expected = messages[-1]["content"] | |
| # Tokenize | |
| inputs = tokenizer.apply_chat_template( | |
| input_messages, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| # Generate | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| input_ids=inputs, | |
| max_new_tokens=args.max_new_tokens, | |
| temperature=args.temperature if args.temperature > 0 else None, | |
| do_sample=args.temperature > 0, | |
| ) | |
| # Decode only the new tokens | |
| new_tokens = outputs[0][inputs.shape[1]:] | |
| prediction = tokenizer.decode(new_tokens, skip_special_tokens=True) | |
| total_tokens += len(new_tokens) | |
| # Score | |
| em = exact_match(prediction, expected) | |
| cm = contains_match(prediction, expected) | |
| results.append({ | |
| "idx": i, | |
| "exact_match": em, | |
| "contains_match": cm, | |
| "prediction_len": len(prediction), | |
| "expected_len": len(expected), | |
| }) | |
| if i < 3: | |
| print(f"\n--- Example {i+1} ---") | |
| print(f"Input: {input_messages[-1]['content'][:100]}...") | |
| print(f"Expected: {expected[:100]}...") | |
| print(f"Got: {prediction[:100]}...") | |
| print(f"EM: {em} | Contains: {cm}") | |
| if (i + 1) % 10 == 0: | |
| print(f" Evaluated {i+1}/{len(eval_data)}...") | |
| # Compute metrics | |
| elapsed = time.time() - start_time | |
| n = len(results) | |
| exact_match_acc = sum(r["exact_match"] for r in results) / n if n else 0 | |
| contains_match_acc = sum(r["contains_match"] for r in results) / n if n else 0 | |
| avg_pred_len = sum(r["prediction_len"] for r in results) / n if n else 0 | |
| tokens_per_sec = total_tokens / elapsed if elapsed > 0 else 0 | |
| print("\n" + "=" * 60) | |
| print("Results") | |
| print("=" * 60) | |
| print(f" Samples evaluated: {n}") | |
| print(f" Exact match: {exact_match_acc:.4f} ({sum(r['exact_match'] for r in results)}/{n})") | |
| print(f" Contains match: {contains_match_acc:.4f} ({sum(r['contains_match'] for r in results)}/{n})") | |
| print(f" Avg prediction len: {avg_pred_len:.0f} chars") | |
| print(f" Inference speed: {tokens_per_sec:.1f} tokens/sec") | |
| print(f" Total time: {elapsed:.1f}s") | |
| # Parseable metrics line for AutoResearch | |
| print(f"\nMETRICS: exact_match={exact_match_acc:.4f} " | |
| f"contains_match={contains_match_acc:.4f} " | |
| f"tokens_per_sec={tokens_per_sec:.1f} " | |
| f"eval_time={elapsed:.1f}") | |
| if __name__ == "__main__": | |
| main() | |