s23deepak
/

grandgemma-eval

Model card Files Files and versions

grandgemma-eval / README.md

s23deepak's picture

Upload README.md

edf8f8f verified 29 days ago

|

history blame contribute delete

3.2 kB

	# GrandgemMa — Gemma 4 Scam Detection Eval & Fine-Tune Kit

	> Goal: Test `google/gemma-4-E2B-it` (2B params) on real scam-call transcripts.
	> If accuracy < 90 % or F1(SCAM) < 85 % → fine-tune with Unsloth 4-bit LoRA, then convert to LiteRT for phone.

	## Model Size Reference

	\| Model \| Params \| FP32 RAM \| 4-bit LiteRT RAM \| Phone? \|
	\|---\|---\|---\|---\|---\|
	\| `gemma-4-31B-it` \| 31B \| ~124 GB \| ~16 GB \| ❌ No \|
	\| `gemma-4-26B-A4B-it` \| 26B \| ~104 GB \| ~13 GB \| ❌ No \|
	\| `gemma-4-E4B-it` \| 4B \| ~16 GB \| ~2 GB \| ⚠️ Flagship only \|
	\| `gemma-4-E2B-it` \| 2B \| ~8 GB \| ~1.5 GB \| ✅ Mid-tier + budget \|

	We use `gemma-4-E2B-it` (2B) — smallest Gemma 4, fits on most phones after LiteRT quantization.

	## Datasets

	- Primary: [`BothBosu/scam-dialogue`](https://huggingface.co/datasets/BothBosu/scam-dialogue) — 800+ labeled transcripts (1=SCAM, 0=LEGIT).
	- Secondary: [`BothBosu/Scammer-Conversation`](https://huggingface.co/datasets/BothBosu/Scammer-Conversation) — extra mixed conversations.

	## Quick Start

	### Step 1: Zero-shot eval (CPU, no GPU needed)

	```bash
	# Quick test — 20 samples, ~2-3 min on laptop CPU
	python eval_zero_shot_cpu.py --limit 20

	# Full test split — ~400 samples, ~30-45 min on CPU
	python eval_zero_shot_cpu.py --limit -1

	# If you have plenty of RAM, use fp16 to halve memory (~4 GB)
	python eval_zero_shot_cpu.py --limit 20 --dtype fp16
	```

	Output: `results_zero_shot_cpu.json` + console report.

	### Step 2: Read the verdict

	\| Accuracy \| F1(SCAM) \| Verdict \| Action \|
	\|---\|---\|---\|---\|
	\| ≥ 90 % \| ≥ 85 % \| ✅ PASS \| Base model good. Go straight to LiteRT conversion. \|
	\| 75–89 % \| 70–84 % \| ⚠️ MARGINAL \| Fine-tune, then LiteRT convert. \|
	\| < 75 % \| < 70 % \| ❌ FAIL \| Fine-tune REQUIRED before phone deployment. \|

	### Step 3: Fine-tune (if needed)

	```bash
	# Install
	pip install unsloth transformers datasets trl peft accelerate

	# Train on GPU (Kaggle T4×2 free, or Colab, or local GPU)
	python train_sft_unsloth.py --push_to_hub s23deepak/grandgemma-scam-sft

	# Then re-eval the fine-tuned model
	python eval_zero_shot_cpu.py \
	--model s23deepak/grandgemma-scam-sft \
	--limit -1
	```

	### Step 4: Convert to LiteRT for Android

	After fine-tuning (or if base passes), convert the 2B model to `.litertlm`:

	```bash
	# Use litert-community tools
	pip install litert

	litert-convert \
	--model s23deepak/grandgemma-scam-sft \
	--output grandgemma-scam.litertlm \
	--quantization int4
	```

	Target RAM on phone: ~1.5 GB for the 2B 4-bit model.

	## Files in This Repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `eval_zero_shot_cpu.py` \| CPU-only zero-shot eval (default, no GPU) \|
	\| `eval_zero_shot.py` \| GPU version (faster, same logic) \|
	\| `train_sft_unsloth.py` \| Unsloth 4-bit LoRA fine-tune \|
	\| `format_dataset.py` \| Convert dataset → ChatML JSONL \|

	## Phone Deployment Checklist

	- [ ] Zero-shot eval passes (≥90% acc, ≥85% F1)
	- [ ] OR fine-tuned model passes same threshold
	- [ ] Convert to `.litertlm` (int4 quantization)
	- [ ] Benchmark on target phone tier (mid-tier / budget)
	- [ ] Measure cold-start load time (<2s target)
	- [ ] Measure inference latency (<500ms per classification)