grandgemma-eval / README.md
s23deepak's picture
Upload README.md
edf8f8f verified
# GrandgemMa β€” Gemma 4 Scam Detection Eval & Fine-Tune Kit
> **Goal:** Test `google/gemma-4-E2B-it` (2B params) on real scam-call transcripts.
> If accuracy < 90 % or F1(SCAM) < 85 % β†’ fine-tune with Unsloth 4-bit LoRA, then convert to LiteRT for phone.
## Model Size Reference
| Model | Params | FP32 RAM | 4-bit LiteRT RAM | Phone? |
|---|---|---|---|---|
| `gemma-4-31B-it` | 31B | ~124 GB | ~16 GB | ❌ No |
| `gemma-4-26B-A4B-it` | 26B | ~104 GB | ~13 GB | ❌ No |
| `gemma-4-E4B-it` | 4B | ~16 GB | ~2 GB | ⚠️ Flagship only |
| **`gemma-4-E2B-it`** | **2B** | **~8 GB** | **~1.5 GB** | βœ… **Mid-tier + budget** |
**We use `gemma-4-E2B-it` (2B)** β€” smallest Gemma 4, fits on most phones after LiteRT quantization.
## Datasets
- **Primary:** [`BothBosu/scam-dialogue`](https://huggingface.co/datasets/BothBosu/scam-dialogue) β€” 800+ labeled transcripts (1=SCAM, 0=LEGIT).
- **Secondary:** [`BothBosu/Scammer-Conversation`](https://huggingface.co/datasets/BothBosu/Scammer-Conversation) β€” extra mixed conversations.
## Quick Start
### Step 1: Zero-shot eval (CPU, no GPU needed)
```bash
# Quick test β€” 20 samples, ~2-3 min on laptop CPU
python eval_zero_shot_cpu.py --limit 20
# Full test split β€” ~400 samples, ~30-45 min on CPU
python eval_zero_shot_cpu.py --limit -1
# If you have plenty of RAM, use fp16 to halve memory (~4 GB)
python eval_zero_shot_cpu.py --limit 20 --dtype fp16
```
**Output:** `results_zero_shot_cpu.json` + console report.
### Step 2: Read the verdict
| Accuracy | F1(SCAM) | Verdict | Action |
|---|---|---|---|
| β‰₯ 90 % | β‰₯ 85 % | βœ… PASS | Base model good. Go straight to LiteRT conversion. |
| 75–89 % | 70–84 % | ⚠️ MARGINAL | Fine-tune, then LiteRT convert. |
| < 75 % | < 70 % | ❌ FAIL | Fine-tune REQUIRED before phone deployment. |
### Step 3: Fine-tune (if needed)
```bash
# Install
pip install unsloth transformers datasets trl peft accelerate
# Train on GPU (Kaggle T4Γ—2 free, or Colab, or local GPU)
python train_sft_unsloth.py --push_to_hub s23deepak/grandgemma-scam-sft
# Then re-eval the fine-tuned model
python eval_zero_shot_cpu.py \
--model s23deepak/grandgemma-scam-sft \
--limit -1
```
### Step 4: Convert to LiteRT for Android
After fine-tuning (or if base passes), convert the 2B model to `.litertlm`:
```bash
# Use litert-community tools
pip install litert
litert-convert \
--model s23deepak/grandgemma-scam-sft \
--output grandgemma-scam.litertlm \
--quantization int4
```
Target RAM on phone: **~1.5 GB** for the 2B 4-bit model.
## Files in This Repo
| File | Purpose |
|---|---|
| `eval_zero_shot_cpu.py` | **CPU-only** zero-shot eval (default, no GPU) |
| `eval_zero_shot.py` | GPU version (faster, same logic) |
| `train_sft_unsloth.py` | Unsloth 4-bit LoRA fine-tune |
| `format_dataset.py` | Convert dataset β†’ ChatML JSONL |
## Phone Deployment Checklist
- [ ] Zero-shot eval passes (β‰₯90% acc, β‰₯85% F1)
- [ ] OR fine-tuned model passes same threshold
- [ ] Convert to `.litertlm` (int4 quantization)
- [ ] Benchmark on target phone tier (mid-tier / budget)
- [ ] Measure cold-start load time (<2s target)
- [ ] Measure inference latency (<500ms per classification)