paath-e4b-merged

Paath is an offline multilingual NCERT tutor for Indian students, built for the Gemma 4 Good Hackathon on Kaggle. This is the QLoRA fine-tuned Gemma 4 E4B model — merged weights ready for
inference.

Kaggle Writeup: https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/new-writeup-1779040319794

Airplane mode for 150 million students.

The full system runs CPU-only on a mid-range Android phone (₹20,000-class, ~Pixel 7): no internet, no cloud, ₹0 marginal inference cost per student. This model handles the generation side; retrieval is done on-device with embeddinggemma-300m over a SQLite NCERT
chunk index.

What it does

Every response follows a four-part teaching structure regardless of subject, language, or grade:

One sentence — what the concept is, simply stated
Three to four numbered steps — flat, two sentences each
One Indian analogy — grounded in everyday Indian experience
"Try this:" — one follow-up question, never the answer

Language is mirrored per message: Devanagari triggers full Hindi; Hinglish keywords route to the casual mix students use when texting; everything else is English.

Example — Class 10 Biology, Hinglish: "yaar samjha de human heart kaise kaam karta hai"

Dil ek double pump hai jo blood ko lungs aur poore body mein simultaneously push karta hai.

Jab atria relax karte hain, blood veins se fill hota hai — right atrium mein deoxygenated blood, left mein oxygenated.

Atria contract karte hain aur blood ventricles mein push hota hai; tricuspid aur bicuspid valves close hote hain — yahi "lub" sound hai stethoscope mein.

Ventricles contract karte hain — right side lungs ko bhejta hai re-oxygenation ke liye, left side aorta se poore body mein; yahi "dub" sound hai.

Socho jaise railway station ke do platforms — ek platform passengers ko "oxygen refill center" (lungs) bhejta hai, doosra unhe seedha destination (body
tissues) pe drop karta hai — dono simultaneously, kabhi mix nahi hote.

Try this: Left ventricle ki wall right se zyada thick kyun hoti hai?

Training

Parameter	Value
Base model	`unsloth/gemma-4-e4b-it-unsloth-bnb-4bit`
Method	QLoRA (Unsloth)
LoRA rank / alpha	r=128, lora_alpha=256
Target modules	All projection layers
Epochs	3
Hardware	A100 40GB (~3.5 hours, ~$25)
Training examples	52,255 pairs
Language split	47% Hinglish / 47% English / 6% Hindi

Data pipeline: 40 NCERT PDFs → Docling extraction (59,112 sections) → Gemini 2.5 Flash Q&A generation (428K pairs, ~$24) → filtering by length, dedup,
Hinglish leakage, and rebalancing → 52,255 training pairs.

Key training finding: Training on a shorter system prompt than the inference prompt causes silent format regression — the model learns formatting conditioned on the training prompt length and reverts to base style when a longer inference prompt is used. V2 fixed this by matching training prompt length to inference exactly, producing a 3× format compliance jump (0.44 → 1.36/3).

Evaluation

Scored by Gemini 2.5 Pro on five dimensions (0–3 each, max 15) across 25 hand-crafted NCERT discriminator test cases, independently validated on 75 additional cases (Claude Sonnet judge).

Dimension (max 3)	Base	Base + RAG	FT	FT + RAG
Factual accuracy	1.48	1.92	0.84	1.92
Format compliance	0.44	0.60	1.36	1.40
Language match	2.92	2.92	2.92	3.00
Pedagogical clarity	1.16	1.16	0.80	1.32
Scope handling	2.52	2.68	2.40	2.76
Total /15	8.52	9.28	8.32	10.40

RAG makes it correct (+0.44/3, +30% relative). Fine-tuning makes it teach (2–3× format compliance, Hinglish fidelity, curriculum refusals).
FT without RAG degrades factual accuracy (1.48 → 0.84/3) — the model overwrites factual associations from pretraining. RAG fully recovers it. Fine-tuning adds what RAG cannot: idiomatic Hinglish delivery and clean out-of-scope refusals.

Intended use

This model is designed for use with a RAG pipeline over NCERT content. Without retrieval it will produce correctly formatted but factually weaker responses. For on-device deployment, pair with: