- paath-e4b-merged
- What it does
- Training
- Evaluation
- RAG makes it correct (+0.44/3, +30% relative). Fine-tuning makes it teach (2–3× format compliance, Hinglish fidelity, curriculum refusals).
FT without RAG degrades factual accuracy (1.48 → 0.84/3) — the model overwrites factual associations from pretraining. RAG fully recovers it. Fine-tuning adds what RAG cannot: idiomatic Hinglish delivery and clean out-of-scope refusals. - Intended use
- Out of scope: Non-NCERT content. The model is trained to refuse queries outside the NCERT curriculum; use with a curriculum-scoped RAG system.
- Notebooks
- What it does
paath-e4b-merged
Paath is an offline multilingual NCERT tutor for Indian students, built for the Gemma 4 Good
Hackathon on Kaggle. This is the QLoRA fine-tuned Gemma 4 E4B model — merged weights ready for
inference.
Kaggle Writeup: https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/new-writeup-1779040319794
Airplane mode for 150 million students.
The full system runs CPU-only on a mid-range Android phone (₹20,000-class, ~Pixel 7): no internet, no cloud, ₹0 marginal inference cost per student. This model
handles the generation side; retrieval is done on-device with embeddinggemma-300m over a SQLite NCERT
chunk index.
What it does
Every response follows a four-part teaching structure regardless of subject, language, or grade:
- One sentence — what the concept is, simply stated
- Three to four numbered steps — flat, two sentences each
- One Indian analogy — grounded in everyday Indian experience
- "Try this:" — one follow-up question, never the answer
Language is mirrored per message: Devanagari triggers full Hindi; Hinglish keywords route to the casual mix students use when texting; everything else is English.
Example — Class 10 Biology, Hinglish: "yaar samjha de human heart kaise kaam karta hai"
Dil ek double pump hai jo blood ko lungs aur poore body mein simultaneously push karta hai.
- Jab atria relax karte hain, blood veins se fill hota hai — right atrium mein deoxygenated blood, left mein oxygenated.
- Atria contract karte hain aur blood ventricles mein push hota hai; tricuspid aur bicuspid valves close hote hain — yahi "lub" sound hai stethoscope mein.
- Ventricles contract karte hain — right side lungs ko bhejta hai re-oxygenation ke liye, left side aorta se poore body mein; yahi "dub" sound hai.
Socho jaise railway station ke do platforms — ek platform passengers ko "oxygen refill center" (lungs) bhejta hai, doosra unhe seedha destination (body
tissues) pe drop karta hai — dono simultaneously, kabhi mix nahi hote.Try this: Left ventricle ki wall right se zyada thick kyun hoti hai?
Training
| Parameter | Value |
|---|---|
| Base model | unsloth/gemma-4-e4b-it-unsloth-bnb-4bit |
| Method | QLoRA (Unsloth) |
| LoRA rank / alpha | r=128, lora_alpha=256 |
| Target modules | All projection layers |
| Epochs | 3 |
| Hardware | A100 40GB (~3.5 hours, ~$25) |
| Training examples | 52,255 pairs |
| Language split | 47% Hinglish / 47% English / 6% Hindi |
Data pipeline: 40 NCERT PDFs → Docling extraction (59,112 sections) → Gemini 2.5 Flash Q&A generation (428K pairs, ~$24) → filtering by length, dedup,
Hinglish leakage, and rebalancing → 52,255 training pairs.
Key training finding: Training on a shorter system prompt than the inference prompt causes silent format regression — the model learns formatting conditioned on the training prompt length and reverts to base style when a longer inference prompt is used. V2 fixed this by matching training prompt length to inference exactly, producing a 3× format compliance jump (0.44 → 1.36/3).
Evaluation
Scored by Gemini 2.5 Pro on five dimensions (0–3 each, max 15) across 25 hand-crafted NCERT discriminator test cases, independently validated on 75 additional cases (Claude Sonnet judge).
| Dimension (max 3) | Base | Base + RAG | FT | FT + RAG |
|---|---|---|---|---|
| Factual accuracy | 1.48 | 1.92 | 0.84 | 1.92 |
| Format compliance | 0.44 | 0.60 | 1.36 | 1.40 |
| Language match | 2.92 | 2.92 | 2.92 | 3.00 |
| Pedagogical clarity | 1.16 | 1.16 | 0.80 | 1.32 |
| Scope handling | 2.52 | 2.68 | 2.40 | 2.76 |
| Total /15 | 8.52 | 9.28 | 8.32 | 10.40 |
RAG makes it correct (+0.44/3, +30% relative). Fine-tuning makes it teach (2–3× format compliance, Hinglish fidelity, curriculum refusals).
FT without RAG degrades factual accuracy (1.48 → 0.84/3) — the model overwrites factual associations from pretraining. RAG fully recovers it. Fine-tuning adds
what RAG cannot: idiomatic Hinglish delivery and clean out-of-scope refusals.
Intended use
This model is designed for use with a RAG pipeline over NCERT content. Without retrieval it will produce correctly formatted but factually weaker responses. For on-device deployment, pair with:
- embeddinggemma-300m for retrieval
- LiteRT-LM for Android CPU inference (pending bundle layout fix — see paath-e4b-litertlm)
Out of scope: Non-NCERT content. The model is trained to refuse queries outside the NCERT curriculum; use with a curriculum-scoped RAG system.
Notebooks
| Resource | Link |
|---|---|
| Fine-tuning (Unsloth QLoRA on A100) | Colab |
| Inference + eval (all 4 configs) | Colab |
| LiteRT conversion | Colab |
| Kaggle writeup | Gemma 4 Good submission |
Trained with Unsloth and Hugging Face TRL.
- Downloads last month
- 4
