Gemmacademy β€” Fractions v1

Built for the Google Gemma 4 Hackathon (May 2026).

A fine-tuned Gemma 4 E2B that runs on a student's Android phone offline, grounded in a teacher's specific lesson content.

What this is

Gemmacademy is an end-to-end pipeline that lets a teacher fine-tune Gemma 4 E2B on their week's lesson materials and ship the resulting model to students' phones. Students then use it offline at home β€” getting tutoring grounded in what their teacher actually taught, with no internet required.

This repo contains the first deployed model from that pipeline: a Gemma 4 E2B fine-tuned to teach 4th-grade fractions in the style of a fictional teacher named "Mrs. Henderson," using a specific procedural method called the "Pizza Method."

The on-device artifact (gemmacademy-fractions-v1-wi8.litertlm) runs via LiteRT-LM on Android. The student app downloads it once over WiFi, then runs it offline forever after.

Why local AI for this use case

Roughly 15-20 million children in the US live in households without reliable broadband. Globally that number is in the hundreds of millions. These kids attend school, sit through lessons, then go home to environments where every cloud-based tutoring tool β€” every homework helper β€” simply doesn't work. Beyond connectivity, on-device inference also addresses FERPA constraints around student data leaving school networks, and per-API-call pricing that makes serving low-income students at scale economically infeasible.

A fine-tuned model that fits on a $50 phone has fundamentally different unit economics. That's the bet of this project.

What's in this repo

File / Directory Contents Size
gemmacademy-fractions-v1-wi8.litertlm The deployment artifact. Quantized via dynamic_wi8_afp32. This is what runs on student phones. 4.8 GB
bf16/ Full-precision merged weights (Unsloth merge of LoRA + base into BF16 safetensors). 9.6 GB
lora-adapter/ The LoRA adapter on its own β€” apply to base Gemma 4 E2B for a different quantization or platform. ~240 MB
qa-fractions.jsonl The 500 synthetic training Q&A pairs the model was fine-tuned on. ~600 KB
lesson-content/fractions-pizza-method.txt The original ~2,000-word lesson description that seeded the synthetic data. small
qa_generation_prompt.md The system prompt used to generate Q&A from the lesson with Gemma 4 26B. small
eval-results-ship.md Side-by-side eval of base vs. fine-tuned on 20 questions. small
eval-results-compare.md Full quantization shootout (BF16 / wi4 / wi8) at rank 32. small
eval-results-r128.md Quantization shootout (BF16 / wi4 / wi8 / weight_only_wi4) at rank 128. small

How to use

On Android via LiteRT-LM

This is the deployment target. Reference LiteRT-LM Android Getting Started. Download gemmacademy-fractions-v1-wi8.litertlm to the device's app-private storage, then load with the LiteRT-LM Kotlin API.

On a desktop via the litert-lm CLI

uv tool install litert-lm
litert-lm run \
  --from-huggingface-repo=jtmuller/gemmacademy-fractions-v1 \
  gemmacademy-fractions-v1-wi8.litertlm \
  --prompt="What is the Henderson Pizza Method?"

Verified working on Apple Silicon MacBooks at usable decode speeds.

As a base for further fine-tuning

Use the bf16/ directory as a transformers-compatible checkpoint, or apply lora-adapter/ on top of google/gemma-4-E2B-it.

Training details

Field Value
Base model google/gemma-4-E2B-it (~4.8B params)
Method LoRA via Unsloth
LoRA rank 128
LoRA alpha 128
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Precision BF16
Optimizer AdamW (8-bit)
Learning rate 2e-4, linear schedule
Epochs 3
Batch size 4 (Γ— 2 grad accum = effective 8)
Hardware 1Γ— NVIDIA RTX 5090 (32 GB)
Training time ~80 seconds
Final train loss 0.76
Final eval loss 2.43 (held-out 50 examples)

The notable train/eval loss gap (~1.7) is documented as an open question in our engineering notes. Our hypothesis is that the synthetic dataset has stylistic regularity the model memorizes faster than it generalizes; behavioral evaluation (see eval-results-*.md) was used as the primary quality signal rather than raw loss.

Data

The 500 training Q&A pairs in qa-fractions.jsonl were synthetically generated. We wrote ~2,000 words of "lesson content" describing a fictional 4th-grade fractions class taught by a fictional teacher (Mrs. Henderson) using a fictional procedural method (the "Pizza Method"), then generated diverse Q&A pairs from that content using Gemma 4 26B (AWQ-4bit) served via vLLM as the data generator. The full prompt is in qa_generation_prompt.md.

We use a fictional classroom rather than a real one because (a) we don't have access to a real classroom's materials at this stage, and (b) the fictional method gives us a clean evaluation signal: the base Gemma 4 E2B has never heard of "the Pizza Method," so any positive answer about it is provably learned from fine-tuning rather than recalled from pretraining.

In real deployment, Gemmacademy will produce models from actual teachers' actual lesson materials. The fictional setup is a controlled stand-in for evaluation.

Evaluation

We evaluated 20 questions across three categories:

  • 10 classroom-specific β€” about the Pizza Method, Mrs. Henderson's specific catchphrases, etc. Only the fine-tune should know these.
  • 5 general 4th-grade fractions β€” both base and fine-tune should handle these.
  • 5 off-topic β€” capital of France, World War 2, etc. Both should respond reasonably; we want the fine-tune to not lose general capability.

Detailed results in eval-results-ship.md.

Headline finding: the deployed wi8 model captures the lesson essence on roughly 6 of 10 classroom-specific questions (e.g., produces "equal slices, equal fractions" or close paraphrases, draws procedurally correct fraction diagrams), while the base model uniformly responds with "I don't know what that is" or asks for context. On general fractions and off-topic questions, the fine-tune is on par with base. The fine-tune is a clear improvement on the target task without obvious capability regression elsewhere.

A finding worth flagging: int4 doesn't survive small-magnitude LoRA fine-tunes

The hackathon's "ship to the cheapest possible phone" framing pushed us hard toward int4 quantization (dynamic_wi4_afp32, ~2.4 GB). It didn't work for our fine-tune.

We tested four configurations with LoRA rank 128 β€” see eval-results-r128.md β€” and found:

  • dynamic_wi4_afp32 (4-bit dynamic, the official Gemma 4 E2B recipe): The fine-tune signal is largely lost. The model produces pizza-fractions content but doesn't reproduce the specific catchphrases or follow the procedural rules from training. Base Gemma 4 E2B in this same recipe also degrades on world knowledge ("Australia won the 2022 World Cup" β€” fabricated).
  • weight_only_wi4_afp32 (4-bit weight-only, alternative algorithm): Catastrophically broken β€” produces degenerate token loops on every prompt.
  • dynamic_wi8_afp32 (8-bit, doubles file size to 4.8 GB): Captures the lesson essence cleanly, restores base capability on general questions. This is what we shipped.
  • BF16 (no quantization, 9.6 GB): Reproduces near-verbatim training-set phrasing. Confirms the LoRA learned what we wanted at training time.

Diagnosis: rank-128 LoRA produces weight deltas that are large enough to land in different bins after 8-bit quantization (256 bins per channel) but get rounded back to base values after 4-bit quantization (16 bins per channel). The base Gemma 4 E2B survives int4 because Google trained it with quantization-aware training β€” that benefit doesn't transfer to a post-hoc LoRA on top.

Implication for others: If you're fine-tuning Gemma 4 with LoRA and need int4 deployment, you'll likely need either (a) much higher LoRA ranks than 128, (b) quantization-aware fine-tuning, or (c) full fine-tuning. Standard LoRA ranks of 8-32 will almost certainly produce a model that looks fine in BF16 and breaks at int4. We landed on shipping the wi8 artifact (4.8 GB) as the right quality/size trade-off for our use case.

Limitations

  • Single subject, single grade. This v1 model only knows 4th-grade fractions content with the Pizza Method. Out-of-distribution questions get reasonable but generic responses.
  • Fictional content. Mrs. Henderson does not exist. The Pizza Method is invented. Real teachers using the pipeline get models trained on their real materials; this artifact is a demonstration.
  • Synthetic data limitations. The training data is generator-quality, not human-curated. There will be subtle inaccuracies and stylistic regularities the model picks up.
  • Eval set size. 50 held-out examples is small for reliable loss measurement. Behavioral eval is the more trustworthy signal at this scale.
  • No safety tuning beyond Gemma 4 base. The base model's safety properties pass through; we have not added or evaluated additional safety guarantees.
  • English only. Multilingual support is on the v2 roadmap.

License

This model is a derivative of google/gemma-4-E2B-it and is governed by the Gemma Terms of Use.

Citation

@misc{gemmacademy-fractions-v1,
  title  = {Gemmacademy: A Fine-Tuned On-Device Tutor for 4th Grade Fractions},
  author = {Muller, Joseph},
  year   = {2026},
  note   = {Built for the Google Gemma 4 Hackathon},
  url    = {https://huggingface.co/jtmuller/gemmacademy-fractions-v1}
}

Project links

  • Hackathon writeup: (link to Kaggle writeup once published)
  • Demo video: (link to YouTube demo once published)
  • Source code: (link to GitHub once published)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jtmuller/gemmacademy-fractions-v1

Adapter
(99)
this model