T2.1 — Compressed Crop Disease Classifier

AIMS KTT Hackathon · Tier 2 · AgriTech / Computer Vision / Quantization / Serving

A compact MobileNetV3-Small classifier for 5 maize/cassava/bean diseases, quantized to INT8 ONNX (< 10 MB), served through a FastAPI /predict endpoint with a Kinyarwanda + French USSD/SMS fallback for feature-phone farmers.

Submission artefacts

Artefact	Location
Source repo	this repository
Model (INT8 ONNX)	`model/model.onnx` (also mirrored on Hugging Face — see link below)
Dataset	regenerated by `scripts/download_data.py` (≤ 2 minutes on a laptop)
Service	`service/app.py` + `service/Dockerfile`
Product artefact	`ussd_fallback.md`
Process log	`process_log.md`
Honor-code sign-off	`SIGNED.md`
4-minute video	see link at the bottom of this README

Model hub: https://huggingface.co/<your-handle>/t2.1-crop-disease-mbv3s-int8 (update before submission)
Video: https://youtu.be/<your-unlisted-id> (update before submission)

Results

Source-of-truth numbers live in model/metrics.json (produced by scripts/evaluate.py). Snapshot of the submitted run:

Split	macro-F1	accuracy	# images	latency mean (CPU)
clean test	0.967	0.967	150	225 ms
field-noisy test	0.899	0.900	60	214 ms
drop (clean → field)	6.7 pp	—	—	—

Model file size on disk: model/model.onnx = 2.29 MB (budget was < 10 MB). FP32 → INT8 shrinkage: 8.48 MB → 2.29 MB (3.7×). FP32↔INT8 top-1 parity on the calibration batch: 1.000.

Per-class F1 on the clean test split:

Class	precision	recall	F1
`bean_spot`	0.964	0.900	0.931
`cassava_mosaic`	0.968	1.000	0.984
`healthy`	0.906	0.967	0.935
`maize_blight`	1.000	0.967	0.983
`maize_rust`	1.000	1.000	1.000

Latency is measured single-threaded on CPU (RTX 2050 host, CPUExecutionProvider) end-to-end including preprocessing; p95 is 375 ms on the clean split. On the target 2-vCPU district VM described in ussd_fallback.md latencies will sit in the same range, which is well within the SMS-relay time budget.

Reproducing the whole run in ≤ 2 commands

pip install -r requirements.txt
python scripts/run_all.py        # downloads data, trains, quantizes, evaluates, writes metrics.json

scripts/run_all.py is a thin wrapper that runs the four pipeline scripts in order. If you'd rather step through them:

python scripts/download_data.py --out data/mini_plant_set --per-class 300
python scripts/make_field_set.py --src data/mini_plant_set/test --out data/test_field --per-class 12
python scripts/train.py          --data data/mini_plant_set --out model --epochs 12
python scripts/quantize_export.py --ckpt model/best_fp32.pt --calib data/mini_plant_set/train --out model/model.onnx
python scripts/evaluate.py       --onnx model/model.onnx --clean data/mini_plant_set/test --field data/test_field

Colab free CPU tier is fine — the whole pipeline finishes in ~25 minutes because the dataset is small and MobileNetV3-Small is tiny.

Running the service

Native Python:

uvicorn service.app:app --host 0.0.0.0 --port 8000

Docker:

docker build -f service/Dockerfile -t crop-classifier .
docker run --rm -p 8000:8000 crop-classifier

Hit it (macOS / Linux / WSL / Git-Bash):

curl -X POST -F "image=@service/samples/maize_rust_1.jpg" http://localhost:8000/predict

On Windows PowerShell, use curl.exe explicitly — the bare curl alias maps to Invoke-WebRequest and handles -F differently:

curl.exe -X POST -F "image=@service/samples/maize_rust_1.jpg" http://localhost:8000/predict

A ready-to-run PowerShell companion lives at service/curl_examples.ps1 (bash version: service/curl_examples.sh).

Sample response (real output from the shipped model on service/samples/maize_rust_1.jpg):

{
  "label": "maize_rust",
  "confidence": 0.8955,
  "top3": [
    {"label": "maize_rust",     "confidence": 0.8955},
    {"label": "cassava_mosaic", "confidence": 0.0421},
    {"label": "healthy",        "confidence": 0.0233}
  ],
  "latency_ms": 37.4,
  "rationale": {
    "summary_en": "Dense small rust-coloured pustules consistent with Puccinia sorghi (common rust).",
    "summary_fr": "Petites pustules rouille nombreuses, caractéristiques de Puccinia sorghi (rouille commune).",
    "summary_rw": "Udutonga duto tw'ibara ry'uburinga bwinshi, ibimenyetso bya Puccinia sorghi.",
    "recommended_action": "Apply Mancozeb 75% WP at 2.5 g/L, 10–14 day interval. Remove volunteer maize nearby.",
    "low_confidence": false,
    "escalation_hint": null
  }
}

And a field-noisy example that triggers the low-confidence escalation (this is the pair we demo on camera):

curl -X POST -F "image=@service/samples/maize_blight_field_1.jpg" http://localhost:8000/predict
# -> confidence 0.497, low_confidence=true, escalation_hint prompts a second photo

PowerShell equivalent:

curl.exe -X POST -F "image=@service/samples/maize_blight_field_1.jpg" http://localhost:8000/predict

Set RATIONALE_MODE=gradcam (and keep model/best_fp32.pt around) to add a Grad-CAM heat-map as rationale.gradcam_png_b64. It's off by default so the shipped image stays small.

Repo map

.
├── README.md                 <- this file
├── SIGNED.md                 <- honor-code sign-off
├── process_log.md            <- hour-by-hour timeline + LLM declarations
├── ussd_fallback.md          <- Product & Business adaptation artefact
├── requirements.txt          <- training + eval deps
├── scripts/
│   ├── download_data.py      <- builds mini_plant_set (HF + procedural fallback)
│   ├── make_field_set.py     <- builds test_field (blur / JPEG / brightness)
│   ├── train.py              <- MobileNetV2 fine-tuning
│   ├── quantize_export.py    <- dynamic INT8 ONNX export (< 10 MB)
│   ├── evaluate.py           <- macro-F1 on clean + field splits
│   └── run_all.py            <- one-shot pipeline wrapper (≤ 2-command repro)
├── service/
│   ├── app.py                <- FastAPI /predict
│   ├── Dockerfile            <- CPU-only image (~350 MB)
│   ├── requirements-service.txt
│   ├── curl_examples.sh     <- bash example calls
│   ├── curl_examples.ps1    <- PowerShell example calls
│   └── samples/              <- example images used in the demo
├── model/
│   ├── model.onnx            <- shipped INT8 artefact
│   ├── best_fp32.pt          <- (git-ignored) FP32 checkpoint, for Grad-CAM
│   ├── class_names.json
│   ├── metrics.json          <- evaluation report
│   ├── quant_report.json     <- quantization parity numbers
│   └── train_metrics.json    <- training curves + per-class breakdown
├── data/                     <- (git-ignored) regenerated in < 2 min
│   ├── mini_plant_set/{train,val,test}/{class}/*.jpg
│   └── test_field/{class}/*.jpg
└── T2.1_Compressed_Crop_Disease_Classifier.pdf   <- original brief (reference)

Technical choices, in one screen

Backbone. MobileNetV2 (width=1.0). I ran a three-way bake-off against the candidates in the brief and only MobileNetV2 survives both the size budget (< 10 MB) and the accuracy floor (≥ 0.80 macro-F1) after INT8 quantisation. Full experiment record in model/experiments/README.md; short version: MobileNetV3-Small INT8 → clean F1 0.393 (hard-swish breaks PTQ); EfficientNet-B0 INT8 → clean F1 0.690 and FP32 is 15.3 MB (over-budget); MobileNetV2 INT8 → clean F1 0.967 at 2.29 MB. ReLU6 is the single activation function in the candidate set that survives onnxruntime's INT8 rescalers cleanly.
Fine-tuning. Two-phase: 2 epochs with backbone frozen (head-only) to stabilise gradients, then 10 epochs full unfreeze at LR×0.3. AdamW, cosine schedule, label smoothing 0.05.
Augmentation. Flip, 15° rotation, mild ColorJitter. Intentionally conservative — the test_field split already provides a hard robustness probe.
Quantization. ONNX Runtime static PTQ (QDQ format, per-channel INT8 weights and activations). Calibration on 128 random train images. Parity check: top-1 agreement ≥ 0.98 vs FP32 on the calibration batch.
Serving. onnxruntime inside FastAPI; no torch in the Docker image so the container weighs ~350 MB. Grad-CAM is a lazy, opt-in path.

Field robustness & the question "which augmentation next?"

Clean → field drop is 6.7 pp (0.967 → 0.899), comfortably under the 12 pp robustness-bonus threshold in the brief. Where it does cost us is bean_spot recall (0.90 → 0.75): JPEG compression at q=50–65 blurs the small angular spots just enough that the model overruns onto healthy. Three of the four misses in the field-noisy bean_spot bucket go to healthy, which matches the clean test confusion pattern (the only bean_spot misses on clean test also went to healthy).

The single augmentation I would add next to close the gap is RandomJPEG compression (quality 40–80) during training — it is the exact distortion our test_field generator applies, and color-jitter alone does not teach the backbone to tolerate JPEG block artefacts. I deliberately did not add it to the submitted run so the reported drop is an honest measurement against a held-out robustness probe rather than a number I trained toward.

Grad-CAM rationale (stretch goal)

Set RATIONALE_MODE=gradcam before starting the service and the /predict response picks up an extra gradcam_png_b64 field — a base64-encoded PNG overlay showing where the model "looked" on the leaf. This is the asset an extension officer can show the farmer on a tablet when they need to justify the diagnosis visually.

RATIONALE_MODE=gradcam uvicorn service.app:app --host 0.0.0.0 --port 8000
# A sample Grad-CAM PNG lives at service/samples/_gradcam_maize_rust.png

Grad-CAM is off by default because enabling it requires torch inside the Docker image (≈ 1.1 GB vs the ≈ 350 MB lean image). See the "hardest decision" note in process_log.md. The FP32 best_fp32.pt needed for Grad-CAM is not committed (gitignored because of its size); run python scripts/run_all.py once to rebuild it locally.

Honor code & LLM declarations

See SIGNED.md for the signed honor code and process_log.md for the full hour-by-hour timeline plus the sample prompts I sent the assistant and the one I discarded.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support