Update metrics: with_preprocess benchmark on full pages (no crop at inference)

b9771ca verified about 16 hours ago

3.95 kB

license: apache-2.0
tags:
  - image-classification
  - tibetan
  - uchen
  - ume
library_name: transformers
pipeline_tag: image-classification

Uchen vs Umê classifier (DINOv3 ViT-S)

Binary Tibetan script classifier: uchen (printed) vs ume (cursive).

Dataset (splits, Parquet, inference): openpecha/uchen-ume-classification-benchmark

Training preprocess (from `config.yaml` + `train.py`)

train.py builds three dataloaders with per-split preprocess (preprocess_for_split in common.py):

Split	`with_preprocess` config	Effect in `ScriptImageDataset.__getitem__`
train	`train_preprocess: center_crop_whole_page`	Center crop before augment + DINO processor
val	`val_preprocess: center_crop_whole_page`	Center crop before DINO processor
test	`test_preprocess: none`	Full page — no crop, only DINO processor

So high validation scores for with_preprocess (val F1 ~0.99) are on cropped pages. Test during training uses full pages (test F1 ~0.51). That is intentional in the code, not a bug.

Benchmark eval must use test_preprocess: none (same as the test split) unless you are deliberately measuring crop-to-crop generalization.

Recommended weights for full manuscript pages

without_preprocess/final_model.pt — trained without runtime crop on any split.

Results summary

Benchmark = 60 held-out images (30 uchen + 30 ume). Test = 867 images (work-stratified), full pages.

Variant	Train/val preprocess	Test & benchmark eval preprocess	Test acc	Test macro-F1	Benchmark acc	Benchmark macro-F1	Benchmark AUC
`without_preprocess/`	none	none (full page)	80.7%	0.708	85.0%	0.848	0.970
`with_preprocess/`	center crop	none (full page)	56.1%	0.506	68.3%	0.648	0.953
~~with_preprocess~~	center crop	~~center crop at inference~~ (not comparable to test)	—	—	~~98.3%~~	~~0.983~~	—

The ~~98.3%~~ benchmark number only appears if you center-crop at inference, which matches val but not how the model was evaluated on test during training.

Benchmark evaluation (60 images)

Fair eval — full pages (`preprocess none`, matches `test_preprocess`)

without_preprocess (recommended):

python inference_uchen_ume.py \
  --benchmark-dir benchmark \
  --weights without_preprocess/final_model.pt \
  --preprocess none

with_preprocess (same protocol as training test split):

python inference_uchen_ume.py \
  --benchmark-dir benchmark \
  --weights with_preprocess/final_model.pt \
  --preprocess none

From this repo:

python experiments/uchen_ume_binary/eval_benchmark.py \
  --checkpoint without_preprocess/final_model.pt --benchmark-dir benchmark/benchmark

python experiments/uchen_ume_binary/eval_benchmark.py \
  --checkpoint with_preprocess/final_model.pt --benchmark-dir benchmark/benchmark
# default test-preprocess is none — do NOT pass center_crop for fair comparison

Parquet dataset

openpecha/uchen-ume-classification-benchmark

from datasets import load_dataset
bench = load_dataset("openpecha/uchen-ume-classification-benchmark", split="benchmark")

Load weights

from huggingface_hub import hf_hub_download
import torch
path = hf_hub_download("openpecha/uchen-ume-classifier", "without_preprocess/final_model.pt", repo_type="model")
ckpt = torch.load(path, map_location="cpu", weights_only=False)