Update metrics: with_preprocess benchmark on full pages (no crop at inference)

b9771ca verified 1 day ago

3.95 kB

	---
	license: apache-2.0
	tags:
	- image-classification
	- tibetan
	- uchen
	- ume
	library_name: transformers
	pipeline_tag: image-classification
	---

	# Uchen vs Umê classifier (DINOv3 ViT-S)

	Binary Tibetan script classifier: uchen (printed) vs ume (cursive).

	Dataset (splits, Parquet, inference): [openpecha/uchen-ume-classification-benchmark](https://huggingface.co/datasets/openpecha/uchen-ume-classification-benchmark)

	## Training preprocess (from `config.yaml` + `train.py`)

	`train.py` builds three dataloaders with per-split preprocess (`preprocess_for_split` in `common.py`):

	\| Split \| `with_preprocess` config \| Effect in `ScriptImageDataset.__getitem__` \|
	\|-------\|--------------------------\|------------------------------------------\|
	\| train \| `train_preprocess: center_crop_whole_page` \| Center crop before augment + DINO processor \|
	\| val \| `val_preprocess: center_crop_whole_page` \| Center crop before DINO processor \|
	\| test \| `test_preprocess: none` \| Full page — no crop, only DINO processor \|

	So high validation scores for `with_preprocess` (val F1 ~0.99) are on cropped pages. Test during training uses full pages (test F1 ~0.51). That is intentional in the code, not a bug.

	Benchmark eval must use `test_preprocess: none` (same as the test split) unless you are deliberately measuring crop-to-crop generalization.

	## Recommended weights for full manuscript pages

	`without_preprocess/final_model.pt` — trained without runtime crop on any split.

	## Results summary

	Benchmark = 60 held-out images (30 uchen + 30 ume). Test = 867 images (work-stratified), full pages.

	\| Variant \| Train/val preprocess \| Test & benchmark eval preprocess \| Test acc \| Test macro-F1 \| Benchmark acc \| Benchmark macro-F1 \| Benchmark AUC \|
	\|---------\|---------------------\|----------------------------------\|----------\|---------------\|---------------\|-------------------\|---------------\|
	\| `without_preprocess/` \| none \| none (full page) \| 80.7% \| 0.708 \| 85.0% \| 0.848 \| 0.970 \|
	\| `with_preprocess/` \| center crop \| none (full page) \| 56.1% \| 0.506 \| 68.3% \| 0.648 \| 0.953 \|
	\| ~~with_preprocess~~ \| center crop \| ~~center crop at inference~~ (not comparable to test) \| — \| — \| ~~98.3%~~ \| ~~0.983~~ \| — \|

	The ~~98.3%~~ benchmark number only appears if you center-crop at inference, which matches val but not how the model was evaluated on test during training.

	## Benchmark evaluation (60 images)

	### Fair eval — full pages (`preprocess none`, matches `test_preprocess`)

	`without_preprocess` (recommended):

	```bash
	python inference_uchen_ume.py \
	--benchmark-dir benchmark \
	--weights without_preprocess/final_model.pt \
	--preprocess none
	```

	`with_preprocess` (same protocol as training test split):

	```bash
	python inference_uchen_ume.py \
	--benchmark-dir benchmark \
	--weights with_preprocess/final_model.pt \
	--preprocess none
	```

	From this repo:

	```bash
	python experiments/uchen_ume_binary/eval_benchmark.py \
	--checkpoint without_preprocess/final_model.pt --benchmark-dir benchmark/benchmark

	python experiments/uchen_ume_binary/eval_benchmark.py \
	--checkpoint with_preprocess/final_model.pt --benchmark-dir benchmark/benchmark
	# default test-preprocess is none — do NOT pass center_crop for fair comparison
	```

	## Parquet dataset

	[openpecha/uchen-ume-classification-benchmark](https://huggingface.co/datasets/openpecha/uchen-ume-classification-benchmark)

	```python
	from datasets import load_dataset
	bench = load_dataset("openpecha/uchen-ume-classification-benchmark", split="benchmark")
	```

	## Load weights

	```python
	from huggingface_hub import hf_hub_download
	import torch
	path = hf_hub_download("openpecha/uchen-ume-classifier", "without_preprocess/final_model.pt", repo_type="model")
	ckpt = torch.load(path, map_location="cpu", weights_only=False)
	```