dchen0 commited on
Commit
07c6d94
·
verified ·
1 Parent(s): d3d01f4

Replace auto-generated model card with detailed description

Browse files
Files changed (1) hide show
  1. README.md +92 -106
README.md CHANGED
@@ -1,115 +1,101 @@
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: image-classification
4
- library_name: transformers # ← change “peft” → “transformers”
5
  tags:
6
  - dinov2
7
  - image-classification
8
  - fonts
 
 
 
 
 
9
  ---
10
 
11
- # dchen0/font-classifier
12
- Merged DINOv2‑base checkpoint with LoRA weights for font classification.
13
-
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
- This model is a fine-tuned version of [facebook/dinov2-base-imagenet1k-1-layer](https://huggingface.co/facebook/dinov2-base-imagenet1k-1-layer) on the imagefolder dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.2637
20
- - Model Preparation Time: 0.0016
21
- - Accuracy: 0.9163
22
-
23
- ## Model description
24
-
25
- More information needed
26
-
27
- ## Intended uses & limitations
28
-
29
- More information needed
30
-
31
- ## Training and evaluation data
32
-
33
- More information needed
34
-
35
- ## Training procedure
36
-
37
- ### Training hyperparameters
38
-
39
- The following hyperparameters were used during training:
40
- - learning_rate: 0.0001
41
- - train_batch_size: 32
42
- - eval_batch_size: 32
43
- - seed: 42
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: linear
46
- - num_epochs: 1
47
-
48
- ### Training results
49
-
50
- | Training Loss | Epoch | Step | Validation Loss | Model Preparation Time | Accuracy |
51
- |:-------------:|:------:|:----:|:---------------:|:----------------------:|:--------:|
52
- | 0.7099 | 0.0182 | 50 | 0.6595 | 0.0016 | 0.7594 |
53
- | 0.7084 | 0.0363 | 100 | 0.6175 | 0.0016 | 0.7806 |
54
- | 0.7638 | 0.0545 | 150 | 0.7014 | 0.0016 | 0.7337 |
55
- | 0.6451 | 0.0727 | 200 | 0.6177 | 0.0016 | 0.7757 |
56
- | 0.6852 | 0.0908 | 250 | 0.5691 | 0.0016 | 0.7971 |
57
- | 0.5753 | 0.1090 | 300 | 0.5666 | 0.0016 | 0.8048 |
58
- | 0.5925 | 0.1272 | 350 | 0.5235 | 0.0016 | 0.8204 |
59
- | 0.6969 | 0.1453 | 400 | 0.5725 | 0.0016 | 0.7922 |
60
- | 0.6096 | 0.1635 | 450 | 0.5103 | 0.0016 | 0.8173 |
61
- | 0.5994 | 0.1817 | 500 | 0.5075 | 0.0016 | 0.8183 |
62
- | 0.5272 | 0.1999 | 550 | 0.5116 | 0.0016 | 0.8229 |
63
- | 0.5193 | 0.2180 | 600 | 0.4952 | 0.0016 | 0.8244 |
64
- | 0.5689 | 0.2362 | 650 | 0.4662 | 0.0016 | 0.8388 |
65
- | 0.5126 | 0.2544 | 700 | 0.4651 | 0.0016 | 0.8327 |
66
- | 0.5301 | 0.2725 | 750 | 0.5080 | 0.0016 | 0.8158 |
67
- | 0.5424 | 0.2907 | 800 | 0.4573 | 0.0016 | 0.8357 |
68
- | 0.4357 | 0.3089 | 850 | 0.4412 | 0.0016 | 0.8486 |
69
- | 0.5522 | 0.3270 | 900 | 0.4755 | 0.0016 | 0.8256 |
70
- | 0.5639 | 0.3452 | 950 | 0.4463 | 0.0016 | 0.8339 |
71
- | 0.4522 | 0.3634 | 1000 | 0.4347 | 0.0016 | 0.8458 |
72
- | 0.5548 | 0.3815 | 1050 | 0.4112 | 0.0016 | 0.8560 |
73
- | 0.4815 | 0.3997 | 1100 | 0.4300 | 0.0016 | 0.8514 |
74
- | 0.5028 | 0.4179 | 1150 | 0.3840 | 0.0016 | 0.8713 |
75
- | 0.4417 | 0.4360 | 1200 | 0.4364 | 0.0016 | 0.8462 |
76
- | 0.4465 | 0.4542 | 1250 | 0.3731 | 0.0016 | 0.8740 |
77
- | 0.3935 | 0.4724 | 1300 | 0.3672 | 0.0016 | 0.8753 |
78
- | 0.5306 | 0.4906 | 1350 | 0.4480 | 0.0016 | 0.8388 |
79
- | 0.3991 | 0.5087 | 1400 | 0.3718 | 0.0016 | 0.8698 |
80
- | 0.483 | 0.5269 | 1450 | 0.3916 | 0.0016 | 0.8652 |
81
- | 0.4323 | 0.5451 | 1500 | 0.3948 | 0.0016 | 0.8648 |
82
- | 0.3664 | 0.5632 | 1550 | 0.3400 | 0.0016 | 0.8796 |
83
- | 0.4941 | 0.5814 | 1600 | 0.3531 | 0.0016 | 0.8765 |
84
- | 0.4185 | 0.5996 | 1650 | 0.3481 | 0.0016 | 0.8820 |
85
- | 0.4506 | 0.6177 | 1700 | 0.3332 | 0.0016 | 0.8866 |
86
- | 0.4015 | 0.6359 | 1750 | 0.3468 | 0.0016 | 0.8768 |
87
- | 0.3919 | 0.6541 | 1800 | 0.3421 | 0.0016 | 0.8897 |
88
- | 0.4281 | 0.6722 | 1850 | 0.3141 | 0.0016 | 0.8937 |
89
- | 0.3659 | 0.6904 | 1900 | 0.3424 | 0.0016 | 0.8823 |
90
- | 0.345 | 0.7086 | 1950 | 0.3172 | 0.0016 | 0.8912 |
91
- | 0.3157 | 0.7267 | 2000 | 0.3226 | 0.0016 | 0.8903 |
92
- | 0.3456 | 0.7449 | 2050 | 0.3178 | 0.0016 | 0.8909 |
93
- | 0.3643 | 0.7631 | 2100 | 0.2988 | 0.0016 | 0.8983 |
94
- | 0.4043 | 0.7812 | 2150 | 0.3036 | 0.0016 | 0.8992 |
95
- | 0.3486 | 0.7994 | 2200 | 0.2974 | 0.0016 | 0.9053 |
96
- | 0.3735 | 0.8176 | 2250 | 0.3026 | 0.0016 | 0.8964 |
97
- | 0.4032 | 0.8358 | 2300 | 0.2990 | 0.0016 | 0.9019 |
98
- | 0.3825 | 0.8539 | 2350 | 0.2938 | 0.0016 | 0.9062 |
99
- | 0.345 | 0.8721 | 2400 | 0.2871 | 0.0016 | 0.9059 |
100
- | 0.3528 | 0.8903 | 2450 | 0.2777 | 0.0016 | 0.9093 |
101
- | 0.3207 | 0.9084 | 2500 | 0.2764 | 0.0016 | 0.9111 |
102
- | 0.2664 | 0.9266 | 2550 | 0.2741 | 0.0016 | 0.9099 |
103
- | 0.3496 | 0.9448 | 2600 | 0.2720 | 0.0016 | 0.9151 |
104
- | 0.3274 | 0.9629 | 2650 | 0.2724 | 0.0016 | 0.9136 |
105
- | 0.3014 | 0.9811 | 2700 | 0.2659 | 0.0016 | 0.9136 |
106
- | 0.3235 | 0.9993 | 2750 | 0.2637 | 0.0016 | 0.9163 |
107
-
108
-
109
- ### Framework versions
110
-
111
- - PEFT 0.15.2
112
- - Transformers 4.52.4
113
- - Pytorch 2.7.1
114
- - Datasets 3.6.0
115
- - Tokenizers 0.21.1
 
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: image-classification
4
+ library_name: transformers
5
  tags:
6
  - dinov2
7
  - image-classification
8
  - fonts
9
+ - lora
10
+ - vision-transformer
11
+ datasets:
12
+ - dchen0/font_crops_v5
13
+ base_model: facebook/dinov2-base-imagenet1k-1-layer
14
  ---
15
 
16
+ # Font Classifier
17
+
18
+ A DINOv2 Vision Transformer fine-tuned with LoRA for font classification across 394 font variants from 32 Google Fonts families.
19
+
20
+ ## How it was made
21
+
22
+ 1. **Base model**: [facebook/dinov2-base-imagenet1k-1-layer](https://huggingface.co/facebook/dinov2-base-imagenet1k-1-layer) (87.2M parameters, frozen).
23
+ 2. **Fine-tuning**: [LoRA](https://arxiv.org/abs/2106.09685) (rank 8, alpha 16) applied to the query and value projections in each ViT attention block, plus a trainable classification head. ~900K trainable parameters (1% of total).
24
+ 3. **Merge**: After training, the LoRA adapter weights were merged into the base model (`merge_and_unload()`), producing this standalone checkpoint. No adapter or PEFT library needed at inference time.
25
+
26
+ ## Performance
27
+
28
+ - **99.0% top-1 accuracy** on 394 font classes (held-out test set)
29
+ - **99.8% family-level accuracy** (collapsing weight variants into parent families)
30
+ - Errors are overwhelmingly within-family weight confusions (e.g. Roboto-400 vs Roboto-500), not cross-family misidentifications
31
+
32
+ | Method | Trainable Params | Top-1 Acc |
33
+ |---|---|---|
34
+ | **LoRA r=8 (this model)** | **900K** | **99.0%** |
35
+ | ResNet-50 | 25.6M | 98.8% |
36
+ | LoRA r=16 | 1.2M | 98.9% |
37
+ | LoRA r=4 | 753K | 97.9% |
38
+ | Full Fine-Tuning | 87.2M | 95.9% |
39
+
40
+ ## Training data
41
+
42
+ [dchen0/font_crops_v5](https://huggingface.co/datasets/dchen0/font_crops_v5) ~225K synthetic images generated by rendering random text in each font variant. ~575 training images and 40 test images per class. Images include color augmentation, layout variation (left/center/right alignment, multi-line), and Gaussian noise.
43
+
44
+ ### Font families (32)
45
+
46
+ BigShouldersText, BricolageGrotesque, CrimsonPro, DMSans, Geist, HedvigLettersSerif, InstrumentSans, InstrumentSerif, Inter, JetBrainsMono, LexendDeca, Lora, Merriweather, Montserrat, Newsreader, NunitoSans, Onest, OpenSans, Petrona, PlayfairDisplay, PlusJakartaSans, Poppins, PT Serif Caption, RethinkSans, Roboto, RobotoSerif, ShipporiMincho, Sora, SpaceGrotesk, Ultra, Urbanist, WorkSans
47
+
48
+ ## Training details
49
+
50
+ | Hyperparameter | Value |
51
+ |---|---|
52
+ | Optimizer | AdamW |
53
+ | Learning rate | 1e-4 |
54
+ | Batch size | 64 |
55
+ | Epochs | 100 |
56
+ | LR scheduler | Linear decay |
57
+ | Precision | FP16 |
58
+ | LoRA rank | 8 |
59
+ | LoRA alpha | 16 |
60
+ | LoRA dropout | 0.1 |
61
+ | LoRA targets | query, value |
62
+ | GPU | NVIDIA RTX 3090 (24 GB) |
63
+ | Training time | ~33 hours |
64
+
65
+ ## Preprocessing
66
+
67
+ Preprocessing is built into `handler.py` and must match at inference time:
68
+
69
+ 1. Convert to RGB
70
+ 2. Pad to square (black fill, centered)
71
+ 3. Resize to 224x224
72
+ 4. Normalize with ImageNet stats (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
73
+
74
+ ## Usage
75
+
76
+ ```python
77
+ from transformers import Dinov2ForImageClassification, AutoImageProcessor
78
+ from handler import get_inference_transform
79
+ from PIL import Image
80
+ import torch
81
+
82
+ model = Dinov2ForImageClassification.from_pretrained("dchen0/font-classifier")
83
+ processor = AutoImageProcessor.from_pretrained("dchen0/font-classifier")
84
+ model.eval()
85
+
86
+ transform = get_inference_transform(processor, processor.size["shortest_edge"])
87
+ image = Image.open("font_sample.png").convert("RGB")
88
+ pixel_values = transform(image).unsqueeze(0)
89
+
90
+ with torch.no_grad():
91
+ logits = model(pixel_values=pixel_values).logits
92
+
93
+ predicted_class = logits.argmax(-1).item()
94
+ print(model.config.id2label[predicted_class])
95
+ ```
96
+
97
+ ## Source
98
+
99
+ - Training code: [github.com/Create-Inc/font-model](https://github.com/Create-Inc/font-model)
100
+ - Results repo (checkpoints, logs): [dchen0/font-model-results](https://huggingface.co/dchen0/font-model-results)
101
+ - Dataset: [dchen0/font_crops_v5](https://huggingface.co/datasets/dchen0/font_crops_v5)