| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - scene-text-recognition |
| - ocr |
| - vision-transformer |
| - mae |
| - image-to-text |
| - pytorch |
| library_name: pytorch |
| --- |
| |
| # STR-Lite |
|
|
| STR-Lite is an ultra-lightweight scene text recognition model that combines **Masked Autoencoder (MAE) pretraining** with an **autoregressive decoder** for text generation. With only **6M parameters**, it achieves competitive accuracy while remaining highly efficient for real-world deployment. |
|
|
| - **GitHub:** [balaboom123/STR-Lite](https://github.com/balaboom123/STR-Lite) |
| - **Author:** Kuanwei Chen |
| - **License:** MIT |
|
|
| ## Model Architecture |
|
|
| | Component | Details | |
| | --------- | ------- | |
| | Backbone | ViT-Tiny (embed=192, depth=12, heads=12) | |
| | Decoder | 1-layer autoregressive transformer (embed=192, heads=12) | |
| | Input size | 32 × 128 (H × W) | |
| | Patch size | 4 × 8 | |
| | Parameters | ~6M | |
| | Precision | bfloat16 | |
|
|
| ## Training |
|
|
| **Stage 1 — MAE Pretraining** |
| - Dataset: U14M-Unlabeled |
| - Epochs: 40 |
|
|
| **Stage 2 — Fine-tuning** |
| - Dataset: U14M-L-Filtered |
| - Epochs: 20, Batch: 256, LR: 1e-3, Weight decay: 0.01 |
|
|
| ## Checkpoints |
|
|
| | Model | Description | Epochs | Acc | Download | |
| | ----- | ----------- | :----: | :-: | :------: | |
| | MAE ViT-Tiny | Pretrained encoder only | 40 | — | [pretrain/checkpoint-last.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/pretrain/checkpoint-last.pth) | |
| | STRLite | Full fine-tuned model | 20 | 93.82% | [finetune/checkpoint-best.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/finetune/checkpoint-best.pth) | |
|
|
| ## Results |
|
|
| **Common STR Benchmarks** |
|
|
| | Subset | w/ pretrain | w/o pretrain | |
| | ------ | :---------: | :----------: | |
| | CUTE80 | 95.83 | 94.79 | |
| | IC13 | 96.85 | 96.50 | |
| | IC15 | 86.80 | 86.25 | |
| | IIIT5k | 96.97 | 96.47 | |
| | SVT | 95.36 | 94.90 | |
| | SVTP | 92.40 | 89.77 | |
| | **Weighted avg.** | **93.82** | **93.12** | |
|
|
| **U14M Benchmarks** |
|
|
| | Subset | w/ pretrain | w/o pretrain | |
| | --------------- | :---------: | :----------: | |
| | artistic | 67.78 | 62.11 | |
| | contextless | 78.95 | 77.43 | |
| | curve | 82.19 | 78.97 | |
| | general | 81.07 | 79.96 | |
| | multi oriented | 82.91 | 78.57 | |
| | multi words | 76.72 | 74.31 | |
| | salient | 78.17 | 75.33 | |
| | **Weighted avg.** | **81.03** | **79.88** | |
|
|
| ## Usage |
|
|
| **Download and evaluate:** |
|
|
| ```bash |
| git clone https://github.com/balaboom123/STR-Lite |
| cd STR-Lite |
| |
| # Download checkpoint |
| from huggingface_hub import hf_hub_download |
| path = hf_hub_download("balaboom123/STRLite", "finetune/checkpoint-best.pth") |
| |
| # Evaluate |
| python eval.py \ |
| resume=$path \ |
| test_data_path='[/path/to/lmdb_test]' |
| ``` |
|
|
| **Fine-tune from MAE pretrained weights:** |
|
|
| ```bash |
| path = hf_hub_download("balaboom123/STRLite", "pretrain/checkpoint-last.pth") |
| |
| python main_finetune.py \ |
| train_data_path='[/path/to/lmdb_train]' \ |
| val_data_path='[/path/to/lmdb_val]' \ |
| pretrained_mae=$path |
| ``` |
|
|
| See the [GitHub repo](https://github.com/balaboom123/STR-Lite) for full installation and dataset preparation instructions. |
|
|