--- license: mit language: - en tags: - scene-text-recognition - ocr - vision-transformer - mae - image-to-text - pytorch library_name: pytorch --- # STR-Lite STR-Lite is an ultra-lightweight scene text recognition model that combines **Masked Autoencoder (MAE) pretraining** with an **autoregressive decoder** for text generation. With only **6M parameters**, it achieves competitive accuracy while remaining highly efficient for real-world deployment. - **GitHub:** [balaboom123/STR-Lite](https://github.com/balaboom123/STR-Lite) - **Author:** Kuanwei Chen - **License:** MIT ## Model Architecture | Component | Details | | --------- | ------- | | Backbone | ViT-Tiny (embed=192, depth=12, heads=12) | | Decoder | 1-layer autoregressive transformer (embed=192, heads=12) | | Input size | 32 × 128 (H × W) | | Patch size | 4 × 8 | | Parameters | ~6M | | Precision | bfloat16 | ## Training **Stage 1 — MAE Pretraining** - Dataset: U14M-Unlabeled - Epochs: 40 **Stage 2 — Fine-tuning** - Dataset: U14M-L-Filtered - Epochs: 20, Batch: 256, LR: 1e-3, Weight decay: 0.01 ## Checkpoints | Model | Description | Epochs | Acc | Download | | ----- | ----------- | :----: | :-: | :------: | | MAE ViT-Tiny | Pretrained encoder only | 40 | — | [pretrain/checkpoint-last.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/pretrain/checkpoint-last.pth) | | STRLite | Full fine-tuned model | 20 | 93.82% | [finetune/checkpoint-best.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/finetune/checkpoint-best.pth) | ## Results **Common STR Benchmarks** | Subset | w/ pretrain | w/o pretrain | | ------ | :---------: | :----------: | | CUTE80 | 95.83 | 94.79 | | IC13 | 96.85 | 96.50 | | IC15 | 86.80 | 86.25 | | IIIT5k | 96.97 | 96.47 | | SVT | 95.36 | 94.90 | | SVTP | 92.40 | 89.77 | | **Weighted avg.** | **93.82** | **93.12** | **U14M Benchmarks** | Subset | w/ pretrain | w/o pretrain | | --------------- | :---------: | :----------: | | artistic | 67.78 | 62.11 | | contextless | 78.95 | 77.43 | | curve | 82.19 | 78.97 | | general | 81.07 | 79.96 | | multi oriented | 82.91 | 78.57 | | multi words | 76.72 | 74.31 | | salient | 78.17 | 75.33 | | **Weighted avg.** | **81.03** | **79.88** | ## Usage **Download and evaluate:** ```bash git clone https://github.com/balaboom123/STR-Lite cd STR-Lite # Download checkpoint from huggingface_hub import hf_hub_download path = hf_hub_download("balaboom123/STRLite", "finetune/checkpoint-best.pth") # Evaluate python eval.py \ resume=$path \ test_data_path='[/path/to/lmdb_test]' ``` **Fine-tune from MAE pretrained weights:** ```bash path = hf_hub_download("balaboom123/STRLite", "pretrain/checkpoint-last.pth") python main_finetune.py \ train_data_path='[/path/to/lmdb_train]' \ val_data_path='[/path/to/lmdb_val]' \ pretrained_mae=$path ``` See the [GitHub repo](https://github.com/balaboom123/STR-Lite) for full installation and dataset preparation instructions.