balaboom123
/

STRLite

scene-text-recognition

vision-transformer

Model card Files Files and versions

STRLite / README.md

balaboom123's picture

Upload README.md with huggingface_hub

f937a4e verified 27 days ago

|

history blame contribute delete

2.96 kB

	---
	license: mit
	language:
	- en
	tags:
	- scene-text-recognition
	- ocr
	- vision-transformer
	- mae
	- image-to-text
	- pytorch
	library_name: pytorch
	---

	# STR-Lite

	STR-Lite is an ultra-lightweight scene text recognition model that combines Masked Autoencoder (MAE) pretraining with an autoregressive decoder for text generation. With only 6M parameters, it achieves competitive accuracy while remaining highly efficient for real-world deployment.

	- GitHub: [balaboom123/STR-Lite](https://github.com/balaboom123/STR-Lite)
	- Author: Kuanwei Chen
	- License: MIT

	## Model Architecture

	\| Component \| Details \|
	\| --------- \| ------- \|
	\| Backbone \| ViT-Tiny (embed=192, depth=12, heads=12) \|
	\| Decoder \| 1-layer autoregressive transformer (embed=192, heads=12) \|
	\| Input size \| 32 × 128 (H × W) \|
	\| Patch size \| 4 × 8 \|
	\| Parameters \| ~6M \|
	\| Precision \| bfloat16 \|

	## Training

	Stage 1 — MAE Pretraining
	- Dataset: U14M-Unlabeled
	- Epochs: 40

	Stage 2 — Fine-tuning
	- Dataset: U14M-L-Filtered
	- Epochs: 20, Batch: 256, LR: 1e-3, Weight decay: 0.01

	## Checkpoints

	\| Model \| Description \| Epochs \| Acc \| Download \|
	\| ----- \| ----------- \| :----: \| :-: \| :------: \|
	\| MAE ViT-Tiny \| Pretrained encoder only \| 40 \| — \| [pretrain/checkpoint-last.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/pretrain/checkpoint-last.pth) \|
	\| STRLite \| Full fine-tuned model \| 20 \| 93.82% \| [finetune/checkpoint-best.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/finetune/checkpoint-best.pth) \|

	## Results

	Common STR Benchmarks

	\| Subset \| w/ pretrain \| w/o pretrain \|
	\| ------ \| :---------: \| :----------: \|
	\| CUTE80 \| 95.83 \| 94.79 \|
	\| IC13 \| 96.85 \| 96.50 \|
	\| IC15 \| 86.80 \| 86.25 \|
	\| IIIT5k \| 96.97 \| 96.47 \|
	\| SVT \| 95.36 \| 94.90 \|
	\| SVTP \| 92.40 \| 89.77 \|
	\| Weighted avg. \| 93.82 \| 93.12 \|

	U14M Benchmarks

	\| Subset \| w/ pretrain \| w/o pretrain \|
	\| --------------- \| :---------: \| :----------: \|
	\| artistic \| 67.78 \| 62.11 \|
	\| contextless \| 78.95 \| 77.43 \|
	\| curve \| 82.19 \| 78.97 \|
	\| general \| 81.07 \| 79.96 \|
	\| multi oriented \| 82.91 \| 78.57 \|
	\| multi words \| 76.72 \| 74.31 \|
	\| salient \| 78.17 \| 75.33 \|
	\| Weighted avg. \| 81.03 \| 79.88 \|

	## Usage

	Download and evaluate:

	```bash
	git clone https://github.com/balaboom123/STR-Lite
	cd STR-Lite

	# Download checkpoint
	from huggingface_hub import hf_hub_download
	path = hf_hub_download("balaboom123/STRLite", "finetune/checkpoint-best.pth")

	# Evaluate
	python eval.py \
	resume=$path \
	test_data_path='[/path/to/lmdb_test]'
	```

	Fine-tune from MAE pretrained weights:

	```bash
	path = hf_hub_download("balaboom123/STRLite", "pretrain/checkpoint-last.pth")

	python main_finetune.py \
	train_data_path='[/path/to/lmdb_train]' \
	val_data_path='[/path/to/lmdb_val]' \
	pretrained_mae=$path
	```

	See the [GitHub repo](https://github.com/balaboom123/STR-Lite) for full installation and dataset preparation instructions.