---
license: mit
language:
- en
tags:
- scene-text-recognition
- ocr
- vision-transformer
- mae
- image-to-text
- pytorch
library_name: pytorch
---

# STR-Lite

STR-Lite is an ultra-lightweight scene text recognition model that combines **Masked Autoencoder (MAE) pretraining** with an **autoregressive decoder** for text generation. With only **6M parameters**, it achieves competitive accuracy while remaining highly efficient for real-world deployment.

- **GitHub:** [balaboom123/STR-Lite](https://github.com/balaboom123/STR-Lite)
- **Author:** Kuanwei Chen
- **License:** MIT

## Model Architecture

| Component | Details |
| --------- | ------- |
| Backbone | ViT-Tiny (embed=192, depth=12, heads=12) |
| Decoder | 1-layer autoregressive transformer (embed=192, heads=12) |
| Input size | 32 × 128 (H × W) |
| Patch size | 4 × 8 |
| Parameters | ~6M |
| Precision | bfloat16 |

## Training

**Stage 1 — MAE Pretraining**
- Dataset: U14M-Unlabeled
- Epochs: 40

**Stage 2 — Fine-tuning**
- Dataset: U14M-L-Filtered
- Epochs: 20, Batch: 256, LR: 1e-3, Weight decay: 0.01

## Checkpoints

| Model | Description | Epochs | Acc | Download |
| ----- | ----------- | :----: | :-: | :------: |
| MAE ViT-Tiny | Pretrained encoder only | 40 | — | [pretrain/checkpoint-last.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/pretrain/checkpoint-last.pth) |
| STRLite | Full fine-tuned model | 20 | 93.82% | [finetune/checkpoint-best.pth](https://huggingface.co/balaboom123/STRLite/resolve/main/finetune/checkpoint-best.pth) |

## Results

**Common STR Benchmarks**

| Subset | w/ pretrain | w/o pretrain |
| ------ | :---------: | :----------: |
| CUTE80 | 95.83 | 94.79 |
| IC13 | 96.85 | 96.50 |
| IC15 | 86.80 | 86.25 |
| IIIT5k | 96.97 | 96.47 |
| SVT | 95.36 | 94.90 |
| SVTP | 92.40 | 89.77 |
| **Weighted avg.** | **93.82** | **93.12** |

**U14M Benchmarks**

| Subset | w/ pretrain | w/o pretrain |
| --------------- | :---------: | :----------: |
| artistic | 67.78 | 62.11 |
| contextless | 78.95 | 77.43 |
| curve | 82.19 | 78.97 |
| general | 81.07 | 79.96 |
| multi oriented | 82.91 | 78.57 |
| multi words | 76.72 | 74.31 |
| salient | 78.17 | 75.33 |
| **Weighted avg.** | **81.03** | **79.88** |

## Usage

**Download and evaluate:**

```bash
git clone https://github.com/balaboom123/STR-Lite
cd STR-Lite

# Download checkpoint
from huggingface_hub import hf_hub_download
path = hf_hub_download("balaboom123/STRLite", "finetune/checkpoint-best.pth")

# Evaluate
python eval.py \
  resume=$path \
  test_data_path='[/path/to/lmdb_test]'
```

**Fine-tune from MAE pretrained weights:**

```bash
path = hf_hub_download("balaboom123/STRLite", "pretrain/checkpoint-last.pth")

python main_finetune.py \
  train_data_path='[/path/to/lmdb_train]' \
  val_data_path='[/path/to/lmdb_val]' \
  pretrained_mae=$path
```

See the [GitHub repo](https://github.com/balaboom123/STR-Lite) for full installation and dataset preparation instructions.