Scene Text Recognition with Permuted Autoregressive Sequence Models
Paper β’ 2207.06966 β’ Published β’ 1
This is a modified version of CLPRNet where the original CNN-based recognition branch has been replaced with PARSeq Tiny (Scene Text Recognition with Permuted Autoregressive Sequence Models, ECCV 2022).
Input Image (1024Γ1024)
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Shared FPN Backbone (unchanged) β
β BasicBlock stack β multi-scale β
β features β FPN upsampling β
ββββββββ¬βββββββββββββββ¬βββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ ββββββββββββββ
β at_head (1ch)β β Detection β
β LP attention β β SEBasicBlockβ
ββββββββββββββββ β β 5ch head β
βββββββ¬βββββββ
β
βΌ
Bounding Boxes (NMS)
β
βΌ
βββββββββββββββββββ
β Plate Cropping β
β (grid_sample) β
β β (32, 128) RGB β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β PARSeq Tiny β
β ViT Encoder β
β (192d, 12 layers)β
β + Transformer β
β Decoder (1 layer)β
ββββββββββ¬βββββββββ
β
βΌ
Character Logits
(B, 9, 74)
| Component | Original | Modified |
|---|---|---|
| Recognition backbone | 4Γ SEBasicBlock CNN | PARSeq Tiny (ViT) |
| Recognition head | Conv2d 256β73 (dense spatial) | Linear 192β74 (sequence) |
| Character attention | 8-channel learned spatial masks | Internal Transformer attention |
| at_head output | 9 channels (1 LP + 8 char) | 1 channel (LP only) |
| Recognition output | (B, 16, 16, 584) dense grid | (B, 9, 74) sequence logits |
| Decoding | Argmax per grid cell | Autoregressive/parallel sequence |
| Plate cropping | Not needed (attention-based) | Differentiable grid_sample |
Total: 8,134,550 params
PARSeq Tiny: 6,007,178 params (recognition)
Detection: 2,127,372 params (backbone + FPN + detection head)
| File | Purpose |
|---|---|
model_parseq.py |
Main model with PARSeq Tiny integrated |
train_parseq.py |
Training script (updated losses) |
inference_parseq.py |
Inference script (two-stage) |
from model_parseq import create_clprnet_parseq
# Create model
model = create_clprnet_parseq(max_label_length=8)
# Training forward pass (with GT boxes + labels)
y_det, y_rec, at_lp, plate_counts = model(
images, # (B, 3, 1024, 1024)
boxes_lurd=gt_boxes, # list of (N_i, 4) tensors [l,t,r,b]
plate_labels=gt_labels # list of plate strings
)
# Inference (detection only)
y_det, _, at_lp, _ = model(images)
# Full inference (detect + recognize)
plates, confs = model.recognize_plates(images, detected_boxes)
Based on DeiT-Ti configuration from the PARSeq paper:
| Parameter | Value |
|---|---|
| embed_dim | 192 |
| Encoder heads | 3 |
| Encoder depth | 12 |
| Decoder heads | 6 |
| Decoder depth | 1 |
| MLP ratio | 4 |
| Patch size | (4, 8) |
| Input size | (32, 128) |
| Max label length | 8 |
| Charset | 73 Chinese LP chars + EOS |
torch >= 2.0
torchvision
numpy
opencv-python
Pillow
Uses the same CCPD + CRPD datasets as the original CLPRNet. See train_parseq.py.
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.