metadata
license: apache-2.0
tags:
- gguf
- ocr
- scene-text
- parseq
- crispembed
base_model: baudm/parseq
PARSeq — Scene Text Recognition (GGUF)
GGUF conversions of PARSeq (ECCV 2022) for use with CrispEmbed.
PARSeq is a scene text recognition model that reads text from natural images (signs, labels, documents). It recognizes 94 printable ASCII characters (digits, letters, punctuation).
Architecture
- Encoder: 12-layer pre-LN ViT (patch 4×8, input 32×128 RGB, 128 tokens, GELU FFN)
- Decoder: 1-layer two-stream Transformer (XLNet-style position queries + context self-attention, then cross-attention to encoder memory)
- Head: Linear → 95 classes (94 printable ASCII chars + EOS)
- Inference: Autoregressive greedy decode (max 25 characters)
Variants
| File | Variant | Params | Size | Notes |
|---|---|---|---|---|
parseq-f32.gguf |
Base | 24M | 91 MB | Full precision |
parseq-q8_0.gguf |
Base | 24M | 24 MB | Best quantized |
parseq-q4_k.gguf |
Base | 24M | 13 MB | Smallest base |
parseq-tiny-f16.gguf |
Tiny | 6M | 12 MB | Half precision |
parseq-tiny-q8_0.gguf |
Tiny | 6M | 6 MB | Smallest overall |
All quantization levels produce identical output on test images.
Usage
# CLI
crispembed -m parseq-q8_0.gguf --ocr image.png
# Auto-download
crispembed -m parseq --auto-download --ocr image.png
from crispembed import CrispMathOcr
ocr = CrispMathOcr("parseq-q8_0.gguf")
text = ocr.recognize("sign.png")
Benchmark (94-char, PARSeq-base)
| Dataset | Accuracy |
|---|---|
| IIIT5k | 99.1% |
| SVT | 97.9% |
| IC13-1015 | 98.1% |
| IC15-2077 | 89.2% |
| SVTP | 96.9% |
| CUTE80 | 98.6% |
Source
- Paper: Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
- Code: baudm/parseq (Apache-2.0)
- Converted with
models/convert-parseq-to-gguf.pyfrom CrispEmbed