parseq-GGUF / README.md
cstr's picture
Upload README.md with huggingface_hub
ba808dd verified
|
Raw
History Blame Contribute Delete
2.09 kB
metadata
license: apache-2.0
tags:
  - gguf
  - ocr
  - scene-text
  - parseq
  - crispembed
base_model: baudm/parseq

PARSeq — Scene Text Recognition (GGUF)

GGUF conversions of PARSeq (ECCV 2022) for use with CrispEmbed.

PARSeq is a scene text recognition model that reads text from natural images (signs, labels, documents). It recognizes 94 printable ASCII characters (digits, letters, punctuation).

Architecture

  • Encoder: 12-layer pre-LN ViT (patch 4×8, input 32×128 RGB, 128 tokens, GELU FFN)
  • Decoder: 1-layer two-stream Transformer (XLNet-style position queries + context self-attention, then cross-attention to encoder memory)
  • Head: Linear → 95 classes (94 printable ASCII chars + EOS)
  • Inference: Autoregressive greedy decode (max 25 characters)

Variants

File Variant Params Size Notes
parseq-f32.gguf Base 24M 91 MB Full precision
parseq-q8_0.gguf Base 24M 24 MB Best quantized
parseq-q4_k.gguf Base 24M 13 MB Smallest base
parseq-tiny-f16.gguf Tiny 6M 12 MB Half precision
parseq-tiny-q8_0.gguf Tiny 6M 6 MB Smallest overall

All quantization levels produce identical output on test images.

Usage

# CLI
crispembed -m parseq-q8_0.gguf --ocr image.png

# Auto-download
crispembed -m parseq --auto-download --ocr image.png
from crispembed import CrispMathOcr
ocr = CrispMathOcr("parseq-q8_0.gguf")
text = ocr.recognize("sign.png")

Benchmark (94-char, PARSeq-base)

Dataset Accuracy
IIIT5k 99.1%
SVT 97.9%
IC13-1015 98.1%
IC15-2077 89.2%
SVTP 96.9%
CUTE80 98.6%

Source