Sam20202's picture
Fix README: GLM-OCR + restore HF frontmatter
363e6f1
metadata
title: GLM OCR
emoji: πŸ“„
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

GLM-OCR β€” Self-Hosted OCR Engine

A full-stack portfolio project: self-hosted OCR backend powered by zai-org/GLM-OCR, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5.

πŸ”— Live Demo β†’ https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction


What is GLM-OCR?

GLM-OCR is a state-of-the-art open-source OCR model from
zai-org (arXiv:2603.10910).

Unlike traditional OCR (Tesseract, etc.) it uses a vision encoder + language model architecture:

Image β†’ [CogViT Vision Encoder] β†’ [GLM-0.5B LM Backbone] β†’ Text

It handles:

  • Plain text from documents, screenshots, photos
  • Tables (preserved structure)
  • Mathematical equations (LaTeX output)
  • Code blocks (syntax preserved)
  • Multilingual text
  • Handwriting

Project Structure

GLMOCR_Text_extraction/
β”œβ”€β”€ main.py              # FastAPI server β€” routes, CORS, request handling
β”œβ”€β”€ ocr_engine.py        # Model loading, inference, OcrResult dataclass
β”œβ”€β”€ requirements.txt     # Python dependencies
β”œβ”€β”€ frontend/
β”‚   └── index.html       # Single-file frontend (served by FastAPI)
β”œβ”€β”€ Extension/           # Browser extension (Chrome/Edge)
β”œβ”€β”€ Dockerfile           # HF Spaces deployment
β”œβ”€β”€ docker-compose.yml   # Local Docker deployment
└── README.md

Architecture Diagram

Browser
  β”‚
  β”‚  POST /ocr (multipart image + mode)
  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FastAPI  (main.py)                 β”‚
β”‚  ─ CORS middleware                  β”‚
β”‚  ─ file validation (type, size)     β”‚
β”‚  ─ session metrics                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚  image_bytes, mode
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GlmOcrEngine  (ocr_engine.py)     β”‚
β”‚  ─ PIL validation & RGB conversion β”‚
β”‚  ─ writes temp PNG to disk         β”‚
β”‚  ─ processor.apply_chat_template() β”‚
β”‚  ─ returns OcrResult dataclass     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚  (torch.inference_mode)
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GLM-OCR 0.9B Model                β”‚
β”‚  zai-org/GLM-OCR                   β”‚
β”‚  ─ Vision encoder (CogViT)        β”‚
β”‚  ─ LM backbone (GLM-0.5B)         β”‚
β”‚  ─ Runs on CUDA / CPU / MPS       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

References