Sam20202's picture
Fix README: GLM-OCR + restore HF frontmatter
363e6f1
---
title: GLM OCR
emoji: πŸ“„
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# GLM-OCR β€” Self-Hosted OCR Engine
> A full-stack portfolio project: self-hosted OCR backend powered by **zai-org/GLM-OCR**, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5.
### πŸ”— [Live Demo β†’ https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction](https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction)
---
## What is GLM-OCR?
GLM-OCR is a state-of-the-art open-source OCR model from
**zai-org** (arXiv:2603.10910).
Unlike traditional OCR (Tesseract, etc.) it uses a **vision encoder + language model** architecture:
```
Image β†’ [CogViT Vision Encoder] β†’ [GLM-0.5B LM Backbone] β†’ Text
```
It handles:
- Plain text from documents, screenshots, photos
- **Tables** (preserved structure)
- **Mathematical equations** (LaTeX output)
- **Code blocks** (syntax preserved)
- **Multilingual** text
- **Handwriting**
---
## Project Structure
```
GLMOCR_Text_extraction/
β”œβ”€β”€ main.py # FastAPI server β€” routes, CORS, request handling
β”œβ”€β”€ ocr_engine.py # Model loading, inference, OcrResult dataclass
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ frontend/
β”‚ └── index.html # Single-file frontend (served by FastAPI)
β”œβ”€β”€ Extension/ # Browser extension (Chrome/Edge)
β”œβ”€β”€ Dockerfile # HF Spaces deployment
β”œβ”€β”€ docker-compose.yml # Local Docker deployment
└── README.md
```
## Architecture Diagram
```
Browser
β”‚
β”‚ POST /ocr (multipart image + mode)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI (main.py) β”‚
β”‚ ─ CORS middleware β”‚
β”‚ ─ file validation (type, size) β”‚
β”‚ ─ session metrics β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ image_bytes, mode
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GlmOcrEngine (ocr_engine.py) β”‚
β”‚ ─ PIL validation & RGB conversion β”‚
β”‚ ─ writes temp PNG to disk β”‚
β”‚ ─ processor.apply_chat_template() β”‚
β”‚ ─ returns OcrResult dataclass β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (torch.inference_mode)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GLM-OCR 0.9B Model β”‚
β”‚ zai-org/GLM-OCR β”‚
β”‚ ─ Vision encoder (CogViT) β”‚
β”‚ ─ LM backbone (GLM-0.5B) β”‚
β”‚ ─ Runs on CUDA / CPU / MPS β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## References
- Paper: [GLM-OCR](https://arxiv.org/abs/2603.10910)
- Model: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR)