Spaces:
Sleeping
Sleeping
| title: GLM OCR | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # GLM-OCR β Self-Hosted OCR Engine | |
| > A full-stack portfolio project: self-hosted OCR backend powered by **zai-org/GLM-OCR**, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5. | |
| ### π [Live Demo β https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction](https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction) | |
| --- | |
| ## What is GLM-OCR? | |
| GLM-OCR is a state-of-the-art open-source OCR model from | |
| **zai-org** (arXiv:2603.10910). | |
| Unlike traditional OCR (Tesseract, etc.) it uses a **vision encoder + language model** architecture: | |
| ``` | |
| Image β [CogViT Vision Encoder] β [GLM-0.5B LM Backbone] β Text | |
| ``` | |
| It handles: | |
| - Plain text from documents, screenshots, photos | |
| - **Tables** (preserved structure) | |
| - **Mathematical equations** (LaTeX output) | |
| - **Code blocks** (syntax preserved) | |
| - **Multilingual** text | |
| - **Handwriting** | |
| --- | |
| ## Project Structure | |
| ``` | |
| GLMOCR_Text_extraction/ | |
| βββ main.py # FastAPI server β routes, CORS, request handling | |
| βββ ocr_engine.py # Model loading, inference, OcrResult dataclass | |
| βββ requirements.txt # Python dependencies | |
| βββ frontend/ | |
| β βββ index.html # Single-file frontend (served by FastAPI) | |
| βββ Extension/ # Browser extension (Chrome/Edge) | |
| βββ Dockerfile # HF Spaces deployment | |
| βββ docker-compose.yml # Local Docker deployment | |
| βββ README.md | |
| ``` | |
| ## Architecture Diagram | |
| ``` | |
| Browser | |
| β | |
| β POST /ocr (multipart image + mode) | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β FastAPI (main.py) β | |
| β β CORS middleware β | |
| β β file validation (type, size) β | |
| β β session metrics β | |
| ββββββββββββββββ¬βββββββββββββββββββββββ | |
| β image_bytes, mode | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β GlmOcrEngine (ocr_engine.py) β | |
| β β PIL validation & RGB conversion β | |
| β β writes temp PNG to disk β | |
| β β processor.apply_chat_template() β | |
| β β returns OcrResult dataclass β | |
| ββββββββββββββββ¬βββββββββββββββββββββββ | |
| β (torch.inference_mode) | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β GLM-OCR 0.9B Model β | |
| β zai-org/GLM-OCR β | |
| β β Vision encoder (CogViT) β | |
| β β LM backbone (GLM-0.5B) β | |
| β β Runs on CUDA / CPU / MPS β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## References | |
| - Paper: [GLM-OCR](https://arxiv.org/abs/2603.10910) | |
| - Model: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) | |