--- title: GLM OCR emoji: 📄 colorFrom: indigo colorTo: purple sdk: docker app_port: 7860 pinned: false --- # GLM-OCR — Self-Hosted OCR Engine > A full-stack portfolio project: self-hosted OCR backend powered by **zai-org/GLM-OCR**, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5. ### 🔗 [Live Demo → https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction](https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction) --- ## What is GLM-OCR? GLM-OCR is a state-of-the-art open-source OCR model from **zai-org** (arXiv:2603.10910). Unlike traditional OCR (Tesseract, etc.) it uses a **vision encoder + language model** architecture: ``` Image → [CogViT Vision Encoder] → [GLM-0.5B LM Backbone] → Text ``` It handles: - Plain text from documents, screenshots, photos - **Tables** (preserved structure) - **Mathematical equations** (LaTeX output) - **Code blocks** (syntax preserved) - **Multilingual** text - **Handwriting** --- ## Project Structure ``` GLMOCR_Text_extraction/ ├── main.py # FastAPI server — routes, CORS, request handling ├── ocr_engine.py # Model loading, inference, OcrResult dataclass ├── requirements.txt # Python dependencies ├── frontend/ │ └── index.html # Single-file frontend (served by FastAPI) ├── Extension/ # Browser extension (Chrome/Edge) ├── Dockerfile # HF Spaces deployment ├── docker-compose.yml # Local Docker deployment └── README.md ``` ## Architecture Diagram ``` Browser │ │ POST /ocr (multipart image + mode) ▼ ┌─────────────────────────────────────┐ │ FastAPI (main.py) │ │ ─ CORS middleware │ │ ─ file validation (type, size) │ │ ─ session metrics │ └──────────────┬──────────────────────┘ │ image_bytes, mode ▼ ┌─────────────────────────────────────┐ │ GlmOcrEngine (ocr_engine.py) │ │ ─ PIL validation & RGB conversion │ │ ─ writes temp PNG to disk │ │ ─ processor.apply_chat_template() │ │ ─ returns OcrResult dataclass │ └──────────────┬──────────────────────┘ │ (torch.inference_mode) ▼ ┌─────────────────────────────────────┐ │ GLM-OCR 0.9B Model │ │ zai-org/GLM-OCR │ │ ─ Vision encoder (CogViT) │ │ ─ LM backbone (GLM-0.5B) │ │ ─ Runs on CUDA / CPU / MPS │ └─────────────────────────────────────┘ ``` ## References - Paper: [GLM-OCR](https://arxiv.org/abs/2603.10910) - Model: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR)