Spaces:
Sleeping
Sleeping
File size: 3,227 Bytes
0533780 363e6f1 0533780 363e6f1 0533780 886ad57 0533780 363e6f1 0533780 363e6f1 0533780 363e6f1 0533780 363e6f1 0533780 363e6f1 0533780 363e6f1 0533780 363e6f1 0533780 363e6f1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | ---
title: GLM OCR
emoji: π
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# GLM-OCR β Self-Hosted OCR Engine
> A full-stack portfolio project: self-hosted OCR backend powered by **zai-org/GLM-OCR**, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5.
### π [Live Demo β https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction](https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction)
---
## What is GLM-OCR?
GLM-OCR is a state-of-the-art open-source OCR model from
**zai-org** (arXiv:2603.10910).
Unlike traditional OCR (Tesseract, etc.) it uses a **vision encoder + language model** architecture:
```
Image β [CogViT Vision Encoder] β [GLM-0.5B LM Backbone] β Text
```
It handles:
- Plain text from documents, screenshots, photos
- **Tables** (preserved structure)
- **Mathematical equations** (LaTeX output)
- **Code blocks** (syntax preserved)
- **Multilingual** text
- **Handwriting**
---
## Project Structure
```
GLMOCR_Text_extraction/
βββ main.py # FastAPI server β routes, CORS, request handling
βββ ocr_engine.py # Model loading, inference, OcrResult dataclass
βββ requirements.txt # Python dependencies
βββ frontend/
β βββ index.html # Single-file frontend (served by FastAPI)
βββ Extension/ # Browser extension (Chrome/Edge)
βββ Dockerfile # HF Spaces deployment
βββ docker-compose.yml # Local Docker deployment
βββ README.md
```
## Architecture Diagram
```
Browser
β
β POST /ocr (multipart image + mode)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β FastAPI (main.py) β
β β CORS middleware β
β β file validation (type, size) β
β β session metrics β
ββββββββββββββββ¬βββββββββββββββββββββββ
β image_bytes, mode
βΌ
βββββββββββββββββββββββββββββββββββββββ
β GlmOcrEngine (ocr_engine.py) β
β β PIL validation & RGB conversion β
β β writes temp PNG to disk β
β β processor.apply_chat_template() β
β β returns OcrResult dataclass β
ββββββββββββββββ¬βββββββββββββββββββββββ
β (torch.inference_mode)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β GLM-OCR 0.9B Model β
β zai-org/GLM-OCR β
β β Vision encoder (CogViT) β
β β LM backbone (GLM-0.5B) β
β β Runs on CUDA / CPU / MPS β
βββββββββββββββββββββββββββββββββββββββ
```
## References
- Paper: [GLM-OCR](https://arxiv.org/abs/2603.10910)
- Model: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR)
|