---
title: GLM OCR
emoji: 📄
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# GLM-OCR — Self-Hosted OCR Engine

> A full-stack portfolio project: self-hosted OCR backend powered by **zai-org/GLM-OCR**, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5.

### 🔗 [Live Demo →  https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction](https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction)

---

## What is GLM-OCR?

GLM-OCR is a state-of-the-art open-source OCR model from  
**zai-org** (arXiv:2603.10910).

Unlike traditional OCR (Tesseract, etc.) it uses a **vision encoder + language model** architecture:

```
Image → [CogViT Vision Encoder] → [GLM-0.5B LM Backbone] → Text
```

It handles:
- Plain text from documents, screenshots, photos
- **Tables** (preserved structure)
- **Mathematical equations** (LaTeX output)
- **Code blocks** (syntax preserved)
- **Multilingual** text
- **Handwriting**

---

## Project Structure

```
GLMOCR_Text_extraction/
├── main.py              # FastAPI server — routes, CORS, request handling
├── ocr_engine.py        # Model loading, inference, OcrResult dataclass
├── requirements.txt     # Python dependencies
├── frontend/
│   └── index.html       # Single-file frontend (served by FastAPI)
├── Extension/           # Browser extension (Chrome/Edge)
├── Dockerfile           # HF Spaces deployment
├── docker-compose.yml   # Local Docker deployment
└── README.md
```

## Architecture Diagram

```
Browser
  │
  │  POST /ocr (multipart image + mode)
  ▼
┌─────────────────────────────────────┐
│  FastAPI  (main.py)                 │
│  ─ CORS middleware                  │
│  ─ file validation (type, size)     │
│  ─ session metrics                  │
└──────────────┬──────────────────────┘
               │  image_bytes, mode
               ▼
┌─────────────────────────────────────┐
│  GlmOcrEngine  (ocr_engine.py)     │
│  ─ PIL validation & RGB conversion │
│  ─ writes temp PNG to disk         │
│  ─ processor.apply_chat_template() │
│  ─ returns OcrResult dataclass     │
└──────────────┬──────────────────────┘
               │  (torch.inference_mode)
               ▼
┌─────────────────────────────────────┐
│  GLM-OCR 0.9B Model                │
│  zai-org/GLM-OCR                   │
│  ─ Vision encoder (CogViT)        │
│  ─ LM backbone (GLM-0.5B)         │
│  ─ Runs on CUDA / CPU / MPS       │
└─────────────────────────────────────┘
```


## References

- Paper: [GLM-OCR](https://arxiv.org/abs/2603.10910)
- Model: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR)