File size: 3,227 Bytes
0533780
 
 
 
 
 
 
 
 
 
363e6f1
0533780
363e6f1
0533780
886ad57
 
0533780
 
363e6f1
0533780
363e6f1
 
0533780
 
 
 
363e6f1
0533780
 
 
 
 
 
 
 
 
 
 
 
 
 
 
363e6f1
 
 
 
0533780
 
363e6f1
 
 
0533780
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
363e6f1
 
 
 
 
0533780
 
 
 
363e6f1
 
 
 
 
0533780
 
 
 
 
 
363e6f1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
title: GLM OCR
emoji: πŸ“„
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# GLM-OCR β€” Self-Hosted OCR Engine

> A full-stack portfolio project: self-hosted OCR backend powered by **zai-org/GLM-OCR**, a 0.9B-param vision-language model ranked #1 on OmniDocBench V1.5.

### πŸ”— [Live Demo β†’  https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction](https://huggingface.co/spaces/Sam20202/GLMOCR_Text_extraction)

---

## What is GLM-OCR?

GLM-OCR is a state-of-the-art open-source OCR model from  
**zai-org** (arXiv:2603.10910).

Unlike traditional OCR (Tesseract, etc.) it uses a **vision encoder + language model** architecture:

```
Image β†’ [CogViT Vision Encoder] β†’ [GLM-0.5B LM Backbone] β†’ Text
```

It handles:
- Plain text from documents, screenshots, photos
- **Tables** (preserved structure)
- **Mathematical equations** (LaTeX output)
- **Code blocks** (syntax preserved)
- **Multilingual** text
- **Handwriting**

---

## Project Structure

```
GLMOCR_Text_extraction/
β”œβ”€β”€ main.py              # FastAPI server β€” routes, CORS, request handling
β”œβ”€β”€ ocr_engine.py        # Model loading, inference, OcrResult dataclass
β”œβ”€β”€ requirements.txt     # Python dependencies
β”œβ”€β”€ frontend/
β”‚   └── index.html       # Single-file frontend (served by FastAPI)
β”œβ”€β”€ Extension/           # Browser extension (Chrome/Edge)
β”œβ”€β”€ Dockerfile           # HF Spaces deployment
β”œβ”€β”€ docker-compose.yml   # Local Docker deployment
└── README.md
```

## Architecture Diagram

```
Browser
  β”‚
  β”‚  POST /ocr (multipart image + mode)
  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FastAPI  (main.py)                 β”‚
β”‚  ─ CORS middleware                  β”‚
β”‚  ─ file validation (type, size)     β”‚
β”‚  ─ session metrics                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚  image_bytes, mode
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GlmOcrEngine  (ocr_engine.py)     β”‚
β”‚  ─ PIL validation & RGB conversion β”‚
β”‚  ─ writes temp PNG to disk         β”‚
β”‚  ─ processor.apply_chat_template() β”‚
β”‚  ─ returns OcrResult dataclass     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚  (torch.inference_mode)
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GLM-OCR 0.9B Model                β”‚
β”‚  zai-org/GLM-OCR                   β”‚
β”‚  ─ Vision encoder (CogViT)        β”‚
β”‚  ─ LM backbone (GLM-0.5B)         β”‚
β”‚  ─ Runs on CUDA / CPU / MPS       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```


## References

- Paper: [GLM-OCR](https://arxiv.org/abs/2603.10910)
- Model: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR)