|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
- fr |
|
|
- es |
|
|
- ru |
|
|
- de |
|
|
- ja |
|
|
- ko |
|
|
pipeline_tag: image-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# ONNX model for [GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) |
|
|
|
|
|
## try with [ningpp/flux](https://github.com/ningpp/flux) |
|
|
|
|
|
Flux is a Java-based OCR |
|
|
|
|
|
|
|
|
|
|
|
# GLM-OCR |
|
|
|
|
|
<div align="center"> |
|
|
<img src=https://raw.githubusercontent.com/zai-org/GLM-OCR/refs/heads/main/resources/logo.svg width="40%"/> |
|
|
</div> |
|
|
<p align="center"> |
|
|
👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-OCR/refs/heads/main/resources/wechat.png" target="_blank">WeChat</a> and <a href="https://discord.gg/8KFjEec7" target="_blank">Discord</a> community |
|
|
<br> |
|
|
📍 Use GLM-OCR's <a href="https://docs.z.ai/guides/image/glm-ocr" target="_blank">API</a> |
|
|
</p> |
|
|
|
|
|
|
|
|
## Introduction |
|
|
|
|
|
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts. |
|
|
|
|
|
**Key Features** |
|
|
|
|
|
- **State-of-the-Art Performance**: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction. |
|
|
|
|
|
- **Optimized for Real-World Scenarios**: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts. |
|
|
|
|
|
- **Efficient Inference**: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments. |
|
|
|
|
|
- **Easy to Use**: Fully open-sourced and equipped with a comprehensive [SDK](https://github.com/zai-org/GLM-OCR) and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### vLLM |
|
|
|
|
|
1. run |
|
|
|
|
|
```bash |
|
|
pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly |
|
|
``` |
|
|
|
|
|
or using docker with: |
|
|
``` |
|
|
docker pull vllm/vllm-openai:nightly |
|
|
``` |
|
|
|
|
|
2. run with: |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/huggingface/transformers.git |
|
|
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080 |
|
|
``` |
|
|
|
|
|
### SGLang |
|
|
|
|
|
|
|
|
1. using docker with: |
|
|
|
|
|
```bash |
|
|
docker pull lmsysorg/sglang:dev |
|
|
``` |
|
|
|
|
|
or build it from source with: |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python |
|
|
``` |
|
|
|
|
|
2. run with: |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/huggingface/transformers.git |
|
|
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080 |
|
|
``` |
|
|
|
|
|
### Ollama |
|
|
|
|
|
1. Download [Ollama](https://ollama.com/download). |
|
|
2. run with: |
|
|
|
|
|
```bash |
|
|
ollama run glm-ocr |
|
|
``` |
|
|
|
|
|
Ollama will automatically use image file path when an image is dragged into the terminal: |
|
|
|
|
|
```bash |
|
|
ollama run glm-ocr Text Recognition: ./image.png |
|
|
``` |
|
|
|
|
|
### Transformers |
|
|
|
|
|
``` |
|
|
pip install git+https://github.com/huggingface/transformers.git |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
|
import torch |
|
|
|
|
|
MODEL_PATH = "zai-org/GLM-OCR" |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"url": "test_image.png" |
|
|
}, |
|
|
{ |
|
|
"type": "text", |
|
|
"text": "Text Recognition:" |
|
|
} |
|
|
], |
|
|
} |
|
|
] |
|
|
processor = AutoProcessor.from_pretrained(MODEL_PATH) |
|
|
model = AutoModelForImageTextToText.from_pretrained( |
|
|
pretrained_model_name_or_path=MODEL_PATH, |
|
|
torch_dtype="auto", |
|
|
device_map="auto", |
|
|
) |
|
|
inputs = processor.apply_chat_template( |
|
|
messages, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, |
|
|
return_dict=True, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
inputs.pop("token_type_ids", None) |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=8192) |
|
|
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
### Prompt Limited |
|
|
|
|
|
GLM-OCR currently supports two types of prompt scenarios: |
|
|
|
|
|
1. **Document Parsing** – extract raw content from documents. Supported tasks include: |
|
|
|
|
|
```python |
|
|
{ |
|
|
"text": "Text Recognition:", |
|
|
"formula": "Formula Recognition:", |
|
|
"table": "Table Recognition:" |
|
|
} |
|
|
``` |
|
|
|
|
|
2. **Information Extraction** – extract structured information from documents. Prompts must follow a strict JSON schema. For example, to extract personal ID information: |
|
|
|
|
|
```python |
|
|
请按下列JSON格式输出图中信息: |
|
|
{ |
|
|
"id_number": "", |
|
|
"last_name": "", |
|
|
"first_name": "", |
|
|
"date_of_birth": "", |
|
|
"address": { |
|
|
"street": "", |
|
|
"city": "", |
|
|
"state": "", |
|
|
"zip_code": "" |
|
|
}, |
|
|
"dates": { |
|
|
"issue_date": "", |
|
|
"expiration_date": "" |
|
|
}, |
|
|
"sex": "" |
|
|
} |
|
|
``` |
|
|
|
|
|
⚠️ Note: When using information extraction, the output must strictly adhere to the defined JSON schema to ensure downstream processing compatibility. |
|
|
|
|
|
## GLM-OCR SDK |
|
|
|
|
|
We provide an easy-to-use SDK for using GLM-OCR more efficiently and conveniently. please check our [github](https://github.com/zai-org/GLM-OCR) to get more detail. |
|
|
|
|
|
## Acknowledgement |
|
|
|
|
|
This project is inspired by the excellent work of the following projects and communities: |
|
|
|
|
|
- [PP-DocLayout-V3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3) |
|
|
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) |
|
|
- [MinerU](https://github.com/opendatalab/MinerU) |
|
|
|
|
|
## License |
|
|
|
|
|
The GLM-OCR model is released under the MIT License. |
|
|
|
|
|
The complete OCR pipeline integrates [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3) for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project. |
|
|
|