--- license: mit language: - zh - en - fr - es - ru - de - ja - ko pipeline_tag: image-to-text library_name: transformers --- # ONNX model for [GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) ## try with [ningpp/flux](https://github.com/ningpp/flux) Flux is a Java-based OCR # GLM-OCR

👋 Join our WeChat and Discord community
📍 Use GLM-OCR's API

## Introduction GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts. **Key Features** - **State-of-the-Art Performance**: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction. - **Optimized for Real-World Scenarios**: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts. - **Efficient Inference**: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments. - **Easy to Use**: Fully open-sourced and equipped with a comprehensive [SDK](https://github.com/zai-org/GLM-OCR) and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines. ## Usage ### vLLM 1. run ```bash pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly ``` or using docker with: ``` docker pull vllm/vllm-openai:nightly ``` 2. run with: ```bash pip install git+https://github.com/huggingface/transformers.git vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080 ``` ### SGLang 1. using docker with: ```bash docker pull lmsysorg/sglang:dev ``` or build it from source with: ```bash pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python ``` 2. run with: ```bash pip install git+https://github.com/huggingface/transformers.git python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080 ``` ### Ollama 1. Download [Ollama](https://ollama.com/download). 2. run with: ```bash ollama run glm-ocr ``` Ollama will automatically use image file path when an image is dragged into the terminal: ```bash ollama run glm-ocr Text Recognition: ./image.png ``` ### Transformers ``` pip install git+https://github.com/huggingface/transformers.git ``` ```python from transformers import AutoProcessor, AutoModelForImageTextToText import torch MODEL_PATH = "zai-org/GLM-OCR" messages = [ { "role": "user", "content": [ { "type": "image", "url": "test_image.png" }, { "type": "text", "text": "Text Recognition:" } ], } ] processor = AutoProcessor.from_pretrained(MODEL_PATH) model = AutoModelForImageTextToText.from_pretrained( pretrained_model_name_or_path=MODEL_PATH, torch_dtype="auto", device_map="auto", ) inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ).to(model.device) inputs.pop("token_type_ids", None) generated_ids = model.generate(**inputs, max_new_tokens=8192) output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False) print(output_text) ``` ### Prompt Limited GLM-OCR currently supports two types of prompt scenarios: 1. **Document Parsing** – extract raw content from documents. Supported tasks include: ```python { "text": "Text Recognition:", "formula": "Formula Recognition:", "table": "Table Recognition:" } ``` 2. **Information Extraction** – extract structured information from documents. Prompts must follow a strict JSON schema. For example, to extract personal ID information: ```python 请按下列JSON格式输出图中信息: { "id_number": "", "last_name": "", "first_name": "", "date_of_birth": "", "address": { "street": "", "city": "", "state": "", "zip_code": "" }, "dates": { "issue_date": "", "expiration_date": "" }, "sex": "" } ``` ⚠️ Note: When using information extraction, the output must strictly adhere to the defined JSON schema to ensure downstream processing compatibility. ## GLM-OCR SDK We provide an easy-to-use SDK for using GLM-OCR more efficiently and conveniently. please check our [github](https://github.com/zai-org/GLM-OCR) to get more detail. ## Acknowledgement This project is inspired by the excellent work of the following projects and communities: - [PP-DocLayout-V3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [MinerU](https://github.com/opendatalab/MinerU) ## License The GLM-OCR model is released under the MIT License. The complete OCR pipeline integrates [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3) for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.