Instructions to use zai-org/GLM-OCR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-OCR with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="zai-org/GLM-OCR")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-OCR")
model = AutoModelForMultimodalLM.from_pretrained("zai-org/GLM-OCR")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-OCR with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-OCR"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-OCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-OCR

SGLang

How to use zai-org/GLM-OCR with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-OCR" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-OCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-OCR" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-OCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-OCR with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-OCR
```

No text output inferencing with ollama?

by frankslin - opened Feb 3

Discussion

frankslin

Feb 3

I'm using the example image from https://cdn.bigmodel.cn/static/logo/introduction.png.

Apple M2, macOS 15.7.3.

$ ollama -v
ollama version is 0.15.5-rc1

$ sha256sum introduction.png
b01139689cb0682b3a1d6f3f3eda3f481101571654e3a23a1de5bf5520f3c6f5  introduction.png

$ ollama run glm-ocr Text Recognition: ./introduction.png
Added image './introduction.png'
```markdown

```markdown

```markdown

```text


When I tried with another file, I got this error:

Error: an error was encountered while running the model: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed
WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
See: https://github.com/ggml-org/llama.cpp/pull/17869
0 ollama 0x0000000100e6d620 ggml_print_backtrace + 276


What am I missing?

famer058

Feb 5

我也遇见了同样的问题，我希望这个研究人员要端正态度，好好的把东西做好。

dewitcho

Feb 5

•

edited Feb 5

I am having the same problem serving glm-ocr from ollama.

Macbook Air M2, Tahoe 26.2
Ollama pre-release 0.15.5

Output for first three images of a non-complex PDF (https://www.fiw.uni-bonn.de/de/forschung/demokratieforschung/team/prof-dr-rudolf-stichweh/papers/pdfs/81_stw_niklas-luhmann-blackwell-companion-to-major-social-theorists.pdf):

Page 1

"<table bordered, no table, but no table, but no table, but no table, no table, no table, no layout, but no layout, or any other content is present. No layout, or any other content is present. No text is present. No layout. No layout. No layout. No layout. No layout."

Page 2

"<table bordered, no table, but no table, but no table, but no table, no table, no layout, rows, columns, rows, columns, rows, or any other content within the image content is present. No text, or any other content is present. No text is present. No layout, just a single row, just a single row, just a single row, just a single row, just a single row, no layout. No layout. No layout. No layout. No layout. No layout. No layout. No layout."

Page 3

Error extracting page: an error was encountered while running the model: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed
WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.
WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.
See: https://github.com/ggml-org/llama.cpp/pull/17869
0 ollama 0x0000000101b050c0 ggml_print_backtrace + 276
1 ollama 0x0000000101b052ac ggml_abort + 156
2 ollama 0x0000000101b0d8e8 ggml_rope + 300
3 ollama 0x0000000101b0db44 ggml_rope_multi + 20
4 ollama 0x0000000101aaf2b0 _cgo_7ebcd35a9797_Cfunc_ggml_rope_multi + 64
5 ollama 0x0000000100e6549c ollama + 513180 (status code: 500)

NeoHuggingF

Feb 6

Either increase the context for the model or reduce the image size. Depending on your available VRAM, Ollama defaults to only 4096 as context, that is not enough for that large file (introduction.png). You can check the available context after loading the model with ollama ps.

letorbi

Feb 6

I can confirm that increasing the context size works. I am using a context size of 10240 and have successfully extracted text from images with a size of 12 megapixels. Here is the modelfile I am using:

# Modelfile generated by "ollama show"
FROM glm-ocr:latest

# Increase context size
PARAMETER num_ctx 10240

TEMPLATE {{ .Prompt }}
RENDERER glm-ocr
PARSER glm-ocr
PARAMETER temperature 0

khansiz1

Feb 9

yea ur right mate

stevan79

Feb 9

Thank you to all for providing the "increase context window" solution .
I did same and I have moved from gibberish output to the clean exact text expected

hoangns

Feb 12

This comment has been hidden

marcofal

Feb 23

It's working for me, but I'd like to have the text as Markdown. Is it possible with only this model ?

NeoHuggingF

Feb 23

•

edited Feb 23

@marcofal This is the community for the original model, not the OIlama version, so probably questions here should be related to the original model.

Anyway, as far as I know, the model uses Markdown tags automatically if there are multiple sections or code, otherwise MD == plaintext. If you need further help I would suggest joining their Discord server (link on model's main page), or an Ollama related community (usually GitHub issues is not a good place for general questions that are not related to coding).

Regards

marcofal

Feb 23

Hi NeoHuggingF,
I tried it with Ollama and vllm and the output it is the same. I wonder why doesn't mark headers, for example, that have a bigger font.
Anyway, it is a very good model.

NeoHuggingF

Feb 23

@marcofal
You may want to take a look into these:
https://github.com/zai-org/GLM-OCR?tab=readme-ov-file#configuration
https://github.com/zai-org/GLM-OCR/blob/main/examples/ollama-deploy/README.md

Disclaimer: I am not related to GLM-OCR, just a regular user.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment