Add OlmOCRBench evaluation results

#43
by staghado - opened

This PR ensures your model shows up at https://huggingface.co/datasets/allenai/olmOCR-bench
The evaluation was done through the official SDK.

Z.ai org

@staghado Thanks for running the evaluation and submitting this PR! We appreciate you taking the time to benchmark the model.

However, the reported metrics look a bit unusual to us. We’re planning to rerun the evaluation on our side using the official SDK to double-check the results. We’ll follow up once we’ve reproduced and verified the numbers.

Thanks again for the effort!

Z.ai org

Also, could you confirm the inference setup you used? For example, did you run inference via the MaaS API, or through the SDK provided in our GitHub repo (https://github.com/zai-org/GLM-OCR)? Knowing the exact setup would help us reproduce the evaluation more accurately.

Z.ai org

It seems the evaluation was run using the ZAI API for inference. We’ll try reproducing the results with the same setup on our side. Thanks!

Thanks for looking into this! Here's what I did:

I used the ZAI Python SDK (zai-sdk==0.2.2) with the layout_parsing.create endpoint. The olmOCR-bench PDFs were pre-rendered to PNG at 200 DPI with a max side length of 1540px (aspect ratio preserved; native resolution kept if smaller). Each image was processed 3 times and test pass rates were averaged across repeats. I then ran the official olmocr.bench.benchmark evaluation script with the standard test JSONL files.

For context, I had previously run GLM-OCR standalone using vLLM with just the "Text Recognition:" prompt (no layout detection), which scored 67.5% overall (excl. h&f). The per-category scores largely match between the two setups, except for tables (42.5% → 77.6%) — which makes sense since the API includes layout detection that routes table regions to the "Table Recognition:" prompt. The other categories see only minor differences, confirming that the evaluation is correct.

Category vLLM (w/o layout) ZAI API (with layout)
arxiv_math 80.4% 80.7%
multi_column 79.9% 76.7%
old_scans_math 74.9% 68.3%
old_scans 39.9% 37.6%
long_tiny_text 87.6% 86.9%
table_tests 42.5% 77.6%
Overall (excl. h&f) 67.5% 75.2%

The full extraction script is available as a gist.

Hope this helps reproduce!

Z.ai org

@staghado Hi,thanks a lot for the detailed explanation and for sharing your setup and results — this is very helpful for us.
We also really appreciate you taking the time to run such a thorough evaluation of our model and documenting the pipeline so clearly.
This is also very helpful for us as we continue iterating and improving future versions of the model.

@iyuge2 Hello,
Glad this helps, please merge this when you get the results/setup or post your results and i will update
Also for the future, I think OlmOCR-bench should be one of the primary benchmark to report as it does not suffer from edit-distance biases and is closer to how a human would evaluate OCR(i.e test that various facts check out like reading order, table cells, formula etc).
Similarly comparing to existing SOTA models would also help the community. For instance LightOnOCR-2-1B was released before GLM-OCR and scores way higher on OlmOCR-bench but was not compared to.
Thanks

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment