evaluate it on the OLMOCR benchmark
is there plans to evaluate it on the OLMOCR benchmark and compare it with LightOnOCR-2.
We found that there are quite a few issues with this benchmark during our evaluation of olmocr-bench, as I mentioned in my response to this issue.
@yeekal
Currently, the evaluation metrics used by olmOCR-bench have some limitations and cannot effectively and fairly assess a model’s true capability in document parsing. For example, as shown in the figure, olmOCR-bench splits multi-line formulas into individual single-line formulas for evaluation, which does not comply with common standards for formula recognition. This results in abnormally high accuracy for models whose outputs match the training data distribution.
Additionally, using pass rate as the sole metric is too strict. For a complex formula, if the model misrecognizes just one character, the score is 0; if it misrecognizes 100 characters, the score is still 0. This fails to reflect the actual performance differences between models.
@yeekal Currently, the evaluation metrics used by olmOCR-bench have some limitations and cannot effectively and fairly assess a model’s true capability in document parsing. For example, as shown in the figure, olmOCR-bench splits multi-line formulas into individual single-line formulas for evaluation, which does not comply with common standards for formula recognition. This results in abnormally high accuracy for models whose outputs match the training data distribution.
Additionally, using pass rate as the sole metric is too strict. For a complex formula, if the model misrecognizes just one character, the score is 0; if it misrecognizes 100 characters, the score is still 0. This fails to reflect the actual performance differences between models.
应该用一些有难度的文档去比较他们的能力,太简单的文档难以区分他们的能力,能不能输出图片的bbox?



