Muliltipage documents
Our biggest issue with all OCR models so far is handling multip-page documents correctly. These are mutu-page invoice documents which have tables spanning multiple pages (some even have hundreds of pages). So far we have been getting around this with post-processing, but it's flaky at best. The issue is, when running inference one page at a time, there is no context from previous pages (a continued table from a previous page could then get an appropriate header row for example).
Some ideas that would be really nice in a future version to help solve this type of issue:
- multipage support (send in multiple page images per request)
- the ability to supply the previous page image along with the current page to OCR. The previous page would not be extracted and would only serve as additional context
- Similar to the previous page idea above, perhaps instead of supplying the previous page image, instead have the ability to provide the current extraction (all markdown from all pages extracted so far, concatenated together for example) along with an image of the current page to extract. This would greatly help where tables and such span multiple pages.
Ivae tried a few of these during inference, but prompting seems to make LightOn OCR2 pretty fragile
Thanks for taking the time to consider these types of scenarios!
Some tests on v1 revealed that multi-page was working even without explicit multi-page training. But we didn't dig this deeper for lack of benchmarks for this kind of task!
One important thing, LightOnOCR is NOT meant to be prompted; adding any input text will just degrade performance.
It would be interesting to test this out by sending multiple images with no prompt.