Apple users fear not you can use MLX too!
If you are like me and love this model but use a Mac for work, and then got sad when you realized it took 2 hours to do a long PDF extraction and it was unusable - well that's ok because you can also use mlx-vlm as a backend! The model uses Pixtral inside, which has support, so the long PyTorch inference code can be condensed down to something as easy as:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("lightonai/LightOnOCR-2-1B", fix_mistral_regex=True)
url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ocr/resolve/main/SROIE-receipt.jpeg"
prompt = apply_chat_template(processor, config=model.config, prompt="", num_images=1)
output = generate(model, processor, prompt, image=[url], max_tokens=1024, verbose=False)
print(output.text)
I did realize the Pixtral implementation on there wasn't using scaled DP attention so I just put together a PR for to update that which will hopefully be merged shortly, results in ~20% faster vision encoding.
In any case, a 2 hour PDF extraction turned into ~25 minutes using this backend instead. You can also easily quantize the model into 4 or 8 bits, I ended up doing some testing and found that with my use case I could use a 4 bit quantization, turn the scale down to 1 for the PDFs, and then spawn multiple workers to run in parallel (tried adding batched vision embedding support but it wasn't very effective, and then mlx just doesn't support paged attention obviously so not much to be done there) and got the overall conversion down to ~7 minutes with no practical quality deterioration either (most performance hits were in things like bolding text properly, using the right markdown heading level, or turning the table of content into an actual table - not really things I care about, performance on my little eval dataset of questions I asked about the document didn't change).
## a100_gpu vs mlx_parallel_opt
Common pages: 214
### Overall Similarity
- Character similarity: 94.05%
- Word similarity: 97.1%
- Line similarity: 76.8%
- Total chars: 523,063 vs 521,690
- Total words: 80,923 vs 81,078
Pretty incredible, it actually was ~1-2 minutes faster than the full precision and full scale model on an A100! Obviously, if I also reduced the quantization and scale then the A100 would probably blow past it but beggars can't be choosers those things are expensive for a reason. The quantization process is really easy with mlx-vlm but if there's any interest I could throw those versions I made up onto huggingface as well.
Anyways, thanks for the HTML support on this version, my use case is heavily around nested and incredibly ugly tables split across multiple pages which Markdown always struggled to reproduce. This has worked perfectly for me so far and enabled a new project idea with just a tiny bit of post-processing to connect split tables at lookup time, cheers to my goats at LightOn!
EDIT: Tried optimizing my A100 workflow and yeah it gets down to like 45 seconds for ~200 page documents so definitely not going to be beating that, but the fact that it's within an order of magnitude is still pretty good.
🐐