--- license: apache-2.0 pipeline_tag: image-text-to-text language: - en - fr - de - es - it - nl - pt - sv - da - zh - ja library_name: transformers tags: - ocr - document-understanding - vision-language - pdf - tables - forms ---
LightOnOCR-2-1B-base Banner
# LightOnOCR-2-1B-base **Base model for fine-tuning.** This is the pre-RLVR checkpoint with strong OCR capabilities, ideal as a starting point for domain adaptation and custom fine-tuning. ## About LightOnOCR-2 LightOnOCR-2 is an efficient end-to-end 1B-parameter vision-language model for converting documents (PDFs, scans, images) into clean, naturally ordered text without relying on brittle pipelines. This second version is trained on a larger and higher-quality corpus with stronger French, arXiv, and scan coverage, improved LaTeX handling, and cleaner normalization. LightOnOCR-2 achieves state-of-the-art performance on OlmOCR-Bench while being ~9× smaller and significantly faster than competing approaches. ## Highlights * ⚡ **Speed:** 3.3× faster than Chandra OCR, 1.7× faster than OlmOCR, 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR * 💸 **Efficiency:** Processes 5.71 pages/s on a single H100 (~493k pages/day) for **<$0.01 per 1,000 pages** * 🧠 **End-to-End:** Fully differentiable, no external OCR pipeline * 🧾 **Versatile:** Handles tables, receipts, forms, multi-column layouts, and math notation * 📍 **Image detection:** Predicts bounding boxes for embedded images (bbox variants) --- 📄 **[Paper]( https://arxiv.org/pdf/2601.14251)** | 📝 **[Blog Post](https://huggingface.co/blog/lightonai/lightonocr-2)** | 🚀 **[Demo](https://huggingface.co/spaces/lightonai/LightOnOCR-2-1B-Demo)** | 📊 **[Dataset](https://huggingface.co/datasets/lightonai/LightOnOCR-mix-0126)** | 📊 **[BBox Dataset](https://huggingface.co/datasets/lightonai/LightOnOCR-bbox-mix-0126)** | 📓 **[Finetuning Notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)** --- ## Model Variants | Variant | Description | |---------|-------------| | **[LightOnOCR-2-1B](https://huggingface.co/lightonai/LightOnOCR-2-1B)** | Best OCR model | | **[LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base)** | Base model, ideal for fine-tuning | | **[LightOnOCR-2-1B-bbox](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox)** | Best model with image bounding boxes | | **[LightOnOCR-2-1B-bbox-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox-base)** | Base bbox model, ideal for fine-tuning | | **[LightOnOCR-2-1B-ocr-soup](https://huggingface.co/lightonai/LightOnOCR-2-1B-ocr-soup)** | Merged variant for extra robustness | | **[LightOnOCR-2-1B-bbox-soup](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox-soup)** | Merged variant: OCR + bbox combined | --- ## Benchmarks
OlmOCR-Bench Results
*See the [paper](https://arxiv.org/pdf/2601.14251) for full benchmark details and methodology.* --- ## Usage with Transformers > **Note:** LightOnOCR-2 requires transformers installed from source (not yet in a stable release). ```bash uv pip install git+https://github.com/huggingface/transformers uv pip install pillow pypdfium2 ``` ```python import torch from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu" dtype = torch.float32 if device == "mps" else torch.bfloat16 model = LightOnOcrForConditionalGeneration.from_pretrained("lightonai/LightOnOCR-2-1B-base", torch_dtype=dtype).to(device) processor = LightOnOcrProcessor.from_pretrained("lightonai/LightOnOCR-2-1B-base") url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ocr/resolve/main/SROIE-receipt.jpeg" conversation = [{"role": "user", "content": [{"type": "image", "url": url}]}] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ) inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()} output_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids = output_ids[0, inputs["input_ids"].shape[1]:] output_text = processor.decode(generated_ids, skip_special_tokens=True) print(output_text) ``` --- ## Usage with vLLM ```bash vllm serve lightonai/LightOnOCR-2-1B-base \ --limit-mm-per-prompt '{"image": 1}' --mm-processor-cache-gb 0 --no-enable-prefix-caching ``` ```python import base64 import requests import pypdfium2 as pdfium import io ENDPOINT = "http://localhost:8000/v1/chat/completions" MODEL = "lightonai/LightOnOCR-2-1B-base" # Download PDF from arXiv pdf_url = "https://arxiv.org/pdf/2412.13663" pdf_data = requests.get(pdf_url).content # Open PDF and convert first page to image pdf = pdfium.PdfDocument(pdf_data) page = pdf[0] # Render at 200 DPI (scale factor = 200/72 ≈ 2.77) pil_image = page.render(scale=2.77).to_pil() # Convert to base64 buffer = io.BytesIO() pil_image.save(buffer, format="PNG") image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8') # Make request payload = { "model": MODEL, "messages": [{ "role": "user", "content": [{ "type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"} }] }], "max_tokens": 4096, "temperature": 0.2, "top_p": 0.9, } response = requests.post(ENDPOINT, json=payload) text = response.json()['choices'][0]['message']['content'] print(text) ``` --- ## Rendering and Preprocessing Tips * Render PDFs to **PNG** or **JPEG** at a target longest dimension of **1540px** * Maintain aspect ratio to preserve text geometry * Use one image per page; batching supported by vLLM --- ## Fine-tuning LightOnOCR-2-1B-base is fully differentiable and supports: * LoRA fine-tuning * Domain adaptation (receipts, scientific articles, forms, etc.) * Multilingual fine-tuning with task-specific corpora * Custom RLVR training with your own reward functions --- ## License Apache License 2.0 --- ## Citation ```bibtex @misc{lightonocr2_2026, title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR}, author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin}, year = {2026}, howpublished = {\url{https://arxiv.org/pdf/2601.14251}} } ```