LightOnOCR-2-1B / README.md

Update README.md

17a629e verified 1 day ago

6.7 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	language:
	- en
	- fr
	- de
	- es
	- it
	- nl
	- pt
	- sv
	- da
	- zh
	- ja
	library_name: transformers
	tags:
	- ocr
	- document-understanding
	- vision-language
	- pdf
	- tables
	- forms
	---

	<div align="center">
	<img src="lightonocr-banner.png" alt="LightOnOCR-2-1B Banner" width="600"/>
	</div>

	# LightOnOCR-2-1B

	Best OCR model . LightOnOCR-2-1B is our flagship OCR model, refined with RLVR training for maximum accuracy. We recommend this variant for most OCR tasks.

	## About LightOnOCR-2

	LightOnOCR-2 is an efficient end-to-end 1B-parameter vision-language model for converting documents (PDFs, scans, images) into clean, naturally ordered text without relying on brittle pipelines. This second version is trained on a larger and higher-quality corpus with stronger French, arXiv, and scan coverage, improved LaTeX handling, and cleaner normalization. LightOnOCR-2 achieves state-of-the-art performance on OlmOCR-Bench while being ~9× smaller and significantly faster than competing approaches.

	## Highlights

	* ⚡ Speed: 3.3× faster than Chandra OCR, 1.7× faster than OlmOCR, 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR
	* 💸 Efficiency: Processes 5.71 pages/s on a single H100 (~493k pages/day) for <$0.01 per 1,000 pages
	* 🧠 End-to-End: Fully differentiable, no external OCR pipeline
	* 🧾 Versatile: Handles tables, receipts, forms, multi-column layouts, and math notation
	* 📍 Image detection: Predicts bounding boxes for embedded images (bbox variants)

	---

	📄 [Paper]( https://arxiv.org/pdf/2601.14251) \| 📝 [Blog Post](https://huggingface.co/blog/lightonai/lightonocr-2) \| 🚀 [Demo](https://huggingface.co/spaces/lightonai/LightOnOCR-2-1B-Demo) \| 📊 [Dataset](https://huggingface.co/datasets/lightonai/LightOnOCR-mix-0126) \| 📊 [BBox Dataset](https://huggingface.co/datasets/lightonai/LightOnOCR-bbox-mix-0126) \| 📓 [Finetuning Notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)

	---

	## Model Variants

	\| Variant \| Description \|
	\|---------\|-------------\|
	\| [LightOnOCR-2-1B](https://huggingface.co/lightonai/LightOnOCR-2-1B) \| Best OCR model \|
	\| [LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base) \| Base model, ideal for fine-tuning \|
	\| [LightOnOCR-2-1B-bbox](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox) \| Best model with image bounding boxes \|
	\| [LightOnOCR-2-1B-bbox-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox-base) \| Base bbox model, ideal for fine-tuning \|
	\| [LightOnOCR-2-1B-ocr-soup](https://huggingface.co/lightonai/LightOnOCR-2-1B-ocr-soup) \| Merged variant for extra robustness \|
	\| [LightOnOCR-2-1B-bbox-soup](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox-soup) \| Merged variant: OCR + bbox combined \|

	---

	## Benchmarks

	<div align="center">
	<img src="benchmark.png" alt="OlmOCR-Bench Results" width="900"/>
	</div>

	See the [paper](https://arxiv.org/pdf/2601.14251) for full benchmark details and methodology.

	---

	## Usage with Transformers

	> Note: LightOnOCR-2 requires transformers installed from source (not yet in a stable release).

	```bash
	uv pip install git+https://github.com/huggingface/transformers
	uv pip install pillow pypdfium2
	```

	```python
	import torch
	from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor

	device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
	dtype = torch.float32 if device == "mps" else torch.bfloat16

	model = LightOnOcrForConditionalGeneration.from_pretrained("lightonai/LightOnOCR-2-1B", torch_dtype=dtype).to(device)
	processor = LightOnOcrProcessor.from_pretrained("lightonai/LightOnOCR-2-1B")

	url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ocr/resolve/main/SROIE-receipt.jpeg"

	conversation = [{"role": "user", "content": [{"type": "image", "url": url}]}]

	inputs = processor.apply_chat_template(
	conversation,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
	)
	inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()}

	output_ids = model.generate(**inputs, max_new_tokens=1024)
	generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
	output_text = processor.decode(generated_ids, skip_special_tokens=True)
	print(output_text)
	```

	---

	## Usage with vLLM

	```bash
	vllm serve lightonai/LightOnOCR-2-1B \
	--limit-mm-per-prompt '{"image": 1}' --mm-processor-cache-gb 0 --no-enable-prefix-caching
	```

	```python
	import base64
	import requests
	import pypdfium2 as pdfium
	import io

	ENDPOINT = "http://localhost:8000/v1/chat/completions"
	MODEL = "lightonai/LightOnOCR-2-1B"

	# Download PDF from arXiv
	pdf_url = "https://arxiv.org/pdf/2412.13663"
	pdf_data = requests.get(pdf_url).content

	# Open PDF and convert first page to image
	pdf = pdfium.PdfDocument(pdf_data)
	page = pdf[0]
	# Render at 200 DPI (scale factor = 200/72 ≈ 2.77)
	pil_image = page.render(scale=2.77).to_pil()

	# Convert to base64
	buffer = io.BytesIO()
	pil_image.save(buffer, format="PNG")
	image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

	# Make request
	payload = {
	"model": MODEL,
	"messages": [{
	"role": "user",
	"content": [{
	"type": "image_url",
	"image_url": {"url": f"data:image/png;base64,{image_base64}"}
	}]
	}],
	"max_tokens": 4096,
	"temperature": 0.2,
	"top_p": 0.9,
	}

	response = requests.post(ENDPOINT, json=payload)
	text = response.json()['choices'][0]['message']['content']
	print(text)
	```

	---

	## Rendering and Preprocessing Tips

	* Render PDFs to PNG or JPEG at a target longest dimension of 1540px
	* Maintain aspect ratio to preserve text geometry
	* Use one image per page; batching supported by vLLM

	---

	## Fine-tuning

	LightOnOCR-2 is fully differentiable and supports:

	* LoRA fine-tuning
	* Domain adaptation (receipts, scientific articles, forms, etc.)
	* Multilingual fine-tuning with task-specific corpora

	For fine-tuning, we recommend starting with the [LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base) variant.

	---

	## License

	Apache License 2.0

	---

	## Citation

	```bibtex
	@misc{lightonocr2_2026,
	title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR},
	author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin},
	year = {2026},
	howpublished = {\url{https://arxiv.org/pdf/2601.14251}}
	}
	```