Bapt120 commited on
Commit
ffa9487
Β·
1 Parent(s): 30d9e24

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -0
README.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-text
4
+ language:
5
+ - en
6
+ - fr
7
+ - de
8
+ - es
9
+ - it
10
+ - nl
11
+ - pt
12
+ - sv
13
+ - da
14
+ - zh
15
+ - ja
16
+ library_name: transformers
17
+ tags:
18
+ - ocr
19
+ - document-understanding
20
+ - vision-language
21
+ - pdf
22
+ - tables
23
+ - forms
24
+ ---
25
+
26
+ <div align="center">
27
+ <img src="lightonocr-banner.png" alt="LightOnOCR-2-1B-base Banner" width="600"/>
28
+ </div>
29
+
30
+ # LightOnOCR-2-1B-base
31
+
32
+ **Base model for fine-tuning.** This is the pre-RLVR checkpoint with strong OCR capabilities, ideal as a starting point for domain adaptation and custom fine-tuning.
33
+
34
+ ## Highlights
35
+
36
+ * ⚑ **Speed:** 5Γ— faster than dots.ocr, 2Γ— faster than PaddleOCR-VL-0.9B, 1.73Γ— faster than DeepSeekOCR
37
+ * πŸ’Έ **Efficiency:** Processes 5.71 pages/s on a single H100 (~493k pages/day) for **<$0.01 per 1,000 pages**
38
+ * 🧠 **End-to-End:** Fully differentiable, no external OCR pipeline
39
+ * 🧾 **Versatile:** Handles tables, receipts, forms, multi-column layouts, and math notation
40
+ * πŸ“ **Image detection:** Predicts bounding boxes for embedded images (bbox variants)
41
+
42
+ ---
43
+
44
+ πŸ“„ **[Paper](https://huggingface.co/papers/lightonocr-2)** | πŸ“ **[Blog Post](https://huggingface.co/blog/lightonai/lightonocr-2)** | πŸš€ **[Demo](https://huggingface.co/spaces/lightonai/LightOnOCR-2-Demo)** | πŸ“Š **[Dataset](https://huggingface.co/datasets/lightonai/LightOnOCR-mix-0126)**
45
+
46
+ ---
47
+
48
+ ## Model Variants
49
+
50
+ | Variant | Description |
51
+ |---------|-------------|
52
+ | **[LightOnOCR-2-1B](https://huggingface.co/lightonai/LightOnOCR-2-1B)** | Best OCR model (recommended) |
53
+ | **[LightOnOCR-2-1B-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-base)** | Base model, ideal for fine-tuning |
54
+ | **[LightOnOCR-2-1B-bbox](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox)** | Best model with image bounding boxes |
55
+ | **[LightOnOCR-2-1B-bbox-base](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox-base)** | Base bbox model, ideal for fine-tuning |
56
+ | **[LightOnOCR-2-1B-ocr-soup](https://huggingface.co/lightonai/LightOnOCR-2-1B-ocr-soup)** | Merged variant for extra robustness |
57
+ | **[LightOnOCR-2-1B-bbox-soup](https://huggingface.co/lightonai/LightOnOCR-2-1B-bbox-soup)** | Merged variant: OCR + bbox combined |
58
+
59
+ ---
60
+
61
+ ## Benchmarks
62
+
63
+ <div align="center">
64
+ <img src="benchmark_placeholder.png" alt="OlmOCR-Bench Results" width="900"/>
65
+ </div>
66
+
67
+ *See the [paper](https://huggingface.co/papers/lightonocr-2) for full benchmark details and methodology.*
68
+
69
+ ---
70
+
71
+ ## Usage with Transformers
72
+
73
+ > **Note:** LightOnOCR-2 requires transformers installed from source (not yet in a stable release).
74
+
75
+ ```bash
76
+ uv pip install git+https://github.com/huggingface/transformers
77
+ uv pip install pillow pypdfium2
78
+ ```
79
+
80
+ ```python
81
+ import torch
82
+ from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
83
+
84
+ device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
85
+ dtype = torch.float32 if device == "mps" else torch.bfloat16
86
+
87
+ model = LightOnOcrForConditionalGeneration.from_pretrained("lightonai/LightOnOCR-2-1B-base", torch_dtype=dtype).to(device)
88
+ processor = LightOnOcrProcessor.from_pretrained("lightonai/LightOnOCR-2-1B-base")
89
+
90
+ url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ocr/resolve/main/SROIE-receipt.jpeg"
91
+
92
+ conversation = [{"role": "user", "content": [{"type": "image", "url": url}]}]
93
+
94
+ inputs = processor.apply_chat_template(
95
+ conversation,
96
+ add_generation_prompt=True,
97
+ tokenize=True,
98
+ return_dict=True,
99
+ return_tensors="pt",
100
+ )
101
+ inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()}
102
+
103
+ output_ids = model.generate(**inputs, max_new_tokens=1024)
104
+ generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
105
+ output_text = processor.decode(generated_ids, skip_special_tokens=True)
106
+ print(output_text)
107
+ ```
108
+
109
+ ---
110
+
111
+ ## Usage with vLLM
112
+
113
+ ```bash
114
+ vllm serve lightonai/LightOnOCR-2-1B-base \
115
+ --limit-mm-per-prompt '{"image": 1}' --mm-processor-cache-gb 0 --no-enable-prefix-caching
116
+ ```
117
+
118
+ ```python
119
+ import base64
120
+ import requests
121
+ import pypdfium2 as pdfium
122
+ import io
123
+
124
+ ENDPOINT = "http://localhost:8000/v1/chat/completions"
125
+ MODEL = "lightonai/LightOnOCR-2-1B-base"
126
+
127
+ # Download PDF from arXiv
128
+ pdf_url = "https://arxiv.org/pdf/2412.13663"
129
+ pdf_data = requests.get(pdf_url).content
130
+
131
+ # Open PDF and convert first page to image
132
+ pdf = pdfium.PdfDocument(pdf_data)
133
+ page = pdf[0]
134
+ # Render at 200 DPI (scale factor = 200/72 β‰ˆ 2.77)
135
+ pil_image = page.render(scale=2.77).to_pil()
136
+
137
+ # Convert to base64
138
+ buffer = io.BytesIO()
139
+ pil_image.save(buffer, format="PNG")
140
+ image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
141
+
142
+ # Make request
143
+ payload = {
144
+ "model": MODEL,
145
+ "messages": [{
146
+ "role": "user",
147
+ "content": [{
148
+ "type": "image_url",
149
+ "image_url": {"url": f"data:image/png;base64,{image_base64}"}
150
+ }]
151
+ }],
152
+ "max_tokens": 4096,
153
+ "temperature": 0.2,
154
+ "top_p": 0.9,
155
+ }
156
+
157
+ response = requests.post(ENDPOINT, json=payload)
158
+ text = response.json()['choices'][0]['message']['content']
159
+ print(text)
160
+ ```
161
+
162
+ ---
163
+
164
+ ## Rendering and Preprocessing Tips
165
+
166
+ * Render PDFs to **PNG** or **JPEG** at a target longest dimension of **1540px**
167
+ * Maintain aspect ratio to preserve text geometry
168
+ * Use one image per page; batching supported by vLLM
169
+
170
+ ---
171
+
172
+ ## Fine-tuning
173
+
174
+ LightOnOCR-2-1B-base is fully differentiable and supports:
175
+
176
+ * LoRA fine-tuning
177
+ * Domain adaptation (receipts, scientific articles, forms, etc.)
178
+ * Multilingual fine-tuning with task-specific corpora
179
+ * Custom RLVR training with your own reward functions
180
+
181
+ ---
182
+
183
+ ## License
184
+
185
+ Apache License 2.0
186
+
187
+ ---
188
+
189
+ ## Citation
190
+
191
+ ```bibtex
192
+ @misc{lightonocr2_2025,
193
+ title = {LightOnOCR: End-to-End, Multilingual, Efficient, State-of-the-Art Vision-Language Model for OCR},
194
+ author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin},
195
+ year = {2025},
196
+ howpublished = {\url{https://huggingface.co/blog/lightonai/lightonocr-2}}
197
+ }
198
+ ```