ToluClassics commited on
Commit
429e471
·
verified ·
1 Parent(s): e5aab0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -2
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - allenai/olmo-mix-1124
5
  language:
6
  - yo
7
  - sw
@@ -19,4 +19,186 @@ tags:
19
  - Text-Generation
20
  - Optical-Character-Recognition
21
  - Low-Resource-Languages
22
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - allenai/olmOCR-mix-0225
5
  language:
6
  - yo
7
  - sw
 
19
  - Text-Generation
20
  - Optical-Character-Recognition
21
  - Low-Resource-Languages
22
+ ---
23
+
24
+ # KarantaOCR: Efficient Document Processing for African Languages
25
+
26
+ ## Model Description
27
+
28
+ **KarantaOCR** is an open-source document OCR and processing model designed for **high-accuracy text extraction in African languages**.
29
+ The model focuses on preserving language-specific characters and diacritics that are often lost, normalized, or mis-transcribed by existing OCR systems.
30
+
31
+ KarantaOCR is fine-tuned from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), a vision-language model that combines a strong vision encoder with a large language model.
32
+ Through targeted curriculum fine-tuning, KarantaOCR extends these capabilities to robust document understanding across diverse PDF formats and multilingual settings.
33
+
34
+ ## Training Data
35
+
36
+ KarantaOCR was trained using a **two-stage curriculum fine-tuning strategy**.
37
+
38
+ ### Stage 1: General OCR Training
39
+
40
+ * **100,000 documents** sampled from [Allenai OCRMix](allenai/olmOCR-mix-0225)
41
+ * Purpose: learn general OCR skills across layouts, fonts, tables, and document structures
42
+
43
+ ### Stage 2: African Language Fine-Tuning
44
+
45
+ * **50,000 PDFs** containing text in **10 African languages**, crawled from the web
46
+ * Domains include:
47
+
48
+ * Religious texts
49
+ * Legal documents
50
+ * Dictionaries
51
+ * Novels
52
+ * Other long-form and structured documents
53
+
54
+ This stage emphasizes accurate transcription of **diacritics, special characters, and region-specific typography**.
55
+
56
+ ---
57
+
58
+ ## Capabilities
59
+
60
+ KarantaOCR supports:
61
+
62
+ * High-accuracy **text extraction** from PDFs
63
+ * **Table extraction** and structured document understanding
64
+ * Robust handling of:
65
+
66
+ * Multi-column layouts
67
+ * Headers and footers
68
+ * Mixed scanned and digital PDFs
69
+
70
+ While improved performance on African languages was our priority, KarantaOCR **maintains strong performance on English and other high-resource languages**, making it suitable for mixed-language document collections.
71
+
72
+ ## Evaluation
73
+
74
+ KarantaOCR is evaluated on the OLMOocr benchmark using pass-rate accuracy. Scores are reported as averages across JSONL files with 95% confidence intervals.
75
+
76
+ | Model | Avg Score ↑ | 95% CI |
77
+ | --------------- | ----------- | ------ |
78
+ | **KarantaOCR** | **74.1%** | ± 1.1 |
79
+ | RoLMOCR | 74.4% | ± 1.0 |
80
+ | NanoNetsOCR-2 | 68.8% | ± 1.1 |
81
+ | OLMOCR | 65.8% | ± 0.9 |
82
+
83
+ ### Results by Documet Type (%)
84
+
85
+ | JSONL File | KarantaOCR | RoLMOCR | NanoNetsOCR-2 | OLMOCR |
86
+ | --------------- | ---------- | -------- | ------------- | -------- |
87
+ | arxiv_math | 74.2 | **76.8** | 73.7 | 68.9 |
88
+ | baseline | **99.4** | 97.9 | **99.5** | 85.0 |
89
+ | headers_footers | **95.3** | 94.1 | 32.8 | **96.4** |
90
+ | long_tiny_text | 72.2 | 61.3 | **92.1** | 81.9 |
91
+ | multi_column | 75.6 | 70.0 | **82.5** | **84.0** |
92
+ | old_scans | 41.3 | 42.4 | 41.4 | **42.0** |
93
+ | old_scans_math | 70.3 | **80.1** | 44.1 | 0.0 |
94
+ | table_tests | 64.3 | 72.2 | **84.2** | 68.3 |
95
+
96
+ ## How to Use
97
+
98
+ KarantaOCR processes PDF documents by rendering pages into images and combining them with structured prompts for inference.
99
+
100
+ ### Load the Model and Processor
101
+
102
+ ```python
103
+ import torch
104
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
105
+
106
+ def load_model(model_path: str, device_map: str = "auto", dtype: str = "auto"):
107
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
108
+ model_path,
109
+ torch_dtype=getattr(torch, dtype) if dtype != "auto" else "auto",
110
+ device_map=device_map,
111
+ )
112
+ return model
113
+
114
+ def load_processor(processor_name: str, min_pixels=None, max_pixels=None):
115
+ if min_pixels and max_pixels:
116
+ return AutoProcessor.from_pretrained(
117
+ processor_name, min_pixels=min_pixels, max_pixels=max_pixels
118
+ )
119
+ return AutoProcessor.from_pretrained(processor_name)
120
+ ```
121
+
122
+ ### Prepare a PDF Page for Inference
123
+
124
+ ```python
125
+ from jinja2 import Template
126
+
127
+ def render_pdf_to_base64png(
128
+ local_pdf_path: str, page_num: int, target_longest_image_dim: int = 2048
129
+ ) -> str:
130
+ longest_dim = max(get_pdf_media_box_width_height(local_pdf_path, page_num))
131
+
132
+ # Convert PDF page to PNG using pdftoppm
133
+ pdftoppm_result = subprocess.run(
134
+ [
135
+ "pdftoppm",
136
+ "-png",
137
+ "-f",
138
+ str(page_num),
139
+ "-l",
140
+ str(page_num),
141
+ "-r",
142
+ str(
143
+ target_longest_image_dim * 72 / longest_dim
144
+ ), # 72 pixels per point is the conversion factor
145
+ local_pdf_path,
146
+ ],
147
+ timeout=120,
148
+ stdout=subprocess.PIPE,
149
+ stderr=subprocess.PIPE,
150
+ )
151
+ assert pdftoppm_result.returncode == 0, pdftoppm_result.stderr
152
+ return base64.b64encode(pdftoppm_result.stdout).decode("utf-8")
153
+
154
+ def build_message(image_url: str, system_prompt: str, page: int = 0):
155
+ image_base64 = render_pdf_to_base64png(image_url, page, TARGET_IMAGE_DIM)
156
+
157
+ prompt = [
158
+ {
159
+ "role": "user",
160
+ "content": [
161
+ {
162
+ "type": "text",
163
+ "text": system_prompt
164
+ },
165
+ {
166
+ "type": "image",
167
+ "image": f"data:image/png;base64,{image_base64}",
168
+ },
169
+ ],
170
+ }
171
+ ]
172
+ return prompt
173
+ ```
174
+
175
+ ### Run OCR Inference
176
+
177
+ ```python
178
+ from qwen_vl_utils import process_vision_info
179
+
180
+ def run_inference(model, processor, messages, max_new_tokens=128, device="cuda"):
181
+ text = processor.apply_chat_template(
182
+ messages, tokenize=False, add_generation_prompt=True
183
+ )
184
+
185
+ image_inputs, _ = process_vision_info(messages)
186
+ inputs = processor(
187
+ text=[text],
188
+ images=image_inputs,
189
+ padding=False,
190
+ return_tensors="pt",
191
+ ).to(device)
192
+
193
+ generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
194
+ trimmed_ids = [
195
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
196
+ ]
197
+
198
+ outputs = processor.batch_decode(
199
+ trimmed_ids,
200
+ skip_special_tokens=True,
201
+ clean_up_tokenization_spaces=False,
202
+ )
203
+ return outputs[0]
204
+ ```