ricklon Claude Sonnet 4.6 commited on
Commit
25ba1bf
·
0 Parent(s):

Initial commit — DeepSeek-OCR-2 Math Rendering Edition

Browse files

MathJax rendering, ZeroGPU support, updated examples and docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.gitattributes ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ ocr.jpg filter=lfs diff=lfs merge=lfs -text
37
+ reachy-mini.jpg filter=lfs diff=lfs merge=lfs -text
38
+ examples/ocr.jpg filter=lfs diff=lfs merge=lfs -text
39
+ examples/reachy-mini.jpg filter=lfs diff=lfs merge=lfs -text
40
+ *.png filter=lfs diff=lfs merge=lfs -text
41
+ *.jpg filter=lfs diff=lfs merge=lfs -text
42
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
43
+ *.pdf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DeepSeek OCR 2 — Math Rendering Edition
3
+ emoji: 🧮
4
+ colorFrom: red
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 6.8.0
8
+ app_file: app.py
9
+ pinned: true
10
+ short_description: DeepSeek-OCR-2 with MathJax math rendering
11
+ license: mit
12
+ python_version: "3.12"
13
+ ---
14
+
15
+ # DeepSeek-OCR-2 — Math Rendering Edition
16
+
17
+ Built on top of the excellent [DeepSeek-OCR-2 Demo](https://huggingface.co/spaces/merterbak/DeepSeek-OCR-2) by **Mert Erbak**. Many thanks for the clean foundation — the OCR pipeline, PDF support, bounding box visualisation, and grounding features are all his work.
18
+
19
+ ## What's new in this fork
20
+
21
+ - **MathJax rendering** — the Markdown Preview tab now renders LaTeX math notation (inline `$...$` and display `$$...$$`) using MathJax 3, so equations from scanned papers and textbooks display as proper math rather than raw LaTeX source.
22
+
23
+ ## Features (inherited + extended)
24
+
25
+ | Feature | Description |
26
+ |---|---|
27
+ | 📋 Markdown | Convert documents to structured markdown with layout detection |
28
+ | 📝 Free OCR | Simple text extraction without layout analysis |
29
+ | 📍 Locate | Find and highlight specific text or elements with bounding boxes |
30
+ | 🔍 Describe | General image description |
31
+ | ✏️ Custom | Provide your own prompt |
32
+ | 🧮 Math Preview | Rendered MathJax output for equations and formulas *(new)* |
33
+
34
+ ## Model
35
+
36
+ Uses `deepseek-ai/DeepSeek-OCR-2` with DeepEncoder v2. Achieves **91.09% on OmniDocBench** (+3.73% over v1).
37
+
38
+ Configuration: 1024 base + 768 patches with dynamic cropping (2–6 patches). 144 tokens per patch + 256 base tokens.
39
+
40
+ ## How it works
41
+
42
+ The model processes images and PDFs using a prompt-based interface with special tokens that control its behaviour:
43
+
44
+ - **`<image>`** — replaced at inference time with visual patch embeddings from the input
45
+ - **`<|grounding|>`** — activates layout detection; the model then annotates every element it finds with a label and bounding box coordinates
46
+ - **`<|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|>`** — the format the model uses to output detected regions
47
+
48
+ When grounding is active, the model self-labels regions as `title`, `text`, `image`, `table`, etc. Regions labelled `image` are automatically cropped out and appear in the **Cropped Images** tab. All regions get bounding boxes drawn in the **Boxes** tab.
49
+
50
+ See [TECHNICAL.md](TECHNICAL.md) for a full breakdown of the pipeline, including some non-obvious implementation details.
51
+
52
+ ## Running locally
53
+
54
+ ```bash
55
+ python3 -m venv .venv
56
+ source .venv/bin/activate
57
+ pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
58
+ pip install -r requirements.txt
59
+ pip install gradio spaces markdown pymdown-extensions
60
+ python app.py
61
+ ```
62
+
63
+ Requires a CUDA-capable GPU. The model is downloaded from HuggingFace on first run.
TECHNICAL.md ADDED
@@ -0,0 +1,432 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Technical Documentation
2
+
3
+ This document covers the implementation details of the DeepSeek-OCR-2 Math Rendering Edition. It is intended for developers who want to understand, extend, or debug the pipeline.
4
+
5
+ ---
6
+
7
+ ## Table of Contents
8
+
9
+ 1. [Architecture Overview](#architecture-overview)
10
+ 0. [VRAM Usage and Quantized Models](#vram-usage-and-quantized-models)
11
+ 2. [Prompts and Special Tokens](#prompts-and-special-tokens)
12
+ 3. [Grounding and Layout Detection](#grounding-and-layout-detection)
13
+ 4. [Figure and Graph Extraction](#figure-and-graph-extraction)
14
+ 5. [stdout Capture Pattern](#stdout-capture-pattern)
15
+ 6. [PDF Rendering](#pdf-rendering) — image conversion, 300 DPI rationale, one page at a time, digital vs scanned
16
+ 7. [Dual-pass Output Cleaning](#dual-pass-output-cleaning)
17
+ 8. [Bounding Box Rendering](#bounding-box-rendering)
18
+ 9. [Math Rendering Pipeline](#math-rendering-pipeline)
19
+ 10. [Known Quirks and Workarounds](#known-quirks-and-workarounds)
20
+
21
+ ---
22
+
23
+ ## Architecture Overview
24
+
25
+ ```
26
+ User input (image or PDF)
27
+
28
+
29
+ PDF? ─── fitz renders page at 300 DPI ──► PIL Image
30
+ No? ─── PIL Image directly
31
+
32
+
33
+ model.infer() called with prompt + image path
34
+
35
+ ▼ (stdout captured)
36
+ Raw model output (text + grounding tokens)
37
+
38
+ ├──► clean_output(include_images=False) ──► Text tab
39
+
40
+ ├──► clean_output(include_images=True)
41
+ │ │
42
+ │ ▼
43
+ │ embed_images() ──► Markdown string with base64 figures
44
+ │ │
45
+ │ ▼
46
+ │ to_math_html() ──► HTML with MathJax ──► Markdown Preview tab
47
+
48
+ ├──► extract_grounding_references()
49
+ │ │
50
+ │ ▼
51
+ │ draw_bounding_boxes() ──► Boxes tab
52
+ │ └──► crops ──► Cropped Images tab
53
+
54
+ └──► raw result ──► Raw Text tab
55
+ ```
56
+
57
+ ---
58
+
59
+ ## Prompts and Special Tokens
60
+
61
+ Each task sends a different prompt to the model. The prompt controls both what the model outputs and whether it performs layout detection.
62
+
63
+ | Task | Prompt | Grounding |
64
+ |---|---|---|
65
+ | Markdown | `<image>\n<\|grounding\|>Convert the document to markdown.` | Yes |
66
+ | Free OCR | `<image>\nFree OCR.` | No |
67
+ | Locate | `<image>\nLocate <\|ref\|>text<\|/ref\|> in the image.` | Yes |
68
+ | Describe | `<image>\nDescribe this image in detail.` | No |
69
+ | Custom | User-defined | Optional |
70
+
71
+ ### Special tokens
72
+
73
+ | Token | Purpose |
74
+ |---|---|
75
+ | `<image>` | Replaced at inference time with visual patch embeddings from the input image |
76
+ | `<\|grounding\|>` | Activates layout detection mode — the model annotates every detected region with a label and bounding box |
77
+ | `<\|ref\|>label<\|/ref\|>` | Wraps the label of a detected region (e.g. `title`, `text`, `image`, `table`) |
78
+ | `<\|det\|>coords<\|/det\|>` | Wraps the bounding box coordinates for that region |
79
+
80
+ ### Locate task
81
+
82
+ When using Locate, the user's input is embedded directly into the prompt:
83
+
84
+ ```python
85
+ prompt = f"<image>\nLocate <|ref|>{custom_prompt.strip()}<|/ref|> in the image."
86
+ ```
87
+
88
+ This asks the model to find a specific string or element and return its bounding box coordinates.
89
+
90
+ ---
91
+
92
+ ## Grounding and Layout Detection
93
+
94
+ When `<|grounding|>` is present, the model interleaves its text output with region annotations. A typical raw output looks like:
95
+
96
+ ```
97
+ # Introduction
98
+ <|ref|>title<|/ref|><|det|>[[45, 12, 820, 48]]<|/det|>
99
+
100
+ This paper presents a method for...
101
+ <|ref|>text<|/ref|><|det|>[[45, 60, 820, 340]]<|/det|>
102
+
103
+ <|ref|>image<|/ref|><|det|>[[45, 360, 820, 680]]<|/det|>
104
+
105
+ | A | B |
106
+ |---|---|
107
+ <|ref|>table<|/ref|><|det|>[[45, 700, 820, 900]]<|/det|>
108
+ ```
109
+
110
+ The labels (`title`, `text`, `image`, `table`) are part of the model's training vocabulary — the model assigns them based on what it detects, not from any hardcoded list in the app.
111
+
112
+ The regex that parses this is:
113
+
114
+ ```python
115
+ pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
116
+ ```
117
+
118
+ This returns a list of tuples: `(full_match, label, coordinates_string)`.
119
+
120
+ ### Coordinate system
121
+
122
+ Bounding box coordinates are normalised to a **0–999 scale**, not pixel coordinates. The app scales them back at render time:
123
+
124
+ ```python
125
+ x1 = int(box[0] / 999 * img_w)
126
+ y1 = int(box[1] / 999 * img_h)
127
+ ```
128
+
129
+ This means coordinates are resolution-independent — the same model output works regardless of the original image size.
130
+
131
+ ---
132
+
133
+ ## Figure and Graph Extraction
134
+
135
+ Graph and figure extraction is a side effect of bounding box processing. Inside `draw_bounding_boxes()`:
136
+
137
+ ```python
138
+ if extract_images and label == 'image':
139
+ crops.append(image.crop((x1, y1, x2, y2)))
140
+ ```
141
+
142
+ Only regions the model labels as `'image'` are cropped. Text blocks, titles, and tables get bounding boxes drawn but are not extracted.
143
+
144
+ These crops are then:
145
+ 1. Added to the **Cropped Images** gallery tab
146
+ 2. Base64-encoded and embedded into the markdown as `![Figure N](data:image/png;base64,...)` by `embed_images()`, so they appear inline in the **Markdown Preview** tab
147
+
148
+ ---
149
+
150
+ ## stdout Capture Pattern
151
+
152
+ The model's `infer()` method was designed as a CLI tool — it `print()`s its output rather than returning it. The app captures this by temporarily replacing `sys.stdout`:
153
+
154
+ ```python
155
+ stdout = sys.stdout
156
+ sys.stdout = StringIO()
157
+
158
+ model.infer(...)
159
+
160
+ raw = sys.stdout.getvalue()
161
+ sys.stdout = stdout
162
+ ```
163
+
164
+ The model also prints internal diagnostics alongside the actual output. These are filtered out by checking for known debug strings:
165
+
166
+ ```python
167
+ debug_filters = ['PATCHES', '====', 'BASE:', 'directly resize',
168
+ 'NO PATCHES', 'torch.Size', '%|']
169
+
170
+ result = '\n'.join([
171
+ l for l in raw.split('\n')
172
+ if l.strip() and not any(s in l for s in debug_filters)
173
+ ])
174
+ ```
175
+
176
+ If inference ever produces unexpected empty output, checking what the model is printing to stdout (by temporarily removing the capture) is the first debugging step.
177
+
178
+ ---
179
+
180
+ ## PDF Rendering
181
+
182
+ ### PDFs are converted to images — the text layer is never read
183
+
184
+ The app does not extract embedded text from PDFs. Every page is rasterised to a PNG image first, then passed through the exact same pipeline as a directly uploaded image:
185
+
186
+ ```python
187
+ def process_pdf(path, task, custom_prompt, page_num):
188
+ doc = fitz.open(path)
189
+ page = doc.load_page(page_num - 1)
190
+ pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
191
+ img = Image.open(BytesIO(pix.tobytes("png")))
192
+ doc.close()
193
+ return process_image(img, task, custom_prompt)
194
+ ```
195
+
196
+ This means the model reads pixels, not characters. It has no access to the PDF's internal text layer, font metadata, or document structure.
197
+
198
+ ### Why 300 DPI
199
+
200
+ PDFs are natively specified at 72 DPI. The `fitz.Matrix(300/72, 300/72)` call scales the render up by ~4.17×:
201
+
202
+ - At 72 DPI, small text, subscripts, superscripts, and fine math symbols are too coarse for the model to read reliably
203
+ - At 300 DPI, characters are sharp enough for accurate OCR even at small point sizes
204
+ - 300 DPI is the standard used by document scanners for archival quality
205
+
206
+ ### One page at a time
207
+
208
+ The current implementation processes one page per submission. There is no batch mode. For a multi-page document the user selects a page number, submits, then moves to the next page.
209
+
210
+ The page selector UI is only shown when a PDF is uploaded:
211
+
212
+ ```python
213
+ def update_page_selector(file_path):
214
+ if file_path.lower().endswith('.pdf'):
215
+ page_count = get_pdf_page_count(file_path)
216
+ return gr.update(visible=True, maximum=page_count, value=1, minimum=1)
217
+ return gr.update(visible=False)
218
+ ```
219
+
220
+ ### Digital vs scanned PDFs
221
+
222
+ Both work identically:
223
+
224
+ | PDF type | What's inside | Result |
225
+ |---|---|---|
226
+ | Digital (text-based) | Vector fonts and geometry | PyMuPDF re-rasterises from vectors — output is perfectly sharp at any DPI |
227
+ | Scanned | Embedded raster images | PyMuPDF extracts the raster — output quality depends on the original scan resolution |
228
+
229
+ For scanned PDFs with low source resolution (e.g. 150 DPI originals), upscaling to 300 DPI will not recover detail that was never there. In those cases inference accuracy may be lower than with high-quality digital PDFs.
230
+
231
+ ---
232
+
233
+ ## Dual-pass Output Cleaning
234
+
235
+ `clean_output()` is called twice on the same raw result to produce two different outputs:
236
+
237
+ ```python
238
+ cleaned = clean_output(result, include_images=False) # → Text tab
239
+ markdown = clean_output(result, include_images=True) # → Markdown Preview
240
+ ```
241
+
242
+ With `include_images=False`:
243
+ - Grounding tokens are stripped
244
+ - `<|ref|>image<|/ref|>` regions are removed entirely
245
+ - Result is clean plain text
246
+
247
+ With `include_images=True`:
248
+ - Text grounding tokens are stripped
249
+ - `<|ref|>image<|/ref|>` regions are replaced with `**[Figure N]**` placeholders
250
+ - `embed_images()` then swaps those placeholders for actual base64-encoded PNGs
251
+
252
+ ---
253
+
254
+ ## Bounding Box Rendering
255
+
256
+ Bounding boxes are drawn in two layers using Pillow:
257
+
258
+ 1. **Solid outline** — drawn directly on a copy of the image
259
+ 2. **Semi-transparent fill** — drawn on a separate RGBA overlay, then composited
260
+
261
+ ```python
262
+ overlay = Image.new('RGBA', img_draw.size, (0, 0, 0, 0))
263
+ # ... draw filled rectangles on overlay with alpha=60 ...
264
+ img_draw.paste(overlay, (0, 0), overlay)
265
+ ```
266
+
267
+ The alpha value of 60 (out of 255) gives a ~24% opacity fill, keeping the underlying content readable.
268
+
269
+ ### Colour assignment
270
+
271
+ Each unique label gets a random RGB colour, generated once per session:
272
+
273
+ ```python
274
+ np.random.seed(42)
275
+ color_map[label] = (
276
+ np.random.randint(50, 255),
277
+ np.random.randint(50, 255),
278
+ np.random.randint(50, 255)
279
+ )
280
+ ```
281
+
282
+ The seed is fixed at 42, so label colours are deterministic across runs — `title` will always get the same colour, `text` always another. The lower bound of 50 prevents colours that are too dark to see against the fill.
283
+
284
+ Title regions get a thicker outline (width=5) than other regions (width=3) to give them visual prominence.
285
+
286
+ ---
287
+
288
+ ## Math Rendering Pipeline
289
+
290
+ Getting LaTeX from the model to display correctly in the browser involves three components working together.
291
+
292
+ ### The markdown/math conflict
293
+
294
+ Standard markdown processors interpret `_` as italic and `*` as bold. Raw LaTeX like `$a_1 + a_2^*$` would be mangled before MathJax ever sees it.
295
+
296
+ The solution is `pymdownx.arithmatex` — a markdown extension that extracts math expressions **before** markdown processing, processes the surrounding text, then reinserts the math wrapped in MathJax-compatible delimiters:
297
+
298
+ ```
299
+ Input: Some text with $a_1 + a_2$ inline.
300
+
301
+ After arithmatex + markdown:
302
+ <p>Some text with <span class="arithmatex">\(a_1 + a_2\)</span> inline.</p>
303
+ ```
304
+
305
+ The `_` inside the math is never touched by the markdown processor.
306
+
307
+ ### MathJax configuration
308
+
309
+ MathJax is loaded once in the page `<head>` and configured to process `\(...\)` for inline math and `\[...\]` for display math — matching the output format of arithmatex:
310
+
311
+ ```javascript
312
+ window.MathJax = {
313
+ tex: {
314
+ inlineMath: [['\\(', '\\)']],
315
+ displayMath: [['\\[', '\\]']],
316
+ processEscapes: true,
317
+ tags: 'ams'
318
+ }
319
+ };
320
+ ```
321
+
322
+ `tags: 'ams'` enables automatic equation numbering for `align`, `equation`, and similar environments.
323
+
324
+ ### Re-typesetting on update
325
+
326
+ MathJax processes the page once on load. When Gradio updates the HTML component with new content, MathJax needs to be told to process the new content:
327
+
328
+ ```python
329
+ submit_event.then(
330
+ fn=None,
331
+ js="() => setTimeout(() => { if(window.MathJax) MathJax.typesetPromise(); }, 300)"
332
+ )
333
+ ```
334
+
335
+ The 300ms delay gives Gradio time to finish updating the DOM before MathJax scans it.
336
+
337
+ ---
338
+
339
+ ## Known Quirks and Workarounds
340
+
341
+ ### `\coloneqq` and `\eqqcolon`
342
+
343
+ These LaTeX commands (`≔` and `=:`) from the `mathtools` package appear frequently in academic papers but are not available in MathJax's default TeX configuration. Rather than loading the full `mathtools` package, they are substituted at the text level:
344
+
345
+ ```python
346
+ text = text.replace('\\coloneqq', ':=').replace('\\eqqcolon', '=:')
347
+ ```
348
+
349
+ If you need proper rendering of these symbols, add `require: { package: ['mathtools'] }` to the MathJax configuration.
350
+
351
+ ### Flash Attention initialisation warning
352
+
353
+ On startup you will see:
354
+
355
+ ```
356
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU.
357
+ ```
358
+
359
+ This is because the model loads onto CPU first (`from_pretrained`) then moves to GPU (`.cuda()`). Flash Attention 2 prefers direct GPU initialisation. The warning is harmless — inference works correctly. To silence it, add `device_map="cuda"` to the `from_pretrained` call.
360
+
361
+ ### Model type mismatch warning
362
+
363
+ ```
364
+ You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR2.
365
+ ```
366
+
367
+ The model's config file on HuggingFace declares `model_type: deepseek_vl_v2` but the custom code registers a `DeepseekOCR2` class. Because `trust_remote_code=True` is set, the correct class is loaded regardless. The warning can be ignored.
368
+
369
+ ### `eval()` on model output
370
+
371
+ Bounding box coordinates are parsed with Python's `eval()`:
372
+
373
+ ```python
374
+ coords = eval(ref[2])
375
+ ```
376
+
377
+ The model outputs coordinates as a Python list literal e.g. `[[45, 12, 820, 48]]`. This is safe in this context since the model runs locally, but worth noting if the architecture ever changes to process untrusted remote model output.
378
+
379
+ ### Zone.Identifier files in examples/
380
+
381
+ Files copied from Windows to WSL2 may have accompanying `.Zone.Identifier` metadata files (e.g. `image.png:Zone.Identifier`). These are Windows security zone markers and are harmless — Gradio ignores them when loading examples.
382
+
383
+ ---
384
+
385
+ ## VRAM Usage and Quantized Models
386
+
387
+ ### The 8GB problem
388
+
389
+ The full-precision BF16 model (`deepseek-ai/DeepSeek-OCR-2`) consumes approximately **7.9GB of VRAM** on load, leaving only ~250MB free on an 8GB GPU (e.g. RTX 3070). This headroom is insufficient for inference on complex documents:
390
+
391
+ - Each patch adds tokens to the KV cache and activations
392
+ - A 6-patch document can exhaust the remaining VRAM
393
+ - When VRAM is full, PyTorch spills to system RAM — which is 50–100× slower
394
+ - Symptom: GPU-Util drops to ~24%, power draw falls to ~47W (waiting on memory, not computing)
395
+
396
+ You can confirm this with `watch -n 1 nvidia-smi` during inference. Near-full VRAM with low GPU utilisation is the telltale sign.
397
+
398
+ ### Quantized alternatives on HuggingFace
399
+
400
+ To switch models, change `MODEL_NAME` in `app.py`. Three options are available as of March 2026:
401
+
402
+ | Model | Format | VRAM | Notes |
403
+ |---|---|---|---|
404
+ | `deepseek-ai/DeepSeek-OCR-2` | BF16 (full) | ~8GB | Original, highest accuracy |
405
+ | `richarddavison/DeepSeek-OCR-2-FP8` | FP8 dynamic | ~3.5GB | ~50% reduction; requires Ampere GPU or newer (RTX 30xx qualifies); 3,750 downloads/mo |
406
+ | `mzbac/DeepSeek-OCR-2-8bit` | 8-bit | ~4GB | Same stack (torch 2.6, flash-attn 2.7.3, Python 3.12); explicitly supports dynamic resolution (0–6 patches); 140 downloads/mo |
407
+
408
+ **Not applicable to NVIDIA GPUs:**
409
+ - `mlx-community/DeepSeek-OCR-2-*` — Apple Silicon only (MLX framework)
410
+
411
+ **Not recommended:**
412
+ - `WHY2001/DeepSeek-OCR-4bit-Quantized` — 17 downloads/month, not well tested
413
+
414
+ ### What does not exist (as of March 2026)
415
+
416
+ - GGUF of DeepSeek-OCR-2 (GGUF repos on HuggingFace are for v1 only)
417
+ - GPTQ of DeepSeek-OCR-2
418
+ - AWQ of DeepSeek-OCR-2
419
+
420
+ ### Switching models
421
+
422
+ Change the single constant in `app.py` and restart:
423
+
424
+ ```python
425
+ # FP8 — recommended first try for 8GB GPUs
426
+ MODEL_NAME = 'richarddavison/DeepSeek-OCR-2-FP8'
427
+
428
+ # 8-bit — alternative with same toolchain
429
+ MODEL_NAME = 'mzbac/DeepSeek-OCR-2-8bit'
430
+ ```
431
+
432
+ The model will be downloaded from HuggingFace on first use and cached locally.
app.py ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from transformers import AutoModel, AutoTokenizer
3
+ import torch
4
+ import spaces
5
+ import os
6
+ import sys
7
+ import tempfile
8
+ import shutil
9
+ from PIL import Image, ImageDraw, ImageFont, ImageOps
10
+ import fitz
11
+ import re
12
+ import numpy as np
13
+ import base64
14
+ import markdown as md_lib
15
+ from io import StringIO, BytesIO
16
+
17
+ # Model options — swap MODEL_NAME to reduce VRAM usage on GPUs with <= 8GB
18
+ #
19
+ # Full precision BF16 (~8GB VRAM) — original, highest accuracy
20
+ MODEL_NAME = 'deepseek-ai/DeepSeek-OCR-2'
21
+ #
22
+ # FP8 dynamic quantization (~3.5GB VRAM) — ~50% VRAM reduction, 3750 downloads/mo
23
+ # Requires Ampere GPU or newer (RTX 3070 is supported)
24
+ # MODEL_NAME = 'richarddavison/DeepSeek-OCR-2-FP8'
25
+ #
26
+ # 8-bit quantization (~4GB VRAM) — same stack (torch 2.6, flash-attn 2.7.3, py3.12)
27
+ # Explicitly supports dynamic resolution (0-6 patches), 140 downloads/mo
28
+ # MODEL_NAME = 'mzbac/DeepSeek-OCR-2-8bit'
29
+
30
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
31
+ model = AutoModel.from_pretrained(MODEL_NAME, _attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True, use_safetensors=True).eval()
32
+ # .cuda() is NOT called here — on ZeroGPU, GPU is only available inside @spaces.GPU
33
+ # functions. Locally, model.cuda() is called inside process_image on first run.
34
+
35
+ BASE_SIZE = 1024
36
+ IMAGE_SIZE = 768
37
+ CROP_MODE = True
38
+
39
+ TASK_PROMPTS = {
40
+ "📋 Markdown": {"prompt": "<image>\n<|grounding|>Convert the document to markdown.", "has_grounding": True},
41
+ "📝 Free OCR": {"prompt": "<image>\nFree OCR.", "has_grounding": False},
42
+ "📍 Locate": {"prompt": "<image>\nLocate <|ref|>text<|/ref|> in the image.", "has_grounding": True},
43
+ "🔍 Describe": {"prompt": "<image>\nDescribe this image in detail.", "has_grounding": False},
44
+ "✏️ Custom": {"prompt": "", "has_grounding": False}
45
+ }
46
+
47
+ def extract_grounding_references(text):
48
+ pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
49
+ return re.findall(pattern, text, re.DOTALL)
50
+
51
+ def draw_bounding_boxes(image, refs, extract_images=False):
52
+ img_w, img_h = image.size
53
+ img_draw = image.copy()
54
+ draw = ImageDraw.Draw(img_draw)
55
+ overlay = Image.new('RGBA', img_draw.size, (0, 0, 0, 0))
56
+ draw2 = ImageDraw.Draw(overlay)
57
+ font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 15)
58
+ crops = []
59
+
60
+ color_map = {}
61
+ np.random.seed(42)
62
+
63
+ for ref in refs:
64
+ label = ref[1]
65
+ if label not in color_map:
66
+ color_map[label] = (np.random.randint(50, 255), np.random.randint(50, 255), np.random.randint(50, 255))
67
+
68
+ color = color_map[label]
69
+ coords = eval(ref[2])
70
+ color_a = color + (60,)
71
+
72
+ for box in coords:
73
+ x1, y1, x2, y2 = int(box[0]/999*img_w), int(box[1]/999*img_h), int(box[2]/999*img_w), int(box[3]/999*img_h)
74
+
75
+ if extract_images and label == 'image':
76
+ crops.append(image.crop((x1, y1, x2, y2)))
77
+
78
+ width = 5 if label == 'title' else 3
79
+ draw.rectangle([x1, y1, x2, y2], outline=color, width=width)
80
+ draw2.rectangle([x1, y1, x2, y2], fill=color_a)
81
+
82
+ text_bbox = draw.textbbox((0, 0), label, font=font)
83
+ tw, th = text_bbox[2] - text_bbox[0], text_bbox[3] - text_bbox[1]
84
+ ty = max(0, y1 - 20)
85
+ draw.rectangle([x1, ty, x1 + tw + 4, ty + th + 4], fill=color)
86
+ draw.text((x1 + 2, ty + 2), label, font=font, fill=(255, 255, 255))
87
+
88
+ img_draw.paste(overlay, (0, 0), overlay)
89
+ return img_draw, crops
90
+
91
+ def clean_output(text, include_images=False):
92
+ if not text:
93
+ return ""
94
+ pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
95
+ matches = re.findall(pattern, text, re.DOTALL)
96
+ img_num = 0
97
+
98
+ for match in matches:
99
+ if '<|ref|>image<|/ref|>' in match[0]:
100
+ if include_images:
101
+ text = text.replace(match[0], f'\n\n**[Figure {img_num + 1}]**\n\n', 1)
102
+ img_num += 1
103
+ else:
104
+ text = text.replace(match[0], '', 1)
105
+ else:
106
+ text = re.sub(rf'(?m)^[^\n]*{re.escape(match[0])}[^\n]*\n?', '', text)
107
+
108
+ text = text.replace('\\coloneqq', ':=').replace('\\eqqcolon', '=:')
109
+
110
+ return text.strip()
111
+
112
+ MATHJAX_HEAD = """
113
+ <script>
114
+ window.MathJax = {
115
+ tex: {
116
+ inlineMath: [['\\\\(', '\\\\)']],
117
+ displayMath: [['\\\\[', '\\\\]']],
118
+ processEscapes: true,
119
+ tags: 'ams'
120
+ },
121
+ options: {
122
+ skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre']
123
+ },
124
+ startup: {
125
+ typeset: false
126
+ }
127
+ };
128
+ </script>
129
+ <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js" async></script>
130
+ <style>
131
+ .math-preview {
132
+ padding: 1.5em;
133
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
134
+ font-size: 15px;
135
+ line-height: 1.8;
136
+ color: #1a1a1a;
137
+ max-width: 100%;
138
+ overflow-x: auto;
139
+ }
140
+ .math-preview h1 { font-size: 1.8em; font-weight: 700; margin: 1em 0 0.4em; border-bottom: 2px solid #e0e0e0; padding-bottom: 0.3em; }
141
+ .math-preview h2 { font-size: 1.4em; font-weight: 600; margin: 1em 0 0.4em; border-bottom: 1px solid #e0e0e0; padding-bottom: 0.2em; }
142
+ .math-preview h3 { font-size: 1.15em; font-weight: 600; margin: 0.9em 0 0.3em; }
143
+ .math-preview h4, .math-preview h5, .math-preview h6 { font-weight: 600; margin: 0.8em 0 0.3em; }
144
+ .math-preview p { margin: 0.6em 0; }
145
+ .math-preview ul, .math-preview ol { padding-left: 1.8em; margin: 0.5em 0; }
146
+ .math-preview li { margin: 0.25em 0; }
147
+ .math-preview table { border-collapse: collapse; width: 100%; margin: 1em 0; font-size: 0.95em; }
148
+ .math-preview th, .math-preview td { border: 1px solid #ccc; padding: 0.45em 0.75em; text-align: left; }
149
+ .math-preview th { background: #f2f2f2; font-weight: 600; }
150
+ .math-preview tr:nth-child(even) { background: #fafafa; }
151
+ .math-preview code { background: #f4f4f4; padding: 0.15em 0.4em; border-radius: 3px; font-family: 'Courier New', monospace; font-size: 0.88em; }
152
+ .math-preview pre { background: #f4f4f4; padding: 1em; border-radius: 5px; overflow-x: auto; margin: 0.8em 0; }
153
+ .math-preview pre code { background: none; padding: 0; }
154
+ .math-preview blockquote { border-left: 4px solid #ccc; margin: 0.8em 0; padding: 0.4em 1em; color: #555; background: #fafafa; }
155
+ .math-preview img { max-width: 100%; height: auto; display: block; margin: 0.8em 0; }
156
+ .math-preview .arithmatex { overflow-x: auto; }
157
+ .math-preview mjx-container[display="true"] { display: block; overflow-x: auto; padding: 0.5em 0; }
158
+ </style>
159
+ """
160
+
161
+ def to_math_html(text):
162
+ if not text:
163
+ return ""
164
+ html = md_lib.markdown(text, extensions=[
165
+ 'pymdownx.arithmatex',
166
+ 'tables',
167
+ 'fenced_code',
168
+ 'sane_lists',
169
+ ], extension_configs={
170
+ 'pymdownx.arithmatex': {'generic': True}
171
+ })
172
+ return f'<div class="math-preview">{html}</div>'
173
+
174
+ def embed_images(markdown, crops):
175
+ if not crops:
176
+ return markdown
177
+ for i, img in enumerate(crops):
178
+ buf = BytesIO()
179
+ img.save(buf, format="PNG")
180
+ b64 = base64.b64encode(buf.getvalue()).decode()
181
+ markdown = markdown.replace(f'**[Figure {i + 1}]**', f'\n\n![Figure {i + 1}](data:image/png;base64,{b64})\n\n', 1)
182
+ return markdown
183
+
184
+ @spaces.GPU(duration=90)
185
+ def process_image(image, task, custom_prompt):
186
+ model.cuda() # GPU is available here — works on ZeroGPU and locally
187
+ if image is None:
188
+ return "Error: Upload an image", "", "", None, []
189
+ if task in ["✏️ Custom", "📍 Locate"] and not custom_prompt.strip():
190
+ return "Please enter a prompt", "", "", None, []
191
+
192
+ if image.mode in ('RGBA', 'LA', 'P'):
193
+ image = image.convert('RGB')
194
+ image = ImageOps.exif_transpose(image)
195
+
196
+ if task == "✏️ Custom":
197
+ prompt = f"<image>\n{custom_prompt.strip()}"
198
+ has_grounding = '<|grounding|>' in custom_prompt
199
+ elif task == "📍 Locate":
200
+ prompt = f"<image>\nLocate <|ref|>{custom_prompt.strip()}<|/ref|> in the image."
201
+ has_grounding = True
202
+ else:
203
+ prompt = TASK_PROMPTS[task]["prompt"]
204
+ has_grounding = TASK_PROMPTS[task]["has_grounding"]
205
+
206
+ tmp = tempfile.NamedTemporaryFile(delete=False, suffix='.jpg')
207
+ image.save(tmp.name, 'JPEG', quality=95)
208
+ tmp.close()
209
+ out_dir = tempfile.mkdtemp()
210
+
211
+ stdout = sys.stdout
212
+ sys.stdout = StringIO()
213
+
214
+ model.infer(
215
+ tokenizer=tokenizer,
216
+ prompt=prompt,
217
+ image_file=tmp.name,
218
+ output_path=out_dir,
219
+ base_size=BASE_SIZE,
220
+ image_size=IMAGE_SIZE,
221
+ crop_mode=CROP_MODE,
222
+ save_results=False
223
+ )
224
+
225
+ debug_filters = ['PATCHES', '====', 'BASE:', 'directly resize', 'NO PATCHES', 'torch.Size', '%|']
226
+ result = '\n'.join([l for l in sys.stdout.getvalue().split('\n')
227
+ if l.strip() and not any(s in l for s in debug_filters)]).strip()
228
+ sys.stdout = stdout
229
+
230
+ os.unlink(tmp.name)
231
+ shutil.rmtree(out_dir, ignore_errors=True)
232
+
233
+ if not result:
234
+ return "No text detected", "", "", None, []
235
+
236
+ cleaned = clean_output(result, False)
237
+ markdown = clean_output(result, True)
238
+
239
+ img_out = None
240
+ crops = []
241
+
242
+ if has_grounding and '<|ref|>' in result:
243
+ refs = extract_grounding_references(result)
244
+ if refs:
245
+ img_out, crops = draw_bounding_boxes(image, refs, True)
246
+
247
+ markdown = embed_images(markdown, crops)
248
+
249
+ return cleaned, markdown, result, img_out, crops
250
+
251
+ @spaces.GPU(duration=90)
252
+ def process_pdf(path, task, custom_prompt, page_num):
253
+ doc = fitz.open(path)
254
+ total_pages = len(doc)
255
+ if page_num < 1 or page_num > total_pages:
256
+ doc.close()
257
+ return f"Invalid page number. PDF has {total_pages} pages.", "", "", None, []
258
+ page = doc.load_page(page_num - 1)
259
+ pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
260
+ img = Image.open(BytesIO(pix.tobytes("png")))
261
+ doc.close()
262
+
263
+ return process_image(img, task, custom_prompt)
264
+
265
+ def process_file(path, task, custom_prompt, page_num):
266
+ if not path:
267
+ return "Error: Upload a file", "", "", None, []
268
+ if path.lower().endswith('.pdf'):
269
+ return process_pdf(path, task, custom_prompt, page_num)
270
+ else:
271
+ return process_image(Image.open(path), task, custom_prompt)
272
+
273
+ def toggle_prompt(task):
274
+ if task == "✏️ Custom":
275
+ return gr.update(visible=True, label="Custom Prompt", placeholder="Add <|grounding|> for bounding boxes")
276
+ elif task == "📍 Locate":
277
+ return gr.update(visible=True, label="Text to Locate", placeholder="Enter text to locate")
278
+ return gr.update(visible=False)
279
+
280
+ def select_boxes(task):
281
+ if task == "📍 Locate":
282
+ return gr.update(selected="tab_boxes")
283
+ return gr.update()
284
+
285
+ def get_pdf_page_count(file_path):
286
+ if not file_path or not file_path.lower().endswith('.pdf'):
287
+ return 1
288
+ doc = fitz.open(file_path)
289
+ count = len(doc)
290
+ doc.close()
291
+ return count
292
+
293
+ def load_image(file_path, page_num=1):
294
+ if not file_path:
295
+ return None
296
+ if file_path.lower().endswith('.pdf'):
297
+ doc = fitz.open(file_path)
298
+ page_idx = max(0, min(int(page_num) - 1, len(doc) - 1))
299
+ page = doc.load_page(page_idx)
300
+ pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
301
+ img = Image.open(BytesIO(pix.tobytes("png")))
302
+ doc.close()
303
+ return img
304
+ else:
305
+ return Image.open(file_path)
306
+
307
+ def update_page_selector(file_path):
308
+ if not file_path:
309
+ return gr.update(visible=False)
310
+ if file_path.lower().endswith('.pdf'):
311
+ page_count = get_pdf_page_count(file_path)
312
+ return gr.update(visible=True, maximum=page_count, value=1, minimum=1,
313
+ label=f"Select Page (1-{page_count})")
314
+ return gr.update(visible=False)
315
+
316
+ with gr.Blocks(title="DeepSeek-OCR-2", head=MATHJAX_HEAD) as demo:
317
+ gr.Markdown("""
318
+ # 🧮 DeepSeek-OCR-2 — Math Rendering Edition
319
+ **Convert documents to markdown, extract text, parse figures, and locate specific content with bounding boxes.**
320
+ **Model uses DeepEncoder v2 and achieves 91.09% on OmniDocBench (+3.73% over v1).**
321
+
322
+ Built on the original [DeepSeek-OCR-2 Demo](https://huggingface.co/spaces/merterbak/DeepSeek-OCR-2) by **Mert Erbak** — thank you for the excellent foundation.
323
+ This fork adds **MathJax rendering** in the Markdown Preview tab so that equations from scanned papers and textbooks display as proper math notation.
324
+ """)
325
+
326
+ with gr.Row():
327
+ with gr.Column(scale=1):
328
+ file_in = gr.File(label="Upload Image or PDF", file_types=["image", ".pdf"], type="filepath")
329
+ input_img = gr.Image(label="Input Image", type="pil", height=300)
330
+ page_selector = gr.Number(label="Select Page", value=1, minimum=1, step=1, visible=False)
331
+ task = gr.Dropdown(list(TASK_PROMPTS.keys()), value="📋 Markdown", label="Task")
332
+ prompt = gr.Textbox(label="Prompt", lines=2, visible=False)
333
+ btn = gr.Button("Extract", variant="primary", size="lg")
334
+
335
+ with gr.Column(scale=2):
336
+ with gr.Tabs() as tabs:
337
+ with gr.Tab("Text", id="tab_text"):
338
+ text_out = gr.Textbox(lines=20, buttons=["copy"], show_label=False)
339
+ with gr.Tab("Markdown Preview", id="tab_markdown"):
340
+ md_out = gr.HTML("")
341
+ with gr.Tab("Boxes", id="tab_boxes"):
342
+ img_out = gr.Image(type="pil", height=500, show_label=False)
343
+ with gr.Tab("Cropped Images", id="tab_crops"):
344
+ gallery = gr.Gallery(show_label=False, columns=3, height=400)
345
+ with gr.Tab("Raw Text", id="tab_raw"):
346
+ raw_out = gr.Textbox(lines=20, buttons=["copy"], show_label=False)
347
+
348
+ with gr.Accordion("Image Examples", open=True):
349
+ gr.Examples(
350
+ examples=[
351
+ ["examples/2022-0922 Section 13 Notes.png", "📋 Markdown", ""],
352
+ ["examples/2022-0922 Section 14 Notes.png", "📋 Markdown", ""],
353
+ ["examples/2022-0922 Section 15 Notes.png", "📋 Markdown", ""],
354
+ ],
355
+ inputs=[input_img, task, prompt],
356
+ cache_examples=False
357
+ )
358
+
359
+ with gr.Accordion("PDF Examples", open=True):
360
+ gr.Examples(
361
+ examples=[
362
+ ["examples/Gursoy Class Notes_ Accessibility Sandbox.pdf", "📋 Markdown", ""],
363
+ ],
364
+ inputs=[file_in, task, prompt],
365
+ cache_examples=False
366
+ )
367
+
368
+ with gr.Accordion("ℹ️ Info", open=False):
369
+ gr.Markdown("""
370
+ ### Configuration
371
+ 1024 base + 768 patches with dynamic cropping (2-6 patches). 144 tokens per patch + 256 base tokens.
372
+
373
+ ### Tasks
374
+ - **Markdown**: Convert document to structured markdown with layout detection (grounding ✅)
375
+ - **Free OCR**: Simple text extraction without layout
376
+ - **Locate**: Find and highlight specific text/elements in image (grounding ✅)
377
+ - **Describe**: General image description
378
+ - **Custom**: Your own prompt
379
+
380
+ ### Special Tokens
381
+ - `<image>` - Placeholder where visual tokens are inserted
382
+ - `<|grounding|>` - Enables layout detection with bounding boxes
383
+ - `<|ref|>text<|/ref|>` - Reference text to locate in the image
384
+
385
+ """)
386
+
387
+ file_in.change(load_image, [file_in, page_selector], [input_img])
388
+ file_in.change(update_page_selector, [file_in], [page_selector])
389
+ page_selector.change(load_image, [file_in, page_selector], [input_img])
390
+ task.change(toggle_prompt, [task], [prompt])
391
+ task.change(select_boxes, [task], [tabs])
392
+
393
+ def run(image, file_path, task, custom_prompt, page_num):
394
+ if file_path:
395
+ cleaned, markdown, raw, img_out, crops = process_file(file_path, task, custom_prompt, int(page_num))
396
+ elif image is not None:
397
+ cleaned, markdown, raw, img_out, crops = process_image(image, task, custom_prompt)
398
+ else:
399
+ return "Error: Upload a file or image", "", "", None, []
400
+ return cleaned, to_math_html(markdown), raw, img_out, crops
401
+
402
+ submit_event = btn.click(run, [input_img, file_in, task, prompt, page_selector],
403
+ [text_out, md_out, raw_out, img_out, gallery])
404
+ submit_event.then(select_boxes, [task], [tabs])
405
+ submit_event.then(fn=None, js="""() => {
406
+ const tryTypeset = () => {
407
+ if (!window.MathJax || !MathJax.typesetPromise) { setTimeout(tryTypeset, 100); return; }
408
+ const el = document.querySelector('.math-preview');
409
+ if (!el) return;
410
+ MathJax.typesetClear([el]);
411
+ MathJax.typesetPromise([el]);
412
+ };
413
+ setTimeout(tryTypeset, 100);
414
+ }""")
415
+
416
+ if __name__ == "__main__":
417
+ # server_name="0.0.0.0" is needed locally (WSL2 → Windows access)
418
+ # On HuggingFace Spaces, SPACE_ID is set and Gradio handles binding automatically
419
+ local = not os.environ.get("SPACE_ID")
420
+ demo.queue(max_size=20).launch(theme=gr.themes.Soft(), server_name="0.0.0.0" if local else None)
examples/2022-0922 Section 13 Notes.png ADDED

Git LFS Details

  • SHA256: e344e03a5967c604e2ce4ddb99ec5ab8d4939b6b692785be461213bad7e6c067
  • Pointer size: 131 Bytes
  • Size of remote file: 500 kB
examples/2022-0922 Section 14 Notes.png ADDED

Git LFS Details

  • SHA256: 7b27aec83556e709fff5f027155ba2a7f76a349cf0ca23334b9c864f5bbfcaf2
  • Pointer size: 131 Bytes
  • Size of remote file: 482 kB
examples/2022-0922 Section 15 Notes.png ADDED

Git LFS Details

  • SHA256: 9bbe0459e0d2035da455bee7977d1513708e57a6bc9fc8fab93c591b5950f0ce
  • Pointer size: 131 Bytes
  • Size of remote file: 746 kB
examples/Gursoy Class Notes_ Accessibility Sandbox.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b7a6494d07f714db7d268d1301382a899f24a2536660672ea12c5ab69ae2c9e
3
+ size 180760
examples/ocr.jpg ADDED

Git LFS Details

  • SHA256: 339d7b11d51ecaa10db3ab721b0d8bbeb03aed60109bc42760089013924fb7d6
  • Pointer size: 131 Bytes
  • Size of remote file: 281 kB
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch==2.6.0
2
+ transformers==4.46.3
3
+ tokenizers==0.20.3
4
+ accelerate
5
+ einops
6
+ addict
7
+ easydict
8
+ torchvision
9
+ flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
10
+ PyMuPDF
11
+ hf_transfer
12
+ markdown
13
+ pymdown-extensions