Spaces:

ricklon
/

DeepSeek-OCR-2-Math

Configuration error

ricklon Claude Sonnet 4.6 commited on Mar 3

Commit

25ba1bf

0 Parent(s):

Initial commit — DeepSeek-OCR-2 Math Rendering Edition

MathJax rendering, ZeroGPU support, updated examples and docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (10) hide show

.gitattributes +43 -0
README.md +63 -0
TECHNICAL.md +432 -0
app.py +420 -0
examples/2022-0922 Section 13 Notes.png +3 -0
examples/2022-0922 Section 14 Notes.png +3 -0
examples/2022-0922 Section 15 Notes.png +3 -0
examples/Gursoy Class Notes_ Accessibility Sandbox.pdf +3 -0
examples/ocr.jpg +3 -0
requirements.txt +13 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,43 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+ocr.jpg filter=lfs diff=lfs merge=lfs -text
+reachy-mini.jpg filter=lfs diff=lfs merge=lfs -text
+examples/ocr.jpg filter=lfs diff=lfs merge=lfs -text
+examples/reachy-mini.jpg filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.jpeg filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+---
+title: DeepSeek OCR 2 — Math Rendering Edition
+emoji: 🧮
+colorFrom: red
+colorTo: blue
+sdk: gradio
+sdk_version: 6.8.0
+app_file: app.py
+pinned: true
+short_description: DeepSeek-OCR-2 with MathJax math rendering
+license: mit
+python_version: "3.12"
+---
+# DeepSeek-OCR-2 — Math Rendering Edition
+Built on top of the excellent [DeepSeek-OCR-2 Demo](https://huggingface.co/spaces/merterbak/DeepSeek-OCR-2) by **Mert Erbak**. Many thanks for the clean foundation — the OCR pipeline, PDF support, bounding box visualisation, and grounding features are all his work.
+## What's new in this fork
+- **MathJax rendering** — the Markdown Preview tab now renders LaTeX math notation (inline `$...$` and display `$$...$$`) using MathJax 3, so equations from scanned papers and textbooks display as proper math rather than raw LaTeX source.
+## Features (inherited + extended)
+| Feature | Description |
+|---|---|
+| 📋 Markdown | Convert documents to structured markdown with layout detection |
+| 📝 Free OCR | Simple text extraction without layout analysis |
+| 📍 Locate | Find and highlight specific text or elements with bounding boxes |
+| 🔍 Describe | General image description |
+| ✏️ Custom | Provide your own prompt |
+| 🧮 Math Preview | Rendered MathJax output for equations and formulas *(new)* |
+## Model
+Uses `deepseek-ai/DeepSeek-OCR-2` with DeepEncoder v2. Achieves **91.09% on OmniDocBench** (+3.73% over v1).
+Configuration: 1024 base + 768 patches with dynamic cropping (2–6 patches). 144 tokens per patch + 256 base tokens.
+## How it works
+The model processes images and PDFs using a prompt-based interface with special tokens that control its behaviour:
+- **`<image>`** — replaced at inference time with visual patch embeddings from the input
+- **`<|grounding|>`** — activates layout detection; the model then annotates every element it finds with a label and bounding box coordinates
+- **`<|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|>`** — the format the model uses to output detected regions
+When grounding is active, the model self-labels regions as `title`, `text`, `image`, `table`, etc. Regions labelled `image` are automatically cropped out and appear in the **Cropped Images** tab. All regions get bounding boxes drawn in the **Boxes** tab.
+See [TECHNICAL.md](TECHNICAL.md) for a full breakdown of the pipeline, including some non-obvious implementation details.
+## Running locally
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
+pip install -r requirements.txt
+pip install gradio spaces markdown pymdown-extensions
+python app.py
+```
+Requires a CUDA-capable GPU. The model is downloaded from HuggingFace on first run.

TECHNICAL.md ADDED Viewed

	@@ -0,0 +1,432 @@

+# Technical Documentation
+This document covers the implementation details of the DeepSeek-OCR-2 Math Rendering Edition. It is intended for developers who want to understand, extend, or debug the pipeline.
+---
+## Table of Contents
+1. [Architecture Overview](#architecture-overview)
+0. [VRAM Usage and Quantized Models](#vram-usage-and-quantized-models)
+2. [Prompts and Special Tokens](#prompts-and-special-tokens)
+3. [Grounding and Layout Detection](#grounding-and-layout-detection)
+4. [Figure and Graph Extraction](#figure-and-graph-extraction)
+5. [stdout Capture Pattern](#stdout-capture-pattern)
+6. [PDF Rendering](#pdf-rendering) — image conversion, 300 DPI rationale, one page at a time, digital vs scanned
+7. [Dual-pass Output Cleaning](#dual-pass-output-cleaning)
+8. [Bounding Box Rendering](#bounding-box-rendering)
+9. [Math Rendering Pipeline](#math-rendering-pipeline)
+10. [Known Quirks and Workarounds](#known-quirks-and-workarounds)
+---
+## Architecture Overview
+```
+User input (image or PDF)
+        │
+        ▼
+  PDF? ─── fitz renders page at 300 DPI ──► PIL Image
+  No?  ─── PIL Image directly
+        │
+        ▼
+  model.infer() called with prompt + image path
+        │
+        ▼ (stdout captured)
+  Raw model output (text + grounding tokens)
+        │
+        ├──► clean_output(include_images=False) ──► Text tab
+        │
+        ├──► clean_output(include_images=True)
+        │           │
+        │           ▼
+        │    embed_images() ──► Markdown string with base64 figures
+        │           │
+        │           ▼
+        │    to_math_html() ──► HTML with MathJax ──► Markdown Preview tab
+        │
+        ├──► extract_grounding_references()
+        │           │
+        │           ▼
+        │    draw_bounding_boxes() ──► Boxes tab
+        │                         └──► crops ──► Cropped Images tab
+        │
+        └──► raw result ──► Raw Text tab
+```
+---
+## Prompts and Special Tokens
+Each task sends a different prompt to the model. The prompt controls both what the model outputs and whether it performs layout detection.
+| Task | Prompt | Grounding |
+|---|---|---|
+| Markdown | `<image>\n<\|grounding\|>Convert the document to markdown.` | Yes |
+| Free OCR | `<image>\nFree OCR.` | No |
+| Locate | `<image>\nLocate <\|ref\|>text<\|/ref\|> in the image.` | Yes |
+| Describe | `<image>\nDescribe this image in detail.` | No |
+| Custom | User-defined | Optional |
+### Special tokens
+| Token | Purpose |
+|---|---|
+| `<image>` | Replaced at inference time with visual patch embeddings from the input image |
+| `<\|grounding\|>` | Activates layout detection mode — the model annotates every detected region with a label and bounding box |
+| `<\|ref\|>label<\|/ref\|>` | Wraps the label of a detected region (e.g. `title`, `text`, `image`, `table`) |
+| `<\|det\|>coords<\|/det\|>` | Wraps the bounding box coordinates for that region |
+### Locate task
+When using Locate, the user's input is embedded directly into the prompt:
+```python
+prompt = f"<image>\nLocate <|ref|>{custom_prompt.strip()}<|/ref|> in the image."
+```
+This asks the model to find a specific string or element and return its bounding box coordinates.
+---
+## Grounding and Layout Detection
+When `<|grounding|>` is present, the model interleaves its text output with region annotations. A typical raw output looks like:
+```
+# Introduction
+<|ref|>title<|/ref|><|det|>[[45, 12, 820, 48]]<|/det|>
+This paper presents a method for...
+<|ref|>text<|/ref|><|det|>[[45, 60, 820, 340]]<|/det|>
+<|ref|>image<|/ref|><|det|>[[45, 360, 820, 680]]<|/det|>
+| A | B |
+|---|---|
+<|ref|>table<|/ref|><|det|>[[45, 700, 820, 900]]<|/det|>
+```
+The labels (`title`, `text`, `image`, `table`) are part of the model's training vocabulary — the model assigns them based on what it detects, not from any hardcoded list in the app.
+The regex that parses this is:
+```python
+pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
+```
+This returns a list of tuples: `(full_match, label, coordinates_string)`.
+### Coordinate system
+Bounding box coordinates are normalised to a **0–999 scale**, not pixel coordinates. The app scales them back at render time:
+```python
+x1 = int(box[0] / 999 * img_w)
+y1 = int(box[1] / 999 * img_h)
+```
+This means coordinates are resolution-independent — the same model output works regardless of the original image size.
+---
+## Figure and Graph Extraction
+Graph and figure extraction is a side effect of bounding box processing. Inside `draw_bounding_boxes()`:
+```python
+if extract_images and label == 'image':
+    crops.append(image.crop((x1, y1, x2, y2)))
+```
+Only regions the model labels as `'image'` are cropped. Text blocks, titles, and tables get bounding boxes drawn but are not extracted.
+These crops are then:
+1. Added to the **Cropped Images** gallery tab
+2. Base64-encoded and embedded into the markdown as `![Figure N](data:image/png;base64,...)` by `embed_images()`, so they appear inline in the **Markdown Preview** tab
+---
+## stdout Capture Pattern
+The model's `infer()` method was designed as a CLI tool — it `print()`s its output rather than returning it. The app captures this by temporarily replacing `sys.stdout`:
+```python
+stdout = sys.stdout
+sys.stdout = StringIO()
+model.infer(...)
+raw = sys.stdout.getvalue()
+sys.stdout = stdout
+```
+The model also prints internal diagnostics alongside the actual output. These are filtered out by checking for known debug strings:
+```python
+debug_filters = ['PATCHES', '====', 'BASE:', 'directly resize',
+                 'NO PATCHES', 'torch.Size', '%|']
+result = '\n'.join([
+    l for l in raw.split('\n')
+    if l.strip() and not any(s in l for s in debug_filters)
+])
+```
+If inference ever produces unexpected empty output, checking what the model is printing to stdout (by temporarily removing the capture) is the first debugging step.
+---
+## PDF Rendering
+### PDFs are converted to images — the text layer is never read
+The app does not extract embedded text from PDFs. Every page is rasterised to a PNG image first, then passed through the exact same pipeline as a directly uploaded image:
+```python
+def process_pdf(path, task, custom_prompt, page_num):
+    doc = fitz.open(path)
+    page = doc.load_page(page_num - 1)
+    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
+    img = Image.open(BytesIO(pix.tobytes("png")))
+    doc.close()
+    return process_image(img, task, custom_prompt)
+```
+This means the model reads pixels, not characters. It has no access to the PDF's internal text layer, font metadata, or document structure.
+### Why 300 DPI
+PDFs are natively specified at 72 DPI. The `fitz.Matrix(300/72, 300/72)` call scales the render up by ~4.17×:
+- At 72 DPI, small text, subscripts, superscripts, and fine math symbols are too coarse for the model to read reliably
+- At 300 DPI, characters are sharp enough for accurate OCR even at small point sizes
+- 300 DPI is the standard used by document scanners for archival quality
+### One page at a time
+The current implementation processes one page per submission. There is no batch mode. For a multi-page document the user selects a page number, submits, then moves to the next page.
+The page selector UI is only shown when a PDF is uploaded:
+```python
+def update_page_selector(file_path):
+    if file_path.lower().endswith('.pdf'):
+        page_count = get_pdf_page_count(file_path)
+        return gr.update(visible=True, maximum=page_count, value=1, minimum=1)
+    return gr.update(visible=False)
+```
+### Digital vs scanned PDFs
+Both work identically:
+| PDF type | What's inside | Result |
+|---|---|---|
+| Digital (text-based) | Vector fonts and geometry | PyMuPDF re-rasterises from vectors — output is perfectly sharp at any DPI |
+| Scanned | Embedded raster images | PyMuPDF extracts the raster — output quality depends on the original scan resolution |
+For scanned PDFs with low source resolution (e.g. 150 DPI originals), upscaling to 300 DPI will not recover detail that was never there. In those cases inference accuracy may be lower than with high-quality digital PDFs.
+---
+## Dual-pass Output Cleaning
+`clean_output()` is called twice on the same raw result to produce two different outputs:
+```python
+cleaned  = clean_output(result, include_images=False)  # → Text tab
+markdown = clean_output(result, include_images=True)   # → Markdown Preview
+```
+With `include_images=False`:
+- Grounding tokens are stripped
+- `<|ref|>image<|/ref|>` regions are removed entirely
+- Result is clean plain text
+With `include_images=True`:
+- Text grounding tokens are stripped
+- `<|ref|>image<|/ref|>` regions are replaced with `**[Figure N]**` placeholders
+- `embed_images()` then swaps those placeholders for actual base64-encoded PNGs
+---
+## Bounding Box Rendering
+Bounding boxes are drawn in two layers using Pillow:
+1. **Solid outline** — drawn directly on a copy of the image
+2. **Semi-transparent fill** — drawn on a separate RGBA overlay, then composited
+```python
+overlay = Image.new('RGBA', img_draw.size, (0, 0, 0, 0))
+# ... draw filled rectangles on overlay with alpha=60 ...
+img_draw.paste(overlay, (0, 0), overlay)
+```
+The alpha value of 60 (out of 255) gives a ~24% opacity fill, keeping the underlying content readable.
+### Colour assignment
+Each unique label gets a random RGB colour, generated once per session:
+```python
+np.random.seed(42)
+color_map[label] = (
+    np.random.randint(50, 255),
+    np.random.randint(50, 255),
+    np.random.randint(50, 255)
+)
+```
+The seed is fixed at 42, so label colours are deterministic across runs — `title` will always get the same colour, `text` always another. The lower bound of 50 prevents colours that are too dark to see against the fill.
+Title regions get a thicker outline (width=5) than other regions (width=3) to give them visual prominence.
+---
+## Math Rendering Pipeline
+Getting LaTeX from the model to display correctly in the browser involves three components working together.
+### The markdown/math conflict
+Standard markdown processors interpret `_` as italic and `*` as bold. Raw LaTeX like `$a_1 + a_2^*$` would be mangled before MathJax ever sees it.
+The solution is `pymdownx.arithmatex` — a markdown extension that extracts math expressions **before** markdown processing, processes the surrounding text, then reinserts the math wrapped in MathJax-compatible delimiters:
+```
+Input:  Some text with $a_1 + a_2$ inline.
+After arithmatex + markdown:
+<p>Some text with <span class="arithmatex">\(a_1 + a_2\)</span> inline.</p>
+```
+The `_` inside the math is never touched by the markdown processor.
+### MathJax configuration
+MathJax is loaded once in the page `<head>` and configured to process `\(...\)` for inline math and `\[...\]` for display math — matching the output format of arithmatex:
+```javascript
+window.MathJax = {
+  tex: {
+    inlineMath:  [['\\(', '\\)']],
+    displayMath: [['\\[', '\\]']],
+    processEscapes: true,
+    tags: 'ams'
+  }
+};
+```
+`tags: 'ams'` enables automatic equation numbering for `align`, `equation`, and similar environments.
+### Re-typesetting on update
+MathJax processes the page once on load. When Gradio updates the HTML component with new content, MathJax needs to be told to process the new content:
+```python
+submit_event.then(
+    fn=None,
+    js="() => setTimeout(() => { if(window.MathJax) MathJax.typesetPromise(); }, 300)"
+)
+```
+The 300ms delay gives Gradio time to finish updating the DOM before MathJax scans it.
+---
+## Known Quirks and Workarounds
+### `\coloneqq` and `\eqqcolon`
+These LaTeX commands (`≔` and `=:`) from the `mathtools` package appear frequently in academic papers but are not available in MathJax's default TeX configuration. Rather than loading the full `mathtools` package, they are substituted at the text level:
+```python
+text = text.replace('\\coloneqq', ':=').replace('\\eqqcolon', '=:')
+```
+If you need proper rendering of these symbols, add `require: { package: ['mathtools'] }` to the MathJax configuration.
+### Flash Attention initialisation warning
+On startup you will see:
+```
+You are attempting to use Flash Attention 2.0 with a model not initialized on GPU.
+```
+This is because the model loads onto CPU first (`from_pretrained`) then moves to GPU (`.cuda()`). Flash Attention 2 prefers direct GPU initialisation. The warning is harmless — inference works correctly. To silence it, add `device_map="cuda"` to the `from_pretrained` call.
+### Model type mismatch warning
+```
+You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR2.
+```
+The model's config file on HuggingFace declares `model_type: deepseek_vl_v2` but the custom code registers a `DeepseekOCR2` class. Because `trust_remote_code=True` is set, the correct class is loaded regardless. The warning can be ignored.
+### `eval()` on model output
+Bounding box coordinates are parsed with Python's `eval()`:
+```python
+coords = eval(ref[2])
+```
+The model outputs coordinates as a Python list literal e.g. `[[45, 12, 820, 48]]`. This is safe in this context since the model runs locally, but worth noting if the architecture ever changes to process untrusted remote model output.
+### Zone.Identifier files in examples/
+Files copied from Windows to WSL2 may have accompanying `.Zone.Identifier` metadata files (e.g. `image.png:Zone.Identifier`). These are Windows security zone markers and are harmless — Gradio ignores them when loading examples.
+---
+## VRAM Usage and Quantized Models
+### The 8GB problem
+The full-precision BF16 model (`deepseek-ai/DeepSeek-OCR-2`) consumes approximately **7.9GB of VRAM** on load, leaving only ~250MB free on an 8GB GPU (e.g. RTX 3070). This headroom is insufficient for inference on complex documents:
+- Each patch adds tokens to the KV cache and activations
+- A 6-patch document can exhaust the remaining VRAM
+- When VRAM is full, PyTorch spills to system RAM — which is 50–100× slower
+- Symptom: GPU-Util drops to ~24%, power draw falls to ~47W (waiting on memory, not computing)
+You can confirm this with `watch -n 1 nvidia-smi` during inference. Near-full VRAM with low GPU utilisation is the telltale sign.
+### Quantized alternatives on HuggingFace
+To switch models, change `MODEL_NAME` in `app.py`. Three options are available as of March 2026:
+| Model | Format | VRAM | Notes |
+|---|---|---|---|
+| `deepseek-ai/DeepSeek-OCR-2` | BF16 (full) | ~8GB | Original, highest accuracy |
+| `richarddavison/DeepSeek-OCR-2-FP8` | FP8 dynamic | ~3.5GB | ~50% reduction; requires Ampere GPU or newer (RTX 30xx qualifies); 3,750 downloads/mo |
+| `mzbac/DeepSeek-OCR-2-8bit` | 8-bit | ~4GB | Same stack (torch 2.6, flash-attn 2.7.3, Python 3.12); explicitly supports dynamic resolution (0–6 patches); 140 downloads/mo |
+**Not applicable to NVIDIA GPUs:**
+- `mlx-community/DeepSeek-OCR-2-*` — Apple Silicon only (MLX framework)
+**Not recommended:**
+- `WHY2001/DeepSeek-OCR-4bit-Quantized` — 17 downloads/month, not well tested
+### What does not exist (as of March 2026)
+- GGUF of DeepSeek-OCR-2 (GGUF repos on HuggingFace are for v1 only)
+- GPTQ of DeepSeek-OCR-2
+- AWQ of DeepSeek-OCR-2
+### Switching models
+Change the single constant in `app.py` and restart:
+```python
+# FP8 — recommended first try for 8GB GPUs
+MODEL_NAME = 'richarddavison/DeepSeek-OCR-2-FP8'
+# 8-bit — alternative with same toolchain
+MODEL_NAME = 'mzbac/DeepSeek-OCR-2-8bit'
+```
+The model will be downloaded from HuggingFace on first use and cached locally.

app.py ADDED Viewed

	@@ -0,0 +1,420 @@

+import gradio as gr
+from transformers import AutoModel, AutoTokenizer
+import torch
+import spaces
+import os
+import sys
+import tempfile
+import shutil
+from PIL import Image, ImageDraw, ImageFont, ImageOps
+import fitz
+import re
+import numpy as np
+import base64
+import markdown as md_lib
+from io import StringIO, BytesIO
+# Model options — swap MODEL_NAME to reduce VRAM usage on GPUs with <= 8GB
+#
+# Full precision BF16 (~8GB VRAM) — original, highest accuracy
+MODEL_NAME = 'deepseek-ai/DeepSeek-OCR-2'
+#
+# FP8 dynamic quantization (~3.5GB VRAM) — ~50% VRAM reduction, 3750 downloads/mo
+# Requires Ampere GPU or newer (RTX 3070 is supported)
+# MODEL_NAME = 'richarddavison/DeepSeek-OCR-2-FP8'
+#
+# 8-bit quantization (~4GB VRAM) — same stack (torch 2.6, flash-attn 2.7.3, py3.12)
+# Explicitly supports dynamic resolution (0-6 patches), 140 downloads/mo
+# MODEL_NAME = 'mzbac/DeepSeek-OCR-2-8bit'
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
+model = AutoModel.from_pretrained(MODEL_NAME, _attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True, use_safetensors=True).eval()
+# .cuda() is NOT called here — on ZeroGPU, GPU is only available inside @spaces.GPU
+# functions. Locally, model.cuda() is called inside process_image on first run.
+BASE_SIZE = 1024
+IMAGE_SIZE = 768
+CROP_MODE = True
+TASK_PROMPTS = {
+    "📋 Markdown": {"prompt": "<image>\n<|grounding|>Convert the document to markdown.", "has_grounding": True},
+    "📝 Free OCR": {"prompt": "<image>\nFree OCR.", "has_grounding": False},
+    "📍 Locate": {"prompt": "<image>\nLocate <|ref|>text<|/ref|> in the image.", "has_grounding": True},
+    "🔍 Describe": {"prompt": "<image>\nDescribe this image in detail.", "has_grounding": False},
+    "✏️ Custom": {"prompt": "", "has_grounding": False}
+}
+def extract_grounding_references(text):
+    pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
+    return re.findall(pattern, text, re.DOTALL)
+def draw_bounding_boxes(image, refs, extract_images=False):
+    img_w, img_h = image.size
+    img_draw = image.copy()
+    draw = ImageDraw.Draw(img_draw)
+    overlay = Image.new('RGBA', img_draw.size, (0, 0, 0, 0))
+    draw2 = ImageDraw.Draw(overlay)
+    font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 15)
+    crops = []
+    color_map = {}
+    np.random.seed(42)
+    for ref in refs:
+        label = ref[1]
+        if label not in color_map:
+            color_map[label] = (np.random.randint(50, 255), np.random.randint(50, 255), np.random.randint(50, 255))
+        color = color_map[label]
+        coords = eval(ref[2])
+        color_a = color + (60,)
+        for box in coords:
+            x1, y1, x2, y2 = int(box[0]/999*img_w), int(box[1]/999*img_h), int(box[2]/999*img_w), int(box[3]/999*img_h)
+            if extract_images and label == 'image':
+                crops.append(image.crop((x1, y1, x2, y2)))
+            width = 5 if label == 'title' else 3
+            draw.rectangle([x1, y1, x2, y2], outline=color, width=width)
+            draw2.rectangle([x1, y1, x2, y2], fill=color_a)
+            text_bbox = draw.textbbox((0, 0), label, font=font)
+            tw, th = text_bbox[2] - text_bbox[0], text_bbox[3] - text_bbox[1]
+            ty = max(0, y1 - 20)
+            draw.rectangle([x1, ty, x1 + tw + 4, ty + th + 4], fill=color)
+            draw.text((x1 + 2, ty + 2), label, font=font, fill=(255, 255, 255))
+    img_draw.paste(overlay, (0, 0), overlay)
+    return img_draw, crops
+def clean_output(text, include_images=False):
+    if not text:
+        return ""
+    pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
+    matches = re.findall(pattern, text, re.DOTALL)
+    img_num = 0
+    for match in matches:
+        if '<|ref|>image<|/ref|>' in match[0]:
+            if include_images:
+                text = text.replace(match[0], f'\n\n**[Figure {img_num + 1}]**\n\n', 1)
+                img_num += 1
+            else:
+                text = text.replace(match[0], '', 1)
+        else:
+            text = re.sub(rf'(?m)^[^\n]*{re.escape(match[0])}[^\n]*\n?', '', text)
+    text = text.replace('\\coloneqq', ':=').replace('\\eqqcolon', '=:')
+    return text.strip()
+MATHJAX_HEAD = """
+<script>
+window.MathJax = {
+  tex: {
+    inlineMath:  [['\\\\(', '\\\\)']],
+    displayMath: [['\\\\[', '\\\\]']],
+    processEscapes: true,
+    tags: 'ams'
+  },
+  options: {
+    skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+  },
+  startup: {
+    typeset: false
+  }
+};
+</script>
+<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js" async></script>
+<style>
+.math-preview {
+  padding: 1.5em;
+  font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
+  font-size: 15px;
+  line-height: 1.8;
+  color: #1a1a1a;
+  max-width: 100%;
+  overflow-x: auto;
+}
+.math-preview h1 { font-size: 1.8em; font-weight: 700; margin: 1em 0 0.4em; border-bottom: 2px solid #e0e0e0; padding-bottom: 0.3em; }
+.math-preview h2 { font-size: 1.4em; font-weight: 600; margin: 1em 0 0.4em; border-bottom: 1px solid #e0e0e0; padding-bottom: 0.2em; }
+.math-preview h3 { font-size: 1.15em; font-weight: 600; margin: 0.9em 0 0.3em; }
+.math-preview h4, .math-preview h5, .math-preview h6 { font-weight: 600; margin: 0.8em 0 0.3em; }
+.math-preview p  { margin: 0.6em 0; }
+.math-preview ul, .math-preview ol { padding-left: 1.8em; margin: 0.5em 0; }
+.math-preview li { margin: 0.25em 0; }
+.math-preview table { border-collapse: collapse; width: 100%; margin: 1em 0; font-size: 0.95em; }
+.math-preview th, .math-preview td { border: 1px solid #ccc; padding: 0.45em 0.75em; text-align: left; }
+.math-preview th { background: #f2f2f2; font-weight: 600; }
+.math-preview tr:nth-child(even) { background: #fafafa; }
+.math-preview code { background: #f4f4f4; padding: 0.15em 0.4em; border-radius: 3px; font-family: 'Courier New', monospace; font-size: 0.88em; }
+.math-preview pre  { background: #f4f4f4; padding: 1em; border-radius: 5px; overflow-x: auto; margin: 0.8em 0; }
+.math-preview pre code { background: none; padding: 0; }
+.math-preview blockquote { border-left: 4px solid #ccc; margin: 0.8em 0; padding: 0.4em 1em; color: #555; background: #fafafa; }
+.math-preview img { max-width: 100%; height: auto; display: block; margin: 0.8em 0; }
+.math-preview .arithmatex { overflow-x: auto; }
+.math-preview mjx-container[display="true"] { display: block; overflow-x: auto; padding: 0.5em 0; }
+</style>
+"""
+def to_math_html(text):
+    if not text:
+        return ""
+    html = md_lib.markdown(text, extensions=[
+        'pymdownx.arithmatex',
+        'tables',
+        'fenced_code',
+        'sane_lists',
+    ], extension_configs={
+        'pymdownx.arithmatex': {'generic': True}
+    })
+    return f'<div class="math-preview">{html}</div>'
+def embed_images(markdown, crops):
+    if not crops:
+        return markdown
+    for i, img in enumerate(crops):
+        buf = BytesIO()
+        img.save(buf, format="PNG")
+        b64 = base64.b64encode(buf.getvalue()).decode()
+        markdown = markdown.replace(f'**[Figure {i + 1}]**', f'\n\n![Figure {i + 1}](data:image/png;base64,{b64})\n\n', 1)
+    return markdown
+@spaces.GPU(duration=90)
+def process_image(image, task, custom_prompt):
+    model.cuda()  # GPU is available here — works on ZeroGPU and locally
+    if image is None:
+        return "Error: Upload an image", "", "", None, []
+    if task in ["✏️ Custom", "📍 Locate"] and not custom_prompt.strip():
+        return "Please enter a prompt", "", "", None, []
+    if image.mode in ('RGBA', 'LA', 'P'):
+        image = image.convert('RGB')
+    image = ImageOps.exif_transpose(image)
+    if task == "✏️ Custom":
+        prompt = f"<image>\n{custom_prompt.strip()}"
+        has_grounding = '<|grounding|>' in custom_prompt
+    elif task == "📍 Locate":
+        prompt = f"<image>\nLocate <|ref|>{custom_prompt.strip()}<|/ref|> in the image."
+        has_grounding = True
+    else:
+        prompt = TASK_PROMPTS[task]["prompt"]
+        has_grounding = TASK_PROMPTS[task]["has_grounding"]
+    tmp = tempfile.NamedTemporaryFile(delete=False, suffix='.jpg')
+    image.save(tmp.name, 'JPEG', quality=95)
+    tmp.close()
+    out_dir = tempfile.mkdtemp()
+    stdout = sys.stdout
+    sys.stdout = StringIO()
+    model.infer(
+        tokenizer=tokenizer,
+        prompt=prompt,
+        image_file=tmp.name,
+        output_path=out_dir,
+        base_size=BASE_SIZE,
+        image_size=IMAGE_SIZE,
+        crop_mode=CROP_MODE,
+        save_results=False
+    )
+    debug_filters = ['PATCHES', '====', 'BASE:', 'directly resize', 'NO PATCHES', 'torch.Size', '%|']
+    result = '\n'.join([l for l in sys.stdout.getvalue().split('\n')
+                        if l.strip() and not any(s in l for s in debug_filters)]).strip()
+    sys.stdout = stdout
+    os.unlink(tmp.name)
+    shutil.rmtree(out_dir, ignore_errors=True)
+    if not result:
+        return "No text detected", "", "", None, []
+    cleaned = clean_output(result, False)
+    markdown = clean_output(result, True)
+    img_out = None
+    crops = []
+    if has_grounding and '<|ref|>' in result:
+        refs = extract_grounding_references(result)
+        if refs:
+            img_out, crops = draw_bounding_boxes(image, refs, True)
+    markdown = embed_images(markdown, crops)
+    return cleaned, markdown, result, img_out, crops
+@spaces.GPU(duration=90)
+def process_pdf(path, task, custom_prompt, page_num):
+    doc = fitz.open(path)
+    total_pages = len(doc)
+    if page_num < 1 or page_num > total_pages:
+        doc.close()
+        return f"Invalid page number. PDF has {total_pages} pages.", "", "", None, []
+    page = doc.load_page(page_num - 1)
+    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
+    img = Image.open(BytesIO(pix.tobytes("png")))
+    doc.close()
+    return process_image(img, task, custom_prompt)
+def process_file(path, task, custom_prompt, page_num):
+    if not path:
+        return "Error: Upload a file", "", "", None, []
+    if path.lower().endswith('.pdf'):
+        return process_pdf(path, task, custom_prompt, page_num)
+    else:
+        return process_image(Image.open(path), task, custom_prompt)
+def toggle_prompt(task):
+    if task == "✏️ Custom":
+        return gr.update(visible=True, label="Custom Prompt", placeholder="Add <|grounding|> for bounding boxes")
+    elif task == "📍 Locate":
+        return gr.update(visible=True, label="Text to Locate", placeholder="Enter text to locate")
+    return gr.update(visible=False)
+def select_boxes(task):
+    if task == "📍 Locate":
+        return gr.update(selected="tab_boxes")
+    return gr.update()
+def get_pdf_page_count(file_path):
+    if not file_path or not file_path.lower().endswith('.pdf'):
+        return 1
+    doc = fitz.open(file_path)
+    count = len(doc)
+    doc.close()
+    return count
+def load_image(file_path, page_num=1):
+    if not file_path:
+        return None
+    if file_path.lower().endswith('.pdf'):
+        doc = fitz.open(file_path)
+        page_idx = max(0, min(int(page_num) - 1, len(doc) - 1))
+        page = doc.load_page(page_idx)
+        pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
+        img = Image.open(BytesIO(pix.tobytes("png")))
+        doc.close()
+        return img
+    else:
+        return Image.open(file_path)
+def update_page_selector(file_path):
+    if not file_path:
+        return gr.update(visible=False)
+    if file_path.lower().endswith('.pdf'):
+        page_count = get_pdf_page_count(file_path)
+        return gr.update(visible=True, maximum=page_count, value=1, minimum=1,
+                        label=f"Select Page (1-{page_count})")
+    return gr.update(visible=False)
+with gr.Blocks(title="DeepSeek-OCR-2", head=MATHJAX_HEAD) as demo:
+    gr.Markdown("""
+    # 🧮 DeepSeek-OCR-2 — Math Rendering Edition
+    **Convert documents to markdown, extract text, parse figures, and locate specific content with bounding boxes.**
+    **Model uses DeepEncoder v2 and achieves 91.09% on OmniDocBench (+3.73% over v1).**
+    Built on the original [DeepSeek-OCR-2 Demo](https://huggingface.co/spaces/merterbak/DeepSeek-OCR-2) by **Mert Erbak** — thank you for the excellent foundation.
+    This fork adds **MathJax rendering** in the Markdown Preview tab so that equations from scanned papers and textbooks display as proper math notation.
+    """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            file_in = gr.File(label="Upload Image or PDF", file_types=["image", ".pdf"], type="filepath")
+            input_img = gr.Image(label="Input Image", type="pil", height=300)
+            page_selector = gr.Number(label="Select Page", value=1, minimum=1, step=1, visible=False)
+            task = gr.Dropdown(list(TASK_PROMPTS.keys()), value="📋 Markdown", label="Task")
+            prompt = gr.Textbox(label="Prompt", lines=2, visible=False)
+            btn = gr.Button("Extract", variant="primary", size="lg")
+        with gr.Column(scale=2):
+            with gr.Tabs() as tabs:
+                with gr.Tab("Text", id="tab_text"):
+                    text_out = gr.Textbox(lines=20, buttons=["copy"], show_label=False)
+                with gr.Tab("Markdown Preview", id="tab_markdown"):
+                    md_out = gr.HTML("")
+                with gr.Tab("Boxes", id="tab_boxes"):
+                    img_out = gr.Image(type="pil", height=500, show_label=False)
+                with gr.Tab("Cropped Images", id="tab_crops"):
+                    gallery = gr.Gallery(show_label=False, columns=3, height=400)
+                with gr.Tab("Raw Text", id="tab_raw"):
+                    raw_out = gr.Textbox(lines=20, buttons=["copy"], show_label=False)
+    with gr.Accordion("Image Examples", open=True):
+        gr.Examples(
+            examples=[
+                ["examples/2022-0922 Section 13 Notes.png", "📋 Markdown", ""],
+                ["examples/2022-0922 Section 14 Notes.png", "📋 Markdown", ""],
+                ["examples/2022-0922 Section 15 Notes.png", "📋 Markdown", ""],
+            ],
+            inputs=[input_img, task, prompt],
+            cache_examples=False
+        )
+    with gr.Accordion("PDF Examples", open=True):
+        gr.Examples(
+            examples=[
+                ["examples/Gursoy Class Notes_ Accessibility Sandbox.pdf", "📋 Markdown", ""],
+            ],
+            inputs=[file_in, task, prompt],
+            cache_examples=False
+        )
+    with gr.Accordion("ℹ️ Info", open=False):
+        gr.Markdown("""
+        ### Configuration
+        1024 base + 768 patches with dynamic cropping (2-6 patches). 144 tokens per patch + 256 base tokens.
+        ### Tasks
+        - **Markdown**: Convert document to structured markdown with layout detection (grounding ✅)
+        - **Free OCR**: Simple text extraction without layout
+        - **Locate**: Find and highlight specific text/elements in image (grounding ✅)
+        - **Describe**: General image description
+        - **Custom**: Your own prompt
+        ### Special Tokens
+        - `<image>` - Placeholder where visual tokens are inserted
+        - `<|grounding|>` - Enables layout detection with bounding boxes
+        - `<|ref|>text<|/ref|>` - Reference text to locate in the image
+        """)
+    file_in.change(load_image, [file_in, page_selector], [input_img])
+    file_in.change(update_page_selector, [file_in], [page_selector])
+    page_selector.change(load_image, [file_in, page_selector], [input_img])
+    task.change(toggle_prompt, [task], [prompt])
+    task.change(select_boxes, [task], [tabs])
+    def run(image, file_path, task, custom_prompt, page_num):
+        if file_path:
+            cleaned, markdown, raw, img_out, crops = process_file(file_path, task, custom_prompt, int(page_num))
+        elif image is not None:
+            cleaned, markdown, raw, img_out, crops = process_image(image, task, custom_prompt)
+        else:
+            return "Error: Upload a file or image", "", "", None, []
+        return cleaned, to_math_html(markdown), raw, img_out, crops
+    submit_event = btn.click(run, [input_img, file_in, task, prompt, page_selector],
+                             [text_out, md_out, raw_out, img_out, gallery])
+    submit_event.then(select_boxes, [task], [tabs])
+    submit_event.then(fn=None, js="""() => {
+      const tryTypeset = () => {
+        if (!window.MathJax || !MathJax.typesetPromise) { setTimeout(tryTypeset, 100); return; }
+        const el = document.querySelector('.math-preview');
+        if (!el) return;
+        MathJax.typesetClear([el]);
+        MathJax.typesetPromise([el]);
+      };
+      setTimeout(tryTypeset, 100);
+    }""")
+if __name__ == "__main__":
+    # server_name="0.0.0.0" is needed locally (WSL2 → Windows access)
+    # On HuggingFace Spaces, SPACE_ID is set and Gradio handles binding automatically
+    local = not os.environ.get("SPACE_ID")
+    demo.queue(max_size=20).launch(theme=gr.themes.Soft(), server_name="0.0.0.0" if local else None)

examples/2022-0922 Section 13 Notes.png ADDED Viewed

Git LFS Details

SHA256: e344e03a5967c604e2ce4ddb99ec5ab8d4939b6b692785be461213bad7e6c067
Pointer size: 131 Bytes
Size of remote file: 500 kB

examples/2022-0922 Section 14 Notes.png ADDED Viewed

Git LFS Details

SHA256: 7b27aec83556e709fff5f027155ba2a7f76a349cf0ca23334b9c864f5bbfcaf2
Pointer size: 131 Bytes
Size of remote file: 482 kB

examples/2022-0922 Section 15 Notes.png ADDED Viewed

Git LFS Details

SHA256: 9bbe0459e0d2035da455bee7977d1513708e57a6bc9fc8fab93c591b5950f0ce
Pointer size: 131 Bytes
Size of remote file: 746 kB

examples/Gursoy Class Notes_ Accessibility Sandbox.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5b7a6494d07f714db7d268d1301382a899f24a2536660672ea12c5ab69ae2c9e
+size 180760

examples/ocr.jpg ADDED Viewed

Git LFS Details

SHA256: 339d7b11d51ecaa10db3ab721b0d8bbeb03aed60109bc42760089013924fb7d6
Pointer size: 131 Bytes
Size of remote file: 281 kB

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+torch==2.6.0
+transformers==4.46.3
+tokenizers==0.20.3
+accelerate
+einops
+addict
+easydict
+torchvision
+flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
+PyMuPDF
+hf_transfer
+markdown
+pymdown-extensions