Spaces:

ricklon
/

DeepSeek-OCR-2-Math

Running on Zero

App Files Files Community

DeepSeek-OCR-2-Math / TECHNICAL.md

ricklon

Fix flash_attention_2 startup crash on ZeroGPU and LaTeX delimiter rendering

b9d5e1c about 1 month ago

preview code

raw

history blame contribute delete

16.4 kB

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

Technical Documentation

This document covers the implementation details of the DeepSeek-OCR-2 Math Rendering Edition. It is intended for developers who want to understand, extend, or debug the pipeline.

Architecture Overview
VRAM Usage and Quantized Models
Prompts and Special Tokens
Grounding and Layout Detection
Figure and Graph Extraction
stdout Capture Pattern
PDF Rendering — image conversion, 300 DPI rationale, one page at a time, digital vs scanned
Dual-pass Output Cleaning
Bounding Box Rendering
Math Rendering Pipeline
Known Quirks and Workarounds

Architecture Overview

User input (image or PDF)
        │
        ▼
  PDF? ─── fitz renders page at 300 DPI ──► PIL Image
  No?  ─── PIL Image directly
        │
        ▼
  model.infer() called with prompt + image path
        │
        ▼ (stdout captured)
  Raw model output (text + grounding tokens)
        │
        ├──► clean_output(include_images=False) ──► Text tab
        │
        ├──► clean_output(include_images=True)
        │           │
        │           ▼
        │    embed_images() ──► Markdown string with base64 figures
        │           │
        │           ▼
        │    to_math_html() ──► HTML with MathJax ──► Markdown Preview tab
        │
        ├──► extract_grounding_references()
        │           │
        │           ▼
        │    draw_bounding_boxes() ──► Boxes tab
        │                         └──► crops ──► Cropped Images tab
        │
        └──► raw result ──► Raw Text tab

Prompts and Special Tokens

Each task sends a different prompt to the model. The prompt controls both what the model outputs and whether it performs layout detection.

Task	Prompt	Grounding
Markdown	`<image>\n<\|grounding\|>Convert the document to markdown.`	Yes
Free OCR	`<image>\nFree OCR.`	No
Locate	`<image>\nLocate <\|ref\|>text<\|/ref\|> in the image.`	Yes
Describe	`<image>\nDescribe this image in detail.`	No
Custom	User-defined	Optional

Special tokens

Token	Purpose
`<image>`	Replaced at inference time with visual patch embeddings from the input image
`<\|grounding\|>`	Activates layout detection mode — the model annotates every detected region with a label and bounding box
`<\|ref\|>label<\|/ref\|>`	Wraps the label of a detected region (e.g. `title`, `text`, `image`, `table`)
`<\|det\|>coords<\|/det\|>`	Wraps the bounding box coordinates for that region

Locate task

When using Locate, the user's input is embedded directly into the prompt:

prompt = f"<image>\nLocate <|ref|>{custom_prompt.strip()}<|/ref|> in the image."

This asks the model to find a specific string or element and return its bounding box coordinates.

Grounding and Layout Detection

When <|grounding|> is present, the model interleaves its text output with region annotations. A typical raw output looks like:

# Introduction
<|ref|>title<|/ref|><|det|>[[45, 12, 820, 48]]<|/det|>

This paper presents a method for...
<|ref|>text<|/ref|><|det|>[[45, 60, 820, 340]]<|/det|>

<|ref|>image<|/ref|><|det|>[[45, 360, 820, 680]]<|/det|>

| A | B |
|---|---|
<|ref|>table<|/ref|><|det|>[[45, 700, 820, 900]]<|/det|>

The labels (title, text, image, table) are part of the model's training vocabulary — the model assigns them based on what it detects, not from any hardcoded list in the app.

The regex that parses this is:

pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'

This returns a list of tuples: (full_match, label, coordinates_string).

Coordinate system

Bounding box coordinates are normalised to a 0–999 scale, not pixel coordinates. The app scales them back at render time:

x1 = int(box[0] / 999 * img_w)
y1 = int(box[1] / 999 * img_h)

This means coordinates are resolution-independent — the same model output works regardless of the original image size.

Figure and Graph Extraction

Graph and figure extraction is a side effect of bounding box processing. Inside draw_bounding_boxes():

if extract_images and label == 'image':
    crops.append(image.crop((x1, y1, x2, y2)))

Only regions the model labels as 'image' are cropped. Text blocks, titles, and tables get bounding boxes drawn but are not extracted.

These crops are then:

Added to the Cropped Images gallery tab
Base64-encoded and embedded into the markdown as ![Figure N](data:image/png;base64,...) by embed_images(), so they appear inline in the Markdown Preview tab

stdout Capture Pattern

The model's infer() method was designed as a CLI tool — it print()s its output rather than returning it. The app captures this by temporarily replacing sys.stdout:

stdout = sys.stdout
sys.stdout = StringIO()

model.infer(...)

raw = sys.stdout.getvalue()
sys.stdout = stdout

The model also prints internal diagnostics alongside the actual output. These are filtered out by checking for known debug strings:

debug_filters = ['PATCHES', '====', 'BASE:', 'directly resize',
                 'NO PATCHES', 'torch.Size', '%|']

result = '\n'.join([
    l for l in raw.split('\n')
    if l.strip() and not any(s in l for s in debug_filters)
])

If inference ever produces unexpected empty output, checking what the model is printing to stdout (by temporarily removing the capture) is the first debugging step.

PDF Rendering

PDFs are converted to images — the text layer is never read

The app does not extract embedded text from PDFs. Every page is rasterised to a PNG image first, then passed through the exact same pipeline as a directly uploaded image:

def process_pdf(path, task, custom_prompt, page_num):
    doc = fitz.open(path)
    page = doc.load_page(page_num - 1)
    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
    img = Image.open(BytesIO(pix.tobytes("png")))
    doc.close()
    return process_image(img, task, custom_prompt)

This means the model reads pixels, not characters. It has no access to the PDF's internal text layer, font metadata, or document structure.

Why 300 DPI

PDFs are natively specified at 72 DPI. The fitz.Matrix(300/72, 300/72) call scales the render up by ~4.17×:

At 72 DPI, small text, subscripts, superscripts, and fine math symbols are too coarse for the model to read reliably
At 300 DPI, characters are sharp enough for accurate OCR even at small point sizes
300 DPI is the standard used by document scanners for archival quality

One page at a time

The current implementation processes one page per submission. There is no batch mode. For a multi-page document the user selects a page number, submits, then moves to the next page.

The page selector UI is only shown when a PDF is uploaded:

def update_page_selector(file_path):
    if file_path.lower().endswith('.pdf'):
        page_count = get_pdf_page_count(file_path)
        return gr.update(visible=True, maximum=page_count, value=1, minimum=1)
    return gr.update(visible=False)

Digital vs scanned PDFs

Both work identically:

PDF type	What's inside	Result
Digital (text-based)	Vector fonts and geometry	PyMuPDF re-rasterises from vectors — output is perfectly sharp at any DPI
Scanned	Embedded raster images	PyMuPDF extracts the raster — output quality depends on the original scan resolution

For scanned PDFs with low source resolution (e.g. 150 DPI originals), upscaling to 300 DPI will not recover detail that was never there. In those cases inference accuracy may be lower than with high-quality digital PDFs.

Dual-pass Output Cleaning

clean_output() is called twice on the same raw result to produce two different outputs:

cleaned  = clean_output(result, include_images=False)  # → Text tab
markdown = clean_output(result, include_images=True)   # → Markdown Preview

With include_images=False:

Grounding tokens are stripped
<|ref|>image<|/ref|> regions are removed entirely
Result is clean plain text

With include_images=True:

Text grounding tokens are stripped
<|ref|>image<|/ref|> regions are replaced with **[Figure N]** placeholders
embed_images() then swaps those placeholders for actual base64-encoded PNGs

Bounding Box Rendering

Bounding boxes are drawn in two layers using Pillow:

Solid outline — drawn directly on a copy of the image
Semi-transparent fill — drawn on a separate RGBA overlay, then composited

overlay = Image.new('RGBA', img_draw.size, (0, 0, 0, 0))
# ... draw filled rectangles on overlay with alpha=60 ...
img_draw.paste(overlay, (0, 0), overlay)

The alpha value of 60 (out of 255) gives a ~24% opacity fill, keeping the underlying content readable.

Colour assignment

Each unique label gets a random RGB colour, generated once per session:

np.random.seed(42)
color_map[label] = (
    np.random.randint(50, 255),
    np.random.randint(50, 255),
    np.random.randint(50, 255)
)

The seed is fixed at 42, so label colours are deterministic across runs — title will always get the same colour, text always another. The lower bound of 50 prevents colours that are too dark to see against the fill.

Title regions get a thicker outline (width=5) than other regions (width=3) to give them visual prominence.

Math Rendering Pipeline

Getting LaTeX from the model to display correctly in the browser involves three components working together.

The markdown/math conflict

Standard markdown processors interpret _ as italic and * as bold. Raw LaTeX like $a_1 + a_2^*$ would be mangled before MathJax ever sees it.

The solution is pymdownx.arithmatex — a markdown extension that extracts math expressions before markdown processing, processes the surrounding text, then reinserts the math wrapped in MathJax-compatible delimiters:

Input:  Some text with $a_1 + a_2$ inline.

After arithmatex + markdown:
<p>Some text with <span class="arithmatex">\(a_1 + a_2\)</span> inline.</p>

The _ inside the math is never touched by the markdown processor.

Delimiter pre-conversion

The model outputs \[...\] for display math and $...$ for inline math. But pymdownx.arithmatex only recognises $...$ and $$...$$ by default. Worse, if \[...\] is passed directly to the markdown processor, the backslashes are stripped first — before arithmatex can intercept them — leaving bare [...] brackets in the output.

to_math_html() therefore pre-converts the model's native delimiters before calling markdown():

text = re.sub(r'\\\[(.+?)\\\]', r'$$\1$$', text, flags=re.DOTALL)
text = re.sub(r'\\\((.+?)\\\)', r'$\1$', text)

After this step, arithmatex sees $$...$$ and $...$ , protects the content from markdown, and wraps it in \[...\] and $...$ for MathJax to render.

MathJax configuration

MathJax is loaded once in the page <head> and configured to process $...$ for inline math and \[...\] for display math — matching the output format of arithmatex:

window.MathJax = {
  tex: {
    inlineMath:  [['\\(', '\\)']],
    displayMath: [['\\[', '\\]']],
    processEscapes: true,
    tags: 'ams'
  }
};

tags: 'ams' enables automatic equation numbering for align, equation, and similar environments.

Re-typesetting on update

MathJax processes the page once on load. When Gradio updates the HTML component with new content, MathJax needs to be told to process the new content:

submit_event.then(
    fn=None,
    js="() => setTimeout(() => { if(window.MathJax) MathJax.typesetPromise(); }, 300)"
)

The 300ms delay gives Gradio time to finish updating the DOM before MathJax scans it.

Known Quirks and Workarounds

`\coloneqq` and `\eqqcolon`

These LaTeX commands (≔ and =:) from the mathtools package appear frequently in academic papers but are not available in MathJax's default TeX configuration. Rather than loading the full mathtools package, they are substituted at the text level:

text = text.replace('\\coloneqq', ':=').replace('\\eqqcolon', '=:')

If you need proper rendering of these symbols, add require: { package: ['mathtools'] } to the MathJax configuration.

Flash Attention initialisation warning

On startup you will see:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU.

This is because the model loads onto CPU first (from_pretrained) then moves to GPU (.cuda()). Flash Attention 2 prefers direct GPU initialisation. The warning is harmless — inference works correctly. To silence it, add device_map="cuda" to the from_pretrained call.

Model type mismatch warning

You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR2.

The model's config file on HuggingFace declares model_type: deepseek_vl_v2 but the custom code registers a DeepseekOCR2 class. Because trust_remote_code=True is set, the correct class is loaded regardless. The warning can be ignored.

`eval()` on model output

Bounding box coordinates are parsed with Python's eval():

coords = eval(ref[2])

The model outputs coordinates as a Python list literal e.g. [[45, 12, 820, 48]]. This is safe in this context since the model runs locally, but worth noting if the architecture ever changes to process untrusted remote model output.

Zone.Identifier files in examples/

Files copied from Windows to WSL2 may have accompanying .Zone.Identifier metadata files (e.g. image.png:Zone.Identifier). These are Windows security zone markers and are harmless — Gradio ignores them when loading examples.

VRAM Usage and Quantized Models

The 8GB problem

The full-precision BF16 model (deepseek-ai/DeepSeek-OCR-2) consumes approximately 7.9GB of VRAM on load, leaving only ~250MB free on an 8GB GPU (e.g. RTX 3070). This headroom is insufficient for inference on complex documents:

Each patch adds tokens to the KV cache and activations
A 6-patch document can exhaust the remaining VRAM
When VRAM is full, PyTorch spills to system RAM — which is 50–100× slower
Symptom: GPU-Util drops to ~24%, power draw falls to ~47W (waiting on memory, not computing)

You can confirm this with watch -n 1 nvidia-smi during inference. Near-full VRAM with low GPU utilisation is the telltale sign.

Quantized alternatives on HuggingFace

To switch models, change MODEL_NAME in app.py. Three options are available as of March 2026:

Model	Format	VRAM	Notes
`deepseek-ai/DeepSeek-OCR-2`	BF16 (full)	~8GB	Original, highest accuracy
`richarddavison/DeepSeek-OCR-2-FP8`	FP8 dynamic	~3.5GB	~50% reduction; requires Ampere GPU or newer (RTX 30xx qualifies); 3,750 downloads/mo
`mzbac/DeepSeek-OCR-2-8bit`	8-bit	~4GB	Same stack (torch 2.6, flash-attn 2.7.3, Python 3.12); explicitly supports dynamic resolution (0–6 patches); 140 downloads/mo

Not applicable to NVIDIA GPUs:

mlx-community/DeepSeek-OCR-2-* — Apple Silicon only (MLX framework)

Not recommended:

WHY2001/DeepSeek-OCR-4bit-Quantized — 17 downloads/month, not well tested

What does not exist (as of March 2026)

GGUF of DeepSeek-OCR-2 (GGUF repos on HuggingFace are for v1 only)
GPTQ of DeepSeek-OCR-2
AWQ of DeepSeek-OCR-2

Switching models

Change the single constant in app.py and restart:

# FP8 — recommended first try for 8GB GPUs
MODEL_NAME = 'richarddavison/DeepSeek-OCR-2-FP8'

# 8-bit — alternative with same toolchain
MODEL_NAME = 'mzbac/DeepSeek-OCR-2-8bit'

The model will be downloaded from HuggingFace on first use and cached locally.