Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.11.0
Technical Documentation
This document covers the implementation details of the DeepSeek-OCR-2 Math Rendering Edition. It is intended for developers who want to understand, extend, or debug the pipeline.
Table of Contents
- Architecture Overview
- VRAM Usage and Quantized Models
- Prompts and Special Tokens
- Grounding and Layout Detection
- Figure and Graph Extraction
- stdout Capture Pattern
- PDF Rendering — image conversion, 300 DPI rationale, one page at a time, digital vs scanned
- Dual-pass Output Cleaning
- Bounding Box Rendering
- Math Rendering Pipeline
- Known Quirks and Workarounds
Architecture Overview
User input (image or PDF)
│
▼
PDF? ─── fitz renders page at 300 DPI ──► PIL Image
No? ─── PIL Image directly
│
▼
model.infer() called with prompt + image path
│
▼ (stdout captured)
Raw model output (text + grounding tokens)
│
├──► clean_output(include_images=False) ──► Text tab
│
├──► clean_output(include_images=True)
│ │
│ ▼
│ embed_images() ──► Markdown string with base64 figures
│ │
│ ▼
│ to_math_html() ──► HTML with MathJax ──► Markdown Preview tab
│
├──► extract_grounding_references()
│ │
│ ▼
│ draw_bounding_boxes() ──► Boxes tab
│ └──► crops ──► Cropped Images tab
│
└──► raw result ──► Raw Text tab
Prompts and Special Tokens
Each task sends a different prompt to the model. The prompt controls both what the model outputs and whether it performs layout detection.
| Task | Prompt | Grounding |
|---|---|---|
| Markdown | <image>\n<|grounding|>Convert the document to markdown. |
Yes |
| Free OCR | <image>\nFree OCR. |
No |
| Locate | <image>\nLocate <|ref|>text<|/ref|> in the image. |
Yes |
| Describe | <image>\nDescribe this image in detail. |
No |
| Custom | User-defined | Optional |
Special tokens
| Token | Purpose |
|---|---|
<image> |
Replaced at inference time with visual patch embeddings from the input image |
<|grounding|> |
Activates layout detection mode — the model annotates every detected region with a label and bounding box |
<|ref|>label<|/ref|> |
Wraps the label of a detected region (e.g. title, text, image, table) |
<|det|>coords<|/det|> |
Wraps the bounding box coordinates for that region |
Locate task
When using Locate, the user's input is embedded directly into the prompt:
prompt = f"<image>\nLocate <|ref|>{custom_prompt.strip()}<|/ref|> in the image."
This asks the model to find a specific string or element and return its bounding box coordinates.
Grounding and Layout Detection
When <|grounding|> is present, the model interleaves its text output with region annotations. A typical raw output looks like:
# Introduction
<|ref|>title<|/ref|><|det|>[[45, 12, 820, 48]]<|/det|>
This paper presents a method for...
<|ref|>text<|/ref|><|det|>[[45, 60, 820, 340]]<|/det|>
<|ref|>image<|/ref|><|det|>[[45, 360, 820, 680]]<|/det|>
| A | B |
|---|---|
<|ref|>table<|/ref|><|det|>[[45, 700, 820, 900]]<|/det|>
The labels (title, text, image, table) are part of the model's training vocabulary — the model assigns them based on what it detects, not from any hardcoded list in the app.
The regex that parses this is:
pattern = r'(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)'
This returns a list of tuples: (full_match, label, coordinates_string).
Coordinate system
Bounding box coordinates are normalised to a 0–999 scale, not pixel coordinates. The app scales them back at render time:
x1 = int(box[0] / 999 * img_w)
y1 = int(box[1] / 999 * img_h)
This means coordinates are resolution-independent — the same model output works regardless of the original image size.
Figure and Graph Extraction
Graph and figure extraction is a side effect of bounding box processing. Inside draw_bounding_boxes():
if extract_images and label == 'image':
crops.append(image.crop((x1, y1, x2, y2)))
Only regions the model labels as 'image' are cropped. Text blocks, titles, and tables get bounding boxes drawn but are not extracted.
These crops are then:
- Added to the Cropped Images gallery tab
- Base64-encoded and embedded into the markdown as
byembed_images(), so they appear inline in the Markdown Preview tab
stdout Capture Pattern
The model's infer() method was designed as a CLI tool — it print()s its output rather than returning it. The app captures this by temporarily replacing sys.stdout:
stdout = sys.stdout
sys.stdout = StringIO()
model.infer(...)
raw = sys.stdout.getvalue()
sys.stdout = stdout
The model also prints internal diagnostics alongside the actual output. These are filtered out by checking for known debug strings:
debug_filters = ['PATCHES', '====', 'BASE:', 'directly resize',
'NO PATCHES', 'torch.Size', '%|']
result = '\n'.join([
l for l in raw.split('\n')
if l.strip() and not any(s in l for s in debug_filters)
])
If inference ever produces unexpected empty output, checking what the model is printing to stdout (by temporarily removing the capture) is the first debugging step.
PDF Rendering
PDFs are converted to images — the text layer is never read
The app does not extract embedded text from PDFs. Every page is rasterised to a PNG image first, then passed through the exact same pipeline as a directly uploaded image:
def process_pdf(path, task, custom_prompt, page_num):
doc = fitz.open(path)
page = doc.load_page(page_num - 1)
pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), alpha=False)
img = Image.open(BytesIO(pix.tobytes("png")))
doc.close()
return process_image(img, task, custom_prompt)
This means the model reads pixels, not characters. It has no access to the PDF's internal text layer, font metadata, or document structure.
Why 300 DPI
PDFs are natively specified at 72 DPI. The fitz.Matrix(300/72, 300/72) call scales the render up by ~4.17×:
- At 72 DPI, small text, subscripts, superscripts, and fine math symbols are too coarse for the model to read reliably
- At 300 DPI, characters are sharp enough for accurate OCR even at small point sizes
- 300 DPI is the standard used by document scanners for archival quality
One page at a time
The current implementation processes one page per submission. There is no batch mode. For a multi-page document the user selects a page number, submits, then moves to the next page.
The page selector UI is only shown when a PDF is uploaded:
def update_page_selector(file_path):
if file_path.lower().endswith('.pdf'):
page_count = get_pdf_page_count(file_path)
return gr.update(visible=True, maximum=page_count, value=1, minimum=1)
return gr.update(visible=False)
Digital vs scanned PDFs
Both work identically:
| PDF type | What's inside | Result |
|---|---|---|
| Digital (text-based) | Vector fonts and geometry | PyMuPDF re-rasterises from vectors — output is perfectly sharp at any DPI |
| Scanned | Embedded raster images | PyMuPDF extracts the raster — output quality depends on the original scan resolution |
For scanned PDFs with low source resolution (e.g. 150 DPI originals), upscaling to 300 DPI will not recover detail that was never there. In those cases inference accuracy may be lower than with high-quality digital PDFs.
Dual-pass Output Cleaning
clean_output() is called twice on the same raw result to produce two different outputs:
cleaned = clean_output(result, include_images=False) # → Text tab
markdown = clean_output(result, include_images=True) # → Markdown Preview
With include_images=False:
- Grounding tokens are stripped
<|ref|>image<|/ref|>regions are removed entirely- Result is clean plain text
With include_images=True:
- Text grounding tokens are stripped
<|ref|>image<|/ref|>regions are replaced with**[Figure N]**placeholdersembed_images()then swaps those placeholders for actual base64-encoded PNGs
Bounding Box Rendering
Bounding boxes are drawn in two layers using Pillow:
- Solid outline — drawn directly on a copy of the image
- Semi-transparent fill — drawn on a separate RGBA overlay, then composited
overlay = Image.new('RGBA', img_draw.size, (0, 0, 0, 0))
# ... draw filled rectangles on overlay with alpha=60 ...
img_draw.paste(overlay, (0, 0), overlay)
The alpha value of 60 (out of 255) gives a ~24% opacity fill, keeping the underlying content readable.
Colour assignment
Each unique label gets a random RGB colour, generated once per session:
np.random.seed(42)
color_map[label] = (
np.random.randint(50, 255),
np.random.randint(50, 255),
np.random.randint(50, 255)
)
The seed is fixed at 42, so label colours are deterministic across runs — title will always get the same colour, text always another. The lower bound of 50 prevents colours that are too dark to see against the fill.
Title regions get a thicker outline (width=5) than other regions (width=3) to give them visual prominence.
Math Rendering Pipeline
Getting LaTeX from the model to display correctly in the browser involves three components working together.
The markdown/math conflict
Standard markdown processors interpret _ as italic and * as bold. Raw LaTeX like $a_1 + a_2^*$ would be mangled before MathJax ever sees it.
The solution is pymdownx.arithmatex — a markdown extension that extracts math expressions before markdown processing, processes the surrounding text, then reinserts the math wrapped in MathJax-compatible delimiters:
Input: Some text with $a_1 + a_2$ inline.
After arithmatex + markdown:
<p>Some text with <span class="arithmatex">\(a_1 + a_2\)</span> inline.</p>
The _ inside the math is never touched by the markdown processor.
Delimiter pre-conversion
The model outputs \[...\] for display math and \(...\) for inline math. But pymdownx.arithmatex only recognises $...$ and $$...$$ by default. Worse, if \[...\] is passed directly to the markdown processor, the backslashes are stripped first — before arithmatex can intercept them — leaving bare [...] brackets in the output.
to_math_html() therefore pre-converts the model's native delimiters before calling markdown():
text = re.sub(r'\\\[(.+?)\\\]', r'$$\1$$', text, flags=re.DOTALL)
text = re.sub(r'\\\((.+?)\\\)', r'$\1$', text)
After this step, arithmatex sees $$...$$ and $...$, protects the content from markdown, and wraps it in \[...\] and \(...\) for MathJax to render.
MathJax configuration
MathJax is loaded once in the page <head> and configured to process \(...\) for inline math and \[...\] for display math — matching the output format of arithmatex:
window.MathJax = {
tex: {
inlineMath: [['\\(', '\\)']],
displayMath: [['\\[', '\\]']],
processEscapes: true,
tags: 'ams'
}
};
tags: 'ams' enables automatic equation numbering for align, equation, and similar environments.
Re-typesetting on update
MathJax processes the page once on load. When Gradio updates the HTML component with new content, MathJax needs to be told to process the new content:
submit_event.then(
fn=None,
js="() => setTimeout(() => { if(window.MathJax) MathJax.typesetPromise(); }, 300)"
)
The 300ms delay gives Gradio time to finish updating the DOM before MathJax scans it.
Known Quirks and Workarounds
\coloneqq and \eqqcolon
These LaTeX commands (≔ and =:) from the mathtools package appear frequently in academic papers but are not available in MathJax's default TeX configuration. Rather than loading the full mathtools package, they are substituted at the text level:
text = text.replace('\\coloneqq', ':=').replace('\\eqqcolon', '=:')
If you need proper rendering of these symbols, add require: { package: ['mathtools'] } to the MathJax configuration.
Flash Attention initialisation warning
On startup you will see:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU.
This is because the model loads onto CPU first (from_pretrained) then moves to GPU (.cuda()). Flash Attention 2 prefers direct GPU initialisation. The warning is harmless — inference works correctly. To silence it, add device_map="cuda" to the from_pretrained call.
Model type mismatch warning
You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR2.
The model's config file on HuggingFace declares model_type: deepseek_vl_v2 but the custom code registers a DeepseekOCR2 class. Because trust_remote_code=True is set, the correct class is loaded regardless. The warning can be ignored.
eval() on model output
Bounding box coordinates are parsed with Python's eval():
coords = eval(ref[2])
The model outputs coordinates as a Python list literal e.g. [[45, 12, 820, 48]]. This is safe in this context since the model runs locally, but worth noting if the architecture ever changes to process untrusted remote model output.
Zone.Identifier files in examples/
Files copied from Windows to WSL2 may have accompanying .Zone.Identifier metadata files (e.g. image.png:Zone.Identifier). These are Windows security zone markers and are harmless — Gradio ignores them when loading examples.
VRAM Usage and Quantized Models
The 8GB problem
The full-precision BF16 model (deepseek-ai/DeepSeek-OCR-2) consumes approximately 7.9GB of VRAM on load, leaving only ~250MB free on an 8GB GPU (e.g. RTX 3070). This headroom is insufficient for inference on complex documents:
- Each patch adds tokens to the KV cache and activations
- A 6-patch document can exhaust the remaining VRAM
- When VRAM is full, PyTorch spills to system RAM — which is 50–100× slower
- Symptom: GPU-Util drops to ~24%, power draw falls to ~47W (waiting on memory, not computing)
You can confirm this with watch -n 1 nvidia-smi during inference. Near-full VRAM with low GPU utilisation is the telltale sign.
Quantized alternatives on HuggingFace
To switch models, change MODEL_NAME in app.py. Three options are available as of March 2026:
| Model | Format | VRAM | Notes |
|---|---|---|---|
deepseek-ai/DeepSeek-OCR-2 |
BF16 (full) | ~8GB | Original, highest accuracy |
richarddavison/DeepSeek-OCR-2-FP8 |
FP8 dynamic | ~3.5GB | ~50% reduction; requires Ampere GPU or newer (RTX 30xx qualifies); 3,750 downloads/mo |
mzbac/DeepSeek-OCR-2-8bit |
8-bit | ~4GB | Same stack (torch 2.6, flash-attn 2.7.3, Python 3.12); explicitly supports dynamic resolution (0–6 patches); 140 downloads/mo |
Not applicable to NVIDIA GPUs:
mlx-community/DeepSeek-OCR-2-*— Apple Silicon only (MLX framework)
Not recommended:
WHY2001/DeepSeek-OCR-4bit-Quantized— 17 downloads/month, not well tested
What does not exist (as of March 2026)
- GGUF of DeepSeek-OCR-2 (GGUF repos on HuggingFace are for v1 only)
- GPTQ of DeepSeek-OCR-2
- AWQ of DeepSeek-OCR-2
Switching models
Change the single constant in app.py and restart:
# FP8 — recommended first try for 8GB GPUs
MODEL_NAME = 'richarddavison/DeepSeek-OCR-2-FP8'
# 8-bit — alternative with same toolchain
MODEL_NAME = 'mzbac/DeepSeek-OCR-2-8bit'
The model will be downloaded from HuggingFace on first use and cached locally.