DeepSeek-OCR-2-Math / README.md
ricklon's picture
Add LaTeX lint/correction item to backlog
f3954b3

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: DeepSeek OCR 2  Math Rendering Edition
emoji: 🧮
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 6.8.0
app_file: app.py
pinned: true
short_description: DeepSeek-OCR-2 with MathJax math rendering
license: mit
python_version: '3.12'
suggested_hardware: zero-a10g

DeepSeek-OCR-2 — Math Rendering Edition

Built on top of the excellent DeepSeek-OCR-2 Demo by Mert Erbak. Many thanks for the clean foundation — the OCR pipeline, PDF support, bounding box visualisation, and grounding features are all his work.

What's new in this fork

  • MathJax rendering — the Markdown Preview tab now renders LaTeX math notation (inline $...$ and display $$...$$) using MathJax 3, so equations from scanned papers and textbooks display as proper math rather than raw LaTeX source.

Features (inherited + extended)

Feature Description
📋 Markdown Convert documents to structured markdown with layout detection
📝 Free OCR Simple text extraction without layout analysis
📍 Locate Find and highlight specific text or elements with bounding boxes
🔍 Describe General image description
✏️ Custom Provide your own prompt
🧮 Math Preview Rendered MathJax output for equations and formulas (new)

Model

Uses deepseek-ai/DeepSeek-OCR-2 with DeepEncoder v2. Achieves 91.09% on OmniDocBench (+3.73% over v1).

Configuration: 1024 base + 768 patches with dynamic cropping (2–6 patches). 144 tokens per patch + 256 base tokens.

How it works

The model processes images and PDFs using a prompt-based interface with special tokens that control its behaviour:

  • <image> — replaced at inference time with visual patch embeddings from the input
  • <|grounding|> — activates layout detection; the model then annotates every element it finds with a label and bounding box coordinates
  • <|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|> — the format the model uses to output detected regions

When grounding is active, the model self-labels regions as title, text, image, table, etc. Regions labelled image are automatically cropped out and appear in the Cropped Images tab. All regions get bounding boxes drawn in the Boxes tab.

See TECHNICAL.md for a full breakdown of the pipeline, including some non-obvious implementation details.

Running locally

python3 -m venv .venv
source .venv/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install gradio spaces markdown pymdown-extensions
python app.py

Requires a CUDA-capable GPU. The model is downloaded from HuggingFace on first run.

Secrets (Spaces + local)

For Hugging Face Spaces, store tokens in Space Settings -> Variables and secrets. Use HF_TOKEN as the secret name.

For local workflows:

cp .env.example .env
# edit .env and set HF_TOKEN=...
set -a; source .env; set +a

HF_TOKEN is ignored by git via .gitignore.

To stream Space logs with token-based auth:

./scripts/fetch_space_logs.sh ricklon/DeepSeek-OCR-2-Math run
./scripts/fetch_space_logs.sh ricklon/DeepSeek-OCR-2-Math build

TODO / Backlog

  • Add a LaTeX lint/correction pipeline for OCR output:
    • Detect malformed math with chktex (or equivalent).
    • Normalize equivalent expressions (for example ^2 vs ^{2}) before display/export.
    • Apply safe auto-fixes for common OCR-LaTeX artifacts.