Spaces:

ricklon
/

DeepSeek-OCR-2-Math

Sleeping

File size: 3,594 Bytes

---
title: DeepSeek OCR 2 — Math Rendering Edition
emoji: 🧮
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 6.8.0
app_file: app.py
pinned: true
short_description: DeepSeek-OCR-2 with MathJax math rendering
license: mit
python_version: "3.12"
suggested_hardware: zero-a10g
---

# DeepSeek-OCR-2 — Math Rendering Edition

Built on top of the excellent [DeepSeek-OCR-2 Demo](https://huggingface.co/spaces/merterbak/DeepSeek-OCR-2) by **Mert Erbak**. Many thanks for the clean foundation — the OCR pipeline, PDF support, bounding box visualisation, and grounding features are all his work.

## What's new in this fork

- **MathJax rendering** — the Markdown Preview tab now renders LaTeX math notation (inline `$...$` and display `$$...$$`) using MathJax 3, so equations from scanned papers and textbooks display as proper math rather than raw LaTeX source.

## Features (inherited + extended)

| Feature | Description |
|---|---|
| 📋 Markdown | Convert documents to structured markdown with layout detection |
| 📝 Free OCR | Simple text extraction without layout analysis |
| 📍 Locate | Find and highlight specific text or elements with bounding boxes |
| 🔍 Describe | General image description |
| ✏️ Custom | Provide your own prompt |
| 🧮 Math Preview | Rendered MathJax output for equations and formulas *(new)* |

## Model

Uses `deepseek-ai/DeepSeek-OCR-2` with DeepEncoder v2. Achieves **91.09% on OmniDocBench** (+3.73% over v1).

Configuration: 1024 base + 768 patches with dynamic cropping (2–6 patches). 144 tokens per patch + 256 base tokens.

## How it works

The model processes images and PDFs using a prompt-based interface with special tokens that control its behaviour:

- **`<image>`** — replaced at inference time with visual patch embeddings from the input
- **`<|grounding|>`** — activates layout detection; the model then annotates every element it finds with a label and bounding box coordinates
- **`<|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|>`** — the format the model uses to output detected regions

When grounding is active, the model self-labels regions as `title`, `text`, `image`, `table`, etc. Regions labelled `image` are automatically cropped out and appear in the **Cropped Images** tab. All regions get bounding boxes drawn in the **Boxes** tab.

See [TECHNICAL.md](TECHNICAL.md) for a full breakdown of the pipeline, including some non-obvious implementation details.

## Running locally

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install gradio spaces markdown pymdown-extensions
python app.py
```

Requires a CUDA-capable GPU. The model is downloaded from HuggingFace on first run.

## Secrets (Spaces + local)

For Hugging Face Spaces, store tokens in **Space Settings -> Variables and secrets**.
Use `HF_TOKEN` as the secret name.

For local workflows:

```bash
cp .env.example .env
# edit .env and set HF_TOKEN=...
set -a; source .env; set +a
```

`HF_TOKEN` is ignored by git via `.gitignore`.

To stream Space logs with token-based auth:

```bash
./scripts/fetch_space_logs.sh ricklon/DeepSeek-OCR-2-Math run
./scripts/fetch_space_logs.sh ricklon/DeepSeek-OCR-2-Math build
```

## TODO / Backlog

- Add a LaTeX lint/correction pipeline for OCR output:
  - Detect malformed math with `chktex` (or equivalent).
  - Normalize equivalent expressions (for example `^2` vs `^{2}`) before display/export.
  - Apply safe auto-fixes for common OCR-LaTeX artifacts.