DeepSeek-OCR-2-Math / README.md
ricklon's picture
Add LaTeX lint/correction item to backlog
f3954b3
---
title: DeepSeek OCR 2 Math Rendering Edition
emoji: 🧮
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 6.8.0
app_file: app.py
pinned: true
short_description: DeepSeek-OCR-2 with MathJax math rendering
license: mit
python_version: "3.12"
suggested_hardware: zero-a10g
---
# DeepSeek-OCR-2 — Math Rendering Edition
Built on top of the excellent [DeepSeek-OCR-2 Demo](https://huggingface.co/spaces/merterbak/DeepSeek-OCR-2) by **Mert Erbak**. Many thanks for the clean foundation — the OCR pipeline, PDF support, bounding box visualisation, and grounding features are all his work.
## What's new in this fork
- **MathJax rendering** — the Markdown Preview tab now renders LaTeX math notation (inline `$...$` and display `$$...$$`) using MathJax 3, so equations from scanned papers and textbooks display as proper math rather than raw LaTeX source.
## Features (inherited + extended)
| Feature | Description |
|---|---|
| 📋 Markdown | Convert documents to structured markdown with layout detection |
| 📝 Free OCR | Simple text extraction without layout analysis |
| 📍 Locate | Find and highlight specific text or elements with bounding boxes |
| 🔍 Describe | General image description |
| ✏️ Custom | Provide your own prompt |
| 🧮 Math Preview | Rendered MathJax output for equations and formulas *(new)* |
## Model
Uses `deepseek-ai/DeepSeek-OCR-2` with DeepEncoder v2. Achieves **91.09% on OmniDocBench** (+3.73% over v1).
Configuration: 1024 base + 768 patches with dynamic cropping (2–6 patches). 144 tokens per patch + 256 base tokens.
## How it works
The model processes images and PDFs using a prompt-based interface with special tokens that control its behaviour:
- **`<image>`** — replaced at inference time with visual patch embeddings from the input
- **`<|grounding|>`** — activates layout detection; the model then annotates every element it finds with a label and bounding box coordinates
- **`<|ref|>label<|/ref|><|det|>[[x1,y1,x2,y2]]<|/det|>`** — the format the model uses to output detected regions
When grounding is active, the model self-labels regions as `title`, `text`, `image`, `table`, etc. Regions labelled `image` are automatically cropped out and appear in the **Cropped Images** tab. All regions get bounding boxes drawn in the **Boxes** tab.
See [TECHNICAL.md](TECHNICAL.md) for a full breakdown of the pipeline, including some non-obvious implementation details.
## Running locally
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install gradio spaces markdown pymdown-extensions
python app.py
```
Requires a CUDA-capable GPU. The model is downloaded from HuggingFace on first run.
## Secrets (Spaces + local)
For Hugging Face Spaces, store tokens in **Space Settings -> Variables and secrets**.
Use `HF_TOKEN` as the secret name.
For local workflows:
```bash
cp .env.example .env
# edit .env and set HF_TOKEN=...
set -a; source .env; set +a
```
`HF_TOKEN` is ignored by git via `.gitignore`.
To stream Space logs with token-based auth:
```bash
./scripts/fetch_space_logs.sh ricklon/DeepSeek-OCR-2-Math run
./scripts/fetch_space_logs.sh ricklon/DeepSeek-OCR-2-Math build
```
## TODO / Backlog
- Add a LaTeX lint/correction pipeline for OCR output:
- Detect malformed math with `chktex` (or equivalent).
- Normalize equivalent expressions (for example `^2` vs `^{2}`) before display/export.
- Apply safe auto-fixes for common OCR-LaTeX artifacts.