File size: 15,354 Bytes
0168313 b6971dc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 |
---
license: apache-2.0
base_model:
- PaddlePaddle/PaddleOCR-VL
base_model_relation: quantized
---
# deepseek-ocr.rs 🚀
Rust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages multiple OCR backends, prompt tooling, and a serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or (alpha) NVIDIA CUDA GPUs.
> 中文文档请看 [README_CN.md](README_CN.md)。
> Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the [build-binaries workflow artifacts](https://github.com/TimmyOVO/deepseek-ocr.rs/actions/workflows/build-binaries.yml). Grab them from the newest green run.
## Choosing a Model 🔬
| Model | Memory footprint* | Best on | When to pick it |
| --- | --- | --- | --- |
| **DeepSeek‑OCR** | **≈6.3 GB** FP16 weights, **≈13 GB** RAM/VRAM with cache & activations (512-token budget) | Apple Silicon + Metal (FP16), high-VRAM NVIDIA GPUs, 32 GB+ RAM desktops | Highest accuracy, SAM+CLIP global/local context, MoE DeepSeek‑V2 decoder (3 B params, ~570 M active per token). Use when latency is secondary to quality. |
| **PaddleOCR‑VL** | **≈4.7 GB** FP16 weights, **≈9 GB** RAM/VRAM with cache & activations | 16 GB laptops, CPU-only boxes, mid-range GPUs | Dense 0.9 B Ernie decoder with SigLIP vision tower. Faster startup, lower memory, great for batch jobs or lightweight deployments. |
\*Measured from the default FP16 safetensors. Runtime footprint varies with sequence length.
Guidance:
- **Need maximum fidelity, multi-region reasoning, or already have 16–24 GB VRAM?** Use **DeepSeek‑OCR**. The hybrid SAM+CLIP tower plus DeepSeek‑V2 MoE decoder handles complex layouts best, but expect higher memory/latency.
- **Deploying to CPU-only nodes, 16 GB laptops, or latency-sensitive services?** Choose **PaddleOCR‑VL**. Its dense Ernie decoder (18 layers, hidden 1024) activates fewer parameters per token and keeps memory under 10 GB while staying close in quality on most docs.
## Why Rust? 💡
The original DeepSeek-OCR ships as a Python + Transformers stack—powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:
- Smaller deployable artifacts with zero Python runtime or conda baggage.
- Memory-safe, thread-friendly infrastructure that blends into native Rust backends.
- Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.
- Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.
## Technical Stack ⚙️
- **Candle** for tensor compute, with Metal and CUDA backends and FlashAttention support.
- **Rocket** + async streaming for OpenAI-compatible `/v1/responses` and `/v1/chat/completions`.
- **tokenizers** (upstream DeepSeek release) wrapped by `crates/assets` for deterministic caching via Hugging Face and ModelScope mirrors.
- **Pure Rust vision/prompt pipeline** shared by CLI and server to avoid duplicated logic.
## Advantages over the Python Release 🥷
- Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.
- Deterministic dual-source (Hugging Face + ModelScope) asset download + verification built into the workspace.
- Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.
- Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.
## Highlights ✨
- **One repo, two entrypoints** – a batteries-included CLI for batch jobs and a Rocket-based server that speaks `/v1/responses` and `/v1/chat/completions`.
- **Works out of the box** – pulls model weights, configs, and tokenizer from whichever of Hugging Face or ModelScope responds fastest on first run.
- **Optimised for Apple Silicon** – optional Metal backend with FP16 execution for real-time OCR on laptops.
- **CUDA (alpha)** – experimental support via `--features cuda` + `--device cuda --dtype f16`; expect rough edges while we finish kernel coverage.
- **Intel MKL (preview)** – faster BLAS on x86 via `--features mkl` (install Intel oneMKL beforehand).
- **OpenAI client compatibility** – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.
## Quick Start 🏁
### Prerequisites
- Rust 1.78+ (edition 2024 support)
- Git
- Optional: Apple Silicon running macOS 13+ for Metal acceleration
- Optional: CUDA 12.2+ toolkit + driver for experimental NVIDIA GPU acceleration on Linux/Windows
- Optional: Intel oneAPI MKL for preview x86 acceleration (see below)
- (Recommended) Hugging Face account with `HF_TOKEN` when pulling from the `deepseek-ai/DeepSeek-OCR` repo (ModelScope is used automatically when it’s faster/reachable).
### Clone the Workspace
```bash
git clone https://github.com/TimmyOVO/deepseek-ocr.rs.git
cd deepseek-ocr.rs
cargo fetch
```
### Model Assets
The first invocation of the CLI or server downloads the config, tokenizer, and `model-00001-of-000001.safetensors` (~6.3GB) into `DeepSeek-OCR/`. To prefetch manually:
```bash
cargo run -p deepseek-ocr-cli --release -- --help # dev profile is extremely slow; always prefer --release
```
> Always include `--release` when running from source; debug builds on this model are extremely slow.
Set `HF_HOME`/`HF_TOKEN` if you store Hugging Face caches elsewhere (ModelScope downloads land alongside the same asset tree). The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).
## Configuration & Overrides 🗂️
The CLI and server share the same configuration. On first launch we create a `config.toml` populated with defaults; later runs reuse it so both entrypoints stay in sync.
| Platform | Config file (default) | Model cache root |
| --- | --- | --- |
| Linux | `~/.config/deepseek-ocr/config.toml` | `~/.cache/deepseek-ocr/models/<id>/…` |
| macOS | `~/Library/Application Support/deepseek-ocr/config.toml` | `~/Library/Caches/deepseek-ocr/models/<id>/…` |
| Windows | `%APPDATA%\deepseek-ocr\config.toml` | `%LOCALAPPDATA%\deepseek-ocr\models\<id>\…` |
- Override the location with `--config /path/to/config.toml` (available on both CLI and server). Missing files are created automatically.
- Each `[models.entries."<id>"]` record can point to custom `config`, `tokenizer`, or `weights` files. When omitted we fall back to the cache directory above and download/update assets as required.
- Runtime values resolve in this order: command-line flags → values stored in `config.toml` → built-in defaults. The HTTP API adds a final layer where request payload fields (for example `max_tokens`) override everything else for that call.
The generated file starts with the defaults below; adjust them to persistently change behaviour:
```toml
[models]
active = "deepseek-ocr"
[models.entries.deepseek-ocr]
[inference]
device = "cpu"
template = "plain"
base_size = 1024
image_size = 640
crop_mode = true
max_new_tokens = 512
use_cache = true
[server]
host = "0.0.0.0"
port = 8000
```
- `[models]` picks the active model and lets you add more entries (each entry can point to its own config/tokenizer/weights).
- `[inference]` controls notebook-friendly defaults shared by the CLI and server (device, template, vision sizing, decoding budget, cache usage).
- `[server]` sets the network binding and the model identifier reported by `/v1/models`.
See `crates/cli/README.md` and `crates/server/README.md` for concise override tables.
## Benchmark Snapshot 📊
Single-request Rust CLI (Accelerate backend on macOS) compared with the reference Python pipeline on the same prompt and image:
| Stage | ref total (ms) | ref avg (ms) | python total | python/ref |
|---------------------------------------------------|----------------|--------------|--------------|------------|
| Decode – Overall (`decode.generate`) | 30077.840 | 30077.840 | 56554.873 | 1.88x |
| Decode – Token Loop (`decode.iterative`) | 26930.216 | 26930.216 | 39227.974 | 1.46x |
| Decode – Prompt Prefill (`decode.prefill`) | 3147.337 | 3147.337 | 5759.684 | 1.83x |
| Prompt – Build Tokens (`prompt.build_tokens`) | 0.466 | 0.466 | 45.434 | 97.42x |
| Prompt – Render Template (`prompt.render`) | 0.005 | 0.005 | 0.019 | 3.52x |
| Vision – Embed Images (`vision.compute_embeddings`)| 6391.435 | 6391.435 | 3953.459 | 0.62x |
| Vision – Prepare Inputs (`vision.prepare_inputs`) | 62.524 | 62.524 | 45.438 | 0.73x |
## Command-Line Interface 🖥️
Build and run directly from the workspace:
```bash
cargo run -p deepseek-ocr-cli --release -- \
--prompt "<image>\n<|grounding|>Convert this receipt to markdown." \
--image baselines/sample/images/test.png \
--device cpu --max-new-tokens 512
```
> Tip: `--release` is required for reasonable throughput; debug builds can be 10x slower.
> macOS tip: append `--features metal` to the `cargo run`/`cargo build` commands to compile with Accelerate + Metal backends.
>
> CUDA tip (Linux/Windows): append `--features cuda` and run with `--device cuda --dtype f16` to target NVIDIA GPUs—feature is still alpha, so be ready for quirks.
>
> Intel MKL preview: install Intel oneMKL, then build with `--features mkl` for faster CPU matmuls on x86.
Install the CLI as a binary:
```bash
cargo install --path crates/cli
deepseek-ocr-cli --help
```
Key flags:
- `--prompt` / `--prompt-file`: text with `<image>` slots
- `--image`: path(s) matching `<image>` placeholders
- `--device` and `--dtype`: choose `metal` + `f16` on Apple Silicon or `cuda` + `f16` on NVIDIA GPUs
- `--max-new-tokens`: decoding budget
- Sampling controls: `--do-sample`, `--temperature`, `--top-p`, `--top-k`, `--repetition-penalty`, `--no-repeat-ngram-size`, `--seed`
- By default decoding stays deterministic (`do_sample=false`, `temperature=0.0`, `no_repeat_ngram_size=20`)
- To use stochastic sampling set `--do-sample true --temperature 0.8` (and optionally adjust the other knobs)
### Switching Models
The autogenerated `config.toml` now contains two model entries:
- `deepseek-ocr` (default) – the original DeepSeek vision-language stack.
- `paddleocr-vl` – the PaddleOCR-VL 0.9B SigLIP + Ernie release.
Pick which one to load via `--model`:
```bash
deepseek-ocr-cli --model paddleocr-vl --prompt "<image> Summarise"
```
The CLI (and server) will download the matching config/tokenizer/weights from the appropriate repository (`deepseek-ai/DeepSeek-OCR` or `PaddlePaddle/PaddleOCR-VL`) into your cache on first use. You can still override paths with `--model-config`, `--tokenizer`, or `--weights` if you maintain local fine-tunes.
## HTTP Server ☁️
Launch an OpenAI-compatible endpoint:
```bash
cargo run -p deepseek-ocr-server --release -- \
--host 0.0.0.0 --port 8000 \
--device cpu --max-new-tokens 512
```
> Keep `--release` on the server as well; the debug profile is far too slow for inference workloads.
> macOS tip: add `--features metal` to the `cargo run -p deepseek-ocr-server` command when you want the server binary to link against Accelerate + Metal (and pair it with `--device metal` at runtime).
>
> CUDA tip: add `--features cuda` and start the server with `--device cuda --dtype f16` to offload inference to NVIDIA GPUs (alpha-quality support).
>
> Intel MKL preview: install Intel oneMKL before building with `--features mkl` to accelerate CPU workloads on x86.
Notes:
- Use `data:` URLs or remote `http(s)` links; local paths are rejected.
- The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.
- Works out of the box with tools such as [Open WebUI](https://github.com/open-webui/open-webui) or any OpenAI-compatible client—just point the base URL to your server (`http://localhost:8000/v1`) and select either the `deepseek-ocr` or `paddleocr-vl` model ID exposed in `/v1/models`.
- Adjust the request body limit with Rocket config if you routinely send large images.

## GPU Acceleration ⚡
- **Metal (macOS 13+ Apple Silicon)** – pass `--device metal --dtype f16` and build binaries with `--features metal` so Candle links against Accelerate + Metal.
- **CUDA (alpha, NVIDIA GPUs)** – install CUDA 12.2+ toolkits, build with `--features cuda`, and launch the CLI/server with `--device cuda --dtype f16`; still experimental.
- **Intel MKL (preview)** – install Intel oneMKL and build with `--features mkl` to speed up CPU workloads on x86.
- For either backend, prefer release builds (e.g. `cargo build --release -p deepseek-ocr-cli --features metal|cuda`) to maximise throughput.
- Combine GPU runs with `--max-new-tokens` and crop tuning flags to balance latency vs. quality.
## Repository Layout 🗂️
- `crates/core` – shared inference pipeline, model loaders, conversation templates.
- `crates/cli` – command-line frontend (`deepseek-ocr-cli`).
- `crates/server` – Rocket server exposing OpenAI-compatible endpoints.
- `crates/assets` – asset management (configuration, tokenizer, Hugging Face + ModelScope download helpers).
- `baselines/` – reference inputs and outputs for regression testing.
Detailed CLI usage lives in [`crates/cli/README.md`](crates/cli/README.md). The server’s OpenAI-compatible interface is covered in [`crates/server/README.md`](crates/server/README.md).
## Troubleshooting 🛠️
- **Where do assets come from?** – downloads automatically pick between Hugging Face and ModelScope based on latency; the CLI prints the chosen source for each file.
- **Slow first response** – model load and GPU warm-up (Metal/CUDA alpha) happen on the initial request; later runs are faster.
- **Large image rejection** – increase Rocket JSON limits in `crates/server/src/main.rs` or downscale the input.
## Roadmap 🗺️
- ✅ Apple Metal backend with FP16 support and CLI/server parity on macOS.
- ✅ NVIDIA CUDA backend (alpha) – build with `--features cuda`, run with `--device cuda --dtype f16` for Linux/Windows GPUs; polishing in progress.
- 🔄 **Parity polish** – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.
- 🔄 **Grounding & streaming** – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.
- 🔄 **Cross-platform acceleration** – continue tuning CUDA kernels, add automatic device detection across CPU/Metal/CUDA, and publish opt-in GPU benchmarks.
- 🔄 **Packaging & Ops** – ship binary releases with deterministic asset checksums, richer logging/metrics, and Helm/docker references for server deploys.
- 🔜 **Structured outputs** – optional JSON schema tools for downstream automation once parity gaps close.
## License 📄
This repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to `DeepSeek-OCR/LICENSE` for model terms and apply the same restrictions to downstream use. |