| --- |
| license: mit |
| language: |
| - en |
| - ja |
| - multilingual |
| tags: |
| - core-ai |
| - coreai |
| - on-device |
| - ocr |
| - document-ai |
| - vision-language |
| - apple |
| pipeline_tag: image-to-text |
| base_model: baidu/Unlimited-OCR |
| library_name: coreai |
| --- |
| |
| # Unlimited-OCR β Core AI (on-device document OCR) |
|
|
| **On-device document β structured-markdown OCR, end-to-end on Apple Core AI.** A port of |
| [`baidu/Unlimited-OCR`](https://huggingface.co/baidu/Unlimited-OCR) (3B-A0.5B MoE, MIT): drop a |
| document image, get back **markdown** β tables as HTML (`<table><tr><td>β¦`), formulas as **LaTeX**, |
| reading order, and `<|det|>` layout boxes. Japanese + English + multilingual. |
|
|
| Runs on the **stock `coreai.runtime`** with **no engine patch** β the decoder is driven directly |
| on `inputs_embeds`, so this is a pure-export port (not the static-input-buffer VLM path). |
|
|
| <!-- gen-cards:use-it begin id=unlimited-ocr (managed by scripts/gen-cards β edit cards.json / QuickStart.swift, not this block) --> |
| ## Use it |
|
|
| βΆοΈ **Run it (source)** β the [ReadDoc runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ReadDoc) |
| (GUI + CLI, one app for every document-OCR model in the catalog): |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-kit |
| open coreai-kit/Examples/ReadDoc/ReadDoc.xcodeproj |
| # β Run, then pick "Unlimited-OCR" in the model picker |
| |
| # agents / headless (macOS): |
| cd coreai-kit/Examples/ReadDoc |
| swift run readdoc-cli --model unlimited-ocr --image sample.png |
| ``` |
|
|
| π» **Build with it** β complete; the glue is kit API, copy-paste runs: |
|
|
| ```swift |
| import CoreAIKit |
| |
| let reader = try await KitDocReader(catalog: "unlimited-ocr") |
| let markdown = try await reader.read(imageAt: imageURL) |
| // markdown: the document as structured text β tables as <table>/<tr>/<td>, |
| // <|det|> layout boxes, reading order β fully on-device |
| ``` |
|
|
| The take-home is [`Examples/ReadDoc/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ReadDoc/Sources/QuickStart.swift) |
| β this exact code as one typed function, no UI; the CLI is an argument shell over it, and |
| the GUI drives the same `KitDocReader(catalog:)` on the image you pick. |
| One `read(imageAt:)` call per page; chunk a PDF into page images first. The output keeps |
| the model's structural markup (tables as HTML, formulas as LaTeX, `<|det|>` boxes) β |
| strip or render it as your app prefers. |
|
|
| **Integration checklist** |
|
|
| - SPM: `https://github.com/john-rocky/coreai-kit` β product **CoreAIKit** |
| - Info.plist: none needed |
| - Entitlements: none needed |
| - First run downloads the model β 4.5 GB (Mac) β then it loads from the |
| local cache (Application Support; progress via the `downloadProgress` callback) |
| - Measure in Release β Debug is ~3Γ slower on per-token host work |
| <!-- gen-cards:use-it end --> |
|
|
| ## What's exciting (why you'd use it) |
|
|
| - **Private OCR**: invoices, receipts, contracts, papers, forms never leave the device. |
| - **Structured, not just text**: tables β HTML, equations β LaTeX, layout β boxes. RAG-ready ingestion. |
| - **Flat latency**: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) |
| keeps every tensor shape constant, so the runtime compiles once and decode stays **flat at |
| ~12.7 ms/token (~79 tok/s on M4 Max)** β no growing-cache recompilation stalls. |
| - **SOTA quality**: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful |
| to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000). |
|
|
| ## Bundles |
|
|
| | path | what | dtype | size | |
| |---|---|---|---| |
| | `vision/unlimited_ocr_vision.aimodel` | DeepEncoder (SAM-ViT + CLIP-ViT cascade) β 100 visual tokens | fp16 | 762 MB | |
| | `decoder/unlimited_ocr_decoder.aimodel` | DeepseekV2 R-SWA MoE decoder, functions **`prefill`** + **`decode`** sharing one weight set + KV state | sym8 | 3.2 GB | |
| | `assets/embed_tokens.f16` | token embedding table `[129280,1280]` (host row-gather) | fp16 | 316 MB | |
| | `assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json` | arrangement constants + the assembly recipe | β | tiny | |
| | `tokenizer/` | fast tokenizer (`tokenizer.json` + configs) | β | β | |
|
|
| ## Pipeline (Base mode, 640px) |
|
|
| ``` |
| image β preprocess (pad to 640Β², normalize mean=std=0.5) |
| β vision .aimodel β visual tokens [1,100,1280] |
| β arrange (10Γ10 + image_newline per row + view_seperator) β [111,1280] |
| β scatter into embed_tokens(prompt_ids) β prefix [1,115,1280] |
| β decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) β tokens |
| β detokenize (keep special tokens) β markdown |
| ``` |
|
|
| The exact, verified recipe is in `assets/recipe.json`. Reference implementations (Python end-to-end |
| + a macOS app, **CoreAIOCR**, driving the stock runtime) are in the |
| [Core AI Model Zoo](https://github.com/john-rocky/coreai-model-zoo): `conversion/unlimited_ocr/` and |
| `apps/CoreAIOCR/`. |
|
|
| ## Notes |
|
|
| - **Appropriate input**: clean single-page documents (invoice / paper / report / table / formula), |
| roughly square or portrait, with text still legible when fit to 640Β². Very dense small-text scans |
| (newspaper) want the tiled `crop_mode` vision export (not included here; Base mode only). |
| - Prompt is fixed to `document parsing` (layout + structured extraction). |
| - License: **MIT** (inherited from `baidu/Unlimited-OCR`). |
|
|
| *Community port β not affiliated with Apple or baidu.* |
|
|