mlboydaisuke's picture
gen-cards: regenerate Use-it block
7fe60bf verified
|
Raw
History Blame Contribute Delete
5.46 kB
---
license: mit
language:
- en
- ja
- multilingual
tags:
- core-ai
- coreai
- on-device
- ocr
- document-ai
- vision-language
- apple
pipeline_tag: image-to-text
base_model: baidu/Unlimited-OCR
library_name: coreai
---
# Unlimited-OCR β†’ Core AI (on-device document OCR)
**On-device document β†’ structured-markdown OCR, end-to-end on Apple Core AI.** A port of
[`baidu/Unlimited-OCR`](https://huggingface.co/baidu/Unlimited-OCR) (3B-A0.5B MoE, MIT): drop a
document image, get back **markdown** β€” tables as HTML (`<table><tr><td>…`), formulas as **LaTeX**,
reading order, and `<|det|>` layout boxes. Japanese + English + multilingual.
Runs on the **stock `coreai.runtime`** with **no engine patch** β€” the decoder is driven directly
on `inputs_embeds`, so this is a pure-export port (not the static-input-buffer VLM path).
<!-- gen-cards:use-it begin id=unlimited-ocr (managed by scripts/gen-cards β€” edit cards.json / QuickStart.swift, not this block) -->
## Use it
▢️ **Run it (source)** β€” the [ReadDoc runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ReadDoc)
(GUI + CLI, one app for every document-OCR model in the catalog):
```bash
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ReadDoc/ReadDoc.xcodeproj
# β†’ Run, then pick "Unlimited-OCR" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ReadDoc
swift run readdoc-cli --model unlimited-ocr --image sample.png
```
πŸ’» **Build with it** β€” complete; the glue is kit API, copy-paste runs:
```swift
import CoreAIKit
let reader = try await KitDocReader(catalog: "unlimited-ocr")
let markdown = try await reader.read(imageAt: imageURL)
// markdown: the document as structured text β€” tables as <table>/<tr>/<td>,
// <|det|> layout boxes, reading order β€” fully on-device
```
The take-home is [`Examples/ReadDoc/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ReadDoc/Sources/QuickStart.swift)
β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same `KitDocReader(catalog:)` on the image you pick.
One `read(imageAt:)` call per page; chunk a PDF into page images first. The output keeps
the model's structural markup (tables as HTML, formulas as LaTeX, `<|det|>` boxes) β€”
strip or render it as your app prefers.
**Integration checklist**
- SPM: `https://github.com/john-rocky/coreai-kit` β†’ product **CoreAIKit**
- Info.plist: none needed
- Entitlements: none needed
- First run downloads the model β€” 4.5 GB (Mac) β€” then it loads from the
local cache (Application Support; progress via the `downloadProgress` callback)
- Measure in Release β€” Debug is ~3Γ— slower on per-token host work
<!-- gen-cards:use-it end -->
## What's exciting (why you'd use it)
- **Private OCR**: invoices, receipts, contracts, papers, forms never leave the device.
- **Structured, not just text**: tables β†’ HTML, equations β†’ LaTeX, layout β†’ boxes. RAG-ready ingestion.
- **Flat latency**: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask)
keeps every tensor shape constant, so the runtime compiles once and decode stays **flat at
~12.7 ms/token (~79 tok/s on M4 Max)** β€” no growing-cache recompilation stalls.
- **SOTA quality**: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful
to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).
## Bundles
| path | what | dtype | size |
|---|---|---|---|
| `vision/unlimited_ocr_vision.aimodel` | DeepEncoder (SAM-ViT + CLIP-ViT cascade) β†’ 100 visual tokens | fp16 | 762 MB |
| `decoder/unlimited_ocr_decoder.aimodel` | DeepseekV2 R-SWA MoE decoder, functions **`prefill`** + **`decode`** sharing one weight set + KV state | sym8 | 3.2 GB |
| `assets/embed_tokens.f16` | token embedding table `[129280,1280]` (host row-gather) | fp16 | 316 MB |
| `assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json` | arrangement constants + the assembly recipe | β€” | tiny |
| `tokenizer/` | fast tokenizer (`tokenizer.json` + configs) | β€” | β€” |
## Pipeline (Base mode, 640px)
```
image β†’ preprocess (pad to 640Β², normalize mean=std=0.5)
β†’ vision .aimodel β†’ visual tokens [1,100,1280]
β†’ arrange (10Γ—10 + image_newline per row + view_seperator) β†’ [111,1280]
β†’ scatter into embed_tokens(prompt_ids) β†’ prefix [1,115,1280]
β†’ decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) β†’ tokens
β†’ detokenize (keep special tokens) β†’ markdown
```
The exact, verified recipe is in `assets/recipe.json`. Reference implementations (Python end-to-end
+ a macOS app, **CoreAIOCR**, driving the stock runtime) are in the
[Core AI Model Zoo](https://github.com/john-rocky/coreai-model-zoo): `conversion/unlimited_ocr/` and
`apps/CoreAIOCR/`.
## Notes
- **Appropriate input**: clean single-page documents (invoice / paper / report / table / formula),
roughly square or portrait, with text still legible when fit to 640Β². Very dense small-text scans
(newspaper) want the tiled `crop_mode` vision export (not included here; Base mode only).
- Prompt is fixed to `document parsing` (layout + structured extraction).
- License: **MIT** (inherited from `baidu/Unlimited-OCR`).
*Community port β€” not affiliated with Apple or baidu.*