gen-cards: regenerate Use-it block

7fe60bf verified about 7 hours ago

5.46 kB

	---
	license: mit
	language:
	- en
	- ja
	- multilingual
	tags:
	- core-ai
	- coreai
	- on-device
	- ocr
	- document-ai
	- vision-language
	- apple
	pipeline_tag: image-to-text
	base_model: baidu/Unlimited-OCR
	library_name: coreai
	---

	# Unlimited-OCR → Core AI (on-device document OCR)

	On-device document → structured-markdown OCR, end-to-end on Apple Core AI. A port of
	[`baidu/Unlimited-OCR`](https://huggingface.co/baidu/Unlimited-OCR) (3B-A0.5B MoE, MIT): drop a
	document image, get back markdown — tables as HTML (`<table><tr><td>…`), formulas as LaTeX,
	reading order, and `<\|det\|>` layout boxes. Japanese + English + multilingual.

	Runs on the stock `coreai.runtime` with no engine patch — the decoder is driven directly
	on `inputs_embeds`, so this is a pure-export port (not the static-input-buffer VLM path).

	<!-- gen-cards:use-it begin id=unlimited-ocr (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) -->
	## Use it

	▶️ Run it (source) — the [ReadDoc runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/ReadDoc)
	(GUI + CLI, one app for every document-OCR model in the catalog):

	```bash
	git clone https://github.com/john-rocky/coreai-kit
	open coreai-kit/Examples/ReadDoc/ReadDoc.xcodeproj
	# → Run, then pick "Unlimited-OCR" in the model picker

	# agents / headless (macOS):
	cd coreai-kit/Examples/ReadDoc
	swift run readdoc-cli --model unlimited-ocr --image sample.png
	```

	💻 Build with it — complete; the glue is kit API, copy-paste runs:

	```swift
	import CoreAIKit

	let reader = try await KitDocReader(catalog: "unlimited-ocr")
	let markdown = try await reader.read(imageAt: imageURL)
	// markdown: the document as structured text — tables as <table>/<tr>/<td>,
	// <\|det\|> layout boxes, reading order — fully on-device
	```

	The take-home is [`Examples/ReadDoc/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/ReadDoc/Sources/QuickStart.swift)
	— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
	the GUI drives the same `KitDocReader(catalog:)` on the image you pick.
	One `read(imageAt:)` call per page; chunk a PDF into page images first. The output keeps
	the model's structural markup (tables as HTML, formulas as LaTeX, `<\|det\|>` boxes) —
	strip or render it as your app prefers.

	Integration checklist

	- SPM: `https://github.com/john-rocky/coreai-kit` → product CoreAIKit
	- Info.plist: none needed
	- Entitlements: none needed
	- First run downloads the model — 4.5 GB (Mac) — then it loads from the
	local cache (Application Support; progress via the `downloadProgress` callback)
	- Measure in Release — Debug is ~3× slower on per-token host work
	<!-- gen-cards:use-it end -->

	## What's exciting (why you'd use it)

	- Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
	- Structured, not just text: tables → HTML, equations → LaTeX, layout → boxes. RAG-ready ingestion.
	- Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask)
	keeps every tensor shape constant, so the runtime compiles once and decode stays **flat at
	~12.7 ms/token (~79 tok/s on M4 Max)** — no growing-cache recompilation stalls.
	- SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful
	to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

	## Bundles

	\| path \| what \| dtype \| size \|
	\|---\|---\|---\|---\|
	\| `vision/unlimited_ocr_vision.aimodel` \| DeepEncoder (SAM-ViT + CLIP-ViT cascade) → 100 visual tokens \| fp16 \| 762 MB \|
	\| `decoder/unlimited_ocr_decoder.aimodel` \| DeepseekV2 R-SWA MoE decoder, functions `prefill` + `decode` sharing one weight set + KV state \| sym8 \| 3.2 GB \|
	\| `assets/embed_tokens.f16` \| token embedding table `[129280,1280]` (host row-gather) \| fp16 \| 316 MB \|
	\| `assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json` \| arrangement constants + the assembly recipe \| — \| tiny \|
	\| `tokenizer/` \| fast tokenizer (`tokenizer.json` + configs) \| — \| — \|

	## Pipeline (Base mode, 640px)

	```
	image → preprocess (pad to 640², normalize mean=std=0.5)
	→ vision .aimodel → visual tokens [1,100,1280]
	→ arrange (10×10 + image_newline per row + view_seperator) → [111,1280]
	→ scatter into embed_tokens(prompt_ids) → prefix [1,115,1280]
	→ decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) → tokens
	→ detokenize (keep special tokens) → markdown
	```

	The exact, verified recipe is in `assets/recipe.json`. Reference implementations (Python end-to-end
	+ a macOS app, CoreAIOCR, driving the stock runtime) are in the
	[Core AI Model Zoo](https://github.com/john-rocky/coreai-model-zoo): `conversion/unlimited_ocr/` and
	`apps/CoreAIOCR/`.

	## Notes

	- Appropriate input: clean single-page documents (invoice / paper / report / table / formula),
	roughly square or portrait, with text still legible when fit to 640². Very dense small-text scans
	(newspaper) want the tiled `crop_mode` vision export (not included here; Base mode only).
	- Prompt is fixed to `document parsing` (layout + structured extraction).
	- License: MIT (inherited from `baidu/Unlimited-OCR`).

	Community port — not affiliated with Apple or baidu.