Unlimited-OCR β†’ Core AI (on-device document OCR)

On-device document β†’ structured-markdown OCR, end-to-end on Apple Core AI. A port of baidu/Unlimited-OCR (3B-A0.5B MoE, MIT): drop a document image, get back markdown β€” tables as HTML (<table><tr><td>…), formulas as LaTeX, reading order, and <|det|> layout boxes. Japanese + English + multilingual.

Runs on the stock coreai.runtime with no engine patch β€” the decoder is driven directly on inputs_embeds, so this is a pure-export port (not the static-input-buffer VLM path).

What's exciting (why you'd use it)

  • Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
  • Structured, not just text: tables β†’ HTML, equations β†’ LaTeX, layout β†’ boxes. RAG-ready ingestion.
  • Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) keeps every tensor shape constant, so the runtime compiles once and decode stays flat at 12.7 ms/token (79 tok/s on M4 Max) β€” no growing-cache recompilation stalls.
  • SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

Bundles

path what dtype size
vision/unlimited_ocr_vision.aimodel DeepEncoder (SAM-ViT + CLIP-ViT cascade) β†’ 100 visual tokens fp16 762 MB
decoder/unlimited_ocr_decoder.aimodel DeepseekV2 R-SWA MoE decoder, functions prefill + decode sharing one weight set + KV state sym8 3.2 GB
assets/embed_tokens.f16 token embedding table [129280,1280] (host row-gather) fp16 316 MB
assets/{image_newline,view_seperator}.f16, assets/prompt_input_ids.i32, assets/recipe.json arrangement constants + the assembly recipe β€” tiny
tokenizer/ fast tokenizer (tokenizer.json + configs) β€” β€”

Pipeline (Base mode, 640px)

image β†’ preprocess (pad to 640Β², normalize mean=std=0.5)
      β†’ vision .aimodel                         β†’ visual tokens [1,100,1280]
      β†’ arrange (10Γ—10 + image_newline per row + view_seperator) β†’ [111,1280]
      β†’ scatter into embed_tokens(prompt_ids)   β†’ prefix [1,115,1280]
      β†’ decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) β†’ tokens
      β†’ detokenize (keep special tokens)        β†’ markdown

The exact, verified recipe is in assets/recipe.json. Reference implementations (Python end-to-end

  • a macOS app, CoreAIOCR, driving the stock runtime) are in the Core AI Model Zoo: conversion/unlimited_ocr/ and apps/CoreAIOCR/.

Notes

  • Appropriate input: clean single-page documents (invoice / paper / report / table / formula), roughly square or portrait, with text still legible when fit to 640Β². Very dense small-text scans (newspaper) want the tiled crop_mode vision export (not included here; Base mode only).
  • Prompt is fixed to document parsing (layout + structured extraction).
  • License: MIT (inherited from baidu/Unlimited-OCR).

Community port β€” not affiliated with Apple or baidu.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for mlboydaisuke/Unlimited-OCR-CoreAI

Finetuned
(3)
this model