Surya OCR 2 CoreML Runtime

This repository contains an early CoreML runtime bundle derived from datalab-to/surya-ocr-2 at source commit 3b3d4cdf88d6928b0acdc75181b13206ea67c4a3.

It is intended for native Apple OCR experiments and future iOS/macOS demo-app work. It is not yet a drop-in app runtime: the model packages pass canary runtime gates, but host-side glue is still required.

Included packages

File	Purpose	Precision / quantization	Shape contract
`surya_vision_fp16.mlpackage`	Vision tower	CoreML FP16 compute	`pixel_values [1024,1536] -> image_embeds [256,1024]`
`surya_vision_int8.mlpackage`	Quantized vision tower	CoreML linear INT8 weight compression	`pixel_values [1024,1536] -> image_embeds [256,1024]`
`surya_prefill_fp16_seq300_cache512.mlpackage`	Language prefill, logits, and initial cache	CoreML FP16 compute	fixed prefill length `300`, cache length `512`
`surya_decode_step_fp16_cache512.mlpackage`	One-token cached decode step	CoreML FP16 compute	one token at a time, cache length `512`

Processor/tokenizer assets are included under processor/. Validation JSONs are under validation/.

Current validation

All validation below was run on a Mac Studio with CoreML package prediction, using the canary prompt/image:

OCR this image to HTML.

The canary image contains:

Invoice 123
Total $42.00

Gate	Result
Prefill parity before CoreML export	native/custom first token `1039`; logits max diff `2.6702880859375e-05`
Prefill CoreML smoke	Torch/CoreML first token `1039`; logits max diff `0.3057253360748291`; mean diff `0.03853870555758476`
Decode CoreML iterative smoke	9/9 tokens match native; text `<p>Invoice`
CoreML prefill -> CoreML decode	9/9 tokens match native; text `<p>Invoice`
CoreML FP16 vision -> CoreML prefill -> CoreML decode	9/9 tokens match native; text `<p>Invoice`
CoreML INT8 vision -> CoreML prefill -> CoreML decode	9/9 tokens match native; text `<p>Invoice`

The INT8 vision package has mean absolute diff 0.021211756393313408 vs the PyTorch vision tower on the canary.

What still lives in host code

This is a runtime component bundle, not a complete Swift package yet. The current host/application layer must handle:

tokenization
initial text token embedding lookup
image placeholder insertion
rotary position embedding generation
generated-token embedding lookup
full-attention KV cache insertion
stopping criteria and HTML/post-processing

The included scripts/export_surya_coreml_runtime.py shows the exact Python reference glue used for the validation runs.

Example: run the current validation harness

pip install coremltools torch transformers pillow qwen-vl-utils huggingface_hub
python scripts/export_surya_coreml_runtime.py vision-combined-runtime-smoke \
  --model-id datalab-to/surya-ocr-2 \
  --vision-package surya_vision_int8.mlpackage \
  --prefill-package surya_prefill_fp16_seq300_cache512.mlpackage \
  --decode-package surya_decode_step_fp16_cache512.mlpackage \
  --output validation/local_vision_int8_prefill_decode_smoke.json \
  --max-cache-length 512 \
  --steps 8

Expected canary output today:

{
  "all_tokens_match": true,
  "coreml_text": "<p>Invoice ",
  "native_text": "<p>Invoice "
}

Important limitations

Fixed-shape canary export: the current packages are specialized to the traced 512x512 sample preprocessing path, with pixel_values [1024,1536], prefill length 300, and full-attention cache length 512.
The language prefill/decode packages are FP16 CoreML, not INT8/4-bit yet.
Only the vision tower has an INT8 package in this release.
This has not yet passed full allenai/olmOCR-bench.
This is not a complete native OCR app. It is the validated CoreML model core we will build the app around.

Provenance

Generated non-destructively from datalab-to/surya-ocr-2. No fine-tuning was performed.

Downloads last month: -

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/surya-ocr-2-coreml-runtime

Base model

datalab-to/surya-ocr-2

Quantized

(5)

this model