Surya OCR 2 CoreML Runtime

This repository contains an early CoreML runtime bundle derived from datalab-to/surya-ocr-2 at source commit 3b3d4cdf88d6928b0acdc75181b13206ea67c4a3.

It is intended for native Apple OCR experiments and future iOS/macOS demo-app work. It is not yet a drop-in app runtime: the model packages pass canary runtime gates, but host-side glue is still required.

Included packages

File Purpose Precision / quantization Shape contract
surya_vision_fp16.mlpackage Vision tower CoreML FP16 compute pixel_values [1024,1536] -> image_embeds [256,1024]
surya_vision_int8.mlpackage Quantized vision tower CoreML linear INT8 weight compression pixel_values [1024,1536] -> image_embeds [256,1024]
surya_prefill_fp16_seq300_cache512.mlpackage Language prefill, logits, and initial cache CoreML FP16 compute fixed prefill length 300, cache length 512
surya_decode_step_fp16_cache512.mlpackage One-token cached decode step CoreML FP16 compute one token at a time, cache length 512

Processor/tokenizer assets are included under processor/. Validation JSONs are under validation/.

Current validation

All validation below was run on a Mac Studio with CoreML package prediction, using the canary prompt/image:

OCR this image to HTML.

The canary image contains:

Invoice 123
Total $42.00
Gate Result
Prefill parity before CoreML export native/custom first token 1039; logits max diff 2.6702880859375e-05
Prefill CoreML smoke Torch/CoreML first token 1039; logits max diff 0.3057253360748291; mean diff 0.03853870555758476
Decode CoreML iterative smoke 9/9 tokens match native; text <p>Invoice
CoreML prefill -> CoreML decode 9/9 tokens match native; text <p>Invoice
CoreML FP16 vision -> CoreML prefill -> CoreML decode 9/9 tokens match native; text <p>Invoice
CoreML INT8 vision -> CoreML prefill -> CoreML decode 9/9 tokens match native; text <p>Invoice

The INT8 vision package has mean absolute diff 0.021211756393313408 vs the PyTorch vision tower on the canary.

What still lives in host code

This is a runtime component bundle, not a complete Swift package yet. The current host/application layer must handle:

  • tokenization
  • initial text token embedding lookup
  • image placeholder insertion
  • rotary position embedding generation
  • generated-token embedding lookup
  • full-attention KV cache insertion
  • stopping criteria and HTML/post-processing

The included scripts/export_surya_coreml_runtime.py shows the exact Python reference glue used for the validation runs.

Example: run the current validation harness

pip install coremltools torch transformers pillow qwen-vl-utils huggingface_hub
python scripts/export_surya_coreml_runtime.py vision-combined-runtime-smoke \
  --model-id datalab-to/surya-ocr-2 \
  --vision-package surya_vision_int8.mlpackage \
  --prefill-package surya_prefill_fp16_seq300_cache512.mlpackage \
  --decode-package surya_decode_step_fp16_cache512.mlpackage \
  --output validation/local_vision_int8_prefill_decode_smoke.json \
  --max-cache-length 512 \
  --steps 8

Expected canary output today:

{
  "all_tokens_match": true,
  "coreml_text": "<p>Invoice ",
  "native_text": "<p>Invoice "
}

Important limitations

  • Fixed-shape canary export: the current packages are specialized to the traced 512x512 sample preprocessing path, with pixel_values [1024,1536], prefill length 300, and full-attention cache length 512.
  • The language prefill/decode packages are FP16 CoreML, not INT8/4-bit yet.
  • Only the vision tower has an INT8 package in this release.
  • This has not yet passed full allenai/olmOCR-bench.
  • This is not a complete native OCR app. It is the validated CoreML model core we will build the app around.

Provenance

Generated non-destructively from datalab-to/surya-ocr-2. No fine-tuning was performed.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/surya-ocr-2-coreml-runtime

Quantized
(5)
this model