Surya OCR 2 CoreML Runtime
This repository contains an early CoreML runtime bundle derived from datalab-to/surya-ocr-2 at source commit 3b3d4cdf88d6928b0acdc75181b13206ea67c4a3.
It is intended for native Apple OCR experiments and future iOS/macOS demo-app work. It is not yet a drop-in app runtime: the model packages pass canary runtime gates, but host-side glue is still required.
Included packages
| File | Purpose | Precision / quantization | Shape contract |
|---|---|---|---|
surya_vision_fp16.mlpackage |
Vision tower | CoreML FP16 compute | pixel_values [1024,1536] -> image_embeds [256,1024] |
surya_vision_int8.mlpackage |
Quantized vision tower | CoreML linear INT8 weight compression | pixel_values [1024,1536] -> image_embeds [256,1024] |
surya_prefill_fp16_seq300_cache512.mlpackage |
Language prefill, logits, and initial cache | CoreML FP16 compute | fixed prefill length 300, cache length 512 |
surya_decode_step_fp16_cache512.mlpackage |
One-token cached decode step | CoreML FP16 compute | one token at a time, cache length 512 |
Processor/tokenizer assets are included under processor/. Validation JSONs are under validation/.
Current validation
All validation below was run on a Mac Studio with CoreML package prediction, using the canary prompt/image:
OCR this image to HTML.
The canary image contains:
Invoice 123
Total $42.00
| Gate | Result |
|---|---|
| Prefill parity before CoreML export | native/custom first token 1039; logits max diff 2.6702880859375e-05 |
| Prefill CoreML smoke | Torch/CoreML first token 1039; logits max diff 0.3057253360748291; mean diff 0.03853870555758476 |
| Decode CoreML iterative smoke | 9/9 tokens match native; text <p>Invoice |
| CoreML prefill -> CoreML decode | 9/9 tokens match native; text <p>Invoice |
| CoreML FP16 vision -> CoreML prefill -> CoreML decode | 9/9 tokens match native; text <p>Invoice |
| CoreML INT8 vision -> CoreML prefill -> CoreML decode | 9/9 tokens match native; text <p>Invoice |
The INT8 vision package has mean absolute diff 0.021211756393313408 vs the PyTorch vision tower on the canary.
What still lives in host code
This is a runtime component bundle, not a complete Swift package yet. The current host/application layer must handle:
- tokenization
- initial text token embedding lookup
- image placeholder insertion
- rotary position embedding generation
- generated-token embedding lookup
- full-attention KV cache insertion
- stopping criteria and HTML/post-processing
The included scripts/export_surya_coreml_runtime.py shows the exact Python reference glue used for the validation runs.
Example: run the current validation harness
pip install coremltools torch transformers pillow qwen-vl-utils huggingface_hub
python scripts/export_surya_coreml_runtime.py vision-combined-runtime-smoke \
--model-id datalab-to/surya-ocr-2 \
--vision-package surya_vision_int8.mlpackage \
--prefill-package surya_prefill_fp16_seq300_cache512.mlpackage \
--decode-package surya_decode_step_fp16_cache512.mlpackage \
--output validation/local_vision_int8_prefill_decode_smoke.json \
--max-cache-length 512 \
--steps 8
Expected canary output today:
{
"all_tokens_match": true,
"coreml_text": "<p>Invoice ",
"native_text": "<p>Invoice "
}
Important limitations
- Fixed-shape canary export: the current packages are specialized to the traced 512x512 sample preprocessing path, with
pixel_values [1024,1536], prefill length300, and full-attention cache length512. - The language prefill/decode packages are FP16 CoreML, not INT8/4-bit yet.
- Only the vision tower has an INT8 package in this release.
- This has not yet passed full
allenai/olmOCR-bench. - This is not a complete native OCR app. It is the validated CoreML model core we will build the app around.
Provenance
Generated non-destructively from datalab-to/surya-ocr-2. No fine-tuning was performed.
- Downloads last month
- -
Model tree for Reza2kn/surya-ocr-2-coreml-runtime
Base model
datalab-to/surya-ocr-2