Surya CoreML Runtime Status
Generated on 2026-06-19 in studio@100.102.185.54:~/datalab-quants-cairo.
Packages
artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_fp16.mlpackageartifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_int8.mlpackageartifacts/coreml/surya-ocr-2-coreml-8bit/surya_prefill_fp16_seq300_cache512.mlpackageartifacts/coreml/surya-ocr-2-coreml-8bit/surya_decode_step_fp16_cache512.mlpackage
Passing Runtime Gates
- Prefill parity before CoreML export:
prefill_parity.json- native/custom first token:
1039 - logits max diff:
2.6702880859375e-05 - logits mean diff:
8.332146421707876e-07
- Prefill CoreML smoke:
prefill_coreml_smoke.json- Torch/CoreML first token:
1039 - logits max diff:
0.3057253360748291 - logits mean diff:
0.03853870555758476
- Decode CoreML iterative smoke with advancing native cache:
decode_step_iterative_smoke_fixed_native_cache.json- tokens match 9/9
- text:
<p>Invoice
- Combined language runtime smoke:
combined_prefill_decode_smoke.json- CoreML prefill cache -> CoreML decode step
- tokens match 9/9
- text:
<p>Invoice
- Vision-inclusive runtime smoke, FP16 vision:
vision_fp16_prefill_decode_smoke.json- CoreML vision -> CoreML prefill -> CoreML decode
- tokens match 9/9
- vision mean diff vs torch:
0.019221976399421692
- Vision-inclusive runtime smoke, INT8 vision:
vision_int8_prefill_decode_smoke.json- CoreML INT8 vision -> CoreML prefill -> CoreML decode
- tokens match 9/9
- vision mean diff vs torch:
0.021211756393313408
Current Host Responsibilities
- Tokenization.
- Initial text token embedding lookup.
- Image placeholder insertion.
- Rotary position embedding generation.
- Generated-token embedding lookup.
- Full-attention KV cache insertion.
Export Notes
- Prefill export uses
skip_model_load=Trueandcompute_units=CPU_ONLYduringct.convertto avoid CoreMLTools eagerly compiling the large MLProgram through ANE before saving. - Runtime smokes still instantiate
.mlpackagefiles withCPU_ONLYand run real predictions. - The current prefill package is fixed to the canary prompt sequence length
300and full-attention cache length512.