| # Surya CoreML Runtime Status |
|
|
| Generated on 2026-06-19 in `studio@100.102.185.54:~/datalab-quants-cairo`. |
|
|
| ## Packages |
|
|
| - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_fp16.mlpackage` |
| - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_int8.mlpackage` |
| - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_prefill_fp16_seq300_cache512.mlpackage` |
| - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_decode_step_fp16_cache512.mlpackage` |
|
|
| ## Passing Runtime Gates |
|
|
| - Prefill parity before CoreML export: |
| - `prefill_parity.json` |
| - native/custom first token: `1039` |
| - logits max diff: `2.6702880859375e-05` |
| - logits mean diff: `8.332146421707876e-07` |
| - Prefill CoreML smoke: |
| - `prefill_coreml_smoke.json` |
| - Torch/CoreML first token: `1039` |
| - logits max diff: `0.3057253360748291` |
| - logits mean diff: `0.03853870555758476` |
| - Decode CoreML iterative smoke with advancing native cache: |
| - `decode_step_iterative_smoke_fixed_native_cache.json` |
| - tokens match 9/9 |
| - text: `<p>Invoice ` |
| - Combined language runtime smoke: |
| - `combined_prefill_decode_smoke.json` |
| - CoreML prefill cache -> CoreML decode step |
| - tokens match 9/9 |
| - text: `<p>Invoice ` |
| - Vision-inclusive runtime smoke, FP16 vision: |
| - `vision_fp16_prefill_decode_smoke.json` |
| - CoreML vision -> CoreML prefill -> CoreML decode |
| - tokens match 9/9 |
| - vision mean diff vs torch: `0.019221976399421692` |
| - Vision-inclusive runtime smoke, INT8 vision: |
| - `vision_int8_prefill_decode_smoke.json` |
| - CoreML INT8 vision -> CoreML prefill -> CoreML decode |
| - tokens match 9/9 |
| - vision mean diff vs torch: `0.021211756393313408` |
|
|
| ## Current Host Responsibilities |
|
|
| - Tokenization. |
| - Initial text token embedding lookup. |
| - Image placeholder insertion. |
| - Rotary position embedding generation. |
| - Generated-token embedding lookup. |
| - Full-attention KV cache insertion. |
|
|
| ## Export Notes |
|
|
| - Prefill export uses `skip_model_load=True` and `compute_units=CPU_ONLY` during `ct.convert` to avoid CoreMLTools eagerly compiling the large MLProgram through ANE before saving. |
| - Runtime smokes still instantiate `.mlpackage` files with `CPU_ONLY` and run real predictions. |
| - The current prefill package is fixed to the canary prompt sequence length `300` and full-attention cache length `512`. |
|
|