# Surya CoreML Runtime Status Generated on 2026-06-19 in `studio@100.102.185.54:~/datalab-quants-cairo`. ## Packages - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_fp16.mlpackage` - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_int8.mlpackage` - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_prefill_fp16_seq300_cache512.mlpackage` - `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_decode_step_fp16_cache512.mlpackage` ## Passing Runtime Gates - Prefill parity before CoreML export: - `prefill_parity.json` - native/custom first token: `1039` - logits max diff: `2.6702880859375e-05` - logits mean diff: `8.332146421707876e-07` - Prefill CoreML smoke: - `prefill_coreml_smoke.json` - Torch/CoreML first token: `1039` - logits max diff: `0.3057253360748291` - logits mean diff: `0.03853870555758476` - Decode CoreML iterative smoke with advancing native cache: - `decode_step_iterative_smoke_fixed_native_cache.json` - tokens match 9/9 - text: `
Invoice ` - Combined language runtime smoke: - `combined_prefill_decode_smoke.json` - CoreML prefill cache -> CoreML decode step - tokens match 9/9 - text: `
Invoice ` - Vision-inclusive runtime smoke, FP16 vision: - `vision_fp16_prefill_decode_smoke.json` - CoreML vision -> CoreML prefill -> CoreML decode - tokens match 9/9 - vision mean diff vs torch: `0.019221976399421692` - Vision-inclusive runtime smoke, INT8 vision: - `vision_int8_prefill_decode_smoke.json` - CoreML INT8 vision -> CoreML prefill -> CoreML decode - tokens match 9/9 - vision mean diff vs torch: `0.021211756393313408` ## Current Host Responsibilities - Tokenization. - Initial text token embedding lookup. - Image placeholder insertion. - Rotary position embedding generation. - Generated-token embedding lookup. - Full-attention KV cache insertion. ## Export Notes - Prefill export uses `skip_model_load=True` and `compute_units=CPU_ONLY` during `ct.convert` to avoid CoreMLTools eagerly compiling the large MLProgram through ANE before saving. - Runtime smokes still instantiate `.mlpackage` files with `CPU_ONLY` and run real predictions. - The current prefill package is fixed to the canary prompt sequence length `300` and full-attention cache length `512`.