# Surya CoreML Runtime Status

Generated on 2026-06-19 in `studio@100.102.185.54:~/datalab-quants-cairo`.

## Packages

- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_fp16.mlpackage`
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_int8.mlpackage`
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_prefill_fp16_seq300_cache512.mlpackage`
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_decode_step_fp16_cache512.mlpackage`

## Passing Runtime Gates

- Prefill parity before CoreML export:
  - `prefill_parity.json`
  - native/custom first token: `1039`
  - logits max diff: `2.6702880859375e-05`
  - logits mean diff: `8.332146421707876e-07`
- Prefill CoreML smoke:
  - `prefill_coreml_smoke.json`
  - Torch/CoreML first token: `1039`
  - logits max diff: `0.3057253360748291`
  - logits mean diff: `0.03853870555758476`
- Decode CoreML iterative smoke with advancing native cache:
  - `decode_step_iterative_smoke_fixed_native_cache.json`
  - tokens match 9/9
  - text: `<p>Invoice `
- Combined language runtime smoke:
  - `combined_prefill_decode_smoke.json`
  - CoreML prefill cache -> CoreML decode step
  - tokens match 9/9
  - text: `<p>Invoice `
- Vision-inclusive runtime smoke, FP16 vision:
  - `vision_fp16_prefill_decode_smoke.json`
  - CoreML vision -> CoreML prefill -> CoreML decode
  - tokens match 9/9
  - vision mean diff vs torch: `0.019221976399421692`
- Vision-inclusive runtime smoke, INT8 vision:
  - `vision_int8_prefill_decode_smoke.json`
  - CoreML INT8 vision -> CoreML prefill -> CoreML decode
  - tokens match 9/9
  - vision mean diff vs torch: `0.021211756393313408`

## Current Host Responsibilities

- Tokenization.
- Initial text token embedding lookup.
- Image placeholder insertion.
- Rotary position embedding generation.
- Generated-token embedding lookup.
- Full-attention KV cache insertion.

## Export Notes

- Prefill export uses `skip_model_load=True` and `compute_units=CPU_ONLY` during `ct.convert` to avoid CoreMLTools eagerly compiling the large MLProgram through ANE before saving.
- Runtime smokes still instantiate `.mlpackage` files with `CPU_ONLY` and run real predictions.
- The current prefill package is fixed to the canary prompt sequence length `300` and full-attention cache length `512`.