surya-ocr-2-coreml-runtime / validation /coreml_runtime_status.md
Reza2kn's picture
Upload Surya OCR 2 CoreML runtime canary
92c0d8d verified
|
Raw
History Blame Contribute Delete
2.27 kB
# Surya CoreML Runtime Status
Generated on 2026-06-19 in `studio@100.102.185.54:~/datalab-quants-cairo`.
## Packages
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_fp16.mlpackage`
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_int8.mlpackage`
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_prefill_fp16_seq300_cache512.mlpackage`
- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_decode_step_fp16_cache512.mlpackage`
## Passing Runtime Gates
- Prefill parity before CoreML export:
- `prefill_parity.json`
- native/custom first token: `1039`
- logits max diff: `2.6702880859375e-05`
- logits mean diff: `8.332146421707876e-07`
- Prefill CoreML smoke:
- `prefill_coreml_smoke.json`
- Torch/CoreML first token: `1039`
- logits max diff: `0.3057253360748291`
- logits mean diff: `0.03853870555758476`
- Decode CoreML iterative smoke with advancing native cache:
- `decode_step_iterative_smoke_fixed_native_cache.json`
- tokens match 9/9
- text: `<p>Invoice `
- Combined language runtime smoke:
- `combined_prefill_decode_smoke.json`
- CoreML prefill cache -> CoreML decode step
- tokens match 9/9
- text: `<p>Invoice `
- Vision-inclusive runtime smoke, FP16 vision:
- `vision_fp16_prefill_decode_smoke.json`
- CoreML vision -> CoreML prefill -> CoreML decode
- tokens match 9/9
- vision mean diff vs torch: `0.019221976399421692`
- Vision-inclusive runtime smoke, INT8 vision:
- `vision_int8_prefill_decode_smoke.json`
- CoreML INT8 vision -> CoreML prefill -> CoreML decode
- tokens match 9/9
- vision mean diff vs torch: `0.021211756393313408`
## Current Host Responsibilities
- Tokenization.
- Initial text token embedding lookup.
- Image placeholder insertion.
- Rotary position embedding generation.
- Generated-token embedding lookup.
- Full-attention KV cache insertion.
## Export Notes
- Prefill export uses `skip_model_load=True` and `compute_units=CPU_ONLY` during `ct.convert` to avoid CoreMLTools eagerly compiling the large MLProgram through ANE before saving.
- Runtime smokes still instantiate `.mlpackage` files with `CPU_ONLY` and run real predictions.
- The current prefill package is fixed to the canary prompt sequence length `300` and full-attention cache length `512`.