surya-ocr-2-coreml-runtime / validation /coreml_runtime_status.md

Upload Surya OCR 2 CoreML runtime canary

92c0d8d verified 14 days ago

2.27 kB

	# Surya CoreML Runtime Status

	Generated on 2026-06-19 in `studio@100.102.185.54:~/datalab-quants-cairo`.

	## Packages

	- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_fp16.mlpackage`
	- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_vision_int8.mlpackage`
	- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_prefill_fp16_seq300_cache512.mlpackage`
	- `artifacts/coreml/surya-ocr-2-coreml-8bit/surya_decode_step_fp16_cache512.mlpackage`

	## Passing Runtime Gates

	- Prefill parity before CoreML export:
	- `prefill_parity.json`
	- native/custom first token: `1039`
	- logits max diff: `2.6702880859375e-05`
	- logits mean diff: `8.332146421707876e-07`
	- Prefill CoreML smoke:
	- `prefill_coreml_smoke.json`
	- Torch/CoreML first token: `1039`
	- logits max diff: `0.3057253360748291`
	- logits mean diff: `0.03853870555758476`
	- Decode CoreML iterative smoke with advancing native cache:
	- `decode_step_iterative_smoke_fixed_native_cache.json`
	- tokens match 9/9
	- text: `<p>Invoice `
	- Combined language runtime smoke:
	- `combined_prefill_decode_smoke.json`
	- CoreML prefill cache -> CoreML decode step
	- tokens match 9/9
	- text: `<p>Invoice `
	- Vision-inclusive runtime smoke, FP16 vision:
	- `vision_fp16_prefill_decode_smoke.json`
	- CoreML vision -> CoreML prefill -> CoreML decode
	- tokens match 9/9
	- vision mean diff vs torch: `0.019221976399421692`
	- Vision-inclusive runtime smoke, INT8 vision:
	- `vision_int8_prefill_decode_smoke.json`
	- CoreML INT8 vision -> CoreML prefill -> CoreML decode
	- tokens match 9/9
	- vision mean diff vs torch: `0.021211756393313408`

	## Current Host Responsibilities

	- Tokenization.
	- Initial text token embedding lookup.
	- Image placeholder insertion.
	- Rotary position embedding generation.
	- Generated-token embedding lookup.
	- Full-attention KV cache insertion.

	## Export Notes

	- Prefill export uses `skip_model_load=True` and `compute_units=CPU_ONLY` during `ct.convert` to avoid CoreMLTools eagerly compiling the large MLProgram through ANE before saving.
	- Runtime smokes still instantiate `.mlpackage` files with `CPU_ONLY` and run real predictions.
	- The current prefill package is fixed to the canary prompt sequence length `300` and full-attention cache length `512`.