zenz-CoreML

Core ML export of Miwa-Keita/zenz-v3.1-small for Apple platforms.

See CHANGELOG.md for version-to-version changes, docs/stateful-runtime-notes.md for the current stateful runtime contract, and docs/performance.md for the detailed benchmark log.

This repo is organized for Hugging Face Hub delivery, not GitHub Releases or SwiftPM binary targets. The intended upload payload is:

  • Artifacts/stateless/zenz-stateless-fp16.mlpackage
  • Artifacts/stateless/zenz-stateless-8bit.mlpackage
  • Artifacts/stateful/zenz-stateful-fp16.mlpackage
  • Artifacts/stateful/zenz-stateful-8bit.mlpackage
  • tokenizer/*
  • hf_manifest.json

The original model remains the source of truth for tokenizer semantics, weights provenance, and training lineage. This Core ML port should be linked back to the upstream model when published on Hugging Face.

Runtime shape

  • stateless is the whole-sequence baseline.
  • stateful is the single-model cached generation path.

The stateful model keeps the same Core ML state layout:

  • keyCache
  • valueCache

The current stateful runtime contract is:

  • prefill incrementally over the prompt
  • decode incrementally with one token at a time
  • reuse the same Core ML state
  • provide an attention_mask that reflects the active sequence length during decode

Compute Units

  • stateless: .all
  • stateful: .cpuAndGPU

Benchmark

Summary

Device Stateful FP16 Stateful 8-bit
iPhone Air 0.436 0.431
iPhone 12 1.124 1.041

Recommendation

  • iPhone 15 Pro and newer: use Stateful FP16 first.
  • Older devices than iPhone 15 Pro: use the Stateful 8-bit model first.

Notes

  • On iPhone Air, both stateful variants produced correct outputs in the recorded run.
  • On iPhone Air, 8-bit was slightly faster on mean latency than FP16.
  • On iPhone 12, 8-bit was faster on mean latency than FP16.
  • On iPhone 12, the recorded FP16 run under .all showed degraded outputs while the recorded 8-bit run remained correct.
  • The current recommendation is based on stateful running under .cpuAndGPU, not .all.
  • My current read is that FP16 is the better top-end option when the device can sustain it cleanly, but 8-bit is the safer deployment default once you care about broader keyboard coverage.
  • In other words: FP16 is the premium path, 8-bit is the compatibility path.

See docs/performance.md for the detailed tables and case-level notes.

Local export

python -m pip install -r requirements.txt
python Scripts/export_all.py

Or run each stage separately:

python convert-to-CoreML.py
python convert-to-CoreML-Stateful.py
Downloads last month
127
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Skyline23/zenz-coreml

Quantized
(1)
this model