ChipCracker
/

omni-asr-coreml

+---
+license: cc-by-nc-4.0
+language:
+  - multilingual
+tags:
+  - coreml
+  - asr
+  - speech-recognition
+  - wav2vec2
+  - ctc
+  - ios
+  - on-device
+  - apple-neural-engine
+pipeline_tag: automatic-speech-recognition
+library_name: coremltools
+---
+# Omni-ASR CTC CoreML Models
+CoreML-optimized versions of [Meta's Omni-ASR](https://ai.meta.com/research/publications/scaling-speech-technology-to-1000-languages/) CTC models for on-device speech recognition on Apple platforms (iOS 17+, macOS 14+).
+These models run entirely on-device using Apple's Neural Engine (ANE), with no cloud dependency.
+## Available Models
+| Model | Parameters | Precision | Size | Recommended |
+|-------|-----------|-----------|------|-------------|
+| `OmniASR_CTC_300M_int8` | 300M | INT8 | 312 MB | **Yes** |
+| `OmniASR_CTC_300M_fp16` | 300M | FP16 | 621 MB | |
+| `OmniASR_CTC_1B_int8` | 1B | INT8 | 933 MB | |
+| `OmniASR_CTC_1B_fp16` | 1B | FP16 | 1.8 GB | |
+The **300M INT8** variant offers the best trade-off between accuracy and latency for real-time use on iPhone.
+## Architecture
+- **Backbone:** wav2vec2 Conformer encoder (fairseq2)
+- **Head:** CTC (Connectionist Temporal Classification)
+- **Feature extractor:** Convolutional, stride 320 (20ms per frame at 16kHz)
+- **Vocabulary:** 9,813 multilingual SentencePiece tokens (shared across all variants)
+- **Training:** Dynamic Chunk Training with ~10% full-context passes
+## Input / Output
+| | Description |
+|---|---|
+| **Input** | `audio`: Float16 MultiArray `[1, T]` — raw 16kHz mono audio samples |
+| **Output** | `logits`: Float16 MultiArray `[1, T/320, 9813]` — CTC log-probabilities |
+Supported input lengths (enumerated shapes):
+- `[1, 160000]` — 10 seconds
+- `[1, 320000]` — 20 seconds
+- `[1, 640000]` — 40 seconds
+Shorter audio is zero-padded to the nearest shape; the CTC decoder trims to actual length.
+## Performance (iPhone 15 Pro, ANE)
+| Model | 4s audio | 20s audio | 40s audio |
+|-------|----------|-----------|-----------|
+| 300M INT8 | ~100 ms | ~500 ms | ~1.2 s |
+| 1B INT8 | ~300 ms | ~1.5 s | ~3.5 s |
+## Usage
+### Download a model
+```bash
+pip install huggingface_hub
+# Download 300M INT8 (recommended)
+huggingface-cli download ChipCracker/omni-asr-coreml \
+    OmniASR_CTC_300M_int8.mlmodelc --local-dir ./models
+```
+### Load in Swift
+```swift
+import CoreML
+let config = MLModelConfiguration()
+config.computeUnits = .cpuAndNeuralEngine
+let model = try await MLModel.load(
+    contentsOf: modelURL,
+    configuration: config
+)
+```
+### Decode with greedy CTC
+```swift
+// After model.prediction(from: features):
+// 1. Argmax over vocabulary dimension
+// 2. Remove consecutive duplicates
+// 3. Remove blank token (index 0)
+// 4. Map indices to vocabulary tokens
+// 5. Join and replace SentencePiece boundary (▁) with space
+```
+### iOS App
+These models are used by the [omni-asr iOS app](https://github.com/ChipCracker/omni-asr) which provides:
+- Live transcription with growing context
+- On-demand model download from this repository
+- Full offline operation after download
+## Export
+Models were exported from PyTorch using [coremltools](https://github.com/apple/coremltools) 9.0:
+```bash
+omni-asr-export \
+    --model-card omniASR_CTC_300M \
+    --output OmniASR_CTC_300M_int8.mlpackage
+# INT8 quantization is applied by default
+```
+INT8 variants use post-training linear symmetric weight quantization, reducing size ~2x with minimal accuracy loss.
+## File Structure
+Each `.mlmodelc` directory contains:
+```
+OmniASR_CTC_300M_int8.mlmodelc/
+├── coremldata.bin          # Model graph serialization
+├── metadata.json           # CoreML metadata
+├── model.mil               # ML Intermediate Language
+├── analytics/coremldata.bin
+└── weights/weight.bin      # Model weights (largest file)
+```
+## Citation
+```bibtex
+@article{pratap2023scaling,
+    title={Scaling Speech Technology to 1,000+ Languages},
+    author={Pratap, Vineel and others},
+    journal={arXiv preprint arXiv:2305.13516},
+    year={2023}
+}
+```
+## License
+The CoreML conversion and app code are provided under CC-BY-NC-4.0.
+The original Omni-ASR model weights are subject to Meta's license terms.