Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- multilingual
|
| 5 |
+
tags:
|
| 6 |
+
- coreml
|
| 7 |
+
- asr
|
| 8 |
+
- speech-recognition
|
| 9 |
+
- wav2vec2
|
| 10 |
+
- ctc
|
| 11 |
+
- ios
|
| 12 |
+
- on-device
|
| 13 |
+
- apple-neural-engine
|
| 14 |
+
pipeline_tag: automatic-speech-recognition
|
| 15 |
+
library_name: coremltools
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# Omni-ASR CTC CoreML Models
|
| 19 |
+
|
| 20 |
+
CoreML-optimized versions of [Meta's Omni-ASR](https://ai.meta.com/research/publications/scaling-speech-technology-to-1000-languages/) CTC models for on-device speech recognition on Apple platforms (iOS 17+, macOS 14+).
|
| 21 |
+
|
| 22 |
+
These models run entirely on-device using Apple's Neural Engine (ANE), with no cloud dependency.
|
| 23 |
+
|
| 24 |
+
## Available Models
|
| 25 |
+
|
| 26 |
+
| Model | Parameters | Precision | Size | Recommended |
|
| 27 |
+
|-------|-----------|-----------|------|-------------|
|
| 28 |
+
| `OmniASR_CTC_300M_int8` | 300M | INT8 | 312 MB | **Yes** |
|
| 29 |
+
| `OmniASR_CTC_300M_fp16` | 300M | FP16 | 621 MB | |
|
| 30 |
+
| `OmniASR_CTC_1B_int8` | 1B | INT8 | 933 MB | |
|
| 31 |
+
| `OmniASR_CTC_1B_fp16` | 1B | FP16 | 1.8 GB | |
|
| 32 |
+
|
| 33 |
+
The **300M INT8** variant offers the best trade-off between accuracy and latency for real-time use on iPhone.
|
| 34 |
+
|
| 35 |
+
## Architecture
|
| 36 |
+
|
| 37 |
+
- **Backbone:** wav2vec2 Conformer encoder (fairseq2)
|
| 38 |
+
- **Head:** CTC (Connectionist Temporal Classification)
|
| 39 |
+
- **Feature extractor:** Convolutional, stride 320 (20ms per frame at 16kHz)
|
| 40 |
+
- **Vocabulary:** 9,813 multilingual SentencePiece tokens (shared across all variants)
|
| 41 |
+
- **Training:** Dynamic Chunk Training with ~10% full-context passes
|
| 42 |
+
|
| 43 |
+
## Input / Output
|
| 44 |
+
|
| 45 |
+
| | Description |
|
| 46 |
+
|---|---|
|
| 47 |
+
| **Input** | `audio`: Float16 MultiArray `[1, T]` β raw 16kHz mono audio samples |
|
| 48 |
+
| **Output** | `logits`: Float16 MultiArray `[1, T/320, 9813]` β CTC log-probabilities |
|
| 49 |
+
|
| 50 |
+
Supported input lengths (enumerated shapes):
|
| 51 |
+
- `[1, 160000]` β 10 seconds
|
| 52 |
+
- `[1, 320000]` β 20 seconds
|
| 53 |
+
- `[1, 640000]` β 40 seconds
|
| 54 |
+
|
| 55 |
+
Shorter audio is zero-padded to the nearest shape; the CTC decoder trims to actual length.
|
| 56 |
+
|
| 57 |
+
## Performance (iPhone 15 Pro, ANE)
|
| 58 |
+
|
| 59 |
+
| Model | 4s audio | 20s audio | 40s audio |
|
| 60 |
+
|-------|----------|-----------|-----------|
|
| 61 |
+
| 300M INT8 | ~100 ms | ~500 ms | ~1.2 s |
|
| 62 |
+
| 1B INT8 | ~300 ms | ~1.5 s | ~3.5 s |
|
| 63 |
+
|
| 64 |
+
## Usage
|
| 65 |
+
|
| 66 |
+
### Download a model
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
pip install huggingface_hub
|
| 70 |
+
# Download 300M INT8 (recommended)
|
| 71 |
+
huggingface-cli download ChipCracker/omni-asr-coreml \
|
| 72 |
+
OmniASR_CTC_300M_int8.mlmodelc --local-dir ./models
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Load in Swift
|
| 76 |
+
|
| 77 |
+
```swift
|
| 78 |
+
import CoreML
|
| 79 |
+
|
| 80 |
+
let config = MLModelConfiguration()
|
| 81 |
+
config.computeUnits = .cpuAndNeuralEngine
|
| 82 |
+
|
| 83 |
+
let model = try await MLModel.load(
|
| 84 |
+
contentsOf: modelURL,
|
| 85 |
+
configuration: config
|
| 86 |
+
)
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### Decode with greedy CTC
|
| 90 |
+
|
| 91 |
+
```swift
|
| 92 |
+
// After model.prediction(from: features):
|
| 93 |
+
// 1. Argmax over vocabulary dimension
|
| 94 |
+
// 2. Remove consecutive duplicates
|
| 95 |
+
// 3. Remove blank token (index 0)
|
| 96 |
+
// 4. Map indices to vocabulary tokens
|
| 97 |
+
// 5. Join and replace SentencePiece boundary (β) with space
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### iOS App
|
| 101 |
+
|
| 102 |
+
These models are used by the [omni-asr iOS app](https://github.com/ChipCracker/omni-asr) which provides:
|
| 103 |
+
- Live transcription with growing context
|
| 104 |
+
- On-demand model download from this repository
|
| 105 |
+
- Full offline operation after download
|
| 106 |
+
|
| 107 |
+
## Export
|
| 108 |
+
|
| 109 |
+
Models were exported from PyTorch using [coremltools](https://github.com/apple/coremltools) 9.0:
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
omni-asr-export \
|
| 113 |
+
--model-card omniASR_CTC_300M \
|
| 114 |
+
--output OmniASR_CTC_300M_int8.mlpackage
|
| 115 |
+
# INT8 quantization is applied by default
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
INT8 variants use post-training linear symmetric weight quantization, reducing size ~2x with minimal accuracy loss.
|
| 119 |
+
|
| 120 |
+
## File Structure
|
| 121 |
+
|
| 122 |
+
Each `.mlmodelc` directory contains:
|
| 123 |
+
```
|
| 124 |
+
OmniASR_CTC_300M_int8.mlmodelc/
|
| 125 |
+
βββ coremldata.bin # Model graph serialization
|
| 126 |
+
βββ metadata.json # CoreML metadata
|
| 127 |
+
βββ model.mil # ML Intermediate Language
|
| 128 |
+
βββ analytics/coremldata.bin
|
| 129 |
+
βββ weights/weight.bin # Model weights (largest file)
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
## Citation
|
| 133 |
+
|
| 134 |
+
```bibtex
|
| 135 |
+
@article{pratap2023scaling,
|
| 136 |
+
title={Scaling Speech Technology to 1,000+ Languages},
|
| 137 |
+
author={Pratap, Vineel and others},
|
| 138 |
+
journal={arXiv preprint arXiv:2305.13516},
|
| 139 |
+
year={2023}
|
| 140 |
+
}
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
## License
|
| 144 |
+
|
| 145 |
+
The CoreML conversion and app code are provided under CC-BY-NC-4.0.
|
| 146 |
+
The original Omni-ASR model weights are subject to Meta's license terms.
|