ChipCracker commited on
Commit
15c6d27
Β·
verified Β·
1 Parent(s): 8a5dcde

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - multilingual
5
+ tags:
6
+ - coreml
7
+ - asr
8
+ - speech-recognition
9
+ - wav2vec2
10
+ - ctc
11
+ - ios
12
+ - on-device
13
+ - apple-neural-engine
14
+ pipeline_tag: automatic-speech-recognition
15
+ library_name: coremltools
16
+ ---
17
+
18
+ # Omni-ASR CTC CoreML Models
19
+
20
+ CoreML-optimized versions of [Meta's Omni-ASR](https://ai.meta.com/research/publications/scaling-speech-technology-to-1000-languages/) CTC models for on-device speech recognition on Apple platforms (iOS 17+, macOS 14+).
21
+
22
+ These models run entirely on-device using Apple's Neural Engine (ANE), with no cloud dependency.
23
+
24
+ ## Available Models
25
+
26
+ | Model | Parameters | Precision | Size | Recommended |
27
+ |-------|-----------|-----------|------|-------------|
28
+ | `OmniASR_CTC_300M_int8` | 300M | INT8 | 312 MB | **Yes** |
29
+ | `OmniASR_CTC_300M_fp16` | 300M | FP16 | 621 MB | |
30
+ | `OmniASR_CTC_1B_int8` | 1B | INT8 | 933 MB | |
31
+ | `OmniASR_CTC_1B_fp16` | 1B | FP16 | 1.8 GB | |
32
+
33
+ The **300M INT8** variant offers the best trade-off between accuracy and latency for real-time use on iPhone.
34
+
35
+ ## Architecture
36
+
37
+ - **Backbone:** wav2vec2 Conformer encoder (fairseq2)
38
+ - **Head:** CTC (Connectionist Temporal Classification)
39
+ - **Feature extractor:** Convolutional, stride 320 (20ms per frame at 16kHz)
40
+ - **Vocabulary:** 9,813 multilingual SentencePiece tokens (shared across all variants)
41
+ - **Training:** Dynamic Chunk Training with ~10% full-context passes
42
+
43
+ ## Input / Output
44
+
45
+ | | Description |
46
+ |---|---|
47
+ | **Input** | `audio`: Float16 MultiArray `[1, T]` β€” raw 16kHz mono audio samples |
48
+ | **Output** | `logits`: Float16 MultiArray `[1, T/320, 9813]` β€” CTC log-probabilities |
49
+
50
+ Supported input lengths (enumerated shapes):
51
+ - `[1, 160000]` β€” 10 seconds
52
+ - `[1, 320000]` β€” 20 seconds
53
+ - `[1, 640000]` β€” 40 seconds
54
+
55
+ Shorter audio is zero-padded to the nearest shape; the CTC decoder trims to actual length.
56
+
57
+ ## Performance (iPhone 15 Pro, ANE)
58
+
59
+ | Model | 4s audio | 20s audio | 40s audio |
60
+ |-------|----------|-----------|-----------|
61
+ | 300M INT8 | ~100 ms | ~500 ms | ~1.2 s |
62
+ | 1B INT8 | ~300 ms | ~1.5 s | ~3.5 s |
63
+
64
+ ## Usage
65
+
66
+ ### Download a model
67
+
68
+ ```bash
69
+ pip install huggingface_hub
70
+ # Download 300M INT8 (recommended)
71
+ huggingface-cli download ChipCracker/omni-asr-coreml \
72
+ OmniASR_CTC_300M_int8.mlmodelc --local-dir ./models
73
+ ```
74
+
75
+ ### Load in Swift
76
+
77
+ ```swift
78
+ import CoreML
79
+
80
+ let config = MLModelConfiguration()
81
+ config.computeUnits = .cpuAndNeuralEngine
82
+
83
+ let model = try await MLModel.load(
84
+ contentsOf: modelURL,
85
+ configuration: config
86
+ )
87
+ ```
88
+
89
+ ### Decode with greedy CTC
90
+
91
+ ```swift
92
+ // After model.prediction(from: features):
93
+ // 1. Argmax over vocabulary dimension
94
+ // 2. Remove consecutive duplicates
95
+ // 3. Remove blank token (index 0)
96
+ // 4. Map indices to vocabulary tokens
97
+ // 5. Join and replace SentencePiece boundary (▁) with space
98
+ ```
99
+
100
+ ### iOS App
101
+
102
+ These models are used by the [omni-asr iOS app](https://github.com/ChipCracker/omni-asr) which provides:
103
+ - Live transcription with growing context
104
+ - On-demand model download from this repository
105
+ - Full offline operation after download
106
+
107
+ ## Export
108
+
109
+ Models were exported from PyTorch using [coremltools](https://github.com/apple/coremltools) 9.0:
110
+
111
+ ```bash
112
+ omni-asr-export \
113
+ --model-card omniASR_CTC_300M \
114
+ --output OmniASR_CTC_300M_int8.mlpackage
115
+ # INT8 quantization is applied by default
116
+ ```
117
+
118
+ INT8 variants use post-training linear symmetric weight quantization, reducing size ~2x with minimal accuracy loss.
119
+
120
+ ## File Structure
121
+
122
+ Each `.mlmodelc` directory contains:
123
+ ```
124
+ OmniASR_CTC_300M_int8.mlmodelc/
125
+ β”œβ”€β”€ coremldata.bin # Model graph serialization
126
+ β”œβ”€β”€ metadata.json # CoreML metadata
127
+ β”œβ”€β”€ model.mil # ML Intermediate Language
128
+ β”œβ”€β”€ analytics/coremldata.bin
129
+ └── weights/weight.bin # Model weights (largest file)
130
+ ```
131
+
132
+ ## Citation
133
+
134
+ ```bibtex
135
+ @article{pratap2023scaling,
136
+ title={Scaling Speech Technology to 1,000+ Languages},
137
+ author={Pratap, Vineel and others},
138
+ journal={arXiv preprint arXiv:2305.13516},
139
+ year={2023}
140
+ }
141
+ ```
142
+
143
+ ## License
144
+
145
+ The CoreML conversion and app code are provided under CC-BY-NC-4.0.
146
+ The original Omni-ASR model weights are subject to Meta's license terms.