| --- |
| license: other |
| license_name: sensevoice-upstream |
| license_link: https://github.com/FunAudioLLM/SenseVoice |
| language: |
| - zh |
| - en |
| - ja |
| - ko |
| - yue |
| library_name: coreml |
| tags: |
| - coreml |
| - ane |
| - speech-recognition |
| - sensevoice |
| - funasr |
| - fluidaudio |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # SenseVoiceSmall — CoreML (Apple Neural Engine) |
|
|
| CoreML conversion of [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall) |
| for on-device inference on Apple Silicon, intended for |
| [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio) |
| (tracks issues #645 / #646). |
|
|
| SenseVoiceSmall is a **non-autoregressive** multilingual ASR model (~234M params, |
| SANM encoder + single CTC head) covering 50+ languages, with emotion and |
| audio-event tags. One forward pass yields all output tokens. |
|
|
| ## Files (3-stage pipeline) |
|
|
| | File | Precision | Compute unit | Size | Role | |
| |------|-----------|--------------|------|------| |
| | `SenseVoicePreprocessor.mlmodelc` | FLOAT32 | CPU | 3 MB | front-end: waveform → 560-d LFR features | |
| | `SenseVoiceSmall.mlmodelc` | FLOAT16 | **`CPU_AND_NE` (ANE)** | 447 MB | **default** encoder+CTC | |
| | `SenseVoiceSmall_int8.mlmodelc` | INT8 (weights) | `CPU_AND_NE` (ANE) | 225 MB | ~half size, accuracy-neutral | |
| | `SenseVoiceSmall_fp32.mlmodelc` | FLOAT32 | any | 897 MB | encoder fallback (non-ANE) | |
| | `vocab.json` | — | — | — | 25055 SentencePiece tokens (array form) | |
|
|
| **int8** is post-training weight quantization (`linear_symmetric`), accuracy-neutral |
| vs fp16 on the full canonical sets: LibriSpeech test-clean WER 3.22→3.25% (2,620), |
| AISHELL-1 test CER 3.09→3.09% (7,176) — Δ +0.03 pp / 0.00 pp, 0 NaN on ANE, peak |
| RAM 0.54→0.32 GB. Pick it for ~half the on-disk/memory footprint. |
|
|
| Pipeline: `waveform → [Preprocessor, fp32/CPU] → features → [encoder+CTC, fp16/ANE] → logits → host greedy-CTC decode`. |
|
|
| > ⚠️ **Compute-unit requirement.** The FLOAT16 encoder is numerically correct on |
| > the **Neural Engine** but produces **NaN on the CPU/GPU fp16 path**. Load it |
| > with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`. On hardware |
| > without ANE (or under ANE fallback), use `SenseVoiceSmall_fp32`. The |
| > preprocessor must run **fp32** (power-spectrum/log exceed fp16 range). |
| |
| ## I/O |
| |
| **`SenseVoicePreprocessor`** — in: `waveform [1, N]` fp32 (16 kHz, scaled ×32768 |
| like kaldi; flexible length). out: `features [1, T, 560]` fp32. |
| |
| **`SenseVoiceSmall`** (encoder+CTC): |
| |
| | name | shape | dtype | notes | |
| |------|-------|-------|-------| |
| | `speech` | `[1, T, 560]` | fp32 | preprocessor output; `T` ∈ enumerated buckets `[128,256,512,1024,1800]` (pad up) | |
| | `speech_lengths` | `[1]` | int32 | valid frame count (before padding) | |
| | `language` | `[1]` | int32 | embed index; `0` = auto | |
| | `textnorm` | `[1]` | int32 | `15` = no inverse text-norm (woitn), `14` = withitn | |
|
|
| **Output:** `ctc_logits` `[1, T+4, 25055]` — the 4 leading positions are the |
| language/emotion/event/itn query tokens; the rest are the transcript. |
|
|
| ## Host pre/post-processing |
|
|
| **Pre:** handled by `SenseVoicePreprocessor` (kaldi fbank80 → LFR m=7,n=6 → CMVN, |
| matching FunASR `WavFrontend` to max|Δ|≈2e-5). Pad its output up to the smallest |
| encoder bucket ≥ `T`. |
|
|
| **Post (decode):** greedy CTC over `ctc_logits` → collapse repeats → drop blank |
| (id 0) → SentencePiece detokenize → strip `<|...|>` tags for the clean |
| transcript. Reference Python in the repo's `decode.py`. |
|
|
| `language`/`textnorm` are **embed indices**, mapped on the host: |
| ``` |
| lid_int_dict = {24884:3, 24885:4, 24888:7, 24892:11, 24896:12, 24992:13} # <|zh|> etc -> embed idx |
| textnorm_int_dict = {25016:14, 25017:15} |
| # language not in dict -> 0 (auto) |
| ``` |
|
|
| ## Verification & benchmarks |
|
|
| Conversion = PyTorch (FunASR) → `torch.jit.trace` → coremltools (FLOAT16, |
| `EnumeratedShapes`, iOS17). Measured on this machine (M-series), FunASR 1.3.9 / |
| coremltools 8.3. |
|
|
| - **End-to-end correctness:** on the cached zh sample, the CoreML(ANE) → |
| greedy-CTC pipeline reproduces FunASR `am.generate` **exactly**: |
| `<|zh|><|NEUTRAL|><|Speech|><|woitn|>欢迎大家来体验达摩院推出的语音识别模型` |
| - **Parity (torch ↔ CoreML, ANE):** CTC argmax token agreement **100%** on real audio. |
| - **LibriSpeech test-clean (canonical — matches the official chart):** CoreML(ANE) |
| **3.21% WER** (torch 3.26%) on n=100 vs the published SenseVoice-Small **~3.1%**. |
| Confirms the full pipeline (front-end + CoreML + decode) reproduces the paper. |
| (Full 2620-utt split number: see repo README.) |
| - **FLEURS WER (CoreML ANE vs torch), 100 samples/lang — conversion is accuracy-neutral:** |
|
|
| | lang | torch | CoreML (ANE) | Δ | RTFx | |
| |------|-------|--------------|---|------| |
| | en_us (WER) | 9.52% | 9.52% | +0.00pp | 402 | |
| | cmn_hans_cn (CER) | 9.60% | 9.57% | −0.03pp | 372 | |
| |
| > FLEURS is a harder/different read-speech set than LibriSpeech/Aishell — its |
| > absolute numbers are not comparable to the official benchmark chart; it's |
| > used here only for cross-language CoreML↔torch parity. |
| |
| - **RTFx (5.55 s clip, by bucket, ANE):** 128→524, 256→274, 512→97, 1024→36, 1800→14.5. |
| (M-series; iPhone ANE not yet measured.) |
| |
| ## License & attribution |
| |
| Weights derive from [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall); |
| the upstream model license applies. This repo only contains a format conversion |
| (no retraining). See the [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) |
| and [FunASR](https://github.com/modelscope/FunASR) projects. |
| |