File size: 5,570 Bytes
f992cca c407979 cdea352 f992cca | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | ---
license: other
license_name: sensevoice-upstream
license_link: https://github.com/FunAudioLLM/SenseVoice
language:
- zh
- en
- ja
- ko
- yue
library_name: coreml
tags:
- coreml
- ane
- speech-recognition
- sensevoice
- funasr
- fluidaudio
pipeline_tag: automatic-speech-recognition
---
# SenseVoiceSmall — CoreML (Apple Neural Engine)
CoreML conversion of [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
for on-device inference on Apple Silicon, intended for
[FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)
(tracks issues #645 / #646).
SenseVoiceSmall is a **non-autoregressive** multilingual ASR model (~234M params,
SANM encoder + single CTC head) covering 50+ languages, with emotion and
audio-event tags. One forward pass yields all output tokens.
## Files (3-stage pipeline)
| File | Precision | Compute unit | Size | Role |
|------|-----------|--------------|------|------|
| `SenseVoicePreprocessor.mlmodelc` | FLOAT32 | CPU | 3 MB | front-end: waveform → 560-d LFR features |
| `SenseVoiceSmall.mlmodelc` | FLOAT16 | **`CPU_AND_NE` (ANE)** | 447 MB | **default** encoder+CTC |
| `SenseVoiceSmall_int8.mlmodelc` | INT8 (weights) | `CPU_AND_NE` (ANE) | 225 MB | ~half size, accuracy-neutral |
| `SenseVoiceSmall_fp32.mlmodelc` | FLOAT32 | any | 897 MB | encoder fallback (non-ANE) |
| `vocab.json` | — | — | — | 25055 SentencePiece tokens (array form) |
**int8** is post-training weight quantization (`linear_symmetric`), accuracy-neutral
vs fp16 on the full canonical sets: LibriSpeech test-clean WER 3.22→3.25% (2,620),
AISHELL-1 test CER 3.09→3.09% (7,176) — Δ +0.03 pp / 0.00 pp, 0 NaN on ANE, peak
RAM 0.54→0.32 GB. Pick it for ~half the on-disk/memory footprint.
Pipeline: `waveform → [Preprocessor, fp32/CPU] → features → [encoder+CTC, fp16/ANE] → logits → host greedy-CTC decode`.
> ⚠️ **Compute-unit requirement.** The FLOAT16 encoder is numerically correct on
> the **Neural Engine** but produces **NaN on the CPU/GPU fp16 path**. Load it
> with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`. On hardware
> without ANE (or under ANE fallback), use `SenseVoiceSmall_fp32`. The
> preprocessor must run **fp32** (power-spectrum/log exceed fp16 range).
## I/O
**`SenseVoicePreprocessor`** — in: `waveform [1, N]` fp32 (16 kHz, scaled ×32768
like kaldi; flexible length). out: `features [1, T, 560]` fp32.
**`SenseVoiceSmall`** (encoder+CTC):
| name | shape | dtype | notes |
|------|-------|-------|-------|
| `speech` | `[1, T, 560]` | fp32 | preprocessor output; `T` ∈ enumerated buckets `[128,256,512,1024,1800]` (pad up) |
| `speech_lengths` | `[1]` | int32 | valid frame count (before padding) |
| `language` | `[1]` | int32 | embed index; `0` = auto |
| `textnorm` | `[1]` | int32 | `15` = no inverse text-norm (woitn), `14` = withitn |
**Output:** `ctc_logits` `[1, T+4, 25055]` — the 4 leading positions are the
language/emotion/event/itn query tokens; the rest are the transcript.
## Host pre/post-processing
**Pre:** handled by `SenseVoicePreprocessor` (kaldi fbank80 → LFR m=7,n=6 → CMVN,
matching FunASR `WavFrontend` to max|Δ|≈2e-5). Pad its output up to the smallest
encoder bucket ≥ `T`.
**Post (decode):** greedy CTC over `ctc_logits` → collapse repeats → drop blank
(id 0) → SentencePiece detokenize → strip `<|...|>` tags for the clean
transcript. Reference Python in the repo's `decode.py`.
`language`/`textnorm` are **embed indices**, mapped on the host:
```
lid_int_dict = {24884:3, 24885:4, 24888:7, 24892:11, 24896:12, 24992:13} # <|zh|> etc -> embed idx
textnorm_int_dict = {25016:14, 25017:15}
# language not in dict -> 0 (auto)
```
## Verification & benchmarks
Conversion = PyTorch (FunASR) → `torch.jit.trace` → coremltools (FLOAT16,
`EnumeratedShapes`, iOS17). Measured on this machine (M-series), FunASR 1.3.9 /
coremltools 8.3.
- **End-to-end correctness:** on the cached zh sample, the CoreML(ANE) →
greedy-CTC pipeline reproduces FunASR `am.generate` **exactly**:
`<|zh|><|NEUTRAL|><|Speech|><|woitn|>欢迎大家来体验达摩院推出的语音识别模型`
- **Parity (torch ↔ CoreML, ANE):** CTC argmax token agreement **100%** on real audio.
- **LibriSpeech test-clean (canonical — matches the official chart):** CoreML(ANE)
**3.21% WER** (torch 3.26%) on n=100 vs the published SenseVoice-Small **~3.1%**.
Confirms the full pipeline (front-end + CoreML + decode) reproduces the paper.
(Full 2620-utt split number: see repo README.)
- **FLEURS WER (CoreML ANE vs torch), 100 samples/lang — conversion is accuracy-neutral:**
| lang | torch | CoreML (ANE) | Δ | RTFx |
|------|-------|--------------|---|------|
| en_us (WER) | 9.52% | 9.52% | +0.00pp | 402 |
| cmn_hans_cn (CER) | 9.60% | 9.57% | −0.03pp | 372 |
> FLEURS is a harder/different read-speech set than LibriSpeech/Aishell — its
> absolute numbers are not comparable to the official benchmark chart; it's
> used here only for cross-language CoreML↔torch parity.
- **RTFx (5.55 s clip, by bucket, ANE):** 128→524, 256→274, 512→97, 1024→36, 1800→14.5.
(M-series; iPhone ANE not yet measured.)
## License & attribution
Weights derive from [FunAudioLLM/SenseVoiceSmall](https://huggingface.co/FunAudioLLM/SenseVoiceSmall);
the upstream model license applies. This repo only contains a format conversion
(no retraining). See the [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
and [FunASR](https://github.com/modelscope/FunASR) projects.
|