| --- |
| license: other |
| license_name: paraformer-upstream |
| license_link: https://github.com/modelscope/FunASR |
| language: |
| - zh |
| library_name: coreml |
| tags: |
| - coreml |
| - ane |
| - speech-recognition |
| - paraformer |
| - funasr |
| - fluidaudio |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Paraformer-large (zh) β CoreML (Apple Neural Engine) |
|
|
| CoreML conversion of FunASR's **Paraformer-large** (Mandarin Chinese) for on-device |
| inference on Apple Silicon, for [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio). |
|
|
| Paraformer is a **non-autoregressive** ASR model: a SANM encoder, a CIF predictor |
| that emits one acoustic-embedding token per output character, and a parallel |
| (single-pass) decoder. Upstream: |
| [iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch). |
|
|
| ## Files (3 CoreML stages + host CIF) |
|
|
| | File | Precision | Compute unit | Size | Role | |
| |------|-----------|--------------|------|------| |
| | `ParaformerPreprocessor.mlmodelc` | FP32 | CPU | 3 MB | front-end: waveform β 560-d LFR features | |
| | `ParaformerEncoder.mlmodelc` | FP16 | **ANE** | 302 MB | SANM encoder (enumerated buckets `[128,256,512,1024,1800]`) | |
| | `ParaformerDecoder.mlmodelc` | FP16 | **ANE** | 109 MB | parallel decoder (enc 512, tokens 128) | |
| | `vocab.json` | β | β | β | 8404 CharTokenizer tokens (array form) | |
|
|
| The **CIF predictor** runs on the host between encoder and decoder (it emits a |
| *dynamic* token count, which a fixed-shape CoreML graph can't express). It's a |
| small conv1d + linear + sigmoid β integrate-and-fire; a numpy reference |
| (`cif_numpy.py`) is in the conversion repo as the Swift blueprint. |
|
|
| ## Pipeline |
|
|
| ``` |
| waveform β [Preprocessor fp32/CPU] β features [1,T,560] |
| β [Encoder fp16/ANE] β enc_out [1,T,512] |
| β [host CIF] β acoustic_embeds [1,L,512], token_count L |
| β [Decoder fp16/ANE] β logits [1,L,8404] |
| β argmax per token β drop sos(1)/eos(2)/blank(0) β CharTokenizer |
| ``` |
|
|
| > Both fp16 encoder/decoder are correct on the Neural Engine. The front-end runs |
| > FP32/CPU (power-spectrum + log exceed the FP16 range). Run the encoder/decoder |
| > with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`. |
|
|
| ## Conversion notes |
|
|
| Two SANM-specific fixes were required for fp16/ANE under bucket padding (see the |
| conversion repo): a fp16-safe attention mask fill (`-inf` β `-1e4`), and building |
| the encoder/decoder pad-masks from the **input tensor's seq dim** (so |
| `EnumeratedShapes` generalize) rather than `lengths.max()`. |
|
|
| ## Benchmark β AISHELL-1 test (CoreML on ANE) |
|
|
| Full test set (7,176 utts), full-CoreML pipeline on M5 Pro ANE: |
|
|
| | Precision | size (enc+dec) | CER | median RTFx | peak RAM | |
| |-----------|----------------|-----|-------------|----------| |
| | fp16 (default) | 411 MB | **2.12%** | 85Γ | 0.38 GB | |
| | int8 | 207 MB | **2.12%** | 84Γ | 0.24 GB | |
|
|
| Official Paraformer-large AISHELL-1 β 1.95% CER (the ~0.17 pp gap is fp16 + the |
| fixed-shape decoder padding). int8 weight quantization is accuracy-neutral (CER |
| unchanged), ~half the size/memory. |
|
|
| Reproduces the published Paraformer-large AISHELL-1 number β confirming the |
| conversion (front-end + encoder + CIF + decoder) is faithful. |
|
|
| ## License & attribution |
|
|
| Weights derive from FunASR's Paraformer-large; the upstream license applies. This |
| repo is a format conversion only (no retraining). See |
| [FunASR](https://github.com/modelscope/FunASR). |
|
|