--- license: other license_name: paraformer-upstream license_link: https://github.com/modelscope/FunASR language: - zh library_name: coreml tags: - coreml - ane - speech-recognition - paraformer - funasr - fluidaudio pipeline_tag: automatic-speech-recognition --- # Paraformer-large (zh) — CoreML (Apple Neural Engine) CoreML conversion of FunASR's **Paraformer-large** (Mandarin Chinese) for on-device inference on Apple Silicon, for [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio). Paraformer is a **non-autoregressive** ASR model: a SANM encoder, a CIF predictor that emits one acoustic-embedding token per output character, and a parallel (single-pass) decoder. Upstream: [iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch). ## Files (3 CoreML stages + host CIF) | File | Precision | Compute unit | Size | Role | |------|-----------|--------------|------|------| | `ParaformerPreprocessor.mlmodelc` | FP32 | CPU | 3 MB | front-end: waveform → 560-d LFR features | | `ParaformerEncoder.mlmodelc` | FP16 | **ANE** | 302 MB | SANM encoder (enumerated buckets `[128,256,512,1024,1800]`) | | `ParaformerDecoder.mlmodelc` | FP16 | **ANE** | 109 MB | parallel decoder (enc 512, tokens 128) | | `vocab.json` | — | — | — | 8404 CharTokenizer tokens (array form) | The **CIF predictor** runs on the host between encoder and decoder (it emits a *dynamic* token count, which a fixed-shape CoreML graph can't express). It's a small conv1d + linear + sigmoid → integrate-and-fire; a numpy reference (`cif_numpy.py`) is in the conversion repo as the Swift blueprint. ## Pipeline ``` waveform → [Preprocessor fp32/CPU] → features [1,T,560] → [Encoder fp16/ANE] → enc_out [1,T,512] → [host CIF] → acoustic_embeds [1,L,512], token_count L → [Decoder fp16/ANE] → logits [1,L,8404] → argmax per token → drop sos(1)/eos(2)/blank(0) → CharTokenizer ``` > Both fp16 encoder/decoder are correct on the Neural Engine. The front-end runs > FP32/CPU (power-spectrum + log exceed the FP16 range). Run the encoder/decoder > with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`. ## Conversion notes Two SANM-specific fixes were required for fp16/ANE under bucket padding (see the conversion repo): a fp16-safe attention mask fill (`-inf` → `-1e4`), and building the encoder/decoder pad-masks from the **input tensor's seq dim** (so `EnumeratedShapes` generalize) rather than `lengths.max()`. ## Benchmark — AISHELL-1 test (CoreML on ANE) Full test set (7,176 utts), full-CoreML pipeline on M5 Pro ANE: | Precision | size (enc+dec) | CER | median RTFx | peak RAM | |-----------|----------------|-----|-------------|----------| | fp16 (default) | 411 MB | **2.12%** | 85× | 0.38 GB | | int8 | 207 MB | **2.12%** | 84× | 0.24 GB | Official Paraformer-large AISHELL-1 ≈ 1.95% CER (the ~0.17 pp gap is fp16 + the fixed-shape decoder padding). int8 weight quantization is accuracy-neutral (CER unchanged), ~half the size/memory. Reproduces the published Paraformer-large AISHELL-1 number — confirming the conversion (front-end + encoder + CIF + decoder) is faithful. ## License & attribution Weights derive from FunASR's Paraformer-large; the upstream license applies. This repo is a format conversion only (no retraining). See [FunASR](https://github.com/modelscope/FunASR).