license: other
license_name: paraformer-upstream
license_link: https://github.com/modelscope/FunASR
language:
- zh
library_name: coreml
tags:
- coreml
- ane
- speech-recognition
- paraformer
- funasr
- fluidaudio
pipeline_tag: automatic-speech-recognition
Paraformer-large (zh) β CoreML (Apple Neural Engine)
CoreML conversion of FunASR's Paraformer-large (Mandarin Chinese) for on-device inference on Apple Silicon, for FluidInference/FluidAudio.
Paraformer is a non-autoregressive ASR model: a SANM encoder, a CIF predictor that emits one acoustic-embedding token per output character, and a parallel (single-pass) decoder. Upstream: iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.
Files (3 CoreML stages + host CIF)
| File | Precision | Compute unit | Size | Role |
|---|---|---|---|---|
ParaformerPreprocessor.mlmodelc |
FP32 | CPU | 3 MB | front-end: waveform β 560-d LFR features |
ParaformerEncoder.mlmodelc |
FP16 | ANE | 302 MB | SANM encoder (enumerated buckets [128,256,512,1024,1800]) |
ParaformerDecoder.mlmodelc |
FP16 | ANE | 109 MB | parallel decoder (enc 512, tokens 128) |
vocab.json |
β | β | β | 8404 CharTokenizer tokens (array form) |
The CIF predictor runs on the host between encoder and decoder (it emits a
dynamic token count, which a fixed-shape CoreML graph can't express). It's a
small conv1d + linear + sigmoid β integrate-and-fire; a numpy reference
(cif_numpy.py) is in the conversion repo as the Swift blueprint.
Pipeline
waveform β [Preprocessor fp32/CPU] β features [1,T,560]
β [Encoder fp16/ANE] β enc_out [1,T,512]
β [host CIF] β acoustic_embeds [1,L,512], token_count L
β [Decoder fp16/ANE] β logits [1,L,8404]
β argmax per token β drop sos(1)/eos(2)/blank(0) β CharTokenizer
Both fp16 encoder/decoder are correct on the Neural Engine. The front-end runs FP32/CPU (power-spectrum + log exceed the FP16 range). Run the encoder/decoder with
MLModelConfiguration.computeUnits = .cpuAndNeuralEngine.
Conversion notes
Two SANM-specific fixes were required for fp16/ANE under bucket padding (see the
conversion repo): a fp16-safe attention mask fill (-inf β -1e4), and building
the encoder/decoder pad-masks from the input tensor's seq dim (so
EnumeratedShapes generalize) rather than lengths.max().
Benchmark β AISHELL-1 test (CoreML on ANE)
Full test set (7,176 utts), full-CoreML pipeline on M5 Pro ANE:
| Precision | size (enc+dec) | CER | median RTFx | peak RAM |
|---|---|---|---|---|
| fp16 (default) | 411 MB | 2.12% | 85Γ | 0.38 GB |
| int8 | 207 MB | 2.12% | 84Γ | 0.24 GB |
Official Paraformer-large AISHELL-1 β 1.95% CER (the ~0.17 pp gap is fp16 + the fixed-shape decoder padding). int8 weight quantization is accuracy-neutral (CER unchanged), ~half the size/memory.
Reproduces the published Paraformer-large AISHELL-1 number β confirming the conversion (front-end + encoder + CIF + decoder) is faithful.
License & attribution
Weights derive from FunASR's Paraformer-large; the upstream license applies. This repo is a format conversion only (no retraining). See FunASR.