docs: fp16/int8 full benchmark + RTFx

5dd557b verified 9 days ago

3.5 kB

license: other
license_name: paraformer-upstream
license_link: https://github.com/modelscope/FunASR
language:
  - zh
library_name: coreml
tags:
  - coreml
  - ane
  - speech-recognition
  - paraformer
  - funasr
  - fluidaudio
pipeline_tag: automatic-speech-recognition

Paraformer-large (zh) — CoreML (Apple Neural Engine)

CoreML conversion of FunASR's Paraformer-large (Mandarin Chinese) for on-device inference on Apple Silicon, for FluidInference/FluidAudio.

Paraformer is a non-autoregressive ASR model: a SANM encoder, a CIF predictor that emits one acoustic-embedding token per output character, and a parallel (single-pass) decoder. Upstream: iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.

Files (3 CoreML stages + host CIF)

File	Precision	Compute unit	Size	Role
`ParaformerPreprocessor.mlmodelc`	FP32	CPU	3 MB	front-end: waveform → 560-d LFR features
`ParaformerEncoder.mlmodelc`	FP16	ANE	302 MB	SANM encoder (enumerated buckets `[128,256,512,1024,1800]`)
`ParaformerDecoder.mlmodelc`	FP16	ANE	109 MB	parallel decoder (enc 512, tokens 128)
`vocab.json`	—	—	—	8404 CharTokenizer tokens (array form)

The CIF predictor runs on the host between encoder and decoder (it emits a dynamic token count, which a fixed-shape CoreML graph can't express). It's a small conv1d + linear + sigmoid → integrate-and-fire; a numpy reference (cif_numpy.py) is in the conversion repo as the Swift blueprint.

Pipeline

waveform → [Preprocessor fp32/CPU] → features [1,T,560]
        → [Encoder fp16/ANE] → enc_out [1,T,512]
        → [host CIF] → acoustic_embeds [1,L,512], token_count L
        → [Decoder fp16/ANE] → logits [1,L,8404]
        → argmax per token → drop sos(1)/eos(2)/blank(0) → CharTokenizer

Both fp16 encoder/decoder are correct on the Neural Engine. The front-end runs FP32/CPU (power-spectrum + log exceed the FP16 range). Run the encoder/decoder with MLModelConfiguration.computeUnits = .cpuAndNeuralEngine.

Conversion notes

Two SANM-specific fixes were required for fp16/ANE under bucket padding (see the conversion repo): a fp16-safe attention mask fill (-inf → -1e4), and building the encoder/decoder pad-masks from the input tensor's seq dim (so EnumeratedShapes generalize) rather than lengths.max().

Benchmark — AISHELL-1 test (CoreML on ANE)

Full test set (7,176 utts), full-CoreML pipeline on M5 Pro ANE:

Precision	size (enc+dec)	CER	median RTFx	peak RAM
fp16 (default)	411 MB	2.12%	85×	0.38 GB
int8	207 MB	2.12%	84×	0.24 GB

Official Paraformer-large AISHELL-1 ≈ 1.95% CER (the ~0.17 pp gap is fp16 + the fixed-shape decoder padding). int8 weight quantization is accuracy-neutral (CER unchanged), ~half the size/memory.

Reproduces the published Paraformer-large AISHELL-1 number — confirming the conversion (front-end + encoder + CIF + decoder) is faithful.

License & attribution

Weights derive from FunASR's Paraformer-large; the upstream license applies. This repo is a format conversion only (no retraining). See FunASR.