FluidInference
/

paraformer-large-zh-coreml

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

paraformer-large-zh-coreml / README.md

alexwengg's picture

docs: fp16/int8 full benchmark + RTFx

5dd557b verified 9 days ago

|

history blame contribute delete

3.5 kB

	---
	license: other
	license_name: paraformer-upstream
	license_link: https://github.com/modelscope/FunASR
	language:
	- zh
	library_name: coreml
	tags:
	- coreml
	- ane
	- speech-recognition
	- paraformer
	- funasr
	- fluidaudio
	pipeline_tag: automatic-speech-recognition
	---

	# Paraformer-large (zh) — CoreML (Apple Neural Engine)

	CoreML conversion of FunASR's Paraformer-large (Mandarin Chinese) for on-device
	inference on Apple Silicon, for [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).

	Paraformer is a non-autoregressive ASR model: a SANM encoder, a CIF predictor
	that emits one acoustic-embedding token per output character, and a parallel
	(single-pass) decoder. Upstream:
	[iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch).

	## Files (3 CoreML stages + host CIF)

	\| File \| Precision \| Compute unit \| Size \| Role \|
	\|------\|-----------\|--------------\|------\|------\|
	\| `ParaformerPreprocessor.mlmodelc` \| FP32 \| CPU \| 3 MB \| front-end: waveform → 560-d LFR features \|
	\| `ParaformerEncoder.mlmodelc` \| FP16 \| ANE \| 302 MB \| SANM encoder (enumerated buckets `[128,256,512,1024,1800]`) \|
	\| `ParaformerDecoder.mlmodelc` \| FP16 \| ANE \| 109 MB \| parallel decoder (enc 512, tokens 128) \|
	\| `vocab.json` \| — \| — \| — \| 8404 CharTokenizer tokens (array form) \|

	The CIF predictor runs on the host between encoder and decoder (it emits a
	dynamic token count, which a fixed-shape CoreML graph can't express). It's a
	small conv1d + linear + sigmoid → integrate-and-fire; a numpy reference
	(`cif_numpy.py`) is in the conversion repo as the Swift blueprint.

	## Pipeline

	```
	waveform → [Preprocessor fp32/CPU] → features [1,T,560]
	→ [Encoder fp16/ANE] → enc_out [1,T,512]
	→ [host CIF] → acoustic_embeds [1,L,512], token_count L
	→ [Decoder fp16/ANE] → logits [1,L,8404]
	→ argmax per token → drop sos(1)/eos(2)/blank(0) → CharTokenizer
	```

	> Both fp16 encoder/decoder are correct on the Neural Engine. The front-end runs
	> FP32/CPU (power-spectrum + log exceed the FP16 range). Run the encoder/decoder
	> with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`.

	## Conversion notes

	Two SANM-specific fixes were required for fp16/ANE under bucket padding (see the
	conversion repo): a fp16-safe attention mask fill (`-inf` → `-1e4`), and building
	the encoder/decoder pad-masks from the input tensor's seq dim (so
	`EnumeratedShapes` generalize) rather than `lengths.max()`.

	## Benchmark — AISHELL-1 test (CoreML on ANE)

	Full test set (7,176 utts), full-CoreML pipeline on M5 Pro ANE:

	\| Precision \| size (enc+dec) \| CER \| median RTFx \| peak RAM \|
	\|-----------\|----------------\|-----\|-------------\|----------\|
	\| fp16 (default) \| 411 MB \| 2.12% \| 85× \| 0.38 GB \|
	\| int8 \| 207 MB \| 2.12% \| 84× \| 0.24 GB \|

	Official Paraformer-large AISHELL-1 ≈ 1.95% CER (the ~0.17 pp gap is fp16 + the
	fixed-shape decoder padding). int8 weight quantization is accuracy-neutral (CER
	unchanged), ~half the size/memory.

	Reproduces the published Paraformer-large AISHELL-1 number — confirming the
	conversion (front-end + encoder + CIF + decoder) is faithful.

	## License & attribution

	Weights derive from FunASR's Paraformer-large; the upstream license applies. This
	repo is a format conversion only (no retraining). See
	[FunASR](https://github.com/modelscope/FunASR).