File size: 3,500 Bytes
d9a096e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dd557b
 
 
 
 
 
 
 
 
 
d9a096e
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: other
license_name: paraformer-upstream
license_link: https://github.com/modelscope/FunASR
language:
- zh
library_name: coreml
tags:
- coreml
- ane
- speech-recognition
- paraformer
- funasr
- fluidaudio
pipeline_tag: automatic-speech-recognition
---

# Paraformer-large (zh) β€” CoreML (Apple Neural Engine)

CoreML conversion of FunASR's **Paraformer-large** (Mandarin Chinese) for on-device
inference on Apple Silicon, for [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio).

Paraformer is a **non-autoregressive** ASR model: a SANM encoder, a CIF predictor
that emits one acoustic-embedding token per output character, and a parallel
(single-pass) decoder. Upstream:
[iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch).

## Files (3 CoreML stages + host CIF)

| File | Precision | Compute unit | Size | Role |
|------|-----------|--------------|------|------|
| `ParaformerPreprocessor.mlmodelc` | FP32 | CPU | 3 MB | front-end: waveform β†’ 560-d LFR features |
| `ParaformerEncoder.mlmodelc` | FP16 | **ANE** | 302 MB | SANM encoder (enumerated buckets `[128,256,512,1024,1800]`) |
| `ParaformerDecoder.mlmodelc` | FP16 | **ANE** | 109 MB | parallel decoder (enc 512, tokens 128) |
| `vocab.json` | β€” | β€” | β€” | 8404 CharTokenizer tokens (array form) |

The **CIF predictor** runs on the host between encoder and decoder (it emits a
*dynamic* token count, which a fixed-shape CoreML graph can't express). It's a
small conv1d + linear + sigmoid β†’ integrate-and-fire; a numpy reference
(`cif_numpy.py`) is in the conversion repo as the Swift blueprint.

## Pipeline

```
waveform β†’ [Preprocessor fp32/CPU] β†’ features [1,T,560]
        β†’ [Encoder fp16/ANE] β†’ enc_out [1,T,512]
        β†’ [host CIF] β†’ acoustic_embeds [1,L,512], token_count L
        β†’ [Decoder fp16/ANE] β†’ logits [1,L,8404]
        β†’ argmax per token β†’ drop sos(1)/eos(2)/blank(0) β†’ CharTokenizer
```

> Both fp16 encoder/decoder are correct on the Neural Engine. The front-end runs
> FP32/CPU (power-spectrum + log exceed the FP16 range). Run the encoder/decoder
> with `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`.

## Conversion notes

Two SANM-specific fixes were required for fp16/ANE under bucket padding (see the
conversion repo): a fp16-safe attention mask fill (`-inf` β†’ `-1e4`), and building
the encoder/decoder pad-masks from the **input tensor's seq dim** (so
`EnumeratedShapes` generalize) rather than `lengths.max()`.

## Benchmark β€” AISHELL-1 test (CoreML on ANE)

Full test set (7,176 utts), full-CoreML pipeline on M5 Pro ANE:

| Precision | size (enc+dec) | CER | median RTFx | peak RAM |
|-----------|----------------|-----|-------------|----------|
| fp16 (default) | 411 MB | **2.12%** | 85Γ— | 0.38 GB |
| int8 | 207 MB | **2.12%** | 84Γ— | 0.24 GB |

Official Paraformer-large AISHELL-1 β‰ˆ 1.95% CER (the ~0.17 pp gap is fp16 + the
fixed-shape decoder padding). int8 weight quantization is accuracy-neutral (CER
unchanged), ~half the size/memory.

Reproduces the published Paraformer-large AISHELL-1 number β€” confirming the
conversion (front-end + encoder + CIF + decoder) is faithful.

## License & attribution

Weights derive from FunASR's Paraformer-large; the upstream license applies. This
repo is a format conversion only (no retraining). See
[FunASR](https://github.com/modelscope/FunASR).