Parakeet CTC 0.6B zh-CN CoreML - Full Pipeline
Complete end-to-end CoreML conversion of NVIDIA's Parakeet-CTC-0.6B Mandarin Chinese (Simplified) for on-device speech recognition on Apple Silicon.
Pure CoreML pipeline - No PyTorch/NeMo dependencies required at inference time.
Model Description
- Architecture: FastConformer-CTC (Parakeet-CTC-XL) - Full pipeline
- Parameters: ~600M (encoder) + CTC decoder head
- Languages: Mandarin Chinese (Simplified) + English
- Training Data: 17,000+ hours of zh-CN and en-US speech
- Vocabulary: 7000 SentencePiece tokens + blank token
- Input: Raw audio waveform [1, up to 240000] (15 seconds @ 16kHz)
- Output: Transcribed Chinese text
Pipeline Components
- Preprocessor (816KB) - Audio → Mel spectrogram
- Encoder - Mel → encoder features [1, 1024, 188]
- v1 (fp32): 1.1GB, 10.45% CER
- v2 (int8): 0.55GB, 10.54% CER ⭐ Recommended
- Decoder (14MB) - Encoder features → CTC logits [1, 188, 7001]
Performance
Benchmarked on 100 FLEURS Mandarin Chinese test samples (1,104.8s audio):
Encoder v2 (int8 - Recommended)
| Metric | Value |
|---|---|
| Mean CER (normalized) | 10.54% |
| Median CER | 5.97% |
| Latency (full pipeline) | 49.3ms |
| Real-Time Factor | 229x |
| Model Size | 0.55GB (2x smaller than fp32) |
| Samples < 10% CER | 64.0% |
Encoder v1 (fp32)
| Metric | Value |
|---|---|
| Mean CER (normalized) | 10.45% |
| Median CER | 5.97% |
| Latency (full pipeline) | 53.2ms |
| Real-Time Factor | 214.7x |
| Model Size | 1.1GB |
Quantization Comparison
| Aspect | v1 (fp32) | v2 (int8) | Difference |
|---|---|---|---|
| Size | 1.1GB | 0.55GB | -50% ✅ |
| Mean CER | 10.45% | 10.54% | +0.09% ✅ |
| Median CER | 5.97% | 5.97% | 0.00% ✅ |
| Latency | 53.2ms | 49.3ms | -7.3% faster ✅ |
Recommendation: Use v2 (int8) for production - 2x smaller, same accuracy, faster inference.
CER Distribution (v2 int8)
- <5% CER: 44.0% (44 samples) - Excellent transcription
- 5-10% CER: 20.0% (20 samples) - Very good
- 10-20% CER: 17.0% (17 samples) - Good
- 20-30% CER: 10.0% (10 samples) - Acceptable
- >30% CER: 9.0% (9 samples) - Poor (mostly foreign names/terms)
Note on CER: We report normalized CER which removes punctuation and normalizes number formats (digits → Chinese characters). Raw CER including these format differences is ~19-20%.
Hardware Requirements
- Platform: macOS 14.0+ or iOS 17.0+
- Processor: Apple Silicon (M1/M2/M3/M4/A15+) recommended
- Neural Engine: Optimized for ANE execution
- Memory: ~600MB runtime (int8 encoder)
Usage
Swift (Recommended)
import CoreML
import AVFoundation
// Load models
let preprocessor = try MLModel(contentsOf: preprocessorURL)
let encoder = try MLModel(contentsOf: encoderV2URL) // Use v2 (int8)
let decoder = try MLModel(contentsOf: decoderURL)
// Prepare audio (16kHz, mono, up to 15 seconds)
let audioArray: [Float] = ... // Your audio samples
let audioPadded = pad(audioArray, to: 240000) // Pad to 15s
// Step 1: Preprocessor
let preprocInput = PreprocessorInput(
audio_signal: try MLMultiArray(audioPadded),
audio_length: try MLMultiArray([240000])
)
let preprocOutput = try preprocessor.prediction(from: preprocInput)
// Step 2: Encoder
let encoderInput = EncoderInput(
audio_signal: preprocOutput.mel,
length: preprocOutput.mel_length
)
let encoderOutput = try encoder.prediction(from: encoderInput)
// Step 3: Decoder (CTC Head)
let decoderInput = DecoderInput(encoder_output: encoderOutput.encoder_output)
let decoderOutput = try decoder.prediction(from: decoderInput)
// Step 4: CTC Greedy Decode
let logits = decoderOutput.ctc_logits
let text = ctcGreedyDecode(logProbs: logits, vocabulary: vocab, blankId: 7000)
print(text) // "你好世界"
Python (Validation/Testing)
import coremltools as ct
import numpy as np
import librosa
# Load models
preprocessor = ct.models.MLModel("Preprocessor.mlpackage")
encoder = ct.models.MLModel("Encoder-v2-int8.mlpackage") # Use v2
decoder = ct.models.MLModel("Decoder.mlpackage")
# Load audio (16kHz mono)
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)
audio = np.pad(audio, (0, max(0, 240000 - len(audio))))[:240000]
audio = audio[np.newaxis, :].astype(np.float32)
audio_length = np.array([240000], dtype=np.int32)
# Step 1: Preprocessor
preproc_out = preprocessor.predict({
"audio_signal": audio,
"audio_length": audio_length
})
# Step 2: Encoder
encoder_out = encoder.predict({
"audio_signal": preproc_out["mel"],
"length": preproc_out["mel_length"]
})
# Step 3: Decoder (CTC Head)
decoder_out = decoder.predict({
"encoder_output": encoder_out["encoder_output"]
})
# Step 4: Decode (greedy CTC)
logits = decoder_out["ctc_logits"][0] # [188, 7001]
labels = np.argmax(logits, axis=-1)
# Collapse repeats and remove blanks
decoded = []
prev = None
for label in labels:
if label != 7000 and label != prev:
decoded.append(label)
prev = label
# Convert to text
text = "".join([vocab[i] for i in decoded])
text = text.replace("▁", " ").strip()
print(text)
Files
Models (.mlpackage - source, .mlmodelc - compiled)
- Preprocessor - Audio preprocessing (mel spectrogram, 816KB)
- Encoder-v1-fp32 - FastConformer encoder (float32, 1.1GB)
- Encoder-v2-int8 - FastConformer encoder (int8 quantized, 0.55GB) ⭐ Recommended
- Decoder - CTC decoder head (linear projection, 14MB)
Supporting Files
- vocab.json - 7000 SentencePiece tokens (index → string mapping)
- pipeline_metadata.json - Model specifications and benchmark results
Note: Both .mlpackage (uncompiled source) and .mlmodelc (compiled) versions are provided. Use .mlmodelc for faster loading.
Text Normalization
For accurate CER evaluation on Chinese text:
- Remove punctuation: Model outputs punctuation, but many datasets don't include it
- Normalize numbers: Convert digits to Chinese characters (15 → 十五, 2011 → 二零一一)
- Normalize whitespace: Collapse multiple spaces
Example:
Reference: "我爱北京天安门"
Hypothesis: "我 爱 北京 天安门 。"
Normalized: "我爱北京天安门" (both matched)
See text_normalizer.py for implementation.
Quantization Details
v2 (int8) encoder uses linear symmetric per-channel quantization via coremltools.optimize:
- Method:
linear_quantize_weightswithmode="linear_symmetric" - Granularity: Per-channel (better accuracy than per-tensor)
- Weight threshold: 512 elements (only large tensors quantized)
- Accuracy impact: +0.09% CER (minimal degradation)
- Performance gain: 7.3% faster inference + 50% memory reduction
The int8 quantization provides excellent compression with virtually no accuracy loss, making it ideal for on-device deployment.
Limitations
Fixed input duration: Requires exactly 15 seconds of audio (240,000 samples @ 16kHz). Shorter audio must be padded, longer audio must be split into chunks.
Language: Optimized for Mandarin Chinese (Simplified). Performance on Traditional Chinese or dialects may vary.
Foreign words: Struggles with transliterated foreign names (e.g., "Rolando Mendoza" → incorrect transliteration).
Punctuation: While the model can output punctuation, it was not extensively trained for punctuation prediction.
Apple Silicon required: Models are optimized for Apple Neural Engine. Performance on Intel Macs or non-Apple hardware will be degraded.
Conversion Details
Converted from NVIDIA NeMo checkpoint using:
- Source:
parakeet-ctc-riva-0-6b-unified-zh-cn_vtrainable_v3.0 - Method: PyTorch → TorchScript → CoreML
- Compute Units: CPU + Neural Engine
- Precision: Float32 (v1), Int8 (v2)
- Target: iOS 17+ / macOS 14+
- Quantization:
coremltools.optimize.coreml.linear_quantize_weights
Conversion script and validation: mobius/models/stt/parakeet-ctc-0.6b-zh-cn
Citation
If you use this model, please cite:
@software{parakeet_ctc_zh_cn_coreml_2026,
title={Parakeet CTC 0.6B zh-CN CoreML - Full Pipeline},
author={Fluid Inference},
year={2026},
url={https://huggingface.co/FluidInference/parakeet-ctc-0.6b-zh-cn-coreml}
}
@article{parakeet_nvidia_2024,
title={Parakeet: NVIDIA Speech AI Models},
author={NVIDIA},
year={2024},
url={https://catalog.ngc.nvidia.com/orgs/nvidia/collections/parakeet-ctc-0.6b-zh-cn}
}
License
CC-BY-4.0 (following NVIDIA's Parakeet model license)
Acknowledgments
- Original Model: NVIDIA NeMo Team
- Conversion: Fluid Inference
- Test Dataset: Google FLEURS (Mandarin Chinese)
Related Models
- parakeet-tdt-0.6b-v3-coreml - English TDT decoder (full pipeline)
- parakeet-ctc-0.6b-coreml - English CTC decoder
- parakeet-tdt-ctc-110m-coreml - Smaller 110M model
- Downloads last month
- 23