Parakeet CTC 0.6B zh-CN CoreML - Full Pipeline

Complete end-to-end CoreML conversion of NVIDIA's Parakeet-CTC-0.6B Mandarin Chinese (Simplified) for on-device speech recognition on Apple Silicon.

Pure CoreML pipeline - No PyTorch/NeMo dependencies required at inference time.

Model Description

Architecture: FastConformer-CTC (Parakeet-CTC-XL) - Full pipeline
Parameters: ~600M (encoder) + CTC decoder head
Languages: Mandarin Chinese (Simplified) + English
Training Data: 17,000+ hours of zh-CN and en-US speech
Vocabulary: 7000 SentencePiece tokens + blank token
Input: Raw audio waveform [1, up to 240000] (15 seconds @ 16kHz)
Output: Transcribed Chinese text

Pipeline Components

Preprocessor (816KB) - Audio → Mel spectrogram
Encoder - Mel → encoder features [1, 1024, 188]
- v1 (fp32): 1.1GB, 10.45% CER
- v2 (int8): 0.55GB, 10.54% CER ⭐ Recommended
Decoder (14MB) - Encoder features → CTC logits [1, 188, 7001]

Performance

Benchmarked on 100 FLEURS Mandarin Chinese test samples (1,104.8s audio):

Encoder v2 (int8 - Recommended)

Metric	Value
Mean CER (normalized)	10.54%
Median CER	5.97%
Latency (full pipeline)	49.3ms
Real-Time Factor	229x
Model Size	0.55GB (2x smaller than fp32)
Samples < 10% CER	64.0%

Encoder v1 (fp32)

Metric	Value
Mean CER (normalized)	10.45%
Median CER	5.97%
Latency (full pipeline)	53.2ms
Real-Time Factor	214.7x
Model Size	1.1GB

Quantization Comparison

Aspect	v1 (fp32)	v2 (int8)	Difference
Size	1.1GB	0.55GB	-50% ✅
Mean CER	10.45%	10.54%	+0.09% ✅
Median CER	5.97%	5.97%	0.00% ✅
Latency	53.2ms	49.3ms	-7.3% faster ✅

Recommendation: Use v2 (int8) for production - 2x smaller, same accuracy, faster inference.

CER Distribution (v2 int8)

<5% CER: 44.0% (44 samples) - Excellent transcription
5-10% CER: 20.0% (20 samples) - Very good
10-20% CER: 17.0% (17 samples) - Good
20-30% CER: 10.0% (10 samples) - Acceptable
>30% CER: 9.0% (9 samples) - Poor (mostly foreign names/terms)

Note on CER: We report normalized CER which removes punctuation and normalizes number formats (digits → Chinese characters). Raw CER including these format differences is ~19-20%.

Hardware Requirements

Platform: macOS 14.0+ or iOS 17.0+
Processor: Apple Silicon (M1/M2/M3/M4/A15+) recommended
Neural Engine: Optimized for ANE execution
Memory: ~600MB runtime (int8 encoder)

Usage

Swift (Recommended)

import CoreML
import AVFoundation

// Load models
let preprocessor = try MLModel(contentsOf: preprocessorURL)
let encoder = try MLModel(contentsOf: encoderV2URL)  // Use v2 (int8)
let decoder = try MLModel(contentsOf: decoderURL)

// Prepare audio (16kHz, mono, up to 15 seconds)
let audioArray: [Float] = ... // Your audio samples
let audioPadded = pad(audioArray, to: 240000)  // Pad to 15s

// Step 1: Preprocessor
let preprocInput = PreprocessorInput(
    audio_signal: try MLMultiArray(audioPadded),
    audio_length: try MLMultiArray([240000])
)
let preprocOutput = try preprocessor.prediction(from: preprocInput)

// Step 2: Encoder
let encoderInput = EncoderInput(
    audio_signal: preprocOutput.mel,
    length: preprocOutput.mel_length
)
let encoderOutput = try encoder.prediction(from: encoderInput)

// Step 3: Decoder (CTC Head)
let decoderInput = DecoderInput(encoder_output: encoderOutput.encoder_output)
let decoderOutput = try decoder.prediction(from: decoderInput)

// Step 4: CTC Greedy Decode
let logits = decoderOutput.ctc_logits
let text = ctcGreedyDecode(logProbs: logits, vocabulary: vocab, blankId: 7000)
print(text)  // "你好世界"

Python (Validation/Testing)

import coremltools as ct
import numpy as np
import librosa

# Load models
preprocessor = ct.models.MLModel("Preprocessor.mlpackage")
encoder = ct.models.MLModel("Encoder-v2-int8.mlpackage")  # Use v2
decoder = ct.models.MLModel("Decoder.mlpackage")

# Load audio (16kHz mono)
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)
audio = np.pad(audio, (0, max(0, 240000 - len(audio))))[:240000]
audio = audio[np.newaxis, :].astype(np.float32)
audio_length = np.array([240000], dtype=np.int32)

# Step 1: Preprocessor
preproc_out = preprocessor.predict({
    "audio_signal": audio,
    "audio_length": audio_length
})

# Step 2: Encoder
encoder_out = encoder.predict({
    "audio_signal": preproc_out["mel"],
    "length": preproc_out["mel_length"]
})

# Step 3: Decoder (CTC Head)
decoder_out = decoder.predict({
    "encoder_output": encoder_out["encoder_output"]
})

# Step 4: Decode (greedy CTC)
logits = decoder_out["ctc_logits"][0]  # [188, 7001]
labels = np.argmax(logits, axis=-1)

# Collapse repeats and remove blanks
decoded = []
prev = None
for label in labels:
    if label != 7000 and label != prev:
        decoded.append(label)
    prev = label

# Convert to text
text = "".join([vocab[i] for i in decoded])
text = text.replace("▁", " ").strip()
print(text)

Files

Models (.mlpackage - source, .mlmodelc - compiled)

Preprocessor - Audio preprocessing (mel spectrogram, 816KB)
Encoder-v1-fp32 - FastConformer encoder (float32, 1.1GB)
Encoder-v2-int8 - FastConformer encoder (int8 quantized, 0.55GB) ⭐ Recommended
Decoder - CTC decoder head (linear projection, 14MB)

Supporting Files

vocab.json - 7000 SentencePiece tokens (index → string mapping)
pipeline_metadata.json - Model specifications and benchmark results

Note: Both .mlpackage (uncompiled source) and .mlmodelc (compiled) versions are provided. Use .mlmodelc for faster loading.

Text Normalization

For accurate CER evaluation on Chinese text:

Remove punctuation: Model outputs punctuation, but many datasets don't include it
Normalize numbers: Convert digits to Chinese characters (15 → 十五, 2011 → 二零一一)
Normalize whitespace: Collapse multiple spaces

Example:

Reference:    "我爱北京天安门"
Hypothesis:   "我 爱 北京 天安门 。"
Normalized:   "我爱北京天安门"  (both matched)

See text_normalizer.py for implementation.

Quantization Details

v2 (int8) encoder uses linear symmetric per-channel quantization via coremltools.optimize:

Method: linear_quantize_weights with mode="linear_symmetric"
Granularity: Per-channel (better accuracy than per-tensor)
Weight threshold: 512 elements (only large tensors quantized)
Accuracy impact: +0.09% CER (minimal degradation)
Performance gain: 7.3% faster inference + 50% memory reduction

The int8 quantization provides excellent compression with virtually no accuracy loss, making it ideal for on-device deployment.

Limitations

Fixed input duration: Requires exactly 15 seconds of audio (240,000 samples @ 16kHz). Shorter audio must be padded, longer audio must be split into chunks.
Language: Optimized for Mandarin Chinese (Simplified). Performance on Traditional Chinese or dialects may vary.
Foreign words: Struggles with transliterated foreign names (e.g., "Rolando Mendoza" → incorrect transliteration).
Punctuation: While the model can output punctuation, it was not extensively trained for punctuation prediction.
Apple Silicon required: Models are optimized for Apple Neural Engine. Performance on Intel Macs or non-Apple hardware will be degraded.

Conversion Details

Converted from NVIDIA NeMo checkpoint using:

Source: parakeet-ctc-riva-0-6b-unified-zh-cn_vtrainable_v3.0
Method: PyTorch → TorchScript → CoreML
Compute Units: CPU + Neural Engine
Precision: Float32 (v1), Int8 (v2)
Target: iOS 17+ / macOS 14+
Quantization: coremltools.optimize.coreml.linear_quantize_weights

Conversion script and validation: mobius/models/stt/parakeet-ctc-0.6b-zh-cn

Citation

If you use this model, please cite:

@software{parakeet_ctc_zh_cn_coreml_2026,
  title={Parakeet CTC 0.6B zh-CN CoreML - Full Pipeline},
  author={Fluid Inference},
  year={2026},
  url={https://huggingface.co/FluidInference/parakeet-ctc-0.6b-zh-cn-coreml}
}

@article{parakeet_nvidia_2024,
  title={Parakeet: NVIDIA Speech AI Models},
  author={NVIDIA},
  year={2024},
  url={https://catalog.ngc.nvidia.com/orgs/nvidia/collections/parakeet-ctc-0.6b-zh-cn}
}

License

CC-BY-4.0 (following NVIDIA's Parakeet model license)

Acknowledgments

Original Model: NVIDIA NeMo Team
Conversion: Fluid Inference
Test Dataset: Google FLEURS (Mandarin Chinese)

Related Models

parakeet-tdt-0.6b-v3-coreml - English TDT decoder (full pipeline)
parakeet-ctc-0.6b-coreml - English CTC decoder
parakeet-tdt-ctc-110m-coreml - Smaller 110M model

Model Card by Fluid Inference • GitHub • Website

Downloads last month: 76

Dataset used to train FluidInference/parakeet-ctc-0.6b-zh-cn-coreml

Collection including FluidInference/parakeet-ctc-0.6b-zh-cn-coreml

CoreML

Collection

Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details • 16 items • Updated Jun 4 • 6