Parakeet CTC 0.6B zh-CN CoreML - Full Pipeline

Complete end-to-end CoreML conversion of NVIDIA's Parakeet-CTC-0.6B Mandarin Chinese (Simplified) for on-device speech recognition on Apple Silicon.

Pure CoreML pipeline - No PyTorch/NeMo dependencies required at inference time.

Model Description

  • Architecture: FastConformer-CTC (Parakeet-CTC-XL) - Full pipeline
  • Parameters: ~600M (encoder) + CTC decoder head
  • Languages: Mandarin Chinese (Simplified) + English
  • Training Data: 17,000+ hours of zh-CN and en-US speech
  • Vocabulary: 7000 SentencePiece tokens + blank token
  • Input: Raw audio waveform [1, up to 240000] (15 seconds @ 16kHz)
  • Output: Transcribed Chinese text

Pipeline Components

  1. Preprocessor (816KB) - Audio → Mel spectrogram
  2. Encoder - Mel → encoder features [1, 1024, 188]
    • v1 (fp32): 1.1GB, 10.45% CER
    • v2 (int8): 0.55GB, 10.54% CER ⭐ Recommended
  3. Decoder (14MB) - Encoder features → CTC logits [1, 188, 7001]

Performance

Benchmarked on 100 FLEURS Mandarin Chinese test samples (1,104.8s audio):

Encoder v2 (int8 - Recommended)

Metric Value
Mean CER (normalized) 10.54%
Median CER 5.97%
Latency (full pipeline) 49.3ms
Real-Time Factor 229x
Model Size 0.55GB (2x smaller than fp32)
Samples < 10% CER 64.0%

Encoder v1 (fp32)

Metric Value
Mean CER (normalized) 10.45%
Median CER 5.97%
Latency (full pipeline) 53.2ms
Real-Time Factor 214.7x
Model Size 1.1GB

Quantization Comparison

Aspect v1 (fp32) v2 (int8) Difference
Size 1.1GB 0.55GB -50%
Mean CER 10.45% 10.54% +0.09% ✅
Median CER 5.97% 5.97% 0.00% ✅
Latency 53.2ms 49.3ms -7.3% faster

Recommendation: Use v2 (int8) for production - 2x smaller, same accuracy, faster inference.

CER Distribution (v2 int8)

  • <5% CER: 44.0% (44 samples) - Excellent transcription
  • 5-10% CER: 20.0% (20 samples) - Very good
  • 10-20% CER: 17.0% (17 samples) - Good
  • 20-30% CER: 10.0% (10 samples) - Acceptable
  • >30% CER: 9.0% (9 samples) - Poor (mostly foreign names/terms)

Note on CER: We report normalized CER which removes punctuation and normalizes number formats (digits → Chinese characters). Raw CER including these format differences is ~19-20%.

Hardware Requirements

  • Platform: macOS 14.0+ or iOS 17.0+
  • Processor: Apple Silicon (M1/M2/M3/M4/A15+) recommended
  • Neural Engine: Optimized for ANE execution
  • Memory: ~600MB runtime (int8 encoder)

Usage

Swift (Recommended)

import CoreML
import AVFoundation

// Load models
let preprocessor = try MLModel(contentsOf: preprocessorURL)
let encoder = try MLModel(contentsOf: encoderV2URL)  // Use v2 (int8)
let decoder = try MLModel(contentsOf: decoderURL)

// Prepare audio (16kHz, mono, up to 15 seconds)
let audioArray: [Float] = ... // Your audio samples
let audioPadded = pad(audioArray, to: 240000)  // Pad to 15s

// Step 1: Preprocessor
let preprocInput = PreprocessorInput(
    audio_signal: try MLMultiArray(audioPadded),
    audio_length: try MLMultiArray([240000])
)
let preprocOutput = try preprocessor.prediction(from: preprocInput)

// Step 2: Encoder
let encoderInput = EncoderInput(
    audio_signal: preprocOutput.mel,
    length: preprocOutput.mel_length
)
let encoderOutput = try encoder.prediction(from: encoderInput)

// Step 3: Decoder (CTC Head)
let decoderInput = DecoderInput(encoder_output: encoderOutput.encoder_output)
let decoderOutput = try decoder.prediction(from: decoderInput)

// Step 4: CTC Greedy Decode
let logits = decoderOutput.ctc_logits
let text = ctcGreedyDecode(logProbs: logits, vocabulary: vocab, blankId: 7000)
print(text)  // "你好世界"

Python (Validation/Testing)

import coremltools as ct
import numpy as np
import librosa

# Load models
preprocessor = ct.models.MLModel("Preprocessor.mlpackage")
encoder = ct.models.MLModel("Encoder-v2-int8.mlpackage")  # Use v2
decoder = ct.models.MLModel("Decoder.mlpackage")

# Load audio (16kHz mono)
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)
audio = np.pad(audio, (0, max(0, 240000 - len(audio))))[:240000]
audio = audio[np.newaxis, :].astype(np.float32)
audio_length = np.array([240000], dtype=np.int32)

# Step 1: Preprocessor
preproc_out = preprocessor.predict({
    "audio_signal": audio,
    "audio_length": audio_length
})

# Step 2: Encoder
encoder_out = encoder.predict({
    "audio_signal": preproc_out["mel"],
    "length": preproc_out["mel_length"]
})

# Step 3: Decoder (CTC Head)
decoder_out = decoder.predict({
    "encoder_output": encoder_out["encoder_output"]
})

# Step 4: Decode (greedy CTC)
logits = decoder_out["ctc_logits"][0]  # [188, 7001]
labels = np.argmax(logits, axis=-1)

# Collapse repeats and remove blanks
decoded = []
prev = None
for label in labels:
    if label != 7000 and label != prev:
        decoded.append(label)
    prev = label

# Convert to text
text = "".join([vocab[i] for i in decoded])
text = text.replace("▁", " ").strip()
print(text)

Files

Models (.mlpackage - source, .mlmodelc - compiled)

  • Preprocessor - Audio preprocessing (mel spectrogram, 816KB)
  • Encoder-v1-fp32 - FastConformer encoder (float32, 1.1GB)
  • Encoder-v2-int8 - FastConformer encoder (int8 quantized, 0.55GB) ⭐ Recommended
  • Decoder - CTC decoder head (linear projection, 14MB)

Supporting Files

  • vocab.json - 7000 SentencePiece tokens (index → string mapping)
  • pipeline_metadata.json - Model specifications and benchmark results

Note: Both .mlpackage (uncompiled source) and .mlmodelc (compiled) versions are provided. Use .mlmodelc for faster loading.

Text Normalization

For accurate CER evaluation on Chinese text:

  1. Remove punctuation: Model outputs punctuation, but many datasets don't include it
  2. Normalize numbers: Convert digits to Chinese characters (15 → 十五, 2011 → 二零一一)
  3. Normalize whitespace: Collapse multiple spaces

Example:

Reference:    "我爱北京天安门"
Hypothesis:   "我 爱 北京 天安门 。"
Normalized:   "我爱北京天安门"  (both matched)

See text_normalizer.py for implementation.

Quantization Details

v2 (int8) encoder uses linear symmetric per-channel quantization via coremltools.optimize:

  • Method: linear_quantize_weights with mode="linear_symmetric"
  • Granularity: Per-channel (better accuracy than per-tensor)
  • Weight threshold: 512 elements (only large tensors quantized)
  • Accuracy impact: +0.09% CER (minimal degradation)
  • Performance gain: 7.3% faster inference + 50% memory reduction

The int8 quantization provides excellent compression with virtually no accuracy loss, making it ideal for on-device deployment.

Limitations

  1. Fixed input duration: Requires exactly 15 seconds of audio (240,000 samples @ 16kHz). Shorter audio must be padded, longer audio must be split into chunks.

  2. Language: Optimized for Mandarin Chinese (Simplified). Performance on Traditional Chinese or dialects may vary.

  3. Foreign words: Struggles with transliterated foreign names (e.g., "Rolando Mendoza" → incorrect transliteration).

  4. Punctuation: While the model can output punctuation, it was not extensively trained for punctuation prediction.

  5. Apple Silicon required: Models are optimized for Apple Neural Engine. Performance on Intel Macs or non-Apple hardware will be degraded.

Conversion Details

Converted from NVIDIA NeMo checkpoint using:

  • Source: parakeet-ctc-riva-0-6b-unified-zh-cn_vtrainable_v3.0
  • Method: PyTorch → TorchScript → CoreML
  • Compute Units: CPU + Neural Engine
  • Precision: Float32 (v1), Int8 (v2)
  • Target: iOS 17+ / macOS 14+
  • Quantization: coremltools.optimize.coreml.linear_quantize_weights

Conversion script and validation: mobius/models/stt/parakeet-ctc-0.6b-zh-cn

Citation

If you use this model, please cite:

@software{parakeet_ctc_zh_cn_coreml_2026,
  title={Parakeet CTC 0.6B zh-CN CoreML - Full Pipeline},
  author={Fluid Inference},
  year={2026},
  url={https://huggingface.co/FluidInference/parakeet-ctc-0.6b-zh-cn-coreml}
}

@article{parakeet_nvidia_2024,
  title={Parakeet: NVIDIA Speech AI Models},
  author={NVIDIA},
  year={2024},
  url={https://catalog.ngc.nvidia.com/orgs/nvidia/collections/parakeet-ctc-0.6b-zh-cn}
}

License

CC-BY-4.0 (following NVIDIA's Parakeet model license)

Acknowledgments

  • Original Model: NVIDIA NeMo Team
  • Conversion: Fluid Inference
  • Test Dataset: Google FLEURS (Mandarin Chinese)

Related Models


Model Card by Fluid InferenceGitHubWebsite

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train FluidInference/parakeet-ctc-0.6b-zh-cn-coreml

Collection including FluidInference/parakeet-ctc-0.6b-zh-cn-coreml