Parakeet CTC 0.6B Japanese - CoreML
CoreML conversion of nvidia/parakeet-tdt_ctc-0.6b-ja for on-device Japanese speech recognition on Apple Silicon.
Model Description
- Language: Japanese (ζ₯ζ¬θͺ)
- Parameters: 600M (0.6B)
- Architecture: Hybrid FastConformer-TDT-CTC
- Vocabulary: 3,072 Japanese SentencePiece BPE tokens
- Sample Rate: 16 kHz
- Max Duration: 15 seconds per chunk
- Platform: iOS 17+ / macOS 14+ (Apple Silicon recommended)
- ANE Utilization: 100% (0 CPU fallbacks)
Performance
Benchmark on FluidInference/fleurs-full (650 Japanese samples):
- CER: 10.29% (within expected 10-13% range)
- RTFx: 136.85x (far exceeds real-time)
- Avg Latency: 91.34ms per sample on M-series chips
Expected CER by Dataset (from NeMo paper):
| Dataset | CER |
|---|---|
| JSUT basic5000 | 6.5% |
| Mozilla Common Voice 8.0 test | 7.2% |
| Mozilla Common Voice 16.1 dev | 10.2% |
| Mozilla Common Voice 16.1 test | 13.3% |
| TEDxJP-10k | 9.1% |
Critical Implementation Note: Raw Logits Output
IMPORTANT: The CTC decoder outputs raw logits (not log-probabilities). You must apply log_softmax before CTC decoding.
Why?
During CoreML conversion, we discovered that log_softmax failed to convert correctly, producing extreme values (-45440 instead of -67). The solution was to output raw logits and apply log_softmax in post-processing.
Usage Example
import coremltools as ct
import numpy as np
import torch
# Load the three CoreML models
preprocessor = ct.models.MLModel('Preprocessor.mlpackage')
encoder = ct.models.MLModel('Encoder.mlpackage')
ctc_decoder = ct.models.MLModel('CtcDecoder.mlpackage')
# Prepare audio (16kHz, mono, max 15 seconds)
audio = np.array(audio_samples, dtype=np.float32).reshape(1, -1)
audio_length = np.array([audio.shape[1]], dtype=np.int32)
# Pad or truncate to 240,000 samples (15 seconds)
if audio.shape[1] < 240000:
audio = np.pad(audio, ((0, 0), (0, 240000 - audio.shape[1])))
else:
audio = audio[:, :240000]
# Step 1: Preprocessor (audio β mel)
prep_out = preprocessor.predict({
'audio_signal': audio,
'length': audio_length
})
# Step 2: Encoder (mel β features)
enc_out = encoder.predict({
'mel_features': prep_out['mel_features'],
'mel_length': prep_out['mel_length']
})
# Step 3: CTC Decoder (features β raw logits)
ctc_out = ctc_decoder.predict({
'encoder_output': enc_out['encoder_output']
})
raw_logits = ctc_out['ctc_logits'] # [1, 188, 3073]
# Apply log_softmax (CRITICAL!)
logits_tensor = torch.from_numpy(raw_logits)
log_probs = torch.nn.functional.log_softmax(logits_tensor, dim=-1)
# Now use log_probs for CTC decoding
# Greedy decoding example:
labels = torch.argmax(log_probs, dim=-1)[0].numpy() # [188]
# Collapse repeats and remove blanks
blank_id = 3072
decoded = []
prev = None
for label in labels:
if label != blank_id and label != prev:
decoded.append(label)
prev = label
# Convert to text using vocabulary
import json
with open('vocab.json', 'r') as f:
vocab = json.load(f)
tokens = [vocab[i] for i in decoded if i < len(vocab)]
text = ''.join(tokens).replace('β', ' ').strip()
print(text)
Files Included
CoreML Models
Preprocessor.mlpackage - Audio β Mel spectrogram
- Input:
audio_signal[1, 240000],length[1] - Output:
mel_features[1, 80, 1501],mel_length[1]
- Input:
Encoder.mlpackage - Mel β Encoder features (FastConformer)
- Input:
mel_features[1, 80, 1501],mel_length[1] - Output:
encoder_output[1, 1024, 188]
- Input:
CtcDecoder.mlpackage - Features β Raw CTC logits
- Input:
encoder_output[1, 1024, 188] - Output:
ctc_logits[1, 188, 3073] (RAW logits, not log-softmax!)
- Input:
Note: Chain these three components together for full audio β text transcription (see usage example above).
Supporting Files
- vocab.json - 3,072 Japanese SentencePiece BPE tokens (index β token mapping)
- metadata.json - Model metadata and shapes
Model Architecture
Audio [1, 240000] @ 16kHz
β Preprocessor (STFT, Mel filterbank)
Mel Spectrogram [1, 80, 1501]
β Encoder (FastConformer, 8x downsampling)
Encoder Features [1, 1024, 188]
β CTC Decoder (Conv1d 1024β3073, kernel_size=1)
Raw Logits [1, 188, 3073]
β log_softmax (YOUR CODE - required!)
Log Probabilities [1, 188, 3073]
β CTC Beam Search / Greedy Decoding
Transcription
Compilation (Optional but Recommended)
Compile models for faster loading:
xcrun coremlcompiler compile Preprocessor.mlpackage .
xcrun coremlcompiler compile Encoder.mlpackage .
xcrun coremlcompiler compile CtcDecoder.mlpackage .
This generates .mlmodelc directories that load ~20x faster on first run.
Validation Results
All models validated against original NeMo implementation:
| Component | Max Diff | Relative Error | ANE % |
|---|---|---|---|
| Preprocessor | 0.148 | < 0.001% | 100% |
| Encoder | 0.109 | 1.03e-07% | 100% |
| CTC Decoder | 0.011 | < 0.001% | 100% |
| Full Pipeline | 0.482 | 1.44% | 100% |
System Requirements
- Minimum: macOS 14.0 / iOS 17.0
- Recommended: Apple Silicon (M1/M2/M3/M4) for optimal performance
- Intel Macs: Will run on CPU only (slower, higher power consumption)
Conversion Details
This CoreML conversion includes a critical fix for log_softmax conversion failure:
The Problem
Initial attempts to convert the CTC decoder's forward() method (which includes log_softmax) produced catastrophically wrong outputs:
- Expected:
[-67.31, -0.00] - CoreML:
[-45440.00, 0.00] - Max difference: 45,422 β
The Solution
Bypass NeMo's forward() method and access only the underlying decoder_layers (Conv1d):
# Instead of:
log_probs = ctc_decoder(encoder_output) # Broken in CoreML
# We do:
raw_logits = ctc_decoder_layers(encoder_output) # Works perfectly
log_probs = torch.nn.functional.log_softmax(raw_logits, dim=-1)
This achieves identical results (0.011 max diff) while avoiding the CoreML conversion bug.
Citation
@misc{parakeet-ctc-ja-coreml,
title={Parakeet CTC 0.6B Japanese - CoreML},
author={FluidInference},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/FluidInference/parakeet-ctc-0.6b-ja-coreml}}
}
@misc{parakeet2024,
title={Parakeet: NVIDIA's Automatic Speech Recognition Toolkit},
author={NVIDIA},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}}
}
License
CC-BY-4.0 (following the original NVIDIA Parakeet model license)
Acknowledgments
- Original model by NVIDIA NeMo team
- Converted to CoreML by FluidInference
- Benchmarked on FluidInference/fleurs-full dataset
Links
- Original Model: https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja
- Benchmark Dataset: https://huggingface.co/datasets/FluidInference/fleurs-full
- Conversion Repository: https://github.com/FluidInference/mobius
- Downloads last month
- 21
Model tree for FluidInference/parakeet-ctc-0.6b-ja-coreml
Base model
nvidia/parakeet-tdt_ctc-0.6b-ja