| --- |
| language: |
| - ja |
| license: cc-by-4.0 |
| tags: |
| - speech |
| - audio |
| - automatic-speech-recognition |
| - coreml |
| - parakeet |
| - ctc |
| - japanese |
| library_name: coreml |
| pipeline_tag: automatic-speech-recognition |
| base_model: |
| - nvidia/parakeet-tdt_ctc-0.6b-ja |
| --- |
| |
| # Parakeet CTC 0.6B Japanese - CoreML |
|
|
| CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple Silicon. |
|
|
| ## Model Description |
|
|
| - **Language**: Japanese (ζ₯ζ¬θͺ) |
| - **Parameters**: 600M (0.6B) |
| - **Architecture**: Hybrid FastConformer-TDT-CTC |
| - **Vocabulary**: 3,072 Japanese SentencePiece BPE tokens |
| - **Sample Rate**: 16 kHz |
| - **Max Duration**: 15 seconds per chunk |
| - **Platform**: iOS 17+ / macOS 14+ (Apple Silicon recommended) |
| - **ANE Utilization**: 100% (0 CPU fallbacks) |
|
|
| ## Performance |
|
|
| **Benchmark on FluidInference/fleurs-full (650 Japanese samples)**: |
| - **CER**: 10.29% (within expected 10-13% range) |
| - **RTFx**: 136.85x (far exceeds real-time) |
| - **Avg Latency**: 91.34ms per sample on M-series chips |
|
|
| **Expected CER by Dataset** (from NeMo paper): |
| | Dataset | CER | |
| |---------|-----| |
| | JSUT basic5000 | 6.5% | |
| | Mozilla Common Voice 8.0 test | 7.2% | |
| | Mozilla Common Voice 16.1 dev | 10.2% | |
| | Mozilla Common Voice 16.1 test | 13.3% | |
| | TEDxJP-10k | 9.1% | |
|
|
| ## Critical Implementation Note: Raw Logits Output |
|
|
| **IMPORTANT**: The CTC decoder outputs **raw logits** (not log-probabilities). You **must** apply `log_softmax` before CTC decoding. |
|
|
| ### Why? |
|
|
| During CoreML conversion, we discovered that `log_softmax` failed to convert correctly, producing extreme values (-45440 instead of -67). The solution was to output raw logits and apply `log_softmax` in post-processing. |
|
|
| ### Usage Example |
|
|
| ```python |
| import coremltools as ct |
| import numpy as np |
| import torch |
| |
| # Load the three CoreML models |
| preprocessor = ct.models.MLModel('Preprocessor.mlpackage') |
| encoder = ct.models.MLModel('Encoder.mlpackage') |
| ctc_decoder = ct.models.MLModel('CtcDecoder.mlpackage') |
| |
| # Prepare audio (16kHz, mono, max 15 seconds) |
| audio = np.array(audio_samples, dtype=np.float32).reshape(1, -1) |
| audio_length = np.array([audio.shape[1]], dtype=np.int32) |
| |
| # Pad or truncate to 240,000 samples (15 seconds) |
| if audio.shape[1] < 240000: |
| audio = np.pad(audio, ((0, 0), (0, 240000 - audio.shape[1]))) |
| else: |
| audio = audio[:, :240000] |
| |
| # Step 1: Preprocessor (audio β mel) |
| prep_out = preprocessor.predict({ |
| 'audio_signal': audio, |
| 'length': audio_length |
| }) |
| |
| # Step 2: Encoder (mel β features) |
| enc_out = encoder.predict({ |
| 'mel_features': prep_out['mel_features'], |
| 'mel_length': prep_out['mel_length'] |
| }) |
| |
| # Step 3: CTC Decoder (features β raw logits) |
| ctc_out = ctc_decoder.predict({ |
| 'encoder_output': enc_out['encoder_output'] |
| }) |
| raw_logits = ctc_out['ctc_logits'] # [1, 188, 3073] |
| |
| # Apply log_softmax (CRITICAL!) |
| logits_tensor = torch.from_numpy(raw_logits) |
| log_probs = torch.nn.functional.log_softmax(logits_tensor, dim=-1) |
| |
| # Now use log_probs for CTC decoding |
| # Greedy decoding example: |
| labels = torch.argmax(log_probs, dim=-1)[0].numpy() # [188] |
| |
| # Collapse repeats and remove blanks |
| blank_id = 3072 |
| decoded = [] |
| prev = None |
| for label in labels: |
| if label != blank_id and label != prev: |
| decoded.append(label) |
| prev = label |
| |
| # Convert to text using vocabulary |
| import json |
| with open('vocab.json', 'r') as f: |
| vocab = json.load(f) |
| tokens = [vocab[i] for i in decoded if i < len(vocab)] |
| text = ''.join(tokens).replace('β', ' ').strip() |
| print(text) |
| ``` |
|
|
| ## Files Included |
|
|
| ### CoreML Models |
|
|
| - **Preprocessor.mlpackage** - Audio β Mel spectrogram |
| - Input: `audio_signal` [1, 240000], `length` [1] |
| - Output: `mel_features` [1, 80, 1501], `mel_length` [1] |
|
|
| - **Encoder.mlpackage** - Mel β Encoder features (FastConformer) |
| - Input: `mel_features` [1, 80, 1501], `mel_length` [1] |
| - Output: `encoder_output` [1, 1024, 188] |
|
|
| - **CtcDecoder.mlpackage** - Features β Raw CTC logits |
| - Input: `encoder_output` [1, 1024, 188] |
| - Output: `ctc_logits` [1, 188, 3073] (RAW logits, not log-softmax!) |
|
|
| **Note**: Chain these three components together for full audio β text transcription (see usage example above). |
|
|
| ### Supporting Files |
|
|
| - **vocab.json** - 3,072 Japanese SentencePiece BPE tokens (index β token mapping) |
| - **metadata.json** - Model metadata and shapes |
|
|
| ## Model Architecture |
|
|
| ``` |
| Audio [1, 240000] @ 16kHz |
| β Preprocessor (STFT, Mel filterbank) |
| Mel Spectrogram [1, 80, 1501] |
| β Encoder (FastConformer, 8x downsampling) |
| Encoder Features [1, 1024, 188] |
| β CTC Decoder (Conv1d 1024β3073, kernel_size=1) |
| Raw Logits [1, 188, 3073] |
| β log_softmax (YOUR CODE - required!) |
| Log Probabilities [1, 188, 3073] |
| β CTC Beam Search / Greedy Decoding |
| Transcription |
| ``` |
|
|
| ## Compilation (Optional but Recommended) |
|
|
| Compile models for faster loading: |
|
|
| ```bash |
| xcrun coremlcompiler compile Preprocessor.mlpackage . |
| xcrun coremlcompiler compile Encoder.mlpackage . |
| xcrun coremlcompiler compile CtcDecoder.mlpackage . |
| ``` |
|
|
| This generates `.mlmodelc` directories that load ~20x faster on first run. |
|
|
| ## Validation Results |
|
|
| All models validated against original NeMo implementation: |
|
|
| | Component | Max Diff | Relative Error | ANE % | |
| |-----------|----------|----------------|-------| |
| | Preprocessor | 0.148 | < 0.001% | 100% | |
| | Encoder | 0.109 | 1.03e-07% | 100% | |
| | CTC Decoder | 0.011 | < 0.001% | 100% | |
| | Full Pipeline | 0.482 | 1.44% | 100% | |
|
|
| ## System Requirements |
|
|
| - **Minimum**: macOS 14.0 / iOS 17.0 |
| - **Recommended**: Apple Silicon (M1/M2/M3/M4) for optimal performance |
| - **Intel Macs**: Will run on CPU only (slower, higher power consumption) |
|
|
| ## Conversion Details |
|
|
| This CoreML conversion includes a critical fix for `log_softmax` conversion failure: |
|
|
| ### The Problem |
|
|
| Initial attempts to convert the CTC decoder's `forward()` method (which includes `log_softmax`) produced catastrophically wrong outputs: |
| - Expected: `[-67.31, -0.00]` |
| - CoreML: `[-45440.00, 0.00]` |
| - Max difference: **45,422** β |
|
|
| ### The Solution |
|
|
| Bypass NeMo's `forward()` method and access only the underlying `decoder_layers` (Conv1d): |
|
|
| ```python |
| # Instead of: |
| log_probs = ctc_decoder(encoder_output) # Broken in CoreML |
| |
| # We do: |
| raw_logits = ctc_decoder_layers(encoder_output) # Works perfectly |
| log_probs = torch.nn.functional.log_softmax(raw_logits, dim=-1) |
| ``` |
|
|
| This achieves identical results (0.011 max diff) while avoiding the CoreML conversion bug. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{parakeet-ctc-ja-coreml, |
| title={Parakeet CTC 0.6B Japanese - CoreML}, |
| author={FluidInference}, |
| year={2026}, |
| publisher={HuggingFace}, |
| howpublished={\url{https://huggingface.co/FluidInference/parakeet-ctc-0.6b-ja-coreml}} |
| } |
| |
| @misc{parakeet2024, |
| title={Parakeet: NVIDIA's Automatic Speech Recognition Toolkit}, |
| author={NVIDIA}, |
| year={2024}, |
| publisher={HuggingFace}, |
| howpublished={\url{https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}} |
| } |
| ``` |
|
|
| ## License |
|
|
| CC-BY-4.0 (following the original NVIDIA Parakeet model license) |
|
|
| ## Acknowledgments |
|
|
| - Original model by NVIDIA NeMo team |
| - Converted to CoreML by FluidInference |
| - Benchmarked on FluidInference/fleurs-full dataset |
|
|
| ## Links |
|
|
| - **Original Model**: https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja |
| - **Benchmark Dataset**: https://huggingface.co/datasets/FluidInference/fleurs-full |
| - **Conversion Repository**: https://github.com/FluidInference/mobius |