--- language: - ja license: cc-by-4.0 tags: - speech - audio - automatic-speech-recognition - coreml - parakeet - ctc - japanese library_name: coreml pipeline_tag: automatic-speech-recognition base_model: - nvidia/parakeet-tdt_ctc-0.6b-ja --- # Parakeet CTC 0.6B Japanese - CoreML CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple Silicon. ## Model Description - **Language**: Japanese (日本語) - **Parameters**: 600M (0.6B) - **Architecture**: Hybrid FastConformer-TDT-CTC - **Vocabulary**: 3,072 Japanese SentencePiece BPE tokens - **Sample Rate**: 16 kHz - **Max Duration**: 15 seconds per chunk - **Platform**: iOS 17+ / macOS 14+ (Apple Silicon recommended) - **ANE Utilization**: 100% (0 CPU fallbacks) ## Performance **Benchmark on FluidInference/fleurs-full (650 Japanese samples)**: - **CER**: 10.29% (within expected 10-13% range) - **RTFx**: 136.85x (far exceeds real-time) - **Avg Latency**: 91.34ms per sample on M-series chips **Expected CER by Dataset** (from NeMo paper): | Dataset | CER | |---------|-----| | JSUT basic5000 | 6.5% | | Mozilla Common Voice 8.0 test | 7.2% | | Mozilla Common Voice 16.1 dev | 10.2% | | Mozilla Common Voice 16.1 test | 13.3% | | TEDxJP-10k | 9.1% | ## Critical Implementation Note: Raw Logits Output **IMPORTANT**: The CTC decoder outputs **raw logits** (not log-probabilities). You **must** apply `log_softmax` before CTC decoding. ### Why? During CoreML conversion, we discovered that `log_softmax` failed to convert correctly, producing extreme values (-45440 instead of -67). The solution was to output raw logits and apply `log_softmax` in post-processing. ### Usage Example ```python import coremltools as ct import numpy as np import torch # Load the three CoreML models preprocessor = ct.models.MLModel('Preprocessor.mlpackage') encoder = ct.models.MLModel('Encoder.mlpackage') ctc_decoder = ct.models.MLModel('CtcDecoder.mlpackage') # Prepare audio (16kHz, mono, max 15 seconds) audio = np.array(audio_samples, dtype=np.float32).reshape(1, -1) audio_length = np.array([audio.shape[1]], dtype=np.int32) # Pad or truncate to 240,000 samples (15 seconds) if audio.shape[1] < 240000: audio = np.pad(audio, ((0, 0), (0, 240000 - audio.shape[1]))) else: audio = audio[:, :240000] # Step 1: Preprocessor (audio → mel) prep_out = preprocessor.predict({ 'audio_signal': audio, 'length': audio_length }) # Step 2: Encoder (mel → features) enc_out = encoder.predict({ 'mel_features': prep_out['mel_features'], 'mel_length': prep_out['mel_length'] }) # Step 3: CTC Decoder (features → raw logits) ctc_out = ctc_decoder.predict({ 'encoder_output': enc_out['encoder_output'] }) raw_logits = ctc_out['ctc_logits'] # [1, 188, 3073] # Apply log_softmax (CRITICAL!) logits_tensor = torch.from_numpy(raw_logits) log_probs = torch.nn.functional.log_softmax(logits_tensor, dim=-1) # Now use log_probs for CTC decoding # Greedy decoding example: labels = torch.argmax(log_probs, dim=-1)[0].numpy() # [188] # Collapse repeats and remove blanks blank_id = 3072 decoded = [] prev = None for label in labels: if label != blank_id and label != prev: decoded.append(label) prev = label # Convert to text using vocabulary import json with open('vocab.json', 'r') as f: vocab = json.load(f) tokens = [vocab[i] for i in decoded if i < len(vocab)] text = ''.join(tokens).replace('▁', ' ').strip() print(text) ``` ## Files Included ### CoreML Models - **Preprocessor.mlpackage** - Audio → Mel spectrogram - Input: `audio_signal` [1, 240000], `length` [1] - Output: `mel_features` [1, 80, 1501], `mel_length` [1] - **Encoder.mlpackage** - Mel → Encoder features (FastConformer) - Input: `mel_features` [1, 80, 1501], `mel_length` [1] - Output: `encoder_output` [1, 1024, 188] - **CtcDecoder.mlpackage** - Features → Raw CTC logits - Input: `encoder_output` [1, 1024, 188] - Output: `ctc_logits` [1, 188, 3073] (RAW logits, not log-softmax!) **Note**: Chain these three components together for full audio → text transcription (see usage example above). ### Supporting Files - **vocab.json** - 3,072 Japanese SentencePiece BPE tokens (index → token mapping) - **metadata.json** - Model metadata and shapes ## Model Architecture ``` Audio [1, 240000] @ 16kHz ↓ Preprocessor (STFT, Mel filterbank) Mel Spectrogram [1, 80, 1501] ↓ Encoder (FastConformer, 8x downsampling) Encoder Features [1, 1024, 188] ↓ CTC Decoder (Conv1d 1024→3073, kernel_size=1) Raw Logits [1, 188, 3073] ↓ log_softmax (YOUR CODE - required!) Log Probabilities [1, 188, 3073] ↓ CTC Beam Search / Greedy Decoding Transcription ``` ## Compilation (Optional but Recommended) Compile models for faster loading: ```bash xcrun coremlcompiler compile Preprocessor.mlpackage . xcrun coremlcompiler compile Encoder.mlpackage . xcrun coremlcompiler compile CtcDecoder.mlpackage . ``` This generates `.mlmodelc` directories that load ~20x faster on first run. ## Validation Results All models validated against original NeMo implementation: | Component | Max Diff | Relative Error | ANE % | |-----------|----------|----------------|-------| | Preprocessor | 0.148 | < 0.001% | 100% | | Encoder | 0.109 | 1.03e-07% | 100% | | CTC Decoder | 0.011 | < 0.001% | 100% | | Full Pipeline | 0.482 | 1.44% | 100% | ## System Requirements - **Minimum**: macOS 14.0 / iOS 17.0 - **Recommended**: Apple Silicon (M1/M2/M3/M4) for optimal performance - **Intel Macs**: Will run on CPU only (slower, higher power consumption) ## Conversion Details This CoreML conversion includes a critical fix for `log_softmax` conversion failure: ### The Problem Initial attempts to convert the CTC decoder's `forward()` method (which includes `log_softmax`) produced catastrophically wrong outputs: - Expected: `[-67.31, -0.00]` - CoreML: `[-45440.00, 0.00]` - Max difference: **45,422** ❌ ### The Solution Bypass NeMo's `forward()` method and access only the underlying `decoder_layers` (Conv1d): ```python # Instead of: log_probs = ctc_decoder(encoder_output) # Broken in CoreML # We do: raw_logits = ctc_decoder_layers(encoder_output) # Works perfectly log_probs = torch.nn.functional.log_softmax(raw_logits, dim=-1) ``` This achieves identical results (0.011 max diff) while avoiding the CoreML conversion bug. ## Citation ```bibtex @misc{parakeet-ctc-ja-coreml, title={Parakeet CTC 0.6B Japanese - CoreML}, author={FluidInference}, year={2026}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/FluidInference/parakeet-ctc-0.6b-ja-coreml}} } @misc{parakeet2024, title={Parakeet: NVIDIA's Automatic Speech Recognition Toolkit}, author={NVIDIA}, year={2024}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}} } ``` ## License CC-BY-4.0 (following the original NVIDIA Parakeet model license) ## Acknowledgments - Original model by NVIDIA NeMo team - Converted to CoreML by FluidInference - Benchmarked on FluidInference/fleurs-full dataset ## Links - **Original Model**: https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja - **Benchmark Dataset**: https://huggingface.co/datasets/FluidInference/fleurs-full - **Conversion Repository**: https://github.com/FluidInference/mobius