Parakeet CTC 0.6B Japanese - CoreML

CoreML conversion of nvidia/parakeet-tdt_ctc-0.6b-ja for on-device Japanese speech recognition on Apple Silicon.

Model Description

  • Language: Japanese (ζ—₯本θͺž)
  • Parameters: 600M (0.6B)
  • Architecture: Hybrid FastConformer-TDT-CTC
  • Vocabulary: 3,072 Japanese SentencePiece BPE tokens
  • Sample Rate: 16 kHz
  • Max Duration: 15 seconds per chunk
  • Platform: iOS 17+ / macOS 14+ (Apple Silicon recommended)
  • ANE Utilization: 100% (0 CPU fallbacks)

Performance

Benchmark on FluidInference/fleurs-full (650 Japanese samples):

  • CER: 10.29% (within expected 10-13% range)
  • RTFx: 136.85x (far exceeds real-time)
  • Avg Latency: 91.34ms per sample on M-series chips

Expected CER by Dataset (from NeMo paper):

Dataset CER
JSUT basic5000 6.5%
Mozilla Common Voice 8.0 test 7.2%
Mozilla Common Voice 16.1 dev 10.2%
Mozilla Common Voice 16.1 test 13.3%
TEDxJP-10k 9.1%

Critical Implementation Note: Raw Logits Output

IMPORTANT: The CTC decoder outputs raw logits (not log-probabilities). You must apply log_softmax before CTC decoding.

Why?

During CoreML conversion, we discovered that log_softmax failed to convert correctly, producing extreme values (-45440 instead of -67). The solution was to output raw logits and apply log_softmax in post-processing.

Usage Example

import coremltools as ct
import numpy as np
import torch

# Load the three CoreML models
preprocessor = ct.models.MLModel('Preprocessor.mlpackage')
encoder = ct.models.MLModel('Encoder.mlpackage')
ctc_decoder = ct.models.MLModel('CtcDecoder.mlpackage')

# Prepare audio (16kHz, mono, max 15 seconds)
audio = np.array(audio_samples, dtype=np.float32).reshape(1, -1)
audio_length = np.array([audio.shape[1]], dtype=np.int32)

# Pad or truncate to 240,000 samples (15 seconds)
if audio.shape[1] < 240000:
    audio = np.pad(audio, ((0, 0), (0, 240000 - audio.shape[1])))
else:
    audio = audio[:, :240000]

# Step 1: Preprocessor (audio β†’ mel)
prep_out = preprocessor.predict({
    'audio_signal': audio,
    'length': audio_length
})

# Step 2: Encoder (mel β†’ features)
enc_out = encoder.predict({
    'mel_features': prep_out['mel_features'],
    'mel_length': prep_out['mel_length']
})

# Step 3: CTC Decoder (features β†’ raw logits)
ctc_out = ctc_decoder.predict({
    'encoder_output': enc_out['encoder_output']
})
raw_logits = ctc_out['ctc_logits']  # [1, 188, 3073]

# Apply log_softmax (CRITICAL!)
logits_tensor = torch.from_numpy(raw_logits)
log_probs = torch.nn.functional.log_softmax(logits_tensor, dim=-1)

# Now use log_probs for CTC decoding
# Greedy decoding example:
labels = torch.argmax(log_probs, dim=-1)[0].numpy()  # [188]

# Collapse repeats and remove blanks
blank_id = 3072
decoded = []
prev = None
for label in labels:
    if label != blank_id and label != prev:
        decoded.append(label)
    prev = label

# Convert to text using vocabulary
import json
with open('vocab.json', 'r') as f:
    vocab = json.load(f)
tokens = [vocab[i] for i in decoded if i < len(vocab)]
text = ''.join(tokens).replace('▁', ' ').strip()
print(text)

Files Included

CoreML Models

  • Preprocessor.mlpackage - Audio β†’ Mel spectrogram

    • Input: audio_signal [1, 240000], length [1]
    • Output: mel_features [1, 80, 1501], mel_length [1]
  • Encoder.mlpackage - Mel β†’ Encoder features (FastConformer)

    • Input: mel_features [1, 80, 1501], mel_length [1]
    • Output: encoder_output [1, 1024, 188]
  • CtcDecoder.mlpackage - Features β†’ Raw CTC logits

    • Input: encoder_output [1, 1024, 188]
    • Output: ctc_logits [1, 188, 3073] (RAW logits, not log-softmax!)

Note: Chain these three components together for full audio β†’ text transcription (see usage example above).

Supporting Files

  • vocab.json - 3,072 Japanese SentencePiece BPE tokens (index β†’ token mapping)
  • metadata.json - Model metadata and shapes

Model Architecture

Audio [1, 240000] @ 16kHz
  ↓ Preprocessor (STFT, Mel filterbank)
Mel Spectrogram [1, 80, 1501]
  ↓ Encoder (FastConformer, 8x downsampling)
Encoder Features [1, 1024, 188]
  ↓ CTC Decoder (Conv1d 1024β†’3073, kernel_size=1)
Raw Logits [1, 188, 3073]
  ↓ log_softmax (YOUR CODE - required!)
Log Probabilities [1, 188, 3073]
  ↓ CTC Beam Search / Greedy Decoding
Transcription

Compilation (Optional but Recommended)

Compile models for faster loading:

xcrun coremlcompiler compile Preprocessor.mlpackage .
xcrun coremlcompiler compile Encoder.mlpackage .
xcrun coremlcompiler compile CtcDecoder.mlpackage .

This generates .mlmodelc directories that load ~20x faster on first run.

Validation Results

All models validated against original NeMo implementation:

Component Max Diff Relative Error ANE %
Preprocessor 0.148 < 0.001% 100%
Encoder 0.109 1.03e-07% 100%
CTC Decoder 0.011 < 0.001% 100%
Full Pipeline 0.482 1.44% 100%

System Requirements

  • Minimum: macOS 14.0 / iOS 17.0
  • Recommended: Apple Silicon (M1/M2/M3/M4) for optimal performance
  • Intel Macs: Will run on CPU only (slower, higher power consumption)

Conversion Details

This CoreML conversion includes a critical fix for log_softmax conversion failure:

The Problem

Initial attempts to convert the CTC decoder's forward() method (which includes log_softmax) produced catastrophically wrong outputs:

  • Expected: [-67.31, -0.00]
  • CoreML: [-45440.00, 0.00]
  • Max difference: 45,422 ❌

The Solution

Bypass NeMo's forward() method and access only the underlying decoder_layers (Conv1d):

# Instead of:
log_probs = ctc_decoder(encoder_output)  # Broken in CoreML

# We do:
raw_logits = ctc_decoder_layers(encoder_output)  # Works perfectly
log_probs = torch.nn.functional.log_softmax(raw_logits, dim=-1)

This achieves identical results (0.011 max diff) while avoiding the CoreML conversion bug.

Citation

@misc{parakeet-ctc-ja-coreml,
  title={Parakeet CTC 0.6B Japanese - CoreML},
  author={FluidInference},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/FluidInference/parakeet-ctc-0.6b-ja-coreml}}
}

@misc{parakeet2024,
  title={Parakeet: NVIDIA's Automatic Speech Recognition Toolkit},
  author={NVIDIA},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}}
}

License

CC-BY-4.0 (following the original NVIDIA Parakeet model license)

Acknowledgments

  • Original model by NVIDIA NeMo team
  • Converted to CoreML by FluidInference
  • Benchmarked on FluidInference/fleurs-full dataset

Links

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FluidInference/parakeet-ctc-0.6b-ja-coreml

Quantized
(5)
this model

Collection including FluidInference/parakeet-ctc-0.6b-ja-coreml