Update README.md

df7f65a verified 12 days ago

7.4 kB

	---
	language:
	- ja
	license: cc-by-4.0
	tags:
	- speech
	- audio
	- automatic-speech-recognition
	- coreml
	- parakeet
	- ctc
	- japanese
	library_name: coreml
	pipeline_tag: automatic-speech-recognition
	base_model:
	- nvidia/parakeet-tdt_ctc-0.6b-ja
	---

	# Parakeet CTC 0.6B Japanese - CoreML

	CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple Silicon.

	## Model Description

	- Language: Japanese (日本語)
	- Parameters: 600M (0.6B)
	- Architecture: Hybrid FastConformer-TDT-CTC
	- Vocabulary: 3,072 Japanese SentencePiece BPE tokens
	- Sample Rate: 16 kHz
	- Max Duration: 15 seconds per chunk
	- Platform: iOS 17+ / macOS 14+ (Apple Silicon recommended)
	- ANE Utilization: 100% (0 CPU fallbacks)

	## Performance

	Benchmark on FluidInference/fleurs-full (650 Japanese samples):
	- CER: 10.29% (within expected 10-13% range)
	- RTFx: 136.85x (far exceeds real-time)
	- Avg Latency: 91.34ms per sample on M-series chips

	Expected CER by Dataset (from NeMo paper):
	\| Dataset \| CER \|
	\|---------\|-----\|
	\| JSUT basic5000 \| 6.5% \|
	\| Mozilla Common Voice 8.0 test \| 7.2% \|
	\| Mozilla Common Voice 16.1 dev \| 10.2% \|
	\| Mozilla Common Voice 16.1 test \| 13.3% \|
	\| TEDxJP-10k \| 9.1% \|

	## Critical Implementation Note: Raw Logits Output

	IMPORTANT: The CTC decoder outputs raw logits (not log-probabilities). You must apply `log_softmax` before CTC decoding.

	### Why?

	During CoreML conversion, we discovered that `log_softmax` failed to convert correctly, producing extreme values (-45440 instead of -67). The solution was to output raw logits and apply `log_softmax` in post-processing.

	### Usage Example

	```python
	import coremltools as ct
	import numpy as np
	import torch

	# Load the three CoreML models
	preprocessor = ct.models.MLModel('Preprocessor.mlpackage')
	encoder = ct.models.MLModel('Encoder.mlpackage')
	ctc_decoder = ct.models.MLModel('CtcDecoder.mlpackage')

	# Prepare audio (16kHz, mono, max 15 seconds)
	audio = np.array(audio_samples, dtype=np.float32).reshape(1, -1)
	audio_length = np.array([audio.shape[1]], dtype=np.int32)

	# Pad or truncate to 240,000 samples (15 seconds)
	if audio.shape[1] < 240000:
	audio = np.pad(audio, ((0, 0), (0, 240000 - audio.shape[1])))
	else:
	audio = audio[:, :240000]

	# Step 1: Preprocessor (audio → mel)
	prep_out = preprocessor.predict({
	'audio_signal': audio,
	'length': audio_length
	})

	# Step 2: Encoder (mel → features)
	enc_out = encoder.predict({
	'mel_features': prep_out['mel_features'],
	'mel_length': prep_out['mel_length']
	})

	# Step 3: CTC Decoder (features → raw logits)
	ctc_out = ctc_decoder.predict({
	'encoder_output': enc_out['encoder_output']
	})
	raw_logits = ctc_out['ctc_logits'] # [1, 188, 3073]

	# Apply log_softmax (CRITICAL!)
	logits_tensor = torch.from_numpy(raw_logits)
	log_probs = torch.nn.functional.log_softmax(logits_tensor, dim=-1)

	# Now use log_probs for CTC decoding
	# Greedy decoding example:
	labels = torch.argmax(log_probs, dim=-1)[0].numpy() # [188]

	# Collapse repeats and remove blanks
	blank_id = 3072
	decoded = []
	prev = None
	for label in labels:
	if label != blank_id and label != prev:
	decoded.append(label)
	prev = label

	# Convert to text using vocabulary
	import json
	with open('vocab.json', 'r') as f:
	vocab = json.load(f)
	tokens = [vocab[i] for i in decoded if i < len(vocab)]
	text = ''.join(tokens).replace('▁', ' ').strip()
	print(text)
	```

	## Files Included

	### CoreML Models

	- Preprocessor.mlpackage - Audio → Mel spectrogram
	- Input: `audio_signal` [1, 240000], `length` [1]
	- Output: `mel_features` [1, 80, 1501], `mel_length` [1]

	- Encoder.mlpackage - Mel → Encoder features (FastConformer)
	- Input: `mel_features` [1, 80, 1501], `mel_length` [1]
	- Output: `encoder_output` [1, 1024, 188]

	- CtcDecoder.mlpackage - Features → Raw CTC logits
	- Input: `encoder_output` [1, 1024, 188]
	- Output: `ctc_logits` [1, 188, 3073] (RAW logits, not log-softmax!)

	Note: Chain these three components together for full audio → text transcription (see usage example above).

	### Supporting Files

	- vocab.json - 3,072 Japanese SentencePiece BPE tokens (index → token mapping)
	- metadata.json - Model metadata and shapes

	## Model Architecture

	```
	Audio [1, 240000] @ 16kHz
	↓ Preprocessor (STFT, Mel filterbank)
	Mel Spectrogram [1, 80, 1501]
	↓ Encoder (FastConformer, 8x downsampling)
	Encoder Features [1, 1024, 188]
	↓ CTC Decoder (Conv1d 1024→3073, kernel_size=1)
	Raw Logits [1, 188, 3073]
	↓ log_softmax (YOUR CODE - required!)
	Log Probabilities [1, 188, 3073]
	↓ CTC Beam Search / Greedy Decoding
	Transcription
	```

	## Compilation (Optional but Recommended)

	Compile models for faster loading:

	```bash
	xcrun coremlcompiler compile Preprocessor.mlpackage .
	xcrun coremlcompiler compile Encoder.mlpackage .
	xcrun coremlcompiler compile CtcDecoder.mlpackage .
	```

	This generates `.mlmodelc` directories that load ~20x faster on first run.

	## Validation Results

	All models validated against original NeMo implementation:

	\| Component \| Max Diff \| Relative Error \| ANE % \|
	\|-----------\|----------\|----------------\|-------\|
	\| Preprocessor \| 0.148 \| < 0.001% \| 100% \|
	\| Encoder \| 0.109 \| 1.03e-07% \| 100% \|
	\| CTC Decoder \| 0.011 \| < 0.001% \| 100% \|
	\| Full Pipeline \| 0.482 \| 1.44% \| 100% \|

	## System Requirements

	- Minimum: macOS 14.0 / iOS 17.0
	- Recommended: Apple Silicon (M1/M2/M3/M4) for optimal performance
	- Intel Macs: Will run on CPU only (slower, higher power consumption)

	## Conversion Details

	This CoreML conversion includes a critical fix for `log_softmax` conversion failure:

	### The Problem

	Initial attempts to convert the CTC decoder's `forward()` method (which includes `log_softmax`) produced catastrophically wrong outputs:
	- Expected: `[-67.31, -0.00]`
	- CoreML: `[-45440.00, 0.00]`
	- Max difference: 45,422 ❌

	### The Solution

	Bypass NeMo's `forward()` method and access only the underlying `decoder_layers` (Conv1d):

	```python
	# Instead of:
	log_probs = ctc_decoder(encoder_output) # Broken in CoreML

	# We do:
	raw_logits = ctc_decoder_layers(encoder_output) # Works perfectly
	log_probs = torch.nn.functional.log_softmax(raw_logits, dim=-1)
	```

	This achieves identical results (0.011 max diff) while avoiding the CoreML conversion bug.

	## Citation

	```bibtex
	@misc{parakeet-ctc-ja-coreml,
	title={Parakeet CTC 0.6B Japanese - CoreML},
	author={FluidInference},
	year={2026},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/FluidInference/parakeet-ctc-0.6b-ja-coreml}}
	}

	@misc{parakeet2024,
	title={Parakeet: NVIDIA's Automatic Speech Recognition Toolkit},
	author={NVIDIA},
	year={2024},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}}
	}
	```

	## License

	CC-BY-4.0 (following the original NVIDIA Parakeet model license)

	## Acknowledgments

	- Original model by NVIDIA NeMo team
	- Converted to CoreML by FluidInference
	- Benchmarked on FluidInference/fleurs-full dataset

	## Links

	- Original Model: https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja
	- Benchmark Dataset: https://huggingface.co/datasets/FluidInference/fleurs-full
	- Conversion Repository: https://github.com/FluidInference/mobius