| # Parakeet-TDT-CTC-110M CoreML | |
| NVIDIA's Parakeet-TDT-CTC-110M model converted to CoreML format for efficient inference on Apple Silicon. | |
| ## Model Description | |
| This is a hybrid ASR model with a shared Conformer encoder and two decoder heads: | |
| - **CTC Head**: Fast greedy decoding, ideal for keyword spotting | |
| - **TDT Head**: Token-Duration Transducer for high-quality transcription | |
| ### Architecture | |
| | Component | Description | Size | | |
| |-----------|-------------|------| | |
| | Preprocessor | Mel spectrogram extraction | ~1 MB | | |
| | Encoder | Conformer encoder (shared) | ~400 MB | | |
| | CTCHead | CTC output projection | ~4 MB | | |
| | Decoder | TDT prediction network (LSTM) | ~25 MB | | |
| | JointDecision | TDT joint network | ~6 MB | | |
| **Total size**: ~436 MB | |
| ### Performance | |
| Benchmarked on Earnings22 dataset (772 audio files): | |
| | Metric | Value | | |
| |--------|-------| | |
| | Keyword Recall | 100% (1309/1309) | | |
| | WER | 17.97% | | |
| | RTFx (M4 Pro) | 358x real-time | | |
| ## Requirements | |
| - macOS 13+ (Ventura or later) | |
| - Apple Silicon (M1/M2/M3/M4) | |
| - Python 3.10+ | |
| ## Installation | |
| ```bash | |
| # Using uv (recommended) | |
| uv sync | |
| # Or using pip | |
| pip install -e . | |
| # For audio file support (WAV, MP3, etc.) | |
| pip install -e ".[audio]" | |
| ``` | |
| ## Usage | |
| ### Python Inference | |
| ```python | |
| from scripts.inference import ParakeetCoreML | |
| # Load model (from current directory with .mlpackage files) | |
| model = ParakeetCoreML(".") | |
| # Transcribe with TDT (higher quality) | |
| text = model.transcribe("audio.wav", mode="tdt") | |
| print(text) | |
| # Or use CTC for faster keyword spotting | |
| text = model.transcribe("audio.wav", mode="ctc") | |
| print(text) | |
| ``` | |
| ### Command Line | |
| ```bash | |
| # TDT decoding (default, higher quality) | |
| uv run scripts/inference.py --audio audio.wav | |
| # CTC decoding (faster, good for keyword spotting) | |
| uv run scripts/inference.py --audio audio.wav --mode ctc | |
| ``` | |
| ## Model Conversion | |
| To convert from the original NeMo model: | |
| ```bash | |
| # Install conversion dependencies | |
| uv sync --extra convert | |
| # Run conversion | |
| uv run scripts/convert_nemo_to_coreml.py --output-dir ./model | |
| ``` | |
| This will: | |
| 1. Download the original model from NVIDIA (`nvidia/parakeet-tdt_ctc-110m`) | |
| 2. Convert each component to CoreML format | |
| 3. Extract vocabulary and create metadata | |
| ## File Structure | |
| ``` | |
| ./ | |
| βββ Preprocessor.mlpackage # Audio β Mel spectrogram | |
| βββ Encoder.mlpackage # Mel β Encoder features | |
| βββ CTCHead.mlpackage # Encoder β CTC log probs | |
| βββ Decoder.mlpackage # TDT prediction network | |
| βββ JointDecision.mlpackage # TDT joint network | |
| βββ vocab.json # Token vocabulary (1024 tokens) | |
| βββ metadata.json # Model configuration | |
| βββ pyproject.toml # Python dependencies | |
| βββ uv.lock # Locked dependencies | |
| βββ scripts/ # Inference & conversion scripts | |
| ``` | |
| ## Decoding Modes | |
| ### TDT Mode (Recommended for Transcription) | |
| - Uses Token-Duration Transducer decoding | |
| - Higher accuracy (17.97% WER) | |
| - Predicts both tokens and durations | |
| - Best for full transcription tasks | |
| ### CTC Mode (Recommended for Keyword Spotting) | |
| - Greedy CTC decoding | |
| - Faster inference | |
| - 100% keyword recall on Earnings22 | |
| - Best for detecting specific words/phrases | |
| ## Custom Vocabulary / Keyword Spotting | |
| For keyword spotting, CTC mode with custom vocabulary boosting achieves 100% recall: | |
| ```python | |
| # Load custom vocabulary with token IDs | |
| with open("custom_vocab.json") as f: | |
| keywords = json.load(f) # {"keyword": [token_ids], ...} | |
| # Run CTC decoding | |
| tokens = model.decode_ctc(encoder_output) | |
| # Check for keyword matches | |
| for keyword, expected_ids in keywords.items(): | |
| if is_subsequence(expected_ids, tokens): | |
| print(f"Found keyword: {keyword}") | |
| ``` | |
| ## License | |
| This model conversion is released under the Apache 2.0 License, same as the original NVIDIA model. | |
| ## Citation | |
| If you use this model, please cite the original NVIDIA work: | |
| ```bibtex | |
| @misc{nvidia_parakeet_tdt_ctc, | |
| title={Parakeet-TDT-CTC-110M}, | |
| author={NVIDIA}, | |
| year={2024}, | |
| publisher={Hugging Face}, | |
| url={https://huggingface.co/nvidia/parakeet-tdt_ctc-110m} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| - Original model by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) | |
| - CoreML conversion by [FluidInference](https://github.com/FluidInference) | |