--- license: cc-by-4.0 language: - en tags: - speech - asr - coreml - parakeet - transducer base_model: nvidia/parakeet-tdt-0.6b-v2 --- # Parakeet TDT v3 — CoreML INT4 CoreML conversion of [NVIDIA Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) with INT4-quantized encoder for Apple Neural Engine acceleration. ## Models | Model | Description | Compute | Quantization | |-------|-------------|---------|-------------| | `encoder.mlmodelc` | FastConformer encoder (24L, 1024 hidden) | CPU + Neural Engine | INT4 palettized | | `decoder.mlmodelc` | LSTM prediction network (2L, 640 hidden) | CPU + Neural Engine | FP16 | | `joint.mlmodelc` | TDT dual-head joint (token + duration logits) | CPU + Neural Engine | FP16 | ## Additional Files | File | Description | |------|-------------| | `vocab.json` | SentencePiece vocabulary (1024 tokens) | | `config.json` | Model configuration | ## Notes - **Mel preprocessing** is done in Swift using Accelerate/vDSP (not CoreML) because `torch.stft` tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs. - **Encoder** uses `EnumeratedShapes` (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes. - **Performance**: ~110x RTF on M4 Pro via Neural Engine. ## Usage Used by [qwen3-asr-swift](https://github.com/AufklarerStudios/qwen3-asr-swift) `ParakeetASR` module: ```swift let model = try await ParakeetASRModel.fromPretrained() let text = try model.transcribeAudio(samples, sampleRate: 16000) ```