| | --- |
| | license: cc-by-4.0 |
| | language: |
| | - en |
| | tags: |
| | - speech |
| | - asr |
| | - coreml |
| | - parakeet |
| | - transducer |
| | base_model: nvidia/parakeet-tdt-0.6b-v2 |
| | --- |
| | |
| | # Parakeet TDT v3 — CoreML INT4 |
| |
|
| | CoreML conversion of [NVIDIA Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) with INT4-quantized encoder for Apple Neural Engine acceleration. |
| |
|
| | ## Models |
| |
|
| | | Model | Description | Compute | Quantization | |
| | |-------|-------------|---------|-------------| |
| | | `encoder.mlmodelc` | FastConformer encoder (24L, 1024 hidden) | CPU + Neural Engine | INT4 palettized | |
| | | `decoder.mlmodelc` | LSTM prediction network (2L, 640 hidden) | CPU + Neural Engine | FP16 | |
| | | `joint.mlmodelc` | TDT dual-head joint (token + duration logits) | CPU + Neural Engine | FP16 | |
| |
|
| | ## Additional Files |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `vocab.json` | SentencePiece vocabulary (1024 tokens) | |
| | | `config.json` | Model configuration | |
| |
|
| | ## Notes |
| |
|
| | - **Mel preprocessing** is done in Swift using Accelerate/vDSP (not CoreML) because `torch.stft` tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs. |
| | - **Encoder** uses `EnumeratedShapes` (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes. |
| | - **Performance**: ~110x RTF on M4 Pro via Neural Engine. |
| |
|
| | ## Usage |
| |
|
| | Used by [qwen3-asr-swift](https://github.com/AufklarerStudios/qwen3-asr-swift) `ParakeetASR` module: |
| |
|
| | ```swift |
| | let model = try await ParakeetASRModel.fromPretrained() |
| | let text = try model.transcribeAudio(samples, sampleRate: 16000) |
| | ``` |
| |
|