File size: 1,562 Bytes
e548cac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: cc-by-4.0
language:
- en
tags:
- speech
- asr
- coreml
- parakeet
- transducer
base_model: nvidia/parakeet-tdt-0.6b-v2
---

# Parakeet TDT v3 — CoreML INT4

CoreML conversion of [NVIDIA Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) with INT4-quantized encoder for Apple Neural Engine acceleration.

## Models

| Model | Description | Compute | Quantization |
|-------|-------------|---------|-------------|
| `encoder.mlmodelc` | FastConformer encoder (24L, 1024 hidden) | CPU + Neural Engine | INT4 palettized |
| `decoder.mlmodelc` | LSTM prediction network (2L, 640 hidden) | CPU + Neural Engine | FP16 |
| `joint.mlmodelc` | TDT dual-head joint (token + duration logits) | CPU + Neural Engine | FP16 |

## Additional Files

| File | Description |
|------|-------------|
| `vocab.json` | SentencePiece vocabulary (1024 tokens) |
| `config.json` | Model configuration |

## Notes

- **Mel preprocessing** is done in Swift using Accelerate/vDSP (not CoreML) because `torch.stft` tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs.
- **Encoder** uses `EnumeratedShapes` (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes.
- **Performance**: ~110x RTF on M4 Pro via Neural Engine.

## Usage

Used by [qwen3-asr-swift](https://github.com/AufklarerStudios/qwen3-asr-swift) `ParakeetASR` module:

```swift
let model = try await ParakeetASRModel.fromPretrained()
let text = try model.transcribeAudio(samples, sampleRate: 16000)
```