Parakeet TDT v3 — CoreML INT4

CoreML conversion of NVIDIA Parakeet-TDT 0.6B v2 with INT4-quantized encoder for Apple Neural Engine acceleration.

Models

Model	Description	Compute	Quantization
`encoder.mlmodelc`	FastConformer encoder (24L, 1024 hidden)	CPU + Neural Engine	INT4 palettized
`decoder.mlmodelc`	LSTM prediction network (2L, 640 hidden)	CPU + Neural Engine	FP16
`joint.mlmodelc`	TDT dual-head joint (token + duration logits)	CPU + Neural Engine	FP16

File	Description
`vocab.json`	SentencePiece vocabulary (1024 tokens)
`config.json`	Model configuration

Mel preprocessing is done in Swift using Accelerate/vDSP (not CoreML) because torch.stft tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs.
Encoder uses EnumeratedShapes (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes.
Performance: ~110x RTF on M4 Pro via Neural Engine.

Used by qwen3-asr-swift ParakeetASR module:

let model = try await ParakeetASRModel.fromPretrained()
let text = try model.transcribeAudio(samples, sampleRate: 16000)

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Finetuned

(39)

this model