Duplicate from aufklarer/Parakeet-TDT-v3-CoreML-INT4

e548cac 1 day ago

1.56 kB

	---
	license: cc-by-4.0
	language:
	- en
	tags:
	- speech
	- asr
	- coreml
	- parakeet
	- transducer
	base_model: nvidia/parakeet-tdt-0.6b-v2
	---

	# Parakeet TDT v3 — CoreML INT4

	CoreML conversion of [NVIDIA Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) with INT4-quantized encoder for Apple Neural Engine acceleration.

	## Models

	\| Model \| Description \| Compute \| Quantization \|
	\|-------\|-------------\|---------\|-------------\|
	\| `encoder.mlmodelc` \| FastConformer encoder (24L, 1024 hidden) \| CPU + Neural Engine \| INT4 palettized \|
	\| `decoder.mlmodelc` \| LSTM prediction network (2L, 640 hidden) \| CPU + Neural Engine \| FP16 \|
	\| `joint.mlmodelc` \| TDT dual-head joint (token + duration logits) \| CPU + Neural Engine \| FP16 \|

	## Additional Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `vocab.json` \| SentencePiece vocabulary (1024 tokens) \|
	\| `config.json` \| Model configuration \|

	## Notes

	- Mel preprocessing is done in Swift using Accelerate/vDSP (not CoreML) because `torch.stft` tracing bakes audio length as a constant, breaking per-feature normalization for variable-length inputs.
	- Encoder uses `EnumeratedShapes` (100–3000 mel frames, covering 1–30s audio) to avoid BNNS crashes with dynamic shapes.
	- Performance: ~110x RTF on M4 Pro via Neural Engine.

	## Usage

	Used by [qwen3-asr-swift](https://github.com/AufklarerStudios/qwen3-asr-swift) `ParakeetASR` module:

	```swift
	let model = try await ParakeetASRModel.fromPretrained()
	let text = try model.transcribeAudio(samples, sampleRate: 16000)
	```