Upload LiteRT FP16 bundle

2a68489 verified 13 days ago

7.55 kB

	---
	license: cc-by-4.0
	language:
	- en
	- es
	- it
	- de
	- fr
	- pt
	library_name: litert
	base_model: nvidia/parakeet-tdt-0.6b-v3
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- parakeet
	- tdt
	- litert
	- tflite
	- on-device
	- mobile
	- android
	- streaming
	pipeline_tag: automatic-speech-recognition
	---

	# Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port

	LiteRT (TFLite) port of
	[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
	packaged for on-device inference (Android / Mac / embedded) without a Python
	or NeMo runtime dependency.

	For model capabilities, languages, training data, license, and benchmarks,
	see the upstream model card. This card only documents what's specific to the
	LiteRT port.

	## What's in this bundle

	\| File \| Size \| Purpose \|
	\|---\|---\|---\|
	\| `encoder_T1500.tflite` \| 1.15 GB \| FP16 encoder, fixed `T_mel = 1500` (15 s window) \|
	\| `decoder_step.tflite` \| 23 MB \| Single-step LSTM prediction network \|
	\| `joint_step.tflite` \| 12 MB \| TDT joint network (token + duration logits) \|
	\| `tokenizer.model` \| 353 KB \| SentencePiece BPE tokenizer (vocab=8192) \|
	\| `manifest.json` \| — \| All metadata the runtime needs \|

	Total: ~1.18 GB (FP16). FP32 reference is ~2.37 GB.

	## Encoder I/O contract

	```
	inputs:
	audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc)
	length : int32 [1] # actual mel frames used (≤ 1500)
	outputs:
	encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8
	encoded_lengths : int32 [1]
	```

	Pad shorter inputs with zeros at the tail (the encoder was trained with
	audio anchored at position 0; left-padding causes hallucinations) and pass
	the true length.

	The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the
	encoder in a sliding-window streaming loop — see "Streaming usage" below.

	Why int32, not int64. LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
	NPU accelerator) reject int64 tensors entirely. With int64 length, every
	internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
	fails outright on Android with the GPU backend. This bundle is exported with
	int32 length end-to-end (input → internal mask arange/comparisons → output
	`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
	so no practical range loss.

	## Why a single bucket and not multi-signature

	An earlier revision shipped a multi-signature encoder with 4 buckets
	(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
	were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
	the LiteRT `CompiledModel.create()` API prepares every signature's
	subgraph at load time — each one going through the full delegate-partition
	pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold
	start was ~28 s.

	A single-bucket file is one subgraph: ~7 s init, then ready. If you need
	multiple bucket sizes for latency reasons, ship them as separate `.tflite`
	files (TFLite has no cross-file weight sharing) and load on demand.

	## Decoder + joint contract

	```
	decoder_step:
	inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
	outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]

	joint_step:
	inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
	outputs: logits float32 [1,1,1,8198]
	# logits[..., 0:8193] → token logits (8192 BPE + 1 blank)
	# logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
	```

	`decoder_step.token` is `int64` because it's an embedding lookup; that op
	runs on CPU regardless of delegate, so int64 there is harmless.

	Greedy TDT decoding (per encoder frame):

	1. Run joint with current `enc_frame` and last predicted `pred_frame`.
	2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
	3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
	re-prime decoder with the emitted token (h, c update).
	4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
	5. Repeat until `enc_lengths` is exhausted.

	Cap at ~10 non-blank emissions per encoder frame to guard against the
	pathological `dur=0` decode loop.

	## Audio preprocessing

	LiteRT itself does not produce mel features — your runtime must compute
	them. Match NeMo's preprocessor exactly:

	```
	sample_rate : 16000 Hz (resample if needed)
	n_fft : 512
	hop_length : 160 → 100 mel frames / second
	win_length : 400
	n_mels : 128
	preemph : 0.97
	log : log(mel + 1e-5), per-feature normalized
	mel_scale : slaney
	```

	Encoder frame rate after the 8× subsampler: 12.5 fps (1 enc frame = 80 ms).

	## Streaming usage

	This bundle supports chunked streaming inference using a left+chunk+right
	context window that fits inside 15 s. A reference Python implementation is
	in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
	for Android UX:

	\| Knob \| Value \| Reason \|
	\|---\|---\|---\|
	\| `chunk_seconds` \| 5 \| committed per step \|
	\| `left_context_seconds` \| 5 \| encoder bilateral context \|
	\| `right_context_seconds` \| 2 \| end-to-end latency ≈ 7 s \|
	\| `window total` \| 12 s \| (5 + 5 + 2) × 100 = 1200 mel ≤ 1500 \|
	\| `carry_state` \| false \| offline-trained model; carrying LSTM state across chunks tends to hurt \|

	We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
	code-switching) with this config, ~22 % on clean offline ≤15 s English.

	## Quantization

	- All `.tflite` weights are FP16. Activations remain FP32.
	- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
	set.

	## Conversion provenance

	Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:

	1. NeMo → torch.export ExportedProgram (per encoder/decoder/joint module).
	2. ExportedProgram → TFLite via
	[`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
	3. FP32 → FP16 via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
	FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.

	Several NeMo internals required export-time monkey-patches:

	- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` — to
	remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
	- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` — to
	build masks in `bool` instead of `uint8` (litert-torch has no uint8
	lowering).
	- `ConformerEncoder.{forward_internal,_create_masks}` and
	`MaskedConvSequential.{forward,_create_mask}` — to keep the entire length
	pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
	GPU/NPU delegates can compile the graph without falling back to CPU.

	## Limitations

	1. Audio at position 0. The encoder expects audio anchored at the start
	of its input window. Padding before the audio causes hallucinations.
	2. 15 s max per call. Use the streaming chunker for longer clips.
	3. No VAD or diarization. Pair with an external VAD or a diarizer
	(e.g. Sortformer) for speaker-attributed transcripts.
	4. Multilingual but no language token. Code-switching works, but the
	model doesn't emit a language ID. Run a separate classifier if you need it.

	## License

	Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).

	## Citation

	```bibtex
	@misc{nvidia_parakeet_tdt_0_6b_v3,
	title = {Parakeet-TDT-0.6B-v3},
	author = {NVIDIA},
	year = {2025},
	url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
	}
	```