--- license: cc-by-4.0 language: - en - es - it - de - fr - pt library_name: litert base_model: nvidia/parakeet-tdt-0.6b-v3 tags: - automatic-speech-recognition - speech - audio - parakeet - tdt - litert - tflite - on-device - mobile - android - streaming pipeline_tag: automatic-speech-recognition --- # Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port LiteRT (TFLite) port of [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3), packaged for on-device inference (Android / Mac / embedded) without a Python or NeMo runtime dependency. For **model capabilities, languages, training data, license, and benchmarks**, see the upstream model card. This card only documents what's specific to the LiteRT port. ## What's in this bundle | File | Size | Purpose | |---|---|---| | `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) | | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network | | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) | | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) | | `manifest.json` | — | All metadata the runtime needs | Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB. ## Encoder I/O contract ``` inputs: audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc) length : int32 [1] # actual mel frames used (≤ 1500) outputs: encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8 encoded_lengths : int32 [1] ``` Pad shorter inputs with zeros at the **tail** (the encoder was trained with audio anchored at position 0; left-padding causes hallucinations) and pass the true length. The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the encoder in a sliding-window streaming loop — see "Streaming usage" below. **Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL, NPU accelerator) reject int64 tensors entirely. With int64 length, every internal CAST node touching it falls back to CPU, and `CompiledModel.create()` fails outright on Android with the GPU backend. This bundle is exported with int32 length end-to-end (input → internal mask arange/comparisons → output `encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio), so no practical range loss. ## Why a single bucket and not multi-signature An earlier revision shipped a multi-signature encoder with 4 buckets (300/500/700/1500) sharing weights inside one `.tflite`. The disk savings were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android the LiteRT `CompiledModel.create()` API prepares **every** signature's subgraph at load time — each one going through the full delegate-partition pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold start was ~28 s. A single-bucket file is one subgraph: ~7 s init, then ready. If you need multiple bucket sizes for latency reasons, ship them as separate `.tflite` files (TFLite has no cross-file weight sharing) and load on demand. ## Decoder + joint contract ``` decoder_step: inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640] outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640] joint_step: inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1] outputs: logits float32 [1,1,1,8198] # logits[..., 0:8193] → token logits (8192 BPE + 1 blank) # logits[..., 8193:8198] → duration logits over [0,1,2,3,4] ``` `decoder_step.token` is `int64` because it's an embedding lookup; that op runs on CPU regardless of delegate, so int64 there is harmless. Greedy TDT decoding (per encoder frame): 1. Run joint with current `enc_frame` and last predicted `pred_frame`. 2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`. 3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames, re-prime decoder with the emitted token (h, c update). 4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder. 5. Repeat until `enc_lengths` is exhausted. Cap at ~10 non-blank emissions per encoder frame to guard against the pathological `dur=0` decode loop. ## Audio preprocessing LiteRT itself does not produce mel features — your runtime must compute them. Match NeMo's preprocessor exactly: ``` sample_rate : 16000 Hz (resample if needed) n_fft : 512 hop_length : 160 → 100 mel frames / second win_length : 400 n_mels : 128 preemph : 0.97 log : log(mel + 1e-5), per-feature normalized mel_scale : slaney ``` Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms). ## Streaming usage This bundle supports chunked streaming inference using a left+chunk+right context window that fits inside 15 s. A reference Python implementation is in the upstream repo (`transcribe_litert_streaming.py`). Recommended config for Android UX: | Knob | Value | Reason | |---|---|---| | `chunk_seconds` | 5 | committed per step | | `left_context_seconds` | 5 | encoder bilateral context | | `right_context_seconds` | 2 | end-to-end latency ≈ 7 s | | `window total` | 12 s | (5 + 5 + 2) × 100 = 1200 mel ≤ 1500 | | `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt | We measured ~27 % WER on multilingual long-form audio (EN/ES/IT code-switching) with this config, ~22 % on clean offline ≤15 s English. ## Quantization - All `.tflite` weights are FP16. Activations remain FP32. - Bit-identical token output vs the upstream FP32 model on a 99-clip eval set. ## Conversion provenance Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via: 1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module). 2. **ExportedProgram → TFLite** via [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0. 3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops. Several NeMo internals required export-time monkey-patches: - `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` — to remove `.expand(...)` patterns rejected by the TFLite broadcast checker. - `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` — to build masks in `bool` instead of `uint8` (litert-torch has no uint8 lowering). - `ConformerEncoder.{forward_internal,_create_masks}` and `MaskedConvSequential.{forward,_create_mask}` — to keep the entire length pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's GPU/NPU delegates can compile the graph without falling back to CPU. ## Limitations 1. **Audio at position 0.** The encoder expects audio anchored at the start of its input window. Padding before the audio causes hallucinations. 2. **15 s max per call.** Use the streaming chunker for longer clips. 3. **No VAD or diarization.** Pair with an external VAD or a diarizer (e.g. Sortformer) for speaker-attributed transcripts. 4. **Multilingual but no language token.** Code-switching works, but the model doesn't emit a language ID. Run a separate classifier if you need it. ## License Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). ## Citation ```bibtex @misc{nvidia_parakeet_tdt_0_6b_v3, title = {Parakeet-TDT-0.6B-v3}, author = {NVIDIA}, year = {2025}, url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}, } ```