parakeet-tdt-litert / README.md
spybyscript's picture
Upload LiteRT FP16 bundle
2a68489 verified
---
license: cc-by-4.0
language:
- en
- es
- it
- de
- fr
- pt
library_name: litert
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- speech
- audio
- parakeet
- tdt
- litert
- tflite
- on-device
- mobile
- android
- streaming
pipeline_tag: automatic-speech-recognition
---
# Parakeet-TDT-0.6B-v3 β€” LiteRT (TFLite) port
LiteRT (TFLite) port of
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
packaged for on-device inference (Android / Mac / embedded) without a Python
or NeMo runtime dependency.
For **model capabilities, languages, training data, license, and benchmarks**,
see the upstream model card. This card only documents what's specific to the
LiteRT port.
## What's in this bundle
| File | Size | Purpose |
|---|---|---|
| `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
| `manifest.json` | β€” | All metadata the runtime needs |
Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.
## Encoder I/O contract
```
inputs:
audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc)
length : int32 [1] # actual mel frames used (≀ 1500)
outputs:
encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8
encoded_lengths : int32 [1]
```
Pad shorter inputs with zeros at the **tail** (the encoder was trained with
audio anchored at position 0; left-padding causes hallucinations) and pass
the true length.
The 1500-mel bucket covers ≀ 15 s of audio. For long-form input, run the
encoder in a sliding-window streaming loop β€” see "Streaming usage" below.
**Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
NPU accelerator) reject int64 tensors entirely. With int64 length, every
internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
fails outright on Android with the GPU backend. This bundle is exported with
int32 length end-to-end (input β†’ internal mask arange/comparisons β†’ output
`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
so no practical range loss.
## Why a single bucket and not multi-signature
An earlier revision shipped a multi-signature encoder with 4 buckets
(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
the LiteRT `CompiledModel.create()` API prepares **every** signature's
subgraph at load time β€” each one going through the full delegate-partition
pass. With 4 signatures Γ— ~7 s of XNNPACK / GPU partition prep, app cold
start was ~28 s.
A single-bucket file is one subgraph: ~7 s init, then ready. If you need
multiple bucket sizes for latency reasons, ship them as separate `.tflite`
files (TFLite has no cross-file weight sharing) and load on demand.
## Decoder + joint contract
```
decoder_step:
inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]
joint_step:
inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
outputs: logits float32 [1,1,1,8198]
# logits[..., 0:8193] β†’ token logits (8192 BPE + 1 blank)
# logits[..., 8193:8198] β†’ duration logits over [0,1,2,3,4]
```
`decoder_step.token` is `int64` because it's an embedding lookup; that op
runs on CPU regardless of delegate, so int64 there is harmless.
Greedy TDT decoding (per encoder frame):
1. Run joint with current `enc_frame` and last predicted `pred_frame`.
2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
re-prime decoder with the emitted token (h, c update).
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
5. Repeat until `enc_lengths` is exhausted.
Cap at ~10 non-blank emissions per encoder frame to guard against the
pathological `dur=0` decode loop.
## Audio preprocessing
LiteRT itself does not produce mel features β€” your runtime must compute
them. Match NeMo's preprocessor exactly:
```
sample_rate : 16000 Hz (resample if needed)
n_fft : 512
hop_length : 160 β†’ 100 mel frames / second
win_length : 400
n_mels : 128
preemph : 0.97
log : log(mel + 1e-5), per-feature normalized
mel_scale : slaney
```
Encoder frame rate after the 8Γ— subsampler: **12.5 fps** (1 enc frame = 80 ms).
## Streaming usage
This bundle supports chunked streaming inference using a left+chunk+right
context window that fits inside 15 s. A reference Python implementation is
in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
for Android UX:
| Knob | Value | Reason |
|---|---|---|
| `chunk_seconds` | 5 | committed per step |
| `left_context_seconds` | 5 | encoder bilateral context |
| `right_context_seconds` | 2 | end-to-end latency β‰ˆ 7 s |
| `window total` | 12 s | (5 + 5 + 2) Γ— 100 = 1200 mel ≀ 1500 |
| `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |
We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
code-switching) with this config, ~22 % on clean offline ≀15 s English.
## Quantization
- All `.tflite` weights are FP16. Activations remain FP32.
- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
set.
## Conversion provenance
Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
1. **NeMo β†’ torch.export ExportedProgram** (per encoder/decoder/joint module).
2. **ExportedProgram β†’ TFLite** via
[`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
3. **FP32 β†’ FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
Several NeMo internals required export-time monkey-patches:
- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` β€” to
remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` β€” to
build masks in `bool` instead of `uint8` (litert-torch has no uint8
lowering).
- `ConformerEncoder.{forward_internal,_create_masks}` and
`MaskedConvSequential.{forward,_create_mask}` β€” to keep the entire length
pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
GPU/NPU delegates can compile the graph without falling back to CPU.
## Limitations
1. **Audio at position 0.** The encoder expects audio anchored at the start
of its input window. Padding before the audio causes hallucinations.
2. **15 s max per call.** Use the streaming chunker for longer clips.
3. **No VAD or diarization.** Pair with an external VAD or a diarizer
(e.g. Sortformer) for speaker-attributed transcripts.
4. **Multilingual but no language token.** Code-switching works, but the
model doesn't emit a language ID. Run a separate classifier if you need it.
## License
Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).
## Citation
```bibtex
@misc{nvidia_parakeet_tdt_0_6b_v3,
title = {Parakeet-TDT-0.6B-v3},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}
```