Automatic Speech Recognition
LiteRT
LiteRT
speech
audio
parakeet
tdt
on-device
mobile
android
streaming
Instructions to use spybyscript/parakeet-tdt-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use spybyscript/parakeet-tdt-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| language: | |
| - en | |
| - es | |
| - it | |
| - de | |
| - fr | |
| - pt | |
| library_name: litert | |
| base_model: nvidia/parakeet-tdt-0.6b-v3 | |
| tags: | |
| - automatic-speech-recognition | |
| - speech | |
| - audio | |
| - parakeet | |
| - tdt | |
| - litert | |
| - tflite | |
| - on-device | |
| - mobile | |
| - android | |
| - streaming | |
| pipeline_tag: automatic-speech-recognition | |
| # Parakeet-TDT-0.6B-v3 β LiteRT (TFLite) port | |
| LiteRT (TFLite) port of | |
| [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3), | |
| packaged for on-device inference (Android / Mac / embedded) without a Python | |
| or NeMo runtime dependency. | |
| For **model capabilities, languages, training data, license, and benchmarks**, | |
| see the upstream model card. This card only documents what's specific to the | |
| LiteRT port. | |
| ## What's in this bundle | |
| | File | Size | Purpose | | |
| |---|---|---| | |
| | `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) | | |
| | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network | | |
| | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) | | |
| | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) | | |
| | `manifest.json` | β | All metadata the runtime needs | | |
| Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB. | |
| ## Encoder I/O contract | |
| ``` | |
| inputs: | |
| audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc) | |
| length : int32 [1] # actual mel frames used (β€ 1500) | |
| outputs: | |
| encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8 | |
| encoded_lengths : int32 [1] | |
| ``` | |
| Pad shorter inputs with zeros at the **tail** (the encoder was trained with | |
| audio anchored at position 0; left-padding causes hallucinations) and pass | |
| the true length. | |
| The 1500-mel bucket covers β€ 15 s of audio. For long-form input, run the | |
| encoder in a sliding-window streaming loop β see "Streaming usage" below. | |
| **Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL, | |
| NPU accelerator) reject int64 tensors entirely. With int64 length, every | |
| internal CAST node touching it falls back to CPU, and `CompiledModel.create()` | |
| fails outright on Android with the GPU backend. This bundle is exported with | |
| int32 length end-to-end (input β internal mask arange/comparisons β output | |
| `encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio), | |
| so no practical range loss. | |
| ## Why a single bucket and not multi-signature | |
| An earlier revision shipped a multi-signature encoder with 4 buckets | |
| (300/500/700/1500) sharing weights inside one `.tflite`. The disk savings | |
| were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android | |
| the LiteRT `CompiledModel.create()` API prepares **every** signature's | |
| subgraph at load time β each one going through the full delegate-partition | |
| pass. With 4 signatures Γ ~7 s of XNNPACK / GPU partition prep, app cold | |
| start was ~28 s. | |
| A single-bucket file is one subgraph: ~7 s init, then ready. If you need | |
| multiple bucket sizes for latency reasons, ship them as separate `.tflite` | |
| files (TFLite has no cross-file weight sharing) and load on demand. | |
| ## Decoder + joint contract | |
| ``` | |
| decoder_step: | |
| inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640] | |
| outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640] | |
| joint_step: | |
| inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1] | |
| outputs: logits float32 [1,1,1,8198] | |
| # logits[..., 0:8193] β token logits (8192 BPE + 1 blank) | |
| # logits[..., 8193:8198] β duration logits over [0,1,2,3,4] | |
| ``` | |
| `decoder_step.token` is `int64` because it's an embedding lookup; that op | |
| runs on CPU regardless of delegate, so int64 there is harmless. | |
| Greedy TDT decoding (per encoder frame): | |
| 1. Run joint with current `enc_frame` and last predicted `pred_frame`. | |
| 2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] β {0,1,2,3,4}`. | |
| 3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames, | |
| re-prime decoder with the emitted token (h, c update). | |
| 4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder. | |
| 5. Repeat until `enc_lengths` is exhausted. | |
| Cap at ~10 non-blank emissions per encoder frame to guard against the | |
| pathological `dur=0` decode loop. | |
| ## Audio preprocessing | |
| LiteRT itself does not produce mel features β your runtime must compute | |
| them. Match NeMo's preprocessor exactly: | |
| ``` | |
| sample_rate : 16000 Hz (resample if needed) | |
| n_fft : 512 | |
| hop_length : 160 β 100 mel frames / second | |
| win_length : 400 | |
| n_mels : 128 | |
| preemph : 0.97 | |
| log : log(mel + 1e-5), per-feature normalized | |
| mel_scale : slaney | |
| ``` | |
| Encoder frame rate after the 8Γ subsampler: **12.5 fps** (1 enc frame = 80 ms). | |
| ## Streaming usage | |
| This bundle supports chunked streaming inference using a left+chunk+right | |
| context window that fits inside 15 s. A reference Python implementation is | |
| in the upstream repo (`transcribe_litert_streaming.py`). Recommended config | |
| for Android UX: | |
| | Knob | Value | Reason | | |
| |---|---|---| | |
| | `chunk_seconds` | 5 | committed per step | | |
| | `left_context_seconds` | 5 | encoder bilateral context | | |
| | `right_context_seconds` | 2 | end-to-end latency β 7 s | | |
| | `window total` | 12 s | (5 + 5 + 2) Γ 100 = 1200 mel β€ 1500 | | |
| | `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt | | |
| We measured ~27 % WER on multilingual long-form audio (EN/ES/IT | |
| code-switching) with this config, ~22 % on clean offline β€15 s English. | |
| ## Quantization | |
| - All `.tflite` weights are FP16. Activations remain FP32. | |
| - Bit-identical token output vs the upstream FP32 model on a 99-clip eval | |
| set. | |
| ## Conversion provenance | |
| Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via: | |
| 1. **NeMo β torch.export ExportedProgram** (per encoder/decoder/joint module). | |
| 2. **ExportedProgram β TFLite** via | |
| [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0. | |
| 3. **FP32 β FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on | |
| FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops. | |
| Several NeMo internals required export-time monkey-patches: | |
| - `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` β to | |
| remove `.expand(...)` patterns rejected by the TFLite broadcast checker. | |
| - `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` β to | |
| build masks in `bool` instead of `uint8` (litert-torch has no uint8 | |
| lowering). | |
| - `ConformerEncoder.{forward_internal,_create_masks}` and | |
| `MaskedConvSequential.{forward,_create_mask}` β to keep the entire length | |
| pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's | |
| GPU/NPU delegates can compile the graph without falling back to CPU. | |
| ## Limitations | |
| 1. **Audio at position 0.** The encoder expects audio anchored at the start | |
| of its input window. Padding before the audio causes hallucinations. | |
| 2. **15 s max per call.** Use the streaming chunker for longer clips. | |
| 3. **No VAD or diarization.** Pair with an external VAD or a diarizer | |
| (e.g. Sortformer) for speaker-attributed transcripts. | |
| 4. **Multilingual but no language token.** Code-switching works, but the | |
| model doesn't emit a language ID. Run a separate classifier if you need it. | |
| ## License | |
| Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). | |
| ## Citation | |
| ```bibtex | |
| @misc{nvidia_parakeet_tdt_0_6b_v3, | |
| title = {Parakeet-TDT-0.6B-v3}, | |
| author = {NVIDIA}, | |
| year = {2025}, | |
| url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}, | |
| } | |
| ``` | |