Parakeet TDT v3 + Streaming Sortformer — ONNX bundle for Vernacula

Combined ONNX shipping bundle used by Vernacula as its default ASR + diarization + VAD stack. Three upstream models are co-located here so a single download brings up the full pipeline.

Conversion scripts: scripts/nemo_export/
Vernacula: github.com/christopherthompson81/vernacula
Upstream models: Parakeet TDT v3, Streaming Sortformer, Silero VAD

Highlights

DFT-basis mel frontend replaces torch.stft. ORT's STFT op diverged from NeMo (cosine ≈ 0.23 on first inspection); the replacement uses precomputed cos/sin basis matrices as Conv1D weights, with center-padded windows and standard ops only. Restored bit-for-bit parity to the NeMo reference.
Streaming Sortformer 6→3 ONNX contract. NeMo's concat_and_pad() isn't ONNX-traceable; the custom exporter replaces dynamic per-batch slicing with fixed-shape ops at 992 chunk frames (124 subsampled at 8× downsampling). Inputs: chunk, chunk_lengths, spkcache, spkcache_lengths, fifo, fifo_lengths. Outputs: spkcache_fifo_chunk_preds, chunk_pre_encode_embs, chunk_pre_encode_lengths.
Dynamic-batch encoder + dynamic-batch joint decoder for Parakeet TDT (preprocessor still batch-1 post-export). INT8 variants of encoder, decoder-joint, and Sortformer ship for CPU-only inference.
Chunk-by-chunk parity diagnostic (compare_sortformer_chunk_outputs.py) compares NeMo vs ONNX state evolution across streaming chunks to localise drift to either model output or carry-state.
Preprocessor export sweep (tune_nemo128_export.py) scores wrapper / custom / DFT modes against a legacy reference with feature-level and encoder-output deltas — the tooling that picked the DFT path in the first bullet.

File	Source	Purpose
`encoder-model.onnx` (+ `.data`)	`nvidia/parakeet-tdt-0.6b-v3`	Parakeet TDT FastConformer encoder
`encoder-model.int8.onnx`	(quantized from above)	INT8-quantized encoder for CPU
`decoder_joint-model.onnx`	Parakeet TDT v3	Joint decoder + prediction network
`decoder_joint-model.int8.onnx`	(quantized)	INT8-quantized decoder for CPU
`vocab.txt`	Parakeet TDT v3	Subword vocabulary
`nemo128.onnx`	NeMo preprocessor	80-mel log-FBANK frontend (128-dim hop config)
`diar_streaming_sortformer_4spk-v2.1.onnx`	`nvidia/diar_streaming_sortformer_4spk-v2.1`	Streaming 4-speaker diarization
`diar_streaming_sortformer_4spk-v2.1_int8.onnx`	(quantized)	INT8-quantized diarization
`sortformer/diar_streaming_sortformer_4spk-v2.1.onnx`	Sortformer	Same graph in subdir layout for legacy clients
`silero_vad.onnx`	snakers4/silero-vad	Voice activity detection
`config.json`, `manifest.json`	Vernacula	Runtime config + per-file MD5 hashes

Export provenance

Exported via scripts/nemo_export/ in the Vernacula repo, which contains:

export_parakeet_nemo_to_onnx.py — Parakeet .nemo → split ONNX with TDT decoder state wired explicitly
export_sortformer_nemo_to_onnx.py — Streaming Sortformer .nemo → six-input / three-output ONNX contract
export_silero_vad_to_onnx.py — Silero VAD → ONNX

The Parakeet export traces the RNNT/TDT decoder loop into a separate joint graph so each step is a fixed-shape ORT call. Sortformer is exported as a streaming graph that takes incoming frames + carry state and returns diarization logits chunk-by-chunk.

License

This bundle aggregates three upstream models under three different licenses. Each component retains its upstream license; redistribution here does not change those terms.

Component	Upstream license
Parakeet TDT v3 weights	CC-BY-4.0
Streaming Sortformer weights	NVIDIA Open Model License
Silero VAD weights	MIT
NeMo mel-frontend code (`nemo128.onnx`)	Apache-2.0

If you redistribute this bundle, propagate all four licenses with it.

Using these files

The cleanest path is via Vernacula, which downloads, caches, and validates this package against manifest.json automatically. Outside Vernacula, pull the package with huggingface_hub and load each .onnx with onnxruntime directly — input / output tensor contracts are documented in scripts/nemo_export/README.md.

from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="christopherthompson81/sortformer_parakeet_onnx")

Limitations

These graphs preserve the numerical behavior of the upstream PyTorch checkpoints distributed via NVIDIA NeMo. Accuracy, language coverage, and known failure modes inherit from the upstream model cards (Parakeet, Sortformer) — see those for the authoritative discussion. INT8 variants trade a small amount of WER for ~2× CPU throughput; use the float32 variants where accuracy is the priority.

Citation

For the underlying models, please cite the upstream authors. See:

Acknowledgments

Original Parakeet TDT v3 and Streaming Sortformer: NVIDIA NeMo team
Original Silero VAD: Silero Team (snakers4)
ONNX repackaging: Chris Thompson for Vernacula

Issues with the ONNX export specifically: open an issue on the Vernacula repo. Issues with the underlying models: see the upstream model cards.

Model tree for christopherthompson81/sortformer_parakeet_onnx

Base model

nvidia/diar_streaming_sortformer_4spk-v2.1

Quantized

(6)

this model

christopherthompson81
/

sortformer_parakeet_onnx