Nemotron Speech Streaming en 0.6B — int8 (parakeet-rs compatible)

Weight-only int8 quantization of the encoder of NVIDIA's nemotron-speech-streaming-en-0.6b (English FastConformer encoder + fused RNN-T decoder/joint, with punctuation & capitalization), packaged for the parakeet-rs ONNX runtime.

~876 MB total vs ~2.34 GB for the FP32 export (≈2.74× smaller).
Quantized only the encoder's MatMul weights to int8 (MatMulInteger / DynamicQuantizeLinear); Conv and other tensors stay FP32 to avoid ConvInteger kernel gaps, and the fused decoder_joint + tokenizer are unchanged.
Encoder graph I/O is preserved (processed_signal in, encoded out), so it is a drop-in for runtimes that load the FP32 export. Streaming framing is unchanged (560 ms).

Files

file	precision	notes
`encoder.onnx` (+ `.data`)	int8 (MatMul)	cache-aware streaming FastConformer encoder
`decoder_joint.onnx`	fp32	fused RNN-T prediction net + joint (verbatim copy)
`tokenizer.model`	—	SentencePiece (verbatim copy)

Provenance

Base model: nvidia/nemotron-speech-streaming-en-0.6b (NVIDIA Open Model License).
FP32 ONNX export source: altunenes/parakeet-rs (subdir nemotron-speech-streaming-en-0.6b).
int8 quantization: onnxruntime.quantization.quantize_dynamic (op_types_to_quantize=['MatMul'], weight_type=QInt8, per_channel=True).

License

NVIDIA Open Model License Agreement: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/. The upstream NVIDIA license governs the weights regardless of export or quantization source. See NOTICE.txt.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for bumpyclock/nemotron-speech-streaming-en-0.6b-int8

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Quantized

(19)

this model