Nemotron Speech Streaming en 0.6B β int8 (parakeet-rs compatible)
Weight-only int8 quantization of the encoder of NVIDIA's
nemotron-speech-streaming-en-0.6b
(English FastConformer encoder + fused RNN-T decoder/joint, with punctuation & capitalization),
packaged for the parakeet-rs ONNX runtime.
- ~876 MB total vs ~2.34 GB for the FP32 export (β2.74Γ smaller).
- Quantized only the encoder's MatMul weights to int8 (
MatMulInteger/DynamicQuantizeLinear); Conv and other tensors stay FP32 to avoidConvIntegerkernel gaps, and the fuseddecoder_joint+ tokenizer are unchanged. - Encoder graph I/O is preserved (
processed_signalin,encodedout), so it is a drop-in for runtimes that load the FP32 export. Streaming framing is unchanged (560 ms).
Files
| file | precision | notes |
|---|---|---|
encoder.onnx (+ .data) |
int8 (MatMul) | cache-aware streaming FastConformer encoder |
decoder_joint.onnx |
fp32 | fused RNN-T prediction net + joint (verbatim copy) |
tokenizer.model |
β | SentencePiece (verbatim copy) |
Provenance
- Base model:
nvidia/nemotron-speech-streaming-en-0.6b(NVIDIA Open Model License). - FP32 ONNX export source:
altunenes/parakeet-rs(subdirnemotron-speech-streaming-en-0.6b). - int8 quantization:
onnxruntime.quantization.quantize_dynamic(op_types_to_quantize=['MatMul'],weight_type=QInt8,per_channel=True).
License
NVIDIA Open Model License Agreement:
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/.
The upstream NVIDIA license governs the weights regardless of export or quantization source.
See NOTICE.txt.
Model tree for bumpyclock/nemotron-speech-streaming-en-0.6b-int8
Base model
nvidia/nemotron-speech-streaming-en-0.6b