Nemotron Speech Streaming en 0.6B β€” int8 (parakeet-rs compatible)

Weight-only int8 quantization of the encoder of NVIDIA's nemotron-speech-streaming-en-0.6b (English FastConformer encoder + fused RNN-T decoder/joint, with punctuation & capitalization), packaged for the parakeet-rs ONNX runtime.

  • ~876 MB total vs ~2.34 GB for the FP32 export (β‰ˆ2.74Γ— smaller).
  • Quantized only the encoder's MatMul weights to int8 (MatMulInteger / DynamicQuantizeLinear); Conv and other tensors stay FP32 to avoid ConvInteger kernel gaps, and the fused decoder_joint + tokenizer are unchanged.
  • Encoder graph I/O is preserved (processed_signal in, encoded out), so it is a drop-in for runtimes that load the FP32 export. Streaming framing is unchanged (560 ms).

Files

file precision notes
encoder.onnx (+ .data) int8 (MatMul) cache-aware streaming FastConformer encoder
decoder_joint.onnx fp32 fused RNN-T prediction net + joint (verbatim copy)
tokenizer.model β€” SentencePiece (verbatim copy)

Provenance

  • Base model: nvidia/nemotron-speech-streaming-en-0.6b (NVIDIA Open Model License).
  • FP32 ONNX export source: altunenes/parakeet-rs (subdir nemotron-speech-streaming-en-0.6b).
  • int8 quantization: onnxruntime.quantization.quantize_dynamic (op_types_to_quantize=['MatMul'], weight_type=QInt8, per_channel=True).

License

NVIDIA Open Model License Agreement: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/. The upstream NVIDIA license governs the weights regardless of export or quantization source. See NOTICE.txt.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bumpyclock/nemotron-speech-streaming-en-0.6b-int8

Quantized
(19)
this model