Nemotron Stream Quants?

#2
by lokkju - opened

Do you happen to have, or does your pipeline support generating, int4 and int8 quants for nemotron-streaming? It doesn't quite perform well enough CPU only for my realtime use-case on my hardware.

My quants seem to be working well enough; code pipeline is at https://github.com/lokkju/nemotron-speech-streaming-onnx-quant
int4 and int8 quants are now linked to this repo

lokkju changed discussion status to closed

good! Currently, I have hw limitations cant export quant models (at least for now) :-(. But glad to hear you managed that :-) .
Is the cpu improvement significant?

I tested on the Rust side. It now uses 1.2 GB ram (the default model uses 2gb). For CPU usage, it seems slight improvements.
I have some validation audios for my real world scenarious. Normally it takes 29 sec to complete but with int8 model 11.68s.

RTF: default: 3.7x
int8: 9.09x

Really nice! quality is not dropped that much, btw. Probably it will affect the grammatical stuff, numbers, and special names. I will also run some tests to get additional insights

additional insights: "quality drop" in the int8 model is minor and mostly phonetic. some examples (base correct)
Base: "...mutilation of children."
Int8: "...mutilation of Chelton."

Loss of Fine Detail (Conversational fillers)
Base: "I actually ask my people today. Yes, sir, because people..."
Int8: "I actually ask my people today because people..."

Stutters
Base: "...but it's much more cumbersome..."
Int8: "...but it's more much more cumbersome..."

note those are minor things. and also numbers and special names actually look good.
I think with a 9x speed increase, this quality drop is acceptable for most applications

I was hoping that the int quants would speed up the CPU inference enough for propper real time streaming output, but at least on my 7840U, it's not quite there. I'm exploring either adding some missing operations to the upstream WebGPU EP (to give us vulkan support), or possible alternative model formats (ncnn) or backends (burn-rs with fp16). I'd love your thoughts - at its root, I'd like to get a decent speedup via vulkan, in the same way whisper does.

Interestingly, I'm not seeing that significant of a speed increase on CPU, though I'm testing via vox type; it's possible my harness is slowing it down somewhere. Did you see that speedup with CPU, or with a different backend?

Interestingly, I'm not seeing that significant of a speed increase on CPU, though I'm testing via vox type; it's possible my harness is slowing it down somewhere. Did you see that speedup with CPU, or with a different backend?

I didn't say there was a significant improvement in the CPU. I only said there was a significant improvement in RAM and speed. You can look here for the most “balanced” setting for the CPU:
Https://docs.rs/ort/latest/ort/session/builder/struct.SessionBuilder.html

in parakeet-rs you can use with "with_custom_configure" :
let config = ExecutionConfig::new()
.with_custom_configure(|builder| builder.with_memory_pattern(false));

with_intra_threads(n) especially this one may affect on the cpu usage.

The last time I checked (last year), Burn didn't support many ONNX operators. I wish it did, but for now, ORT is the best option for onnx. WebGPU is coming soon on ort (already here but only limited for browsers). So I assume we'll be able to use WebGPU as an EP on desktop with ORT:
https://github.com/microsoft/onnxruntime/issues/26328#issuecomment-3786348916

never tried ncnn no comment.

Sign up or log in to comment