any quantization released to reduce memory to fit into 8gb GPU RAM?

#33

by atolfia - opened 28 days ago

Fantastic work!!! but is was impossible for me to fit the personaplex into my 8gb gpu ram even trying to use RUST (moshi). Any plans for quantization released to reduce memory to fit into 8gb GPU RAM?
Thanks!!!

smarttech445

27 days ago

This a great work, for me personally i might be trying to use a rush

Niroop-2007

27 days ago

People even i am trying to Quantise it to 4Bit and run it on my RTX 3050 with GPU Offloading and other methods - But still facing issue - if you want i will give access to my GDRIVE where i have stored them all you guys can see what more can be done : nirooph1@gmail.com lets-connect

royrajarshi

NVIDIA org 15 days ago

No plans for an official 4-bit quantization that will fit on 8GB VRAM. However, I am working on a FP8 weights-only quantization that should fit 16GB VRAM. Will leave this discussion open in case others successfully make a 4-bit quantized version. No plans on official Rust support either. My resources are quite limited, and focused on making a smarter future model.

dhruvmarathe

12 days ago

I’m building a voice tutor platform and trying to use nvidia/personaplex-7b-v1 for speech-to-speech. I’m on a T4 cloud GPU (~15.5GB usable VRAM), but the fp16 model is ~16.7GB, so it doesn’t even fit during loading.

My goal is to quantize to int8 to reduce VRAM with minimal quality loss.

The issue I hit:
Even after quantization, the model was loading dequantized fp16 tensors onto the GPU during load. So the weights were effectively piling up as fp16 on GPU, and I still ran out of memory before inference even started.

So the real problem isn’t inference. It’s the loading phase.

What I now understand:

Conceptually, int8 should halve memory.

But if weights are dequantized to fp16 and moved to GPU during quantization, VRAM usage stays basically the same.

I need true int8 storage on GPU, where weights remain int8 and are dequantized on-the-fly during matmul.

The options I see:

True int8 inference with custom int8 linear layers (weights stay int8, dequant during forward pass).

CPU offloading for some layers.

Replace Linear layers with proper quantized modules that store int8 weights.

Right now I’m considering PyTorch dynamic quantization to replace Linear layers with quantized versions that keep weights in int8 and dequantize during forward pass. But I need to confirm the model architecture supports that cleanly.

If anyone has handled int8 loading for 7B models on T4 without OOM during load, I’d appreciate guidance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment