Pre-quantized 4-bit checkpoint + ComfyUI Node for 16GB GPUs

by PavonicDev - opened 1 day ago

Hey everyone!

We got HeartMuLa-oss-3B running on 16 GB consumer GPUs (tested on RTX 5070 Ti) with bitsandbytes 4-bit NF4 quantization.

Along the way we fixed several compatibility issues with current library versions and packaged everything into a ready-to-use solution:

Pre-quantized 4-bit Checkpoint

All fixes are baked into the ComfyUI node, no manual patching needed:

transformers 5.x: ignore_mismatched_sizes=True for from_pretrained()
torchtune >= 0.5 RoPE fix: Monkey-patch setup_caches() to call rope_init()
OOM fix: model.cpu() offload before codec decoding
torchaudio/torchcodec: Replaced with soundfile (no torchcodec dependency)
bitsandbytes 4-bit: Full NF4 quantization support

Hope this helps others who want to run HeartMuLa on consumer hardware!

Made by ForgeAI / PavonicAI

PavonicDev changed discussion status to closed 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment