Pre-quantized 4-bit checkpoint + ComfyUI Node for 16GB GPUs
#6
by
PavonicDev - opened
Hey everyone! π
We got HeartMuLa-oss-3B running on 16 GB consumer GPUs (tested on RTX 5070 Ti) with bitsandbytes 4-bit NF4 quantization.
Along the way we fixed several compatibility issues with current library versions and packaged everything into a ready-to-use solution:
π₯ Pre-quantized 4-bit Checkpoint
- Pre-quantized (NF4), loads in seconds instead of quantizing on-the-fly
- ~4.87 GB instead of ~12 GB
- Runs on 16 GB VRAM GPUs
ποΈ ComfyUI Custom Node (All-in-One)
GitHub: PavonicAI/ForgeAI-HeartMuLa
- All-in-one music generation node (lyrics + tags β audio file)
- WAV and MP3 export
- Built-in quantization selection (4bit / 8bit / none)
- Lyrics transcriber node included
π§ Compatibility Fixes Included
All fixes are baked into the ComfyUI node, no manual patching needed:
- transformers 5.x: for
- torchtune >= 0.5 RoPE fix: Monkey-patch to call
- OOM fix: offload before codec decoding
- torchaudio/torchcodec: Replaced with (no torchcodec dependency)
- bitsandbytes 4-bit: Full NF4 quantization support
Hardware Tested
- RTX 5070 Ti (16 GB) β works perfectly at 4-bit
- ~10 it/s generation speed, ~76 seconds for 60s of audio
Hope this helps others who want to run HeartMuLa on consumer hardware! π
Made with β€οΈ by ForgeAI / PavonicAI