Pre-quantized 4-bit checkpoint + ComfyUI Node for 16GB GPUs

#7
by PavonicDev - opened

Hey everyone!

We got HeartMuLa-oss-3B running on 16 GB consumer GPUs (tested on RTX 5070 Ti) with bitsandbytes 4-bit NF4 quantization.

Along the way we fixed several compatibility issues with current library versions and packaged everything into a ready-to-use solution:

Pre-quantized 4-bit Checkpoint

PavonicAI/HeartMuLa-3B-4bit

  • Pre-quantized (NF4), loads in seconds instead of quantizing on-the-fly
  • ~4.87 GB instead of ~12 GB
  • Runs on 16 GB VRAM GPUs

ComfyUI Custom Node (All-in-One)

GitHub: PavonicAI/ForgeAI-HeartMuLa

  • All-in-one music generation node (lyrics + tags -> audio file)
  • WAV and MP3 export
  • Built-in quantization selection (4bit / 8bit / none)
  • Lyrics transcriber node included

Compatibility Fixes Included

All fixes are baked into the ComfyUI node, no manual patching needed:

  1. transformers 5.x: ignore_mismatched_sizes=True for from_pretrained()
  2. torchtune >= 0.5 RoPE fix: Monkey-patch setup_caches() to call rope_init()
  3. OOM fix: model.cpu() offload before codec decoding
  4. torchaudio/torchcodec: Replaced with soundfile (no torchcodec dependency)
  5. bitsandbytes 4-bit: Full NF4 quantization support

Hardware Tested

  • RTX 5070 Ti (16 GB) - works perfectly at 4-bit
  • ~10 it/s generation speed, ~76 seconds for 60s of audio

Hope this helps others who want to run HeartMuLa on consumer hardware!

Made by ForgeAI / PavonicAI

PavonicDev changed discussion status to closed

Sign up or log in to comment