Upload README.md with huggingface_hub

d0f0a2a verified about 16 hours ago

5.59 kB

license: apache-2.0
tags:
  - music-generation
  - heartmula
  - 4bit
  - quantized
  - bitsandbytes
  - nf4
  - comfyui
base_model: HeartMuLa/HeartMuLa-oss-3B
library_name: transformers

HeartMuLa 3B - 4-bit NF4 Quantized

Pre-quantized 4-bit (NF4) checkpoint of HeartMuLa-oss-3B for 16 GB VRAM GPUs (RTX 4060 Ti, RTX 5070 Ti, etc.).

Demo Songs

All songs generated with this checkpoint on an RTX 5070 Ti (16 GB) using our ForgeAI ComfyUI Node:

Song	Genre	Duration	CFG
Codigo del Alma (CFG 2)	Spanish Pop, Emotional	3:00	2.0
Codigo del Alma (CFG 3)	Spanish Pop, Emotional	3:00	3.0
Codigo del Alma (60s)	Spanish Pop	1:00	2.0
Codigo del Alma (Latin)	Latin Pop	1:00	2.0
Runtime	Chill, R&B	3:00	2.0
Forged in Code	Country Pop	2:00	2.0
Digital Rain	Electronic	1:00	2.0
Pixel Life	Pop	1:00	2.0

The Problem

The original HeartMuLa 3B model requires ~~15 GB VRAM in bfloat16. Together with HeartCodec (~~1.5 GB), it exceeds 16 GB VRAM, making it impossible to run on consumer GPUs like RTX 4060 Ti, RTX 5070 Ti, etc.

On top of that, the original code has several compatibility issues with modern PyTorch/transformers/torchtune versions (see fixes below).

What This Checkpoint Does

4-bit NF4 quantized HeartMuLa 3B (~4.9 GB instead of ~6 GB)
Fits on 16 GB VRAM together with HeartCodec
Works with PyTorch 2.4+, transformers 4.57+/5.x, torchtune 0.4+

ComfyUI Usage (Recommended)

Use our ForgeAI HeartMuLa ComfyUI Node for the easiest setup. All compatibility fixes are applied automatically.

Also available on the ComfyUI Registry.

Setup

Install via ComfyUI Manager or clone into custom_nodes:

cd ComfyUI/custom_nodes
git clone https://github.com/PavonicAI/ForgeAI-HeartMuLa.git
pip install -r ForgeAI-HeartMuLa/requirements.txt

Download this checkpoint into your ComfyUI models folder:
```
ComfyUI/models/HeartMuLa/HeartMuLa-oss-3B/
```

You still need the original HeartCodec and tokenizer from the original repo:

ComfyUI/models/HeartMuLa/
  ├── HeartMuLa-oss-3B/    ← this checkpoint
  ├── HeartCodec-oss/       ← from original repo
  ├── tokenizer.json        ← from original repo
  └── gen_config.json       ← from original repo

Tag Guide

HeartMuLa uses comma-separated tags to control style. Genre is the most important tag — always put it first.

genre:pop, emotional, synth, warm, female voice

CFG Scale

CFG	Best For	Notes
2.0	Pop, Ballads, Emotional	Sweet spot for clean vocals
3.0	Rock, Latin, Uptempo	More energy
4.0+	Electronic, Dance	May introduce artifacts

Structure Tags (in Lyrics)

[intro]
[verse]
Your lyrics here...
[chorus]
Chorus lyrics...
[outro]

Manual Setup (Without ComfyUI)

If you want to use this checkpoint without ComfyUI, you need to apply several code fixes manually. See the sections below.

Required Code Fixes

1. ignore_mismatched_sizes Error (transformers 5.x)

Add ignore_mismatched_sizes=True to ALL from_pretrained() calls:

HeartCodec.from_pretrained(..., ignore_mismatched_sizes=True)
HeartMuLa.from_pretrained(..., ignore_mismatched_sizes=True)

2. RoPE cache is not built Error (torchtune >= 0.5)

In modeling_heartmula.py, add RoPE init to setup_caches():

def setup_caches(self, ...):
    # ... existing cache setup ...
    for m in self.modules():
        if hasattr(m, "rope_init"):
            m.rope_init()
            m.to(device)

3. OOM at Codec Decode (16 GB GPUs)

Offload model to CPU before codec decode:

self.model.cpu()
torch.cuda.empty_cache()
wav = self.audio_codec.detokenize(frames)

4. torchcodec Missing (torchaudio >= 2.10)

Replace torchaudio with soundfile:

import soundfile as sf
sf.write(save_path, wav_np, 48000)

5. 4-bit Quantization Loading

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = HeartMuLa.from_pretrained(
    "PavonicAI/HeartMuLa-3B-4bit",
    quantization_config=bnb_config,
    device_map="cuda:0",
    ignore_mismatched_sizes=True,
)

Hardware Tested

NVIDIA RTX 5070 Ti (16 GB) with 4-bit quantization
~13 GB VRAM during generation, ~8 GB during encoding
Stable for hours of continuous generation
Output: 48kHz stereo audio

Credits

Original model by HeartMuLa Team (Apache-2.0)
Quantization, compatibility fixes & ComfyUI node by ForgeAI / PavonicAI

License

Apache-2.0 (same as original)