PavonicAI
/

HeartMuLa-3B-4bit

@@ -12,10 +12,25 @@ base_model: HeartMuLa/HeartMuLa-oss-3B
 library_name: transformers
 ---
-# HeartMuLa 3B — 4-bit NF4 Quantized
 Pre-quantized 4-bit (NF4) checkpoint of [HeartMuLa-oss-3B](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) for **16 GB VRAM GPUs** (RTX 4060 Ti, RTX 5070 Ti, etc.).
 ## The Problem
 The original HeartMuLa 3B model requires ~15 GB VRAM in bfloat16. Together with HeartCodec (~1.5 GB), it exceeds 16 GB VRAM, making it impossible to run on consumer GPUs like RTX 4060 Ti, RTX 5070 Ti, etc.
@@ -28,91 +43,110 @@ On top of that, the original code has several compatibility issues with modern P
 - Fits on **16 GB VRAM** together with HeartCodec
 - Works with **PyTorch 2.4+**, **transformers 4.57+/5.x**, **torchtune 0.4+**
-## ComfyUI Usage
-This checkpoint works with the [HeartMuLa ComfyUI custom nodes](https://github.com/BenjaminBurworworworton/HeartMuLa_ComfyUI), but you need to apply the code fixes listed below to make it work with modern package versions.
 ### Setup
-1. Download this checkpoint into your ComfyUI models folder:
-   ```
-   ComfyUI/models/HeartMuLa/HeartMuLa-4bit-3B/
    ```
-2. You still need the original [HeartCodec](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) and tokenizer from the original repo
-3. Install required packages in ComfyUI's Python:
-   ```bash
-   pip install bitsandbytes soundfile
    ```
-## Required Code Fixes
-If you're using modern package versions (PyTorch 2.4+, transformers 5.x, torchtune 0.5+), you need these fixes in your heartlib code:
-### 1. `ignore_mismatched_sizes` Error (transformers 5.x)
-Add `ignore_mismatched_sizes=True` to ALL `from_pretrained()` calls in `music_generation.py` and `lyrics_transcription.py`:
 ```python
-# In music_generation.py - HeartCodec loading
 HeartCodec.from_pretrained(..., ignore_mismatched_sizes=True)
-# In music_generation.py - HeartMuLa loading
 HeartMuLa.from_pretrained(..., ignore_mismatched_sizes=True)
-# In lyrics_transcription.py - Whisper loading
-WhisperForConditionalGeneration.from_pretrained(..., ignore_mismatched_sizes=True)
 ```
-### 2. `RoPE cache is not built` Error (torchtune >= 0.5)
-In `modeling_heartmula.py`, add this to the `setup_caches()` method after the cache setup:
 ```python
 def setup_caches(self, ...):
-    # ... existing cache setup code ...
-    # ADD THIS: Initialize RoPE caches (required for torchtune >= 0.5)
     for m in self.modules():
-        if hasattr(m, 'rope_init'):
             m.rope_init()
             m.to(device)
 ```
-### 3. OOM at Codec Decode (16 GB GPUs)
-In `music_generation.py`, offload the model to CPU before running HeartCodec:
 ```python
-# After generating frames, BEFORE codec decode:
-frames = torch.stack(frames).permute(1, 2, 0).squeeze(0)
-self.model.reset_caches()
-self.model.cpu()           # <-- ADD THIS
-torch.cuda.empty_cache()   # <-- ADD THIS
 wav = self.audio_codec.detokenize(frames)
 ```
-### 4. `torchcodec` Missing (torchaudio >= 2.10)
-Replace `torchaudio.save()` and `torchaudio.load()` with `soundfile`:
 ```python
-# Instead of torchaudio.save():
 import soundfile as sf
-wav_np = wav.cpu().float().numpy()
-if wav_np.ndim == 2:
-    wav_np = wav_np.T
 sf.write(save_path, wav_np, 48000)
-# Instead of torchaudio.load():
-audio_data, sample_rate = sf.read(path, dtype='float32')
-waveform = torch.from_numpy(audio_data)
 ```
-### 5. 4-bit Quantization Loading
-When loading this checkpoint, use `device_map="cuda:0"`:
 ```python
 from transformers import BitsAndBytesConfig
@@ -131,24 +165,17 @@ model = HeartMuLa.from_pretrained(
 )
 ```
-## Requirements
-- `torch >= 2.4` with CUDA
-- `bitsandbytes >= 0.43`
-- `transformers >= 4.57`
-- `torchtune >= 0.4`
-- `soundfile`
-- HeartCodec + tokenizer weights from [original HeartMuLa repo](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B)
 ## Hardware Tested
-- NVIDIA RTX 5070 Ti (16 GB) — works with 4-bit quantization + CPU offload during codec decode
-- Output: 48kHz WAV audio
 ## Credits
 - Original model by [HeartMuLa Team](https://heartmula.github.io/) (Apache-2.0)
-- Quantization & compatibility fixes by ForgeAI / PavonicAI
 ## License

 library_name: transformers
 ---
+# HeartMuLa 3B - 4-bit NF4 Quantized
 Pre-quantized 4-bit (NF4) checkpoint of [HeartMuLa-oss-3B](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) for **16 GB VRAM GPUs** (RTX 4060 Ti, RTX 5070 Ti, etc.).
+## Demo Songs
+All songs generated with this checkpoint on an RTX 5070 Ti (16 GB) using our [ForgeAI ComfyUI Node](https://github.com/PavonicAI/ForgeAI-HeartMuLa):
+| Song | Genre | Duration | CFG |
+|---|---|---|---|
+| [Codigo del Alma (CFG 2)](demos/Codigo_del_Alma_cfg2.mp3) | Spanish Pop, Emotional | 3:00 | 2.0 |
+| [Codigo del Alma (CFG 3)](demos/Codigo_del_Alma_cfg3.mp3) | Spanish Pop, Emotional | 3:00 | 3.0 |
+| [Codigo del Alma (60s)](demos/Codigo_del_Alma_60s.mp3) | Spanish Pop | 1:00 | 2.0 |
+| [Codigo del Alma (Latin)](demos/Codigo_del_Alma_Latin.mp3) | Latin Pop | 1:00 | 2.0 |
+| [Runtime](demos/Runtime.mp3) | Chill, R&B | 3:00 | 2.0 |
+| [Forged in Code](demos/Forged_in_Code.mp3) | Country Pop | 2:00 | 2.0 |
+| [Digital Rain](demos/Digital_Rain.mp3) | Electronic | 1:00 | 2.0 |
+| [Pixel Life](demos/Pixel_Life.mp3) | Pop | 1:00 | 2.0 |
 ## The Problem
 The original HeartMuLa 3B model requires ~15 GB VRAM in bfloat16. Together with HeartCodec (~1.5 GB), it exceeds 16 GB VRAM, making it impossible to run on consumer GPUs like RTX 4060 Ti, RTX 5070 Ti, etc.
 - Fits on **16 GB VRAM** together with HeartCodec
 - Works with **PyTorch 2.4+**, **transformers 4.57+/5.x**, **torchtune 0.4+**
+## ComfyUI Usage (Recommended)
+Use our **[ForgeAI HeartMuLa ComfyUI Node](https://github.com/PavonicAI/ForgeAI-HeartMuLa)** for the easiest setup. All compatibility fixes are applied automatically.
+Also available on the [ComfyUI Registry](https://registry.comfy.org/publishers/forgeai/nodes/forgeai-heartmula).
 ### Setup
+1. Install via ComfyUI Manager or clone into custom_nodes:
+   ```bash
+   cd ComfyUI/custom_nodes
+   git clone https://github.com/PavonicAI/ForgeAI-HeartMuLa.git
+   pip install -r ForgeAI-HeartMuLa/requirements.txt
    ```
+2. Download this checkpoint into your ComfyUI models folder:
+   ```
+   ComfyUI/models/HeartMuLa/HeartMuLa-oss-3B/
+   ```
+3. You still need the original [HeartCodec](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) and tokenizer from the original repo:
+   ```
+   ComfyUI/models/HeartMuLa/
+     ├── HeartMuLa-oss-3B/    ← this checkpoint
+     ├── HeartCodec-oss/       ← from original repo
+     ├── tokenizer.json        ← from original repo
+     └── gen_config.json       ← from original repo
    ```
+## Tag Guide
+HeartMuLa uses comma-separated tags to control style. **Genre is the most important tag** — always put it first.
+```
+genre:pop, emotional, synth, warm, female voice
+```
+### CFG Scale
+| CFG | Best For | Notes |
+|---|---|---|
+| **2.0** | Pop, Ballads, Emotional | Sweet spot for clean vocals |
+| **3.0** | Rock, Latin, Uptempo | More energy |
+| **4.0+** | Electronic, Dance | May introduce artifacts |
+### Structure Tags (in Lyrics)
+```
+[intro]
+[verse]
+Your lyrics here...
+[chorus]
+Chorus lyrics...
+[outro]
+```
+## Manual Setup (Without ComfyUI)
+If you want to use this checkpoint without ComfyUI, you need to apply several code fixes manually. See the sections below.
+### Required Code Fixes
+#### 1. ignore_mismatched_sizes Error (transformers 5.x)
+Add `ignore_mismatched_sizes=True` to ALL `from_pretrained()` calls:
 ```python
 HeartCodec.from_pretrained(..., ignore_mismatched_sizes=True)
 HeartMuLa.from_pretrained(..., ignore_mismatched_sizes=True)
 ```
+#### 2. RoPE cache is not built Error (torchtune >= 0.5)
+In `modeling_heartmula.py`, add RoPE init to `setup_caches()`:
 ```python
 def setup_caches(self, ...):
+    # ... existing cache setup ...
     for m in self.modules():
+        if hasattr(m, "rope_init"):
             m.rope_init()
             m.to(device)
 ```
+#### 3. OOM at Codec Decode (16 GB GPUs)
+Offload model to CPU before codec decode:
 ```python
+self.model.cpu()
+torch.cuda.empty_cache()
 wav = self.audio_codec.detokenize(frames)
 ```
+#### 4. torchcodec Missing (torchaudio >= 2.10)
+Replace torchaudio with soundfile:
 ```python
 import soundfile as sf
 sf.write(save_path, wav_np, 48000)
 ```
+#### 5. 4-bit Quantization Loading
 ```python
 from transformers import BitsAndBytesConfig
 )
 ```
 ## Hardware Tested
+- NVIDIA RTX 5070 Ti (16 GB) with 4-bit quantization
+- ~13 GB VRAM during generation, ~8 GB during encoding
+- Stable for hours of continuous generation
+- Output: 48kHz stereo audio
 ## Credits
 - Original model by [HeartMuLa Team](https://heartmula.github.io/) (Apache-2.0)
+- Quantization, compatibility fixes & ComfyUI node by [ForgeAI / PavonicAI](https://github.com/PavonicAI)
 ## License