PavonicAI
/

HeartMuLa-3B-4bit

@@ -16,49 +16,139 @@ library_name: transformers
 Pre-quantized 4-bit (NF4) checkpoint of [HeartMuLa-oss-3B](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) for **16 GB VRAM GPUs** (RTX 4060 Ti, RTX 5070 Ti, etc.).
-## Why?
-The original HeartMuLa 3B model requires ~15 GB VRAM in bfloat16, which doesn't leave enough room for the HeartCodec decoder on 16 GB cards. This pre-quantized checkpoint loads directly in 4-bit, skipping the expensive on-the-fly quantization step.
-## Quick Start
 ```python
-from heartlib.heartmula.modeling_heartmula import HeartMuLa
 model = HeartMuLa.from_pretrained(
-    "ForgeAI/HeartMuLa-3B-4bit",
     device_map="cuda:0",
     ignore_mismatched_sizes=True,
 )
 ```
-## Compatibility Fixes Included
-This checkpoint was created with the following environment fixes for modern PyTorch/torchtune/transformers stacks:
-| Issue | Fix |
-|---|---|
-| `ignore_mismatched_sizes` error (transformers 5.x) | Added `ignore_mismatched_sizes=True` to all `from_pretrained()` calls |
-| `RoPE cache is not built` (torchtune >= 0.5) | Explicit `rope_init()` + `.to(device)` in `setup_caches()` |
-| OOM at codec decode step | `model.cpu()` + `torch.cuda.empty_cache()` before HeartCodec decode |
-| `torchcodec` missing (torchaudio >= 2.10) | Replaced `torchaudio.save/load` with `soundfile` |
 ## Requirements
 - `torch >= 2.4` with CUDA
 - `bitsandbytes >= 0.43`
 - `transformers >= 4.57`
 - `torchtune >= 0.4`
-- HeartCodec weights (from original HeartMuLa repo)
 ## Hardware Tested
 - NVIDIA RTX 5070 Ti (16 GB) — works with 4-bit quantization + CPU offload during codec decode
 ## Credits
 - Original model by [HeartMuLa Team](https://heartmula.github.io/) (Apache-2.0)
-- Quantization & compatibility fixes by ForgeAI
 ## License

 Pre-quantized 4-bit (NF4) checkpoint of [HeartMuLa-oss-3B](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) for **16 GB VRAM GPUs** (RTX 4060 Ti, RTX 5070 Ti, etc.).
+## The Problem
+The original HeartMuLa 3B model requires ~15 GB VRAM in bfloat16. Together with HeartCodec (~1.5 GB), it exceeds 16 GB VRAM, making it impossible to run on consumer GPUs like RTX 4060 Ti, RTX 5070 Ti, etc.
+On top of that, the original code has several compatibility issues with modern PyTorch/transformers/torchtune versions (see fixes below).
+## What This Checkpoint Does
+- **4-bit NF4 quantized** HeartMuLa 3B (~4.9 GB instead of ~6 GB)
+- Fits on **16 GB VRAM** together with HeartCodec
+- Works with **PyTorch 2.4+**, **transformers 4.57+/5.x**, **torchtune 0.4+**
+## ComfyUI Usage
+This checkpoint works with the [HeartMuLa ComfyUI custom nodes](https://github.com/BenjaminBurworworworton/HeartMuLa_ComfyUI), but you need to apply the code fixes listed below to make it work with modern package versions.
+### Setup
+1. Download this checkpoint into your ComfyUI models folder:
+   ```
+   ComfyUI/models/HeartMuLa/HeartMuLa-4bit-3B/
+   ```
+2. You still need the original [HeartCodec](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) and tokenizer from the original repo
+3. Install required packages in ComfyUI's Python:
+   ```bash
+   pip install bitsandbytes soundfile
+   ```
+## Required Code Fixes
+If you're using modern package versions (PyTorch 2.4+, transformers 5.x, torchtune 0.5+), you need these fixes in your heartlib code:
+### 1. `ignore_mismatched_sizes` Error (transformers 5.x)
+Add `ignore_mismatched_sizes=True` to ALL `from_pretrained()` calls in `music_generation.py` and `lyrics_transcription.py`:
+```python
+# In music_generation.py - HeartCodec loading
+HeartCodec.from_pretrained(..., ignore_mismatched_sizes=True)
+# In music_generation.py - HeartMuLa loading
+HeartMuLa.from_pretrained(..., ignore_mismatched_sizes=True)
+# In lyrics_transcription.py - Whisper loading
+WhisperForConditionalGeneration.from_pretrained(..., ignore_mismatched_sizes=True)
+```
+### 2. `RoPE cache is not built` Error (torchtune >= 0.5)
+In `modeling_heartmula.py`, add this to the `setup_caches()` method after the cache setup:
 ```python
+def setup_caches(self, ...):
+    # ... existing cache setup code ...
+    # ADD THIS: Initialize RoPE caches (required for torchtune >= 0.5)
+    for m in self.modules():
+        if hasattr(m, 'rope_init'):
+            m.rope_init()
+            m.to(device)
+```
+### 3. OOM at Codec Decode (16 GB GPUs)
+In `music_generation.py`, offload the model to CPU before running HeartCodec:
+```python
+# After generating frames, BEFORE codec decode:
+frames = torch.stack(frames).permute(1, 2, 0).squeeze(0)
+self.model.reset_caches()
+self.model.cpu()           # <-- ADD THIS
+torch.cuda.empty_cache()   # <-- ADD THIS
+wav = self.audio_codec.detokenize(frames)
+```
+### 4. `torchcodec` Missing (torchaudio >= 2.10)
+Replace `torchaudio.save()` and `torchaudio.load()` with `soundfile`:
+```python
+# Instead of torchaudio.save():
+import soundfile as sf
+wav_np = wav.cpu().float().numpy()
+if wav_np.ndim == 2:
+    wav_np = wav_np.T
+sf.write(save_path, wav_np, 48000)
+# Instead of torchaudio.load():
+audio_data, sample_rate = sf.read(path, dtype='float32')
+waveform = torch.from_numpy(audio_data)
+```
+### 5. 4-bit Quantization Loading
+When loading this checkpoint, use `device_map="cuda:0"`:
+```python
+from transformers import BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+)
 model = HeartMuLa.from_pretrained(
+    "PavonicAI/HeartMuLa-3B-4bit",
+    quantization_config=bnb_config,
     device_map="cuda:0",
     ignore_mismatched_sizes=True,
 )
 ```
 ## Requirements
 - `torch >= 2.4` with CUDA
 - `bitsandbytes >= 0.43`
 - `transformers >= 4.57`
 - `torchtune >= 0.4`
+- `soundfile`
+- HeartCodec + tokenizer weights from [original HeartMuLa repo](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B)
 ## Hardware Tested
 - NVIDIA RTX 5070 Ti (16 GB) — works with 4-bit quantization + CPU offload during codec decode
+- Output: 48kHz WAV audio
 ## Credits
 - Original model by [HeartMuLa Team](https://heartmula.github.io/) (Apache-2.0)
+- Quantization & compatibility fixes by ForgeAI / PavonicAI
 ## License