PavonicAI commited on
Commit
db88514
·
verified ·
1 Parent(s): 64aa05e

Update model card: add ComfyUI instructions and detailed fix guide

Browse files
Files changed (1) hide show
  1. README.md +108 -18
README.md CHANGED
@@ -16,49 +16,139 @@ library_name: transformers
16
 
17
  Pre-quantized 4-bit (NF4) checkpoint of [HeartMuLa-oss-3B](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) for **16 GB VRAM GPUs** (RTX 4060 Ti, RTX 5070 Ti, etc.).
18
 
19
- ## Why?
20
 
21
- The original HeartMuLa 3B model requires ~15 GB VRAM in bfloat16, which doesn't leave enough room for the HeartCodec decoder on 16 GB cards. This pre-quantized checkpoint loads directly in 4-bit, skipping the expensive on-the-fly quantization step.
22
 
23
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ```python
26
- from heartlib.heartmula.modeling_heartmula import HeartMuLa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  model = HeartMuLa.from_pretrained(
29
- "ForgeAI/HeartMuLa-3B-4bit",
 
30
  device_map="cuda:0",
31
  ignore_mismatched_sizes=True,
32
  )
33
  ```
34
 
35
- ## Compatibility Fixes Included
36
-
37
- This checkpoint was created with the following environment fixes for modern PyTorch/torchtune/transformers stacks:
38
-
39
- | Issue | Fix |
40
- |---|---|
41
- | `ignore_mismatched_sizes` error (transformers 5.x) | Added `ignore_mismatched_sizes=True` to all `from_pretrained()` calls |
42
- | `RoPE cache is not built` (torchtune >= 0.5) | Explicit `rope_init()` + `.to(device)` in `setup_caches()` |
43
- | OOM at codec decode step | `model.cpu()` + `torch.cuda.empty_cache()` before HeartCodec decode |
44
- | `torchcodec` missing (torchaudio >= 2.10) | Replaced `torchaudio.save/load` with `soundfile` |
45
-
46
  ## Requirements
47
 
48
  - `torch >= 2.4` with CUDA
49
  - `bitsandbytes >= 0.43`
50
  - `transformers >= 4.57`
51
  - `torchtune >= 0.4`
52
- - HeartCodec weights (from original HeartMuLa repo)
 
53
 
54
  ## Hardware Tested
55
 
56
  - NVIDIA RTX 5070 Ti (16 GB) — works with 4-bit quantization + CPU offload during codec decode
 
57
 
58
  ## Credits
59
 
60
  - Original model by [HeartMuLa Team](https://heartmula.github.io/) (Apache-2.0)
61
- - Quantization & compatibility fixes by ForgeAI
62
 
63
  ## License
64
 
 
16
 
17
  Pre-quantized 4-bit (NF4) checkpoint of [HeartMuLa-oss-3B](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) for **16 GB VRAM GPUs** (RTX 4060 Ti, RTX 5070 Ti, etc.).
18
 
19
+ ## The Problem
20
 
21
+ The original HeartMuLa 3B model requires ~15 GB VRAM in bfloat16. Together with HeartCodec (~1.5 GB), it exceeds 16 GB VRAM, making it impossible to run on consumer GPUs like RTX 4060 Ti, RTX 5070 Ti, etc.
22
 
23
+ On top of that, the original code has several compatibility issues with modern PyTorch/transformers/torchtune versions (see fixes below).
24
+
25
+ ## What This Checkpoint Does
26
+
27
+ - **4-bit NF4 quantized** HeartMuLa 3B (~4.9 GB instead of ~6 GB)
28
+ - Fits on **16 GB VRAM** together with HeartCodec
29
+ - Works with **PyTorch 2.4+**, **transformers 4.57+/5.x**, **torchtune 0.4+**
30
+
31
+ ## ComfyUI Usage
32
+
33
+ This checkpoint works with the [HeartMuLa ComfyUI custom nodes](https://github.com/BenjaminBurworworworton/HeartMuLa_ComfyUI), but you need to apply the code fixes listed below to make it work with modern package versions.
34
+
35
+ ### Setup
36
+
37
+ 1. Download this checkpoint into your ComfyUI models folder:
38
+ ```
39
+ ComfyUI/models/HeartMuLa/HeartMuLa-4bit-3B/
40
+ ```
41
+
42
+ 2. You still need the original [HeartCodec](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B) and tokenizer from the original repo
43
+
44
+ 3. Install required packages in ComfyUI's Python:
45
+ ```bash
46
+ pip install bitsandbytes soundfile
47
+ ```
48
+
49
+ ## Required Code Fixes
50
+
51
+ If you're using modern package versions (PyTorch 2.4+, transformers 5.x, torchtune 0.5+), you need these fixes in your heartlib code:
52
+
53
+ ### 1. `ignore_mismatched_sizes` Error (transformers 5.x)
54
+
55
+ Add `ignore_mismatched_sizes=True` to ALL `from_pretrained()` calls in `music_generation.py` and `lyrics_transcription.py`:
56
+
57
+ ```python
58
+ # In music_generation.py - HeartCodec loading
59
+ HeartCodec.from_pretrained(..., ignore_mismatched_sizes=True)
60
+
61
+ # In music_generation.py - HeartMuLa loading
62
+ HeartMuLa.from_pretrained(..., ignore_mismatched_sizes=True)
63
+
64
+ # In lyrics_transcription.py - Whisper loading
65
+ WhisperForConditionalGeneration.from_pretrained(..., ignore_mismatched_sizes=True)
66
+ ```
67
+
68
+ ### 2. `RoPE cache is not built` Error (torchtune >= 0.5)
69
+
70
+ In `modeling_heartmula.py`, add this to the `setup_caches()` method after the cache setup:
71
 
72
  ```python
73
+ def setup_caches(self, ...):
74
+ # ... existing cache setup code ...
75
+
76
+ # ADD THIS: Initialize RoPE caches (required for torchtune >= 0.5)
77
+ for m in self.modules():
78
+ if hasattr(m, 'rope_init'):
79
+ m.rope_init()
80
+ m.to(device)
81
+ ```
82
+
83
+ ### 3. OOM at Codec Decode (16 GB GPUs)
84
+
85
+ In `music_generation.py`, offload the model to CPU before running HeartCodec:
86
+
87
+ ```python
88
+ # After generating frames, BEFORE codec decode:
89
+ frames = torch.stack(frames).permute(1, 2, 0).squeeze(0)
90
+ self.model.reset_caches()
91
+ self.model.cpu() # <-- ADD THIS
92
+ torch.cuda.empty_cache() # <-- ADD THIS
93
+ wav = self.audio_codec.detokenize(frames)
94
+ ```
95
+
96
+ ### 4. `torchcodec` Missing (torchaudio >= 2.10)
97
+
98
+ Replace `torchaudio.save()` and `torchaudio.load()` with `soundfile`:
99
+
100
+ ```python
101
+ # Instead of torchaudio.save():
102
+ import soundfile as sf
103
+ wav_np = wav.cpu().float().numpy()
104
+ if wav_np.ndim == 2:
105
+ wav_np = wav_np.T
106
+ sf.write(save_path, wav_np, 48000)
107
+
108
+ # Instead of torchaudio.load():
109
+ audio_data, sample_rate = sf.read(path, dtype='float32')
110
+ waveform = torch.from_numpy(audio_data)
111
+ ```
112
+
113
+ ### 5. 4-bit Quantization Loading
114
+
115
+ When loading this checkpoint, use `device_map="cuda:0"`:
116
+
117
+ ```python
118
+ from transformers import BitsAndBytesConfig
119
+
120
+ bnb_config = BitsAndBytesConfig(
121
+ load_in_4bit=True,
122
+ bnb_4bit_compute_dtype=torch.bfloat16,
123
+ bnb_4bit_quant_type="nf4",
124
+ )
125
 
126
  model = HeartMuLa.from_pretrained(
127
+ "PavonicAI/HeartMuLa-3B-4bit",
128
+ quantization_config=bnb_config,
129
  device_map="cuda:0",
130
  ignore_mismatched_sizes=True,
131
  )
132
  ```
133
 
 
 
 
 
 
 
 
 
 
 
 
134
  ## Requirements
135
 
136
  - `torch >= 2.4` with CUDA
137
  - `bitsandbytes >= 0.43`
138
  - `transformers >= 4.57`
139
  - `torchtune >= 0.4`
140
+ - `soundfile`
141
+ - HeartCodec + tokenizer weights from [original HeartMuLa repo](https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B)
142
 
143
  ## Hardware Tested
144
 
145
  - NVIDIA RTX 5070 Ti (16 GB) — works with 4-bit quantization + CPU offload during codec decode
146
+ - Output: 48kHz WAV audio
147
 
148
  ## Credits
149
 
150
  - Original model by [HeartMuLa Team](https://heartmula.github.io/) (Apache-2.0)
151
+ - Quantization & compatibility fixes by ForgeAI / PavonicAI
152
 
153
  ## License
154