Spaces:
Sleeping
Sleeping
GitHub Copilot instructions for heartlib
What this repo is
- HeartMuLa music generation stack: converts lyrics + style tags → audio via two-stage pipeline (HeartMuLa LLM → audio tokens, HeartCodec flow-matching codec → waveform).
- Supports lyrics transcription via Whisper-based HeartTranscriptor.
- Python package with main entry points:
heartlib.HeartMuLaGenPipelineandheartlib.HeartTranscriptorPipeline(seesrc/heartlib/__init__.py). - Examples/reference CLIs in
examples/; production use: install viapip install -e .
Core architecture & data flow
Music generation: Inputs(lyrics, tags) → Tokenizer → HeartMuLa (frame-by-frame token generation) → HeartCodec (flow-matching detokenization) → MP3
- HeartMuLa (LLaMA3.2 backbone, 3B/300M/7B flavors): generates 8 parallel audio codebook streams + 1 prompt guidance stream (9 total,
_parallel_number=9) - HeartCodec: VQ codec that reconstructs waveforms from codebook frames in overlapping windows, fixed 48 kHz output
- HeartTranscriptor: Whisper variant fine-tuned for vocal transcription; works on 30-second chunks, batch=16
Repo map
src/heartlib/pipelines/music_generation.py: orchestrates tokenization → HeartMuLa inference → HeartCodec detokenize →torchaudio.save()HeartMuLaGenPipeline.from_pretrained(): factory with device/dtype/lazy_load config_resolve_paths(): validates checkpoint layout early (hard error if missing)_resolve_devices(): handles scalar/dict device specs; forceslazy_load=Falsefor multi-device
src/heartlib/heartmula/modeling_heartmula.py: backbone (llama3_2_3B/7B/300M factory functions), token generator with CFG supportsrc/heartlib/heartcodec/modeling_heartcodec.py: VQ codec + flow-matching decoder, detokenizes(codebooks, time)framessrc/heartlib/pipelines/lyrics_transcription.py: wraps transformers' Whisper; fixed chunk=30s, batch=16src/heartlib/heartmula/configuration_heartmula.py,src/heartlib/heartcodec/configuration_heartcodec.py: model configs
Checkpoints & required layout
Directory structure after downloads (see README or hf download commands):
./ckpt/
HeartMuLa-oss-3B/ (or -7B, -300M)
config.json
model-*.safetensors
model.safetensors.index.json
HeartCodec-oss/
config.json
model-*.safetensors
model.safetensors.index.json
HeartTranscriptor-oss/
config.json
pytorch_model.bin (or safetensors)
tokenizer.json
gen_config.json
_resolve_paths(pretrained_path, version)validates all required files; raises FileNotFoundError if missing- Latest checkpoint: HeartMuLa-RL-oss-3B-20260123 (RL-tuned, recommended for style control)
Generation pipeline behaviors to know
- Inputs: dict with
lyrics,tags(both strings or file paths); auto-lowercased, tags wrapped with<tag>...</tag>if missing - Tokenization: uses
tokenizers.Tokenizer(fromtokenizer.json); token IDs fromHeartMuLaGenConfig(text_bos_id=128000, text_eos_id=128001, audio_eos_id=8193) - CFG (classifier-free guidance): if
cfg_scale != 1.0, batch duplicated for unconditional pass (bs becomes 2×); enables style control tradeoff - Audio generation loop: runs max
max_audio_length_ms // 80frames (~12.5 Hz generation rate); stops early if any token ≥ audio_eos_id - Memory optimization:
lazy_load=Truedefers model loading, unloads after generation (saves CUDA between uses)- Forced
lazy_load=Falseif mula_device ≠ codec_device (different device types can't swap) - Uses
torch.autocastwith specified dtype to reduce memory footprint
- Output: via
torchaudio.save(save_path, wav, 48000)at fixed 48 kHz sample rate
Codec specifics
HeartCodec.detokenize(frames)expects shape(codebooks, time)where codebooks ≤ 8; pads/repeats to uniform length internally- Uses flow-matching inference in overlapping windows (reduces boundary artifacts), then scalar decoder → PCM waveform
- Fixed 48 kHz output; non-standard rates must be resampled post-generation
- Model config (
config.json) defines number of codebooks and codec architecture
Transcription pipeline behaviors
HeartTranscriptorPipeline.from_pretrained(model_path, device, dtype)wrapsWhisperForConditionalGenerationfromHeartTranscriptor-oss- Fixed at 30-second chunks, batch size 16; no dynamic chunking
- Note: trained on separated vocals; best results with source-separated inputs (use demucs or similar pre-pipeline)
- Supports beam search and temperature kwargs via
__call__()decoding_kwargs (seeexamples/run_lyrics_transcription.py)
Dev workflows & commands
- Install:
pip install -e .(Python ≥3.9, 3.10 recommended; CUDA deps: torch 2.4.1, torchaudio 2.4.1, torchtune 0.4.0, bitsandbytes 0.49.0) - Generate:
python examples/run_music_generation.py --model_path ./ckpt --version 3B --lyrics ./assets/lyrics.txt --tags ./assets/tags.txt- Key flags:
--mula_device cuda --codec_device cuda(or separate);--lazy_load true(single-GPU VRAM relief);--cfg_scale 1.5(style strength)
- Key flags:
- Transcribe:
python examples/run_lyrics_transcription.py --model_path ./ckpt --music_path ./assets/output.mp3 - No test suite; validate changes via example scripts
Coding conventions & critical patterns
- Device specs:
from_pretrained(..., device=X)acceptstorch.device(both models→X) or dict{"mula": dev1, "codec": dev2}(forceslazy_load=False) - Dtype specs: mirrors device—scalar dtype or dict with
"mula","codec"keys - Token/text handling: always lowercase inputs, auto-wrap tags with
<tag>...</tag>, append BOS/EOS via tokenizer config (callers depend on this) - Unimplemented: reference audio path exists but raises
NotImplementedError; don't add stub without full end-to-end implementation - Generation loop internals:
tqdmprogress,torch.autocastscope—avoid breaking these or model cache setup (setup_caches()) - Memory patterns: properties
self.mula/self.codeclazy-load on first access iflazy_load=True, then can unload via_unload_models()
Quick pointers for agents
- Extend pipelines, not models, unless changing core LLM/codec logic
- Validate paths early (mirror
_resolve_pathsstyle) for new entry points - Preserve 48 kHz sample rate and codebook count (8) in outputs
- When modifying tokenization or BOS/EOS logic, verify examples still run end-to-end
- Device/dtype flexibility is intentional—test multi-GPU configs if changing device dispatch logic