Spaces:

brandongraves08
/

test

Sleeping

App Files Files Community

test / .github /copilot-instructions.md

brandongraves08

HeartMuLa Gradio Space deployment

0322512 3 months ago

preview code

raw

history blame contribute delete

6.57 kB

GitHub Copilot instructions for heartlib

What this repo is

HeartMuLa music generation stack: converts lyrics + style tags → audio via two-stage pipeline (HeartMuLa LLM → audio tokens, HeartCodec flow-matching codec → waveform).
Supports lyrics transcription via Whisper-based HeartTranscriptor.
Python package with main entry points: heartlib.HeartMuLaGenPipeline and heartlib.HeartTranscriptorPipeline (see src/heartlib/__init__.py).
Examples/reference CLIs in examples/; production use: install via pip install -e .

Core architecture & data flow

Music generation: Inputs(lyrics, tags) → Tokenizer → HeartMuLa (frame-by-frame token generation) → HeartCodec (flow-matching detokenization) → MP3

HeartMuLa (LLaMA3.2 backbone, 3B/300M/7B flavors): generates 8 parallel audio codebook streams + 1 prompt guidance stream (9 total, _parallel_number=9)
HeartCodec: VQ codec that reconstructs waveforms from codebook frames in overlapping windows, fixed 48 kHz output
HeartTranscriptor: Whisper variant fine-tuned for vocal transcription; works on 30-second chunks, batch=16

Repo map

src/heartlib/pipelines/music_generation.py: orchestrates tokenization → HeartMuLa inference → HeartCodec detokenize → torchaudio.save()
- HeartMuLaGenPipeline.from_pretrained(): factory with device/dtype/lazy_load config
- _resolve_paths(): validates checkpoint layout early (hard error if missing)
- _resolve_devices(): handles scalar/dict device specs; forces lazy_load=False for multi-device
src/heartlib/heartmula/modeling_heartmula.py: backbone (llama3_2_3B/7B/300M factory functions), token generator with CFG support
src/heartlib/heartcodec/modeling_heartcodec.py: VQ codec + flow-matching decoder, detokenizes (codebooks, time) frames
src/heartlib/pipelines/lyrics_transcription.py: wraps transformers' Whisper; fixed chunk=30s, batch=16
src/heartlib/heartmula/configuration_heartmula.py, src/heartlib/heartcodec/configuration_heartcodec.py: model configs

Checkpoints & required layout

Directory structure after downloads (see README or hf download commands):

./ckpt/
  HeartMuLa-oss-3B/  (or -7B, -300M)
    config.json
    model-*.safetensors
    model.safetensors.index.json
  HeartCodec-oss/
    config.json
    model-*.safetensors
    model.safetensors.index.json
  HeartTranscriptor-oss/
    config.json
    pytorch_model.bin (or safetensors)
  tokenizer.json
  gen_config.json

_resolve_paths(pretrained_path, version) validates all required files; raises FileNotFoundError if missing
Latest checkpoint: HeartMuLa-RL-oss-3B-20260123 (RL-tuned, recommended for style control)

Generation pipeline behaviors to know

Inputs: dict with lyrics, tags (both strings or file paths); auto-lowercased, tags wrapped with <tag>...</tag> if missing
Tokenization: uses tokenizers.Tokenizer (from tokenizer.json); token IDs from HeartMuLaGenConfig (text_bos_id=128000, text_eos_id=128001, audio_eos_id=8193)
CFG (classifier-free guidance): if cfg_scale != 1.0, batch duplicated for unconditional pass (bs becomes 2×); enables style control tradeoff
Audio generation loop: runs max max_audio_length_ms // 80 frames (~12.5 Hz generation rate); stops early if any token ≥ audio_eos_id
Memory optimization:
- lazy_load=True defers model loading, unloads after generation (saves CUDA between uses)
- Forced lazy_load=False if mula_device ≠ codec_device (different device types can't swap)
- Uses torch.autocast with specified dtype to reduce memory footprint
Output: via torchaudio.save(save_path, wav, 48000) at fixed 48 kHz sample rate

Codec specifics

HeartCodec.detokenize(frames) expects shape (codebooks, time) where codebooks ≤ 8; pads/repeats to uniform length internally
Uses flow-matching inference in overlapping windows (reduces boundary artifacts), then scalar decoder → PCM waveform
Fixed 48 kHz output; non-standard rates must be resampled post-generation
Model config (config.json) defines number of codebooks and codec architecture

Transcription pipeline behaviors

HeartTranscriptorPipeline.from_pretrained(model_path, device, dtype) wraps WhisperForConditionalGeneration from HeartTranscriptor-oss
Fixed at 30-second chunks, batch size 16; no dynamic chunking
Note: trained on separated vocals; best results with source-separated inputs (use demucs or similar pre-pipeline)
Supports beam search and temperature kwargs via __call__() decoding_kwargs (see examples/run_lyrics_transcription.py)

Dev workflows & commands

Install: pip install -e . (Python ≥3.9, 3.10 recommended; CUDA deps: torch 2.4.1, torchaudio 2.4.1, torchtune 0.4.0, bitsandbytes 0.49.0)
Generate: python examples/run_music_generation.py --model_path ./ckpt --version 3B --lyrics ./assets/lyrics.txt --tags ./assets/tags.txt
- Key flags: --mula_device cuda --codec_device cuda (or separate); --lazy_load true (single-GPU VRAM relief); --cfg_scale 1.5 (style strength)
Transcribe: python examples/run_lyrics_transcription.py --model_path ./ckpt --music_path ./assets/output.mp3
No test suite; validate changes via example scripts

Coding conventions & critical patterns

Device specs: from_pretrained(..., device=X) accepts torch.device (both models→X) or dict {"mula": dev1, "codec": dev2} (forces lazy_load=False)
Dtype specs: mirrors device—scalar dtype or dict with "mula", "codec" keys
Token/text handling: always lowercase inputs, auto-wrap tags with <tag>...</tag>, append BOS/EOS via tokenizer config (callers depend on this)
Unimplemented: reference audio path exists but raises NotImplementedError; don't add stub without full end-to-end implementation
Generation loop internals: tqdm progress, torch.autocast scope—avoid breaking these or model cache setup (setup_caches())
Memory patterns: properties self.mula / self.codec lazy-load on first access if lazy_load=True, then can unload via _unload_models()

Quick pointers for agents

Extend pipelines, not models, unless changing core LLM/codec logic
Validate paths early (mirror _resolve_paths style) for new entry points
Preserve 48 kHz sample rate and codebook count (8) in outputs
When modifying tokenization or BOS/EOS logic, verify examples still run end-to-end
Device/dtype flexibility is intentional—test multi-GPU configs if changing device dispatch logic