test / .github /copilot-instructions.md
brandongraves08's picture
HeartMuLa Gradio Space deployment
0322512

GitHub Copilot instructions for heartlib

What this repo is

  • HeartMuLa music generation stack: converts lyrics + style tags → audio via two-stage pipeline (HeartMuLa LLM → audio tokens, HeartCodec flow-matching codec → waveform).
  • Supports lyrics transcription via Whisper-based HeartTranscriptor.
  • Python package with main entry points: heartlib.HeartMuLaGenPipeline and heartlib.HeartTranscriptorPipeline (see src/heartlib/__init__.py).
  • Examples/reference CLIs in examples/; production use: install via pip install -e .

Core architecture & data flow

Music generation: Inputs(lyrics, tags) → Tokenizer → HeartMuLa (frame-by-frame token generation) → HeartCodec (flow-matching detokenization) → MP3

  • HeartMuLa (LLaMA3.2 backbone, 3B/300M/7B flavors): generates 8 parallel audio codebook streams + 1 prompt guidance stream (9 total, _parallel_number=9)
  • HeartCodec: VQ codec that reconstructs waveforms from codebook frames in overlapping windows, fixed 48 kHz output
  • HeartTranscriptor: Whisper variant fine-tuned for vocal transcription; works on 30-second chunks, batch=16

Repo map

  • src/heartlib/pipelines/music_generation.py: orchestrates tokenization → HeartMuLa inference → HeartCodec detokenize → torchaudio.save()
    • HeartMuLaGenPipeline.from_pretrained(): factory with device/dtype/lazy_load config
    • _resolve_paths(): validates checkpoint layout early (hard error if missing)
    • _resolve_devices(): handles scalar/dict device specs; forces lazy_load=False for multi-device
  • src/heartlib/heartmula/modeling_heartmula.py: backbone (llama3_2_3B/7B/300M factory functions), token generator with CFG support
  • src/heartlib/heartcodec/modeling_heartcodec.py: VQ codec + flow-matching decoder, detokenizes (codebooks, time) frames
  • src/heartlib/pipelines/lyrics_transcription.py: wraps transformers' Whisper; fixed chunk=30s, batch=16
  • src/heartlib/heartmula/configuration_heartmula.py, src/heartlib/heartcodec/configuration_heartcodec.py: model configs

Checkpoints & required layout

Directory structure after downloads (see README or hf download commands):

./ckpt/
  HeartMuLa-oss-3B/  (or -7B, -300M)
    config.json
    model-*.safetensors
    model.safetensors.index.json
  HeartCodec-oss/
    config.json
    model-*.safetensors
    model.safetensors.index.json
  HeartTranscriptor-oss/
    config.json
    pytorch_model.bin (or safetensors)
  tokenizer.json
  gen_config.json
  • _resolve_paths(pretrained_path, version) validates all required files; raises FileNotFoundError if missing
  • Latest checkpoint: HeartMuLa-RL-oss-3B-20260123 (RL-tuned, recommended for style control)

Generation pipeline behaviors to know

  • Inputs: dict with lyrics, tags (both strings or file paths); auto-lowercased, tags wrapped with <tag>...</tag> if missing
  • Tokenization: uses tokenizers.Tokenizer (from tokenizer.json); token IDs from HeartMuLaGenConfig (text_bos_id=128000, text_eos_id=128001, audio_eos_id=8193)
  • CFG (classifier-free guidance): if cfg_scale != 1.0, batch duplicated for unconditional pass (bs becomes 2×); enables style control tradeoff
  • Audio generation loop: runs max max_audio_length_ms // 80 frames (~12.5 Hz generation rate); stops early if any token ≥ audio_eos_id
  • Memory optimization:
    • lazy_load=True defers model loading, unloads after generation (saves CUDA between uses)
    • Forced lazy_load=False if mula_device ≠ codec_device (different device types can't swap)
    • Uses torch.autocast with specified dtype to reduce memory footprint
  • Output: via torchaudio.save(save_path, wav, 48000) at fixed 48 kHz sample rate

Codec specifics

  • HeartCodec.detokenize(frames) expects shape (codebooks, time) where codebooks ≤ 8; pads/repeats to uniform length internally
  • Uses flow-matching inference in overlapping windows (reduces boundary artifacts), then scalar decoder → PCM waveform
  • Fixed 48 kHz output; non-standard rates must be resampled post-generation
  • Model config (config.json) defines number of codebooks and codec architecture

Transcription pipeline behaviors

  • HeartTranscriptorPipeline.from_pretrained(model_path, device, dtype) wraps WhisperForConditionalGeneration from HeartTranscriptor-oss
  • Fixed at 30-second chunks, batch size 16; no dynamic chunking
  • Note: trained on separated vocals; best results with source-separated inputs (use demucs or similar pre-pipeline)
  • Supports beam search and temperature kwargs via __call__() decoding_kwargs (see examples/run_lyrics_transcription.py)

Dev workflows & commands

  • Install: pip install -e . (Python ≥3.9, 3.10 recommended; CUDA deps: torch 2.4.1, torchaudio 2.4.1, torchtune 0.4.0, bitsandbytes 0.49.0)
  • Generate: python examples/run_music_generation.py --model_path ./ckpt --version 3B --lyrics ./assets/lyrics.txt --tags ./assets/tags.txt
    • Key flags: --mula_device cuda --codec_device cuda (or separate); --lazy_load true (single-GPU VRAM relief); --cfg_scale 1.5 (style strength)
  • Transcribe: python examples/run_lyrics_transcription.py --model_path ./ckpt --music_path ./assets/output.mp3
  • No test suite; validate changes via example scripts

Coding conventions & critical patterns

  • Device specs: from_pretrained(..., device=X) accepts torch.device (both models→X) or dict {"mula": dev1, "codec": dev2} (forces lazy_load=False)
  • Dtype specs: mirrors device—scalar dtype or dict with "mula", "codec" keys
  • Token/text handling: always lowercase inputs, auto-wrap tags with <tag>...</tag>, append BOS/EOS via tokenizer config (callers depend on this)
  • Unimplemented: reference audio path exists but raises NotImplementedError; don't add stub without full end-to-end implementation
  • Generation loop internals: tqdm progress, torch.autocast scope—avoid breaking these or model cache setup (setup_caches())
  • Memory patterns: properties self.mula / self.codec lazy-load on first access if lazy_load=True, then can unload via _unload_models()

Quick pointers for agents

  • Extend pipelines, not models, unless changing core LLM/codec logic
  • Validate paths early (mirror _resolve_paths style) for new entry points
  • Preserve 48 kHz sample rate and codebook count (8) in outputs
  • When modifying tokenization or BOS/EOS logic, verify examples still run end-to-end
  • Device/dtype flexibility is intentional—test multi-GPU configs if changing device dispatch logic