MelodyFlow: High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching
AudioCraft provides the code and models for MelodyFlow, High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching.
MelodyFlow is a text-guided music generation and editing model capable of generating high-quality stereo samples conditioned on text descriptions. It is a Flow Matching Diffusion Transformer trained over a 48 kHz stereo (resp. 32 kHz mono) quantizer-free EnCodec tokenizer sampled at 25 Hz (resp. 20 Hz). Unlike prior work on Flow Matching for music generation such as MusicFlow: Cascaded Flow Matching for Text Guided Music Generation, MelodyFlow doesn't require model cascading, which makes it very convenient for music editing.
Check out our [sample page][melodyflow_samples] or test the available demo!
We use 16K hours of licensed music to train MelodyFlow. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
Local Inference Warning
If you are running MelodyFlow locally, the main failure mode we hit was not prompt quality or solver choice. The real problem was using the wrong code path.
- Successful local text-to-music generation depended on using the official MelodyFlow Space implementation or a maintained fork of it.
- A stale generic AudioCraft checkout can produce structurally valid files that still sound like buzz or hum because the latent contract does not match the released MelodyFlow implementation.
- On PyTorch 2.6 and newer, trusted local checkpoint loads may require
weights_only=False.
Read the sections below before debugging sampler settings.
Known Good Local Shape
The local inference path that worked reliably had these pieces:
- a dedicated MelodyFlow Python environment
- a local checkout of the official MelodyFlow Space or a maintained fork
- a local checkpoint directory for
facebook/melodyflow-t24-30secs - imports resolved from the Space checkout, not from an older generic AudioCraft clone
- a trusted checkpoint load path that can force
weights_only=Falseon PyTorch 2.6+
If any of those pieces differ, treat that as a setup issue first.
Model Card
See the model card.
Installation
Please follow the AudioCraft installation instructions from the README.
AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).
Usage
We currently offer two ways to interact with MAGNeT:
- You can use the gradio demo locally by running
python -m demos.melodyflow_app --share. - You can play with MelodyFlow by running the jupyter notebook at
demos/melodyflow_demo.ipynblocally (also works on CPU).
Local Fork Maintenance Notes
If you maintain a derived MelodyFlow repo for local development or deployment, the practical git layout is the same as any normal fork workflow:
- Keep the official MelodyFlow Space as
upstream. - Keep your writable Hugging Face repo as
origin. - Rebase or merge from
upstream/mainon a regular cadence so compatibility fixes do not drift.
Example remote layout:
git remote rename origin upstream
git remote add origin https://huggingface.co/ericleigh007/MelodyFlow
git fetch upstream
PyTorch 2.6 Checkpoint Compatibility
Some local consumers loading older MelodyFlow checkpoints under PyTorch 2.6 or newer may need to override the new default torch.load(..., weights_only=True) behavior.
For trusted local checkpoint files, a compatibility wrapper like the following can be required:
import torch
original_load = torch.load
def trusted_load(*args, **kwargs):
kwargs.setdefault("weights_only", False)
return original_load(*args, **kwargs)
Without that override, loading can fail with errors involving omegaconf.dictconfig.DictConfig or other non-tensor objects serialized by older releases.
Local Regression Checks
If local outputs suddenly regress into buzz, hum, or other clearly invalid audio, check these before tuning solver parameters:
- Confirm you are importing the official
MelodyFlowclass and not reconstructing the model from an older generic AudioCraft checkout. - Confirm the checkpoint directory still includes both
state_dict.binandcompression_state_dict.bin. - Confirm the local runner or application is still using the intended fork checkout instead of a stale clone elsewhere on disk.
API
We provide a simple API and 1 pre-trained model:
facebook/melodyflow-t24-30secs: 1B model, text to music, generates 30-second samples - 🤗 Hub
See after a quick example for using the API.
import torchaudio
from audiocraft.models import MelodyFlow
from audiocraft.data.audio import audio_write
model = MelodyFlow.get_pretrained('facebook/melodyflow-t24-30secs')
descriptions = ['disco beat', 'energetic EDM', 'funky groove']
wav = model.generate(descriptions) # generates 3 samples.
for idx, one_wav in enumerate(wav):
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
Training
Coming later...
Citation
@misc{lan2024high,
title={High fidelity text-guided music generation and editing via single-stage flow matching},
author={Le Lan, Gael and Shi, Bowen and Ni, Zhaoheng and Srinivasan, Sidd and Kumar, Anurag and Ellis, Brian and Kant, David and Nagaraja, Varun and Chang, Ernie and Hsu, Wei-Ning and others},
year={2024},
eprint={2407.03648},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
License
See license information in the model card.