MelodyFlow: High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

AudioCraft provides the code and models for MelodyFlow, High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching.

MelodyFlow is a text-guided music generation and editing model capable of generating high-quality stereo samples conditioned on text descriptions. It is a Flow Matching Diffusion Transformer trained over a 48 kHz stereo (resp. 32 kHz mono) quantizer-free EnCodec tokenizer sampled at 25 Hz (resp. 20 Hz). Unlike prior work on Flow Matching for music generation such as MusicFlow: Cascaded Flow Matching for Text Guided Music Generation, MelodyFlow doesn't require model cascading, which makes it very convenient for music editing.

Check out our [sample page][melodyflow_samples] or test the available demo!

We use 16K hours of licensed music to train MelodyFlow. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.

Local Inference Warning

If you are running MelodyFlow locally, the main failure mode we hit was not prompt quality or solver choice. The real problem was using the wrong code path.

Successful local text-to-music generation depended on using the official MelodyFlow Space implementation or a maintained fork of it.
A stale generic AudioCraft checkout can produce structurally valid files that still sound like buzz or hum because the latent contract does not match the released MelodyFlow implementation.
On PyTorch 2.6 and newer, trusted local checkpoint loads may require weights_only=False.

Read the sections below before debugging sampler settings.

Known Good Local Shape

The local inference path that worked reliably had these pieces:

a dedicated MelodyFlow Python environment
a local checkout of the official MelodyFlow Space or a maintained fork
a local checkpoint directory for facebook/melodyflow-t24-30secs
imports resolved from the Space checkout, not from an older generic AudioCraft clone
a trusted checkpoint load path that can force weights_only=False on PyTorch 2.6+

If any of those pieces differ, treat that as a setup issue first.

Model Card

See the model card.

Installation

Please follow the AudioCraft installation instructions from the README.

AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).

Usage

We currently offer two ways to interact with MAGNeT:

You can use the gradio demo locally by running python -m demos.melodyflow_app --share.
You can play with MelodyFlow by running the jupyter notebook at demos/melodyflow_demo.ipynb locally (also works on CPU).

Local Fork Maintenance Notes

If you maintain a derived MelodyFlow repo for local development or deployment, the practical git layout is the same as any normal fork workflow:

Keep the official MelodyFlow Space as upstream.
Keep your writable Hugging Face repo as origin.
Rebase or merge from upstream/main on a regular cadence so compatibility fixes do not drift.

Example remote layout:

git remote rename origin upstream
git remote add origin https://huggingface.co/ericleigh007/MelodyFlow
git fetch upstream

PyTorch 2.6 Checkpoint Compatibility

Some local consumers loading older MelodyFlow checkpoints under PyTorch 2.6 or newer may need to override the new default torch.load(..., weights_only=True) behavior.

For trusted local checkpoint files, a compatibility wrapper like the following can be required:

import torch

original_load = torch.load

def trusted_load(*args, **kwargs):
    kwargs.setdefault("weights_only", False)
    return original_load(*args, **kwargs)

Without that override, loading can fail with errors involving omegaconf.dictconfig.DictConfig or other non-tensor objects serialized by older releases.

Local Regression Checks

If local outputs suddenly regress into buzz, hum, or other clearly invalid audio, check these before tuning solver parameters:

Confirm you are importing the official MelodyFlow class and not reconstructing the model from an older generic AudioCraft checkout.
Confirm the checkpoint directory still includes both state_dict.bin and compression_state_dict.bin.
Confirm the local runner or application is still using the intended fork checkout instead of a stale clone elsewhere on disk.

API

We provide a simple API and 1 pre-trained model:

facebook/melodyflow-t24-30secs: 1B model, text to music, generates 30-second samples - 🤗 Hub

See after a quick example for using the API.

import torchaudio
from audiocraft.models import MelodyFlow
from audiocraft.data.audio import audio_write

model = MelodyFlow.get_pretrained('facebook/melodyflow-t24-30secs')
descriptions = ['disco beat', 'energetic EDM', 'funky groove']
wav = model.generate(descriptions)  # generates 3 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

Training

Coming later...

Citation

@misc{lan2024high,
      title={High fidelity text-guided music generation and editing via single-stage flow matching}, 
      author={Le Lan, Gael and Shi, Bowen and Ni, Zhaoheng and Srinivasan, Sidd and Kumar, Anurag and Ellis, Brian and Kant, David and Nagaraja, Varun and Chang, Ernie and Hsu, Wei-Ning and others},
      year={2024},
      eprint={2407.03648},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

License

See license information in the model card.