|
|
---
|
|
|
library_name: "pytorch"
|
|
|
language:
|
|
|
- en
|
|
|
tags:
|
|
|
- audio
|
|
|
- diffusion
|
|
|
- editing
|
|
|
license: "other"
|
|
|
---
|
|
|
|
|
|
# SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
|
|
|
[Paper](https://www.arxiv.org/abs/2510.22795) | [Sample Page](https://eth-disco.github.io/sao-instruct) | [Code](https://github.com/eth-disco/sao-instruct)
|
|
|
|
|
|

|
|
|
|
|
|
SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions.
|
|
|
|
|
|
|
|
|
## Inference
|
|
|
To get started, clone the repository and install the dependencies:
|
|
|
```shell
|
|
|
git clone https://github.com/ETH-DISCO/sao-instruct.git
|
|
|
pip install -r model/requirements.txt && pip install model/stable-audio-tools
|
|
|
```
|
|
|
|
|
|
Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When `encode_audio` is set to `True`, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the `encoded_audio_noise` parameter. Experiment with different configurations to achieve optimal results.
|
|
|
|
|
|
```python
|
|
|
import torch
|
|
|
from IPython.display import Audio, display
|
|
|
from model.sao_instruct import SAOInstruct
|
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
|
model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device)
|
|
|
|
|
|
audio_path = "path/to/audio.wav"
|
|
|
edited_clips = model.edit_audio(
|
|
|
instructions=["add a cat meowing"],
|
|
|
audio_path=audio_path,
|
|
|
encode_audio=True,
|
|
|
cfg_scale=6,
|
|
|
encoded_audio_noise=4
|
|
|
)
|
|
|
|
|
|
display(Audio(audio_path))
|
|
|
for clip in edited_clips:
|
|
|
display(Audio(clip, rate=model.sample_rate, normalize=False))
|
|
|
```
|
|
|
|
|
|
## Data Generation
|
|
|
The required files to generate audio editing triplets are in the `dataset/` folder.
|
|
|
|
|
|
### Prompt Generation
|
|
|
The script `generate_prompts.py` can be used for prompt generation. It accepts a `.jsonl` file as input in the following form:
|
|
|
```json lines
|
|
|
{"caption": "Audio Caption", "metadata": {}}
|
|
|
```
|
|
|
This input `.jsonl` file can be created using the `prepare_captions.py` script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename. The output of this script is a `.jsonl` file that includes processed prompts, containing the input caption, edit instruction, and output caption.
|
|
|
|
|
|
|
|
|
### Paired Sample Generation
|
|
|
#### Prompt-to-Prompt
|
|
|
After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs.
|
|
|
The Prompt-to-Prompt pipeline consists of two parts:
|
|
|
- Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file.
|
|
|
- Sample Generation: Generating the edited audio pairs using the candidates found in the previous step.
|
|
|
|
|
|
Use the script `generate_candidates.py` for the candidate search.
|
|
|
The script `generate_samples.py` can be used for Prompt-to-Prompt sample generation (use the mode `p2p`).
|
|
|
We have included the source code of [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) with the adaptations made for Prompt-to-Prompt in `audio_generation/p2p/stable-audio-tools` (particularly in `audio_generation/p2p/stable-audio-tools/models/transformer.py`).
|
|
|
You can install its requirements using:
|
|
|
```shell
|
|
|
pip install audio_generation/p2p/stable-audio-tools
|
|
|
```
|
|
|
Make sure that the `k_diffusion` package is configured to use the same starting noise. Change the function `sample_dpmpp_3m_sde` in the `k_diffusion/sampling.py` file to:
|
|
|
```python
|
|
|
if eta:
|
|
|
noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0)
|
|
|
noise = noise.repeat(x.shape[0], 1, 1)
|
|
|
x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise
|
|
|
```
|
|
|
|
|
|
#### DDPM Inversion
|
|
|
The script `generate_samples.py` can be used to create samples using DDPM inversion (use the mode `edit`).
|
|
|
We follow the implementation from the paper [Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://github.com/HilaManor/AudioEditingCode/).
|
|
|
Clone the repository and install its dependencies using:
|
|
|
```shell
|
|
|
cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git
|
|
|
cd AudioEditingCode && pip install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
#### Manual Edits
|
|
|
For generating manual edits, use the script `manual_edits/generate_manual_samples.py`.
|
|
|
|
|
|
|
|
|
## Fine-tuning Stable Audio Open
|
|
|
We provide training and data loading scripts to enable fine-tuning on audio editing triplets:
|
|
|
- `model/stable-audio-tools/train_edit.py` - Modified training script for audio editing tasks
|
|
|
- `model/stable-audio-tools/stable_audio_tools/data/dataset_edit.py` - Custom dataset loader for editing triplets
|
|
|
- `model/stable-audio-tools/stable_audio_tools/configs` - Contains configuration files for both the model and dataset
|
|
|
|
|
|
Otherwise, follow the official recommendations from [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) to fine-tune the model.
|
|
|
|
|
|
|
|
|
## Attribution and License
|
|
|
This repository builds upon **Stable Audio Open**, a model developed by [Stability AI](https://stability.ai).
|
|
|
It uses checkpoints and components from [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) that are licensed under the **[Stability AI Community License](./LICENSE-StabilityAI)**. Please see the [NOTICE](./NOTICE) file for required attribution.
|
|
|
|
|
|
**Powered by Stability AI**
|
|
|
This repository and its contents are released for **academic research and non-commercial use only**.
|
|
|
|