sao-instruct / README.md
ungersboeck's picture
Initial commit
570d967
---
library_name: "pytorch"
language:
- en
tags:
- audio
- diffusion
- editing
license: "other"
---
# SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
[Paper](https://www.arxiv.org/abs/2510.22795) | [Sample Page](https://eth-disco.github.io/sao-instruct) | [Code](https://github.com/eth-disco/sao-instruct)
![SAO-Instruct Overview](assets/sao-instruct.png)
SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions.
## Inference
To get started, clone the repository and install the dependencies:
```shell
git clone https://github.com/ETH-DISCO/sao-instruct.git
pip install -r model/requirements.txt && pip install model/stable-audio-tools
```
Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When `encode_audio` is set to `True`, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the `encoded_audio_noise` parameter. Experiment with different configurations to achieve optimal results.
```python
import torch
from IPython.display import Audio, display
from model.sao_instruct import SAOInstruct
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device)
audio_path = "path/to/audio.wav"
edited_clips = model.edit_audio(
instructions=["add a cat meowing"],
audio_path=audio_path,
encode_audio=True,
cfg_scale=6,
encoded_audio_noise=4
)
display(Audio(audio_path))
for clip in edited_clips:
display(Audio(clip, rate=model.sample_rate, normalize=False))
```
## Data Generation
The required files to generate audio editing triplets are in the `dataset/` folder.
### Prompt Generation
The script `generate_prompts.py` can be used for prompt generation. It accepts a `.jsonl` file as input in the following form:
```json lines
{"caption": "Audio Caption", "metadata": {}}
```
This input `.jsonl` file can be created using the `prepare_captions.py` script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename. The output of this script is a `.jsonl` file that includes processed prompts, containing the input caption, edit instruction, and output caption.
### Paired Sample Generation
#### Prompt-to-Prompt
After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs.
The Prompt-to-Prompt pipeline consists of two parts:
- Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file.
- Sample Generation: Generating the edited audio pairs using the candidates found in the previous step.
Use the script `generate_candidates.py` for the candidate search.
The script `generate_samples.py` can be used for Prompt-to-Prompt sample generation (use the mode `p2p`).
We have included the source code of [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) with the adaptations made for Prompt-to-Prompt in `audio_generation/p2p/stable-audio-tools` (particularly in `audio_generation/p2p/stable-audio-tools/models/transformer.py`).
You can install its requirements using:
```shell
pip install audio_generation/p2p/stable-audio-tools
```
Make sure that the `k_diffusion` package is configured to use the same starting noise. Change the function `sample_dpmpp_3m_sde` in the `k_diffusion/sampling.py` file to:
```python
if eta:
noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0)
noise = noise.repeat(x.shape[0], 1, 1)
x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise
```
#### DDPM Inversion
The script `generate_samples.py` can be used to create samples using DDPM inversion (use the mode `edit`).
We follow the implementation from the paper [Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://github.com/HilaManor/AudioEditingCode/).
Clone the repository and install its dependencies using:
```shell
cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git
cd AudioEditingCode && pip install -r requirements.txt
```
#### Manual Edits
For generating manual edits, use the script `manual_edits/generate_manual_samples.py`.
## Fine-tuning Stable Audio Open
We provide training and data loading scripts to enable fine-tuning on audio editing triplets:
- `model/stable-audio-tools/train_edit.py` - Modified training script for audio editing tasks
- `model/stable-audio-tools/stable_audio_tools/data/dataset_edit.py` - Custom dataset loader for editing triplets
- `model/stable-audio-tools/stable_audio_tools/configs` - Contains configuration files for both the model and dataset
Otherwise, follow the official recommendations from [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) to fine-tune the model.
## Attribution and License
This repository builds upon **Stable Audio Open**, a model developed by [Stability AI](https://stability.ai).
It uses checkpoints and components from [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) that are licensed under the **[Stability AI Community License](./LICENSE-StabilityAI)**. Please see the [NOTICE](./NOTICE) file for required attribution.
**Powered by Stability AI**
This repository and its contents are released for **academic research and non-commercial use only**.