File size: 6,090 Bytes

570d967

---

library_name: "pytorch"
language:
  - en
tags:
  - audio
  - diffusion
  - editing
license: "other"
---


# SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
[Paper](https://www.arxiv.org/abs/2510.22795) | [Sample Page](https://eth-disco.github.io/sao-instruct) | [Code](https://github.com/eth-disco/sao-instruct)

![SAO-Instruct Overview](assets/sao-instruct.png)

SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions.


## Inference
To get started, clone the repository and install the dependencies:
```shell

git clone https://github.com/ETH-DISCO/sao-instruct.git

pip install -r model/requirements.txt && pip install model/stable-audio-tools

```

Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When `encode_audio` is set to `True`, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the `encoded_audio_noise` parameter. Experiment with different configurations to achieve optimal results.

```python

import torch

from IPython.display import Audio, display

from model.sao_instruct import SAOInstruct



device = "cuda" if torch.cuda.is_available() else "cpu" 

model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device)



audio_path = "path/to/audio.wav"

edited_clips = model.edit_audio(

    instructions=["add a cat meowing"],

    audio_path=audio_path,

    encode_audio=True,

    cfg_scale=6,

    encoded_audio_noise=4

)



display(Audio(audio_path))

for clip in edited_clips:

    display(Audio(clip, rate=model.sample_rate, normalize=False))

```

## Data Generation
The required files to generate audio editing triplets are in the `dataset/` folder.

### Prompt Generation
The script `generate_prompts.py` can be used for prompt generation. It accepts a `.jsonl` file as input in the following form:
```json lines

{"caption":  "Audio Caption", "metadata": {}}

```
This input `.jsonl` file can be created using the `prepare_captions.py` script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename. The output of this script is a `.jsonl` file that includes processed prompts, containing the input caption, edit instruction, and output caption.


### Paired Sample Generation
#### Prompt-to-Prompt
After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs.
The Prompt-to-Prompt pipeline consists of two parts:
- Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file.
- Sample Generation: Generating the edited audio pairs using the candidates found in the previous step. 

Use the script `generate_candidates.py` for the candidate search.
The script `generate_samples.py` can be used for Prompt-to-Prompt sample generation (use the mode `p2p`).
We have included the source code of [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) with the adaptations made for Prompt-to-Prompt in `audio_generation/p2p/stable-audio-tools` (particularly in `audio_generation/p2p/stable-audio-tools/models/transformer.py`).
You can install its requirements using:
```shell

pip install audio_generation/p2p/stable-audio-tools

```
Make sure that the `k_diffusion` package is configured to use the same starting noise. Change the function `sample_dpmpp_3m_sde` in the `k_diffusion/sampling.py` file to:
```python

if eta:

	noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0) 

	noise = noise.repeat(x.shape[0], 1, 1) 

	x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise

```

#### DDPM Inversion
The script `generate_samples.py` can be used to create samples using DDPM inversion (use the mode `edit`).
We follow the implementation from the paper [Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://github.com/HilaManor/AudioEditingCode/).
Clone the repository and install its dependencies using:
```shell

cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git

cd AudioEditingCode && pip install -r requirements.txt

```

#### Manual Edits
For generating manual edits, use the script `manual_edits/generate_manual_samples.py`.


## Fine-tuning Stable Audio Open
We provide training and data loading scripts to enable fine-tuning on audio editing triplets:
- `model/stable-audio-tools/train_edit.py` - Modified training script for audio editing tasks
- `model/stable-audio-tools/stable_audio_tools/data/dataset_edit.py` - Custom dataset loader for editing triplets
- `model/stable-audio-tools/stable_audio_tools/configs` - Contains configuration files for both the model and dataset

Otherwise, follow the official recommendations from [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) to fine-tune the model.


## Attribution and License
This repository builds upon **Stable Audio Open**, a model developed by [Stability AI](https://stability.ai).  
It uses checkpoints and components from [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) that are licensed under the **[Stability AI Community License](./LICENSE-StabilityAI)**. Please see the [NOTICE](./NOTICE) file for required attribution.  

**Powered by Stability AI**  
This repository and its contents are released for **academic research and non-commercial use only**.