sao-instruct / README.md

Initial commit

570d967 3 months ago

6.09 kB

	---
	library_name: "pytorch"
	language:
	- en
	tags:
	- audio
	- diffusion
	- editing
	license: "other"
	---

	# SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
	[Paper](https://www.arxiv.org/abs/2510.22795) \| [Sample Page](https://eth-disco.github.io/sao-instruct) \| [Code](https://github.com/eth-disco/sao-instruct)

	![SAO-Instruct Overview](assets/sao-instruct.png)

	SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions.


	## Inference
	To get started, clone the repository and install the dependencies:
	```shell
	git clone https://github.com/ETH-DISCO/sao-instruct.git
	pip install -r model/requirements.txt && pip install model/stable-audio-tools
	```

	Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When `encode_audio` is set to `True`, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the `encoded_audio_noise` parameter. Experiment with different configurations to achieve optimal results.

	```python
	import torch
	from IPython.display import Audio, display
	from model.sao_instruct import SAOInstruct

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device)

	audio_path = "path/to/audio.wav"
	edited_clips = model.edit_audio(
	instructions=["add a cat meowing"],
	audio_path=audio_path,
	encode_audio=True,
	cfg_scale=6,
	encoded_audio_noise=4
	)

	display(Audio(audio_path))
	for clip in edited_clips:
	display(Audio(clip, rate=model.sample_rate, normalize=False))
	```

	## Data Generation
	The required files to generate audio editing triplets are in the `dataset/` folder.

	### Prompt Generation
	The script `generate_prompts.py` can be used for prompt generation. It accepts a `.jsonl` file as input in the following form:
	```json lines
	{"caption": "Audio Caption", "metadata": {}}
	```
	This input `.jsonl` file can be created using the `prepare_captions.py` script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename. The output of this script is a `.jsonl` file that includes processed prompts, containing the input caption, edit instruction, and output caption.


	### Paired Sample Generation
	#### Prompt-to-Prompt
	After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs.
	The Prompt-to-Prompt pipeline consists of two parts:
	- Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file.
	- Sample Generation: Generating the edited audio pairs using the candidates found in the previous step.

	Use the script `generate_candidates.py` for the candidate search.
	The script `generate_samples.py` can be used for Prompt-to-Prompt sample generation (use the mode `p2p`).
	We have included the source code of [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) with the adaptations made for Prompt-to-Prompt in `audio_generation/p2p/stable-audio-tools` (particularly in `audio_generation/p2p/stable-audio-tools/models/transformer.py`).
	You can install its requirements using:
	```shell
	pip install audio_generation/p2p/stable-audio-tools
	```
	Make sure that the `k_diffusion` package is configured to use the same starting noise. Change the function `sample_dpmpp_3m_sde` in the `k_diffusion/sampling.py` file to:
	```python
	if eta:
	noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0)
	noise = noise.repeat(x.shape[0], 1, 1)
	x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise
	```

	#### DDPM Inversion
	The script `generate_samples.py` can be used to create samples using DDPM inversion (use the mode `edit`).
	We follow the implementation from the paper [Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://github.com/HilaManor/AudioEditingCode/).
	Clone the repository and install its dependencies using:
	```shell
	cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git
	cd AudioEditingCode && pip install -r requirements.txt
	```

	#### Manual Edits
	For generating manual edits, use the script `manual_edits/generate_manual_samples.py`.


	## Fine-tuning Stable Audio Open
	We provide training and data loading scripts to enable fine-tuning on audio editing triplets:
	- `model/stable-audio-tools/train_edit.py` - Modified training script for audio editing tasks
	- `model/stable-audio-tools/stable_audio_tools/data/dataset_edit.py` - Custom dataset loader for editing triplets
	- `model/stable-audio-tools/stable_audio_tools/configs` - Contains configuration files for both the model and dataset

	Otherwise, follow the official recommendations from [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) to fine-tune the model.


	## Attribution and License
	This repository builds upon Stable Audio Open, a model developed by [Stability AI](https://stability.ai).
	It uses checkpoints and components from [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) that are licensed under the [Stability AI Community License](./LICENSE-StabilityAI). Please see the [NOTICE](./NOTICE) file for required attribution.

	Powered by Stability AI
	This repository and its contents are released for academic research and non-commercial use only.