DAC-SE1 / README.md
AAdonis's picture
Checkpoint Description
11e87e8 verified
---
license: mit
tags:
- audio
- speech-enhancement
- denoising
- generative
pipeline_tag: audio-to-audio
---
# DAC-SE1: High-Fidelity Speech Enhancement via Discrete Audio Tokens
This checkpoint has been trained to specifically reflect real-world denoising scenarios. It utilizes discrete audio tokens to perform high-fidelity speech enhancement, treating audio restoration as a sequence modeling task to generate clean audio tokens from noisy input sequences.
## Usage
To use this model, you need the inference tools and tokenizers provided in the official GitHub repository.
### 1. Setup Environment
First, clone the repository to get the necessary helper scripts (`DACTools`, `DACTokenizer`, etc.) and navigate into the folder:
```sh
git clone https://github.com/ETH-DISCO/DAC-SE1.git
cd DAC-SE1
pip install -r requirements.txt
```
### 2. Inference
You can run the following Python script to denoise an audio file.
```python
import torch
from transformers import LlamaForCausalLM, LogitsProcessorList
from inference import DACTools, DACTokenizer, DACConstrainedLogitsProcessor
import re
from huggingface_hub import login
# Initialize DAC tools for audio encoding/decoding
dac_tools = DACTools()
tokenizer = DACTokenizer(num_layers=9, codebook_size=1024)
# Load denoiser model
model_path = "disco-eth/DAC-SE1"
model = LlamaForCausalLM.from_pretrained(model_path)
model = model.to('cuda')
model.eval()
# Load noisy audio and convert to tokens
noisy_tokens = dac_tools.audio_to_tokens('input.wav')
# Prepare input for model
token_ids = tokenizer.encode(noisy_tokens, add_special_tokens=False)
input_ids = [tokenizer.bos_token_id] + token_ids + [tokenizer.start_clean_token_id]
input_tensor = torch.tensor([input_ids]).cuda()
# Generate clean tokens
num_tokens = len(re.findall(r'<\|s\d+_c\d\|>', noisy_tokens))
logits_processor = LogitsProcessorList([
DACConstrainedLogitsProcessor(tokenizer=tokenizer, min_tokens=num_tokens)
])
with torch.no_grad():
outputs = model.generate(
input_tensor,
max_new_tokens=num_tokens,
min_new_tokens=num_tokens,
logits_processor=logits_processor,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
do_sample=False,
)
# Extract generated tokens
generated_ids = outputs[0, len(input_ids):].tolist()
generated_output = tokenizer.decode(generated_ids, skip_special_tokens=True)
# Convert tokens back to audio
valid_tokens = re.findall(r'<\|s\d+_c\d\|>', generated_output)
if valid_tokens:
remainder = len(valid_tokens) % 9
if remainder != 0:
valid_tokens = valid_tokens[:len(valid_tokens) - remainder]
denoised_tokens = "".join(valid_tokens)
tokens = dac_tools.string_to_tokens(denoised_tokens)
clean_audio = dac_tools.tokens_to_audio(tokens)
# Save denoised audio
import soundfile as sf
sf.write('output.wav', clean_audio, dac_tools.sample_rate)
```
## Citation
If you use this model, please cite our paper:
```bibtex
@misc{lanzendörfer2025highfidelityspeechenhancementdiscrete,
title={High-Fidelity Speech Enhancement via Discrete Audio Tokens},
author={Luca A. Lanzendörfer and Frédéric Berdoz and Antonis Asonitis and Roger Wattenhofer},
year={2025},
eprint={2510.02187},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2510.02187},
}
```