|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- audio |
|
|
- speech-enhancement |
|
|
- denoising |
|
|
- generative |
|
|
pipeline_tag: audio-to-audio |
|
|
--- |
|
|
|
|
|
# DAC-SE1: High-Fidelity Speech Enhancement via Discrete Audio Tokens |
|
|
|
|
|
This checkpoint has been trained to specifically reflect real-world denoising scenarios. It utilizes discrete audio tokens to perform high-fidelity speech enhancement, treating audio restoration as a sequence modeling task to generate clean audio tokens from noisy input sequences. |
|
|
|
|
|
## Usage |
|
|
|
|
|
To use this model, you need the inference tools and tokenizers provided in the official GitHub repository. |
|
|
|
|
|
### 1. Setup Environment |
|
|
First, clone the repository to get the necessary helper scripts (`DACTools`, `DACTokenizer`, etc.) and navigate into the folder: |
|
|
|
|
|
```sh |
|
|
git clone https://github.com/ETH-DISCO/DAC-SE1.git |
|
|
cd DAC-SE1 |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### 2. Inference |
|
|
You can run the following Python script to denoise an audio file. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import LlamaForCausalLM, LogitsProcessorList |
|
|
from inference import DACTools, DACTokenizer, DACConstrainedLogitsProcessor |
|
|
import re |
|
|
from huggingface_hub import login |
|
|
|
|
|
|
|
|
# Initialize DAC tools for audio encoding/decoding |
|
|
dac_tools = DACTools() |
|
|
tokenizer = DACTokenizer(num_layers=9, codebook_size=1024) |
|
|
|
|
|
# Load denoiser model |
|
|
model_path = "disco-eth/DAC-SE1" |
|
|
model = LlamaForCausalLM.from_pretrained(model_path) |
|
|
model = model.to('cuda') |
|
|
model.eval() |
|
|
|
|
|
# Load noisy audio and convert to tokens |
|
|
noisy_tokens = dac_tools.audio_to_tokens('input.wav') |
|
|
|
|
|
# Prepare input for model |
|
|
token_ids = tokenizer.encode(noisy_tokens, add_special_tokens=False) |
|
|
input_ids = [tokenizer.bos_token_id] + token_ids + [tokenizer.start_clean_token_id] |
|
|
input_tensor = torch.tensor([input_ids]).cuda() |
|
|
|
|
|
# Generate clean tokens |
|
|
num_tokens = len(re.findall(r'<\|s\d+_c\d\|>', noisy_tokens)) |
|
|
logits_processor = LogitsProcessorList([ |
|
|
DACConstrainedLogitsProcessor(tokenizer=tokenizer, min_tokens=num_tokens) |
|
|
]) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
input_tensor, |
|
|
max_new_tokens=num_tokens, |
|
|
min_new_tokens=num_tokens, |
|
|
logits_processor=logits_processor, |
|
|
pad_token_id=tokenizer.pad_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
do_sample=False, |
|
|
) |
|
|
|
|
|
# Extract generated tokens |
|
|
generated_ids = outputs[0, len(input_ids):].tolist() |
|
|
generated_output = tokenizer.decode(generated_ids, skip_special_tokens=True) |
|
|
|
|
|
# Convert tokens back to audio |
|
|
valid_tokens = re.findall(r'<\|s\d+_c\d\|>', generated_output) |
|
|
if valid_tokens: |
|
|
remainder = len(valid_tokens) % 9 |
|
|
if remainder != 0: |
|
|
valid_tokens = valid_tokens[:len(valid_tokens) - remainder] |
|
|
denoised_tokens = "".join(valid_tokens) |
|
|
tokens = dac_tools.string_to_tokens(denoised_tokens) |
|
|
clean_audio = dac_tools.tokens_to_audio(tokens) |
|
|
|
|
|
# Save denoised audio |
|
|
import soundfile as sf |
|
|
sf.write('output.wav', clean_audio, dac_tools.sample_rate) |
|
|
|
|
|
``` |
|
|
|
|
|
## Citation |
|
|
If you use this model, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{lanzendörfer2025highfidelityspeechenhancementdiscrete, |
|
|
title={High-Fidelity Speech Enhancement via Discrete Audio Tokens}, |
|
|
author={Luca A. Lanzendörfer and Frédéric Berdoz and Antonis Asonitis and Roger Wattenhofer}, |
|
|
year={2025}, |
|
|
eprint={2510.02187}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SD}, |
|
|
url={https://arxiv.org/abs/2510.02187}, |
|
|
} |
|
|
``` |