--- license: mit tags: - audio - speech-enhancement - denoising - generative pipeline_tag: audio-to-audio --- # DAC-SE1: High-Fidelity Speech Enhancement via Discrete Audio Tokens This checkpoint has been trained to specifically reflect real-world denoising scenarios. It utilizes discrete audio tokens to perform high-fidelity speech enhancement, treating audio restoration as a sequence modeling task to generate clean audio tokens from noisy input sequences. ## Usage To use this model, you need the inference tools and tokenizers provided in the official GitHub repository. ### 1. Setup Environment First, clone the repository to get the necessary helper scripts (`DACTools`, `DACTokenizer`, etc.) and navigate into the folder: ```sh git clone https://github.com/ETH-DISCO/DAC-SE1.git cd DAC-SE1 pip install -r requirements.txt ``` ### 2. Inference You can run the following Python script to denoise an audio file. ```python import torch from transformers import LlamaForCausalLM, LogitsProcessorList from inference import DACTools, DACTokenizer, DACConstrainedLogitsProcessor import re from huggingface_hub import login # Initialize DAC tools for audio encoding/decoding dac_tools = DACTools() tokenizer = DACTokenizer(num_layers=9, codebook_size=1024) # Load denoiser model model_path = "disco-eth/DAC-SE1" model = LlamaForCausalLM.from_pretrained(model_path) model = model.to('cuda') model.eval() # Load noisy audio and convert to tokens noisy_tokens = dac_tools.audio_to_tokens('input.wav') # Prepare input for model token_ids = tokenizer.encode(noisy_tokens, add_special_tokens=False) input_ids = [tokenizer.bos_token_id] + token_ids + [tokenizer.start_clean_token_id] input_tensor = torch.tensor([input_ids]).cuda() # Generate clean tokens num_tokens = len(re.findall(r'<\|s\d+_c\d\|>', noisy_tokens)) logits_processor = LogitsProcessorList([ DACConstrainedLogitsProcessor(tokenizer=tokenizer, min_tokens=num_tokens) ]) with torch.no_grad(): outputs = model.generate( input_tensor, max_new_tokens=num_tokens, min_new_tokens=num_tokens, logits_processor=logits_processor, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, do_sample=False, ) # Extract generated tokens generated_ids = outputs[0, len(input_ids):].tolist() generated_output = tokenizer.decode(generated_ids, skip_special_tokens=True) # Convert tokens back to audio valid_tokens = re.findall(r'<\|s\d+_c\d\|>', generated_output) if valid_tokens: remainder = len(valid_tokens) % 9 if remainder != 0: valid_tokens = valid_tokens[:len(valid_tokens) - remainder] denoised_tokens = "".join(valid_tokens) tokens = dac_tools.string_to_tokens(denoised_tokens) clean_audio = dac_tools.tokens_to_audio(tokens) # Save denoised audio import soundfile as sf sf.write('output.wav', clean_audio, dac_tools.sample_rate) ``` ## Citation If you use this model, please cite our paper: ```bibtex @misc{lanzendörfer2025highfidelityspeechenhancementdiscrete, title={High-Fidelity Speech Enhancement via Discrete Audio Tokens}, author={Luca A. Lanzendörfer and Frédéric Berdoz and Antonis Asonitis and Roger Wattenhofer}, year={2025}, eprint={2510.02187}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2510.02187}, } ```