Update README.md

2018199 verified 9 months ago

3.2 kB

license: apache-2.0
pipeline_tag: audio-to-audio

Diffusion Bridge Language Model for Speech Enhancement

Model Hub: Yorch233/DBLM-SE-1B
Code Repository: GitHub - Yorch233/DBLM-SE

This repository hosts the pretrained checkpoint of DBLM-SE-1B, a 1-billion-parameter Diffusion Bridge Language Model (DBLM) for speech enhancement. The model performs joint denoising and dereverberation by modeling the restoration process as a diffusion bridge in the latent space, using an LLM-style architecture to map corrupted speech to clean acoustic tokens.

🔴 Note: This is a model weights-only release. Full training and inference code is available in the official GitHub repository.

Model Overview

DBLM-SE-1B enhances noisy and reverberant speech through a hybrid latent-space pipeline:

WavLM Encoder
→ Extracts continuous latent tokens from input audio.
Diffusion Bridge Language Model (DBLM)
→ Iteratively denoises and dereverberates latents via backward SDE/ODE sampling.
→ Leverages causal attention and diffusion conditioning for temporal coherence.
Latent-to-Discrete Translator
→ Predicts XCodec2 discrete acoustic token IDs from clean continuous latents.
XCodec2 Decoder
→ Reconstructs high-fidelity, clean speech from discrete codes.

This repository contains the full DBLM model for latent restoration and acoustic token generation.

Key Capabilities

✅ Speech Denoising – Removes background noise (e.g., babble, traffic, machinery)
✅ Dereverberation – Suppresses room echoes for improved clarity and ASR performance
✅ Latent Diffusion Bridge – Stable, iterative restoration in compressed space
✅ XCodec2-Compatible Output – Directly generates discrete codes for neural decoding

Checkpoint Details

Model name: DBLM-SE-1B
Architecture: LLaMA
Parameters: ~1.0B
Input: WavLM-Large continuous latents (T × 1024)
Output: XCodec2 discrete acoustic token IDs (multi-codebook)

The training data for DBLM-SE-1B is constructed following the DNS Challenge methodology (Microsoft DNS Challenge), ensuring realistic and diverse noisy-reverberant conditions for robust speech enhancement.

Clean Speech: LibriVox, VCTK
Noise: AudioSet, Freesound, DEMAND
Room Impulse Responses (RIRs): OpenSLR26, OpenSLR28

To improve generalization and prevent overfitting, noisy and reverberant training samples are generated on the fly during training with the following augmentation strategy:

90% probability of adding background noise (SNR: [-5, 20] dB)
50% probability of applying reverberation via RIR convolution

This dynamic mixing pipeline enables the model to learn joint denoising and dereverberation in a realistic and data-efficient manner.

📦 This is a checkpoint-only release.
🤗 For code, demos, and documentation, go to:
https://github.com/Yorch233/DBLM-SE