Diffusion Bridge Language Model for Speech Enhancement
Model Hub: Yorch233/DBLM-SE-1B
Code Repository: GitHub - Yorch233/DBLM-SE
This repository hosts the pretrained checkpoint of DBLM-SE-1B, a 1-billion-parameter Diffusion Bridge Language Model (DBLM) for speech enhancement. The model performs joint denoising and dereverberation by modeling the restoration process as a diffusion bridge in the latent space, using an LLM-style architecture to map corrupted speech to clean acoustic tokens.
π΄ Note: This is a model weights-only release. Full training and inference code is available in the official GitHub repository.
Model Overview
DBLM-SE-1B enhances noisy and reverberant speech through a hybrid latent-space pipeline:
WavLM Encoder
β Extracts continuous latent tokens from input audio.Diffusion Bridge Language Model (DBLM)
β Iteratively denoises and dereverberates latents via backward SDE/ODE sampling.
β Leverages causal attention and diffusion conditioning for temporal coherence.Latent-to-Discrete Translator
β Predicts XCodec2 discrete acoustic token IDs from clean continuous latents.XCodec2 Decoder
β Reconstructs high-fidelity, clean speech from discrete codes.
This repository contains the full DBLM model for latent restoration and acoustic token generation.
Key Capabilities
β
Speech Denoising β Removes background noise (e.g., babble, traffic, machinery)
β
Dereverberation β Suppresses room echoes for improved clarity and ASR performance
β
Latent Diffusion Bridge β Stable, iterative restoration in compressed space
β
XCodec2-Compatible Output β Directly generates discrete codes for neural decoding
Checkpoint Details
- Model name: DBLM-SE-1B
- Architecture: LLaMA
- Parameters: ~1.0B
- Input: WavLM-Large continuous latents (T Γ 1024)
- Output: XCodec2 discrete acoustic token IDs (multi-codebook)
The training data for DBLM-SE-1B is constructed following the DNS Challenge methodology (Microsoft DNS Challenge), ensuring realistic and diverse noisy-reverberant conditions for robust speech enhancement.
- Clean Speech: LibriVox, VCTK
- Noise: AudioSet, Freesound, DEMAND
- Room Impulse Responses (RIRs): OpenSLR26, OpenSLR28
To improve generalization and prevent overfitting, noisy and reverberant training samples are generated on the fly during training with the following augmentation strategy:
- 90% probability of adding background noise (SNR: [-5, 20] dB)
- 50% probability of applying reverberation via RIR convolution
This dynamic mixing pipeline enables the model to learn joint denoising and dereverberation in a realistic and data-efficient manner.
π¦ This is a checkpoint-only release.
π€ For code, demos, and documentation, go to:
https://github.com/Yorch233/DBLM-SE