DBLM-SE-1B / README.md
Yorch233's picture
Update README.md
2018199 verified
metadata
license: apache-2.0
pipeline_tag: audio-to-audio

Diffusion Bridge Language Model for Speech Enhancement

Model Hub: Yorch233/DBLM-SE-1B
Code Repository: GitHub - Yorch233/DBLM-SE

This repository hosts the pretrained checkpoint of DBLM-SE-1B, a 1-billion-parameter Diffusion Bridge Language Model (DBLM) for speech enhancement. The model performs joint denoising and dereverberation by modeling the restoration process as a diffusion bridge in the latent space, using an LLM-style architecture to map corrupted speech to clean acoustic tokens.

πŸ”΄ Note: This is a model weights-only release. Full training and inference code is available in the official GitHub repository.

Model Overview

DBLM-SE-1B enhances noisy and reverberant speech through a hybrid latent-space pipeline:

  1. WavLM Encoder
    β†’ Extracts continuous latent tokens from input audio.

  2. Diffusion Bridge Language Model (DBLM)
    β†’ Iteratively denoises and dereverberates latents via backward SDE/ODE sampling.
    β†’ Leverages causal attention and diffusion conditioning for temporal coherence.

  3. Latent-to-Discrete Translator
    β†’ Predicts XCodec2 discrete acoustic token IDs from clean continuous latents.

  4. XCodec2 Decoder
    β†’ Reconstructs high-fidelity, clean speech from discrete codes.

This repository contains the full DBLM model for latent restoration and acoustic token generation.

Key Capabilities

βœ… Speech Denoising – Removes background noise (e.g., babble, traffic, machinery)
βœ… Dereverberation – Suppresses room echoes for improved clarity and ASR performance
βœ… Latent Diffusion Bridge – Stable, iterative restoration in compressed space
βœ… XCodec2-Compatible Output – Directly generates discrete codes for neural decoding

Checkpoint Details

  • Model name: DBLM-SE-1B
  • Architecture: LLaMA
  • Parameters: ~1.0B
  • Input: WavLM-Large continuous latents (T Γ— 1024)
  • Output: XCodec2 discrete acoustic token IDs (multi-codebook)

The training data for DBLM-SE-1B is constructed following the DNS Challenge methodology (Microsoft DNS Challenge), ensuring realistic and diverse noisy-reverberant conditions for robust speech enhancement.

  • Clean Speech: LibriVox, VCTK
  • Noise: AudioSet, Freesound, DEMAND
  • Room Impulse Responses (RIRs): OpenSLR26, OpenSLR28

To improve generalization and prevent overfitting, noisy and reverberant training samples are generated on the fly during training with the following augmentation strategy:

  • 90% probability of adding background noise (SNR: [-5, 20] dB)
  • 50% probability of applying reverberation via RIR convolution

This dynamic mixing pipeline enables the model to learn joint denoising and dereverberation in a realistic and data-efficient manner.


πŸ“¦ This is a checkpoint-only release.
πŸ€— For code, demos, and documentation, go to:
https://github.com/Yorch233/DBLM-SE