Diffusion Bridge Language Model for Speech Enhancement

Model Hub: Yorch233/DBLM-SE-1B
Code Repository: GitHub - Yorch233/DBLM-SE

This repository hosts the pretrained checkpoint of DBLM-SE-1B, a 1-billion-parameter Diffusion Bridge Language Model (DBLM) for speech enhancement. The model performs joint denoising and dereverberation by modeling the restoration process as a diffusion bridge in the latent space, using an LLM-style architecture to map corrupted speech to clean acoustic tokens.

πŸ”΄ Note: This is a model weights-only release. Full training and inference code is available in the official GitHub repository.

Model Overview

DBLM-SE-1B enhances noisy and reverberant speech through a hybrid latent-space pipeline:

  1. WavLM Encoder
    β†’ Extracts continuous latent tokens from input audio.

  2. Diffusion Bridge Language Model (DBLM)
    β†’ Iteratively denoises and dereverberates latents via backward SDE/ODE sampling.
    β†’ Leverages causal attention and diffusion conditioning for temporal coherence.

  3. Latent-to-Discrete Translator
    β†’ Predicts XCodec2 discrete acoustic token IDs from clean continuous latents.

  4. XCodec2 Decoder
    β†’ Reconstructs high-fidelity, clean speech from discrete codes.

This repository contains the full DBLM model for latent restoration and acoustic token generation.

Key Capabilities

βœ… Speech Denoising – Removes background noise (e.g., babble, traffic, machinery)
βœ… Dereverberation – Suppresses room echoes for improved clarity and ASR performance
βœ… Latent Diffusion Bridge – Stable, iterative restoration in compressed space
βœ… XCodec2-Compatible Output – Directly generates discrete codes for neural decoding

Checkpoint Details

  • Model name: DBLM-SE-1B
  • Architecture: LLaMA
  • Parameters: ~1.0B
  • Input: WavLM-Large continuous latents (T Γ— 1024)
  • Output: XCodec2 discrete acoustic token IDs (multi-codebook)

The training data for DBLM-SE-1B is constructed following the DNS Challenge methodology (Microsoft DNS Challenge), ensuring realistic and diverse noisy-reverberant conditions for robust speech enhancement.

  • Clean Speech: LibriVox, VCTK
  • Noise: AudioSet, Freesound, DEMAND
  • Room Impulse Responses (RIRs): OpenSLR26, OpenSLR28

To improve generalization and prevent overfitting, noisy and reverberant training samples are generated on the fly during training with the following augmentation strategy:

  • 90% probability of adding background noise (SNR: [-5, 20] dB)
  • 50% probability of applying reverberation via RIR convolution

This dynamic mixing pipeline enables the model to learn joint denoising and dereverberation in a realistic and data-efficient manner.


πŸ“¦ This is a checkpoint-only release.
πŸ€— For code, demos, and documentation, go to:
https://github.com/Yorch233/DBLM-SE


Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support