Hush

The first open-source speech enhancement model built specifically for Voice AI — with real-time background speaker suppression.

8 MB model · Runs fully on CPU in real time · Trained on 10,000+ hours of mixed audio · Under 1 ms processing per 10 ms of audio

🚀 Coming Soon: We are currently fine-tuning a new model optimized specifically for environments with even louder background noise and louder background speech! Stay tuned for the upcoming release.

Model Overview

Hush is designed from the ground up for Voice AI applications — phone-based voice agents, call centre bots, voice assistants, real-time transcription pipelines, and conversational AI systems. It isolates exactly one speaker from a live audio stream, in real time, under production conditions.

The model is language-agnostic — it operates on the acoustic signal directly and works for any spoken language.

At a Production Glance


Model size	8 MB
Runs on	CPU only — no GPU required
Processing latency	< 1 ms per 10 ms of audio
Algorithmic latency	~20 ms (fully causal, zero lookahead)
Training data	10,000+ hours of mixed speech, noise, and competing speakers
Sample rate	16 kHz (telephony-native: G.711, WebRTC, SIP)
Language	Any (language-agnostic speech enhancement)

The Problem It Solves

Every major open-source speech enhancement model (DeepFilterNet3, RNNoise, SEGAN, MetricGAN+, DNS-Challenge entrants) is trained on stationary noise — fans, traffic, keyboard clicks. None treat a competing human voice as a first-class problem.

When the interference is another person speaking, these models either:

Leak the competing speaker → gets transcribed as part of the conversation, breaking NLP/LLMs
Suppress both speakers → degrades the primary speaker's intelligibility

Hush is the first open-source model to explicitly train for background speaker suppression.

What Makes Hush Different

Built on DeepFilterNet3, extended with one targeted innovation: teaching the encoder to distinguish speakers, not just speech from noise.

Training data reflecting the real problem — 60% of training samples include a competing human speaker at 12–24 dB SIR
Auxiliary Separation Head — lightweight Linear(256→32) + Sigmoid head trained with L1 loss to predict ERB-domain background speaker masks (training only — zero inference overhead)
Joint optimization — separation loss (weight 0.1) combined with multi-resolution spectral loss across 4 FFT scales

Architecture

Input Waveform [B, 1, T]
        |
        v
  STFT (FFT=320, Hop=160)
        |
   _____|_______________
   |                   |
   v                   v
ERB features        DF features
[B, 1, T, 32]      [B, 2, T, 64]
   |                   |
   '-------+------------'
           |
           v
        ENCODER
   (SqueezedGRU, 256-dim)
           |
   ________|____________________________
   |               |                   |
   v               v                   v
ERB DECODER     DF DECODER     SEPARATION HEAD *
(ConvTranspose  (3-layer GRU   (Linear + Sigmoid
 + skip conns)   + DF filter)   ERB-domain mask)
   |               |
   v               v
ERB gain mask   Complex filter
   |               |
   '-------+--------'
           |
           v
    Enhanced Spectrum
           |
           v
         ISTFT
           |
           v
   Enhanced Waveform [B, 1, T]

* Separation Head is active during training only — discarded at inference.

Model Specifications

Parameter	Value
Model size	8 MB
Parameters	~1.8M
Sample rate	16,000 Hz
Frame size / hop	320 / 160 samples (10 ms)
ERB bands	32
DF bins	64 (order-5 filter)
Encoder dim	256
Lookahead	0 (fully causal)

Quick Start: PyTorch Inference

Important: PyTorch inference requires DeepFilterLib for correct feature extraction. Install it with pip install DeepFilterLib.

The simplest way is the CLI script from the GitHub repo:

python scripts/infer_single.py \
    --checkpoint model_best.ckpt \
    --input noisy_speech.wav \
    --output enhanced.wav

Or use the Python API directly:

import torch
import numpy as np
import soundfile as sf
from libdf import DF, erb, erb_norm, unit_norm
from model.dfnet_se import DfNetSE, as_complex, as_real, get_config, get_norm_alpha

# Load model
config = get_config()
model = DfNetSE(config)
checkpoint = torch.load("model_best.ckpt", map_location="cpu")
model.model.load_state_dict(checkpoint)
model.eval()

# Load audio
audio, sr = sf.read("noisy_speech.wav")
assert sr == 16000, "Input must be 16 kHz"
wav = torch.tensor(audio, dtype=torch.float32).unsqueeze(0)  # [1, T]

# Feature extraction via libdf (must match training pipeline)
df_state = DF(sr=16000, fft_size=320, hop_size=160, nb_bands=32, min_nb_erb_freqs=2)
alpha = get_norm_alpha(16000, 160, config.norm_tau)
wav_padded = torch.nn.functional.pad(wav, (0, 320))
spec_np = df_state.analysis(wav_padded.numpy(), reset=True)
erb_feat = torch.as_tensor(erb_norm(erb(spec_np, df_state.erb_widths()), alpha)).unsqueeze(1)
spec_feat = as_real(torch.as_tensor(unit_norm(spec_np[..., :64], alpha))).unsqueeze(1)
spec_t = as_real(torch.as_tensor(spec_np)).unsqueeze(1)

# Enhance
with torch.no_grad():
    spec_enh = model.model(spec_t.clone(), erb_feat, spec_feat)[0]
    spec_enh_c = as_complex(spec_enh.squeeze(1))

# Synthesize and compensate delay
enh_np = df_state.synthesis(spec_enh_c.numpy(), reset=True)
enh = torch.from_numpy(np.asarray(enh_np, dtype=np.float32))
delay = 320 - 160  # fft_size - hop_size
enh = enh[:, delay : len(audio) + delay]

sf.write("enhanced.wav", enh.squeeze().numpy(), 16000)

Quick Start: Production (ONNX, No PyTorch)

For production deployment without PyTorch, use the prebuilt Weya NC Standalone library:

import ctypes, platform, numpy as np

lib_name = {"Darwin": "libweya_nc.dylib", "Windows": "weya_nc.dll"}.get(
    platform.system(), "libweya_nc.so"
)
lib = ctypes.CDLL(f"deployment/lib/{lib_name}")

model = lib.weya_nc_model_load_from_path(b"onnx/advanced_dfnet16k_model_best_onnx.tar.gz")
session = lib.weya_nc_session_create(model, 16000, ctypes.c_float(100.0))
frame_len = int(lib.weya_nc_get_frame_length(session))
lib.weya_nc_process_frame(session, input_ptr, output_ptr)

Prebuilt binaries are available for Linux, macOS (Apple Silicon), and Windows. See the deployment guide for full integration instructions.

Training Details

Hyperparameter	Value
Optimizer	AdamW
Learning rate	5e-4 (cosine decay to 1e-6)
LR warmup	3 epochs (1e-4 → 5e-4)
Weight decay	0.05
Batch size	16
Max sample length	5 seconds
Epochs	100
Early stopping	patience=25 epochs
Gradient clip	1.0
Loss	MultiResSpecLoss (4 scales) + LocalSNRLoss + SeparationLoss (×0.1)
Background speaker prob.	60% of samples
Background SIR range	12–24 dB

Datasets Used

The model was trained on standard publicly available datasets totalling over 10,000 hours of mixed audio:

Category	Datasets
Primary speech	LibriSpeech (train-clean-100/360), VCTK Corpus, Common Voice
Background speech	LibriSpeech / VCTK / LibriTTS (speaker-disjoint splits)
Noise	DNS Challenge, FreeSound, ESC-50, AudioSet
Room impulse responses	MIT IR Survey, OpenAIR, BUT ReverbDB

Note: Speech enhancement operates on acoustic features, not linguistic content — Hush works effectively across all languages.

See DATASETS.md for full details with URLs and licensing.

Known Limitations

16 kHz only — trained and evaluated at 16 kHz; other sample rates require resampling
Separation head is auxiliary — the background speaker mask is an ERB-domain soft mask used for training regularization, not a standalone source separation output
Background speakers at moderate SIR — trained with background speakers at 12–24 dB SIR; very loud competing speakers may not be fully suppressed

Repository Structure

weya-ai/hush/  (this Hugging Face repo)
├── README.md                  ← This Model Card
├── config.json                ← Model configuration metadata
├── model_best.ckpt            ← PyTorch checkpoint
├── onnx/
│   └── advanced_dfnet16k_model_best_onnx.tar.gz  ← ONNX production bundle
└── LICENSE                    ← Apache 2.0

Full source code, training scripts, deployment examples, and documentation are available on GitHub.

Acknowledgements

Built on DeepFilterNet by Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B., and Andreas Maier. The core architecture, ERB filterbank, SqueezedGRU module, and loss functions closely follow the DF3 design.

Citation

If you use this model or code, please cite the original DeepFilterNet paper:

@inproceedings{schroter2023deepfilternet3,
  title     = {DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement},
  author    = {Schröter, Hendrik and Rosenkranz, Tobias and Escalante-B., Alberto N and Maier, Andreas},
  booktitle = {INTERSPEECH},
  year      = {2023}
}

License

Apache License 2.0 — see LICENSE for details.

Downloads last month: 77