Sidon β Call-Centre / Telephony Speech Restoration
Restore narrowband, codec'd, noisy call-centre / telephony speech (e.g. 8 kHz G.711/GSM phone audio) to clean 48 kHz. Two stages, both trained for the telephony domain (Malaysian/Singaporean
- multilingual clean teachers):
input (8-16 kHz telephony) --16k--> [FE: 24-layer w2v-BERT 2.0 + LoRA] --features[T,1024]-->
[DAC decoder, 188M] --> 48 kHz clean waveform
The FE LoRA adapter is merged into the base weights at load time, so inference needs no peft β
just transformers + descript-audio-codec.
Quick start β infer from the HF checkpoint
pip install torch torchaudio "transformers>=4.56" "descript-audio-codec>=1.0.0" soundfile "huggingface_hub[cli]"
# pull the CLI + the two slim checkpoints from the Hub
hf auth login # private repo: log in first (or export HF_TOKEN=hf_...)
hf download Scicom-intl/sidon-callcentre \
infer_callcentre.py fe_callcentre/fe_adapter_full.pt \
decoder_callcentre/decoder_only.pt --local-dir sidon-callcentre
cd sidon-callcentre && python infer_callcentre.py \
--input your_call.wav --out-dir out \
--fe-adapter fe_callcentre/fe_adapter_full.pt \
--decoder decoder_callcentre/decoder_only.pt \
--chunk 0 --device cuda # --chunk 0 = NO chunking (default single pass); --device cpu if no GPU
# -> out/your_call_restored48k.wav (clean 48 kHz) + out/your_call_orig48k.wav (A/B)
Prefer Python (load weights from the Hub with hf_hub_download)? See Python below.
Status: the decoder is still training (~step 30k of 100k) and these checkpoints are refreshed periodically β quality keeps improving. It already restores real 8 kHz call-centre audio well.
Files
Use the current-run checkpoints under fe_callcentre/ and decoder_callcentre/:
| path | role | size |
|---|---|---|
fe_callcentre/fe_adapter_full.pt |
FE adapter (inference) β 144 tensors: 96 LoRA + 48 trained output_dense biases |
~63 MB |
decoder_callcentre/decoder_only.pt |
decoder (inference) β 188M DAC decoder | ~0.75 GB |
fe_callcentre/last.pt, decoder_callcentre/last.pt |
raw checkpoints (resume training) | ~2.5 / 2.8 GB |
infer_callcentre.py |
inference CLI (below) | β |
For inference you only need the two slim files + infer_callcentre.py. (Root-level
fe_adapter_full.pt / decoder_only.pt are from an earlier run and are superseded.)
End-to-end example (straight from HuggingFace)
pip install torch torchaudio "transformers>=4.56" "descript-audio-codec>=1.0.0" soundfile "huggingface_hub[cli]"
# pull the CLI + the two slim checkpoints, straight from this repo
hf auth login # private repo: log in first (or export HF_TOKEN=hf_...)
hf download Scicom-intl/sidon-callcentre \
infer_callcentre.py \
fe_callcentre/fe_adapter_full.pt \
decoder_callcentre/decoder_only.pt \
--local-dir sidon-callcentre
cd sidon-callcentre
# restore your audio end-to-end
python infer_callcentre.py \
--input your_call.wav \
--out-dir out \
--fe-adapter fe_callcentre/fe_adapter_full.pt \
--decoder decoder_callcentre/decoder_only.pt \
--chunk 0 --device cuda # --chunk 0 = NO chunking (single straight pass, default); --device cpu if no GPU
Outputs:
out/your_call_restored48k.wavβ the restored clean 48 kHz speech.out/your_call_orig48k.wavβ the input, naively upsampled to 48 kHz (no model), for an A/B listen.
--input accepts a file or a directory (.wav/.flac/.mp3/.ogg/.opus/.m4a). Stereo (e.g.
agent/customer on separate channels) is restored per channel and recombined.
Inference is a single straight pass (--chunk 0, the default): w2v-BERT 2.0 uses relative/rotary
position embeddings and the DAC decoder is fully convolutional, so a full pass is length-invariant and
cleanest. --chunk <seconds> enables crossfaded windowing purely as a memory fallback for very long
audio (self-attention is O(T^2)); it is spectrally near-identical (log-mel corr β 0.98) but adds seams,
so prefer the default single pass unless you hit OOM.
Python (pull weights from the Hub)
import numpy as np, soundfile as sf, torch, torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
import dac
REPO, SSL, FE_SR, SR_OUT = "Scicom-intl/sidon-callcentre", "facebook/w2v-bert-2.0", 16000, 48000
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ck = torch.load(hf_hub_download(REPO, "fe_callcentre/fe_adapter_full.pt"), map_location="cpu")
ad, scale = ck["adapter"], ck["lora_alpha"] / ck["r"]
fe = Wav2Vec2BertModel.from_pretrained(SSL, num_hidden_layers=ck.get("layers", 24), layerdrop=0.0)
sd = fe.state_dict() # merge LoRA -> base (no peft needed)
for p in sorted({k[:-len(".lora_A.default.weight")] for k in ad if k.endswith(".lora_A.default.weight")}):
sd[p+".weight"] = sd[p+".weight"].float() + scale * (ad[p+".lora_B.default.weight"].float() @ ad[p+".lora_A.default.weight"].float())
if p+".base_layer.bias" in ad: sd[p+".bias"] = ad[p+".base_layer.bias"].to(sd[p+".bias"].dtype)
fe.load_state_dict(sd); fe.to(dev).eval()
dck = torch.load(hf_hub_download(REPO, "decoder_callcentre/decoder_only.pt"), map_location="cpu")
dec = dac.model.dac.Decoder(input_channel=1024, channels=dck.get("dec_channels", 3072), rates=[8,5,4,3,2])
dec.load_state_dict(dck["decoder"]); dec.to(dev).eval()
proc = AutoFeatureExtractor.from_pretrained(SSL)
@torch.no_grad()
def restore(path, out="restored48k.wav"): # single straight pass
x, sr = sf.read(path, dtype="float32"); x = x.mean(1) if x.ndim > 1 else x
if sr != FE_SR: x = torchaudio.functional.resample(torch.from_numpy(x)[None], sr, FE_SR)[0].numpy()
x = x / (np.abs(x).max() + 1e-9) * 0.95
feats = {k: v.to(dev) for k, v in proc(x, sampling_rate=FE_SR, return_tensors="pt").items()}
y = dec(fe(**feats).last_hidden_state.transpose(1, 2)).squeeze().float().cpu().numpy()
sf.write(out, y / (np.abs(y).max() + 1e-9) * 0.97, SR_OUT); print("wrote", out)
restore("your_call.wav") # <-- your own telephony/call-centre audio
Model details
- FE: full 24-layer
facebook/w2v-bert-2.0+ fresh LoRA (r=64, alpha=16, dropout=0.1, bias="lora_only", target_modules=["output_dense"]), trained by MSE distillation of a degraded signal's features toward a frozen teacher on the clean signal (~16M trainable). - Decoder:
dac.model.dac.Decoder(input_channel=1024, channels=3072, rates=[8,5,4,3,2])(188M, 50 fps x 960 = 48 kHz), trained with DAC multi-resolution mel + GAN (loss = 15*mel + 2*adv + 1*feat). - Degradation (train-time): telephone HP -> narrowband ceiling (8/11/12/16k) -> GSM / G.711-mu-law -> 16-40 kbps MP3 -> line noise + VoIP dropouts.
- Teachers (clean 48 kHz): EARS + Expresso (studio) + DNSMOS-filtered multilingual HF datasets +
DNSMOS-filtered Malaysian/Singaporean podcast & movie
(
Scicom-intl/sidon-callcentre-podcast).
License / intended use
cc-by-nc-4.0 β research / non-commercial. Built on facebook/w2v-bert-2.0 and Descript Audio Codec.
Model tree for Scicom-intl/sidon-callcentre
Base model
facebook/w2v-bert-2.0