Zero-Ping

Neural speech codec (16 kHz) with built-in packet-loss repair via a local masked attention transformer. Designed for real-time VoIP / WebRTC applications where packets are lost in transit.

Sample rate 16 kHz mono
Bitrate ~6 kbps (9 RVQ codebooks ร— 1024 entries)
Frame size 15 ms (hop = 240 samples)
Latency ~30 ms algorithmic (2 future frames)
Parameters 17.8 M
Best val STOI 0.94

Training data: LibriTTS train-clean-100 + VCTK (~700 h).

Install

git clone https://huggingface.co/Lucabr01/Zero-Ping
cd Zero-Ping
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install vector-quantize-pytorch einops huggingface_hub
pip install -e .

Usage

import torch, torchaudio
from zpcodec import ZPCodec, GilbertElliottConfig, GilbertElliottSimulator

# Load model (downloads weights automatically on first run)
model = ZPCodec.from_pretrained("Lucabr01/Zero-Ping", device="cpu")

# Load audio (must be 16 kHz mono)
wav, sr = torchaudio.load("speech.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
wav = wav.mean(0, keepdim=True).unsqueeze(0)   # [1, 1, T]

with torch.no_grad():
    # Encode โ†’ decode (clean, no packet loss)
    z_q, indices = model.encode(wav)
    wav_clean = model.decode(z_q)

    # Simulate 10% packet loss and repair
    cfg  = GilbertElliottConfig(p=0.05, r=0.5, k=0.999, h=0.5)
    sim  = GilbertElliottSimulator(cfg, sample_rate=16000, hop_length=model.hop_length)
    mask = sim.sample_frame_mask(1, z_q.shape[-1])
    wav_repaired = model.decode(z_q, frame_mask=mask)

torchaudio.save("clean.wav",    wav_clean.squeeze(0),    16000)
torchaudio.save("repaired.wav", wav_repaired.squeeze(0), 16000)

Architecture

Three-stage training:

  1. Codec pre-training (GAN + multi-scale mel + waveform + STFT losses)
  2. Repair transformer training (frozen codec, latent L1 on missing frames only)
  3. Joint fine-tuning (all modules, Gilbert-Elliott curriculum from mild to severe loss)

The GilbertElliottConfig parameters let you tune the simulated channel:

  • p โ€” probability of entering the Bad state (higher = more frequent bursts)
  • r โ€” probability of leaving the Bad state (higher = shorter bursts)
  • h โ€” P(no loss | Bad state), default 0.5

Citation

If you use Zero-Ping in your work, please cite:

@misc{zeropingcodec2026,
  author = {Lucabr01},
  title  = {Zero-Ping: Neural Speech Codec with Packet-Loss Repair},
  year   = {2026},
  url    = {https://huggingface.co/Lucabr01/Zero-Ping}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support