--- language: en tags: - audio - speech - codec - neural-codec - packet-loss - pytorch license: mit --- # Zero-Ping Neural speech codec (16 kHz) with built-in **packet-loss repair** via a local masked attention transformer. Designed for real-time VoIP / WebRTC applications where packets are lost in transit. | | | |---|---| | Sample rate | 16 kHz mono | | Bitrate | ~6 kbps (9 RVQ codebooks × 1024 entries) | | Frame size | 15 ms (hop = 240 samples) | | Latency | ~30 ms algorithmic (2 future frames) | | Parameters | 17.8 M | | Best val STOI | **0.94** | Training data: LibriTTS train-clean-100 + VCTK (~700 h). ## Install ```bash git clone https://huggingface.co/Lucabr01/Zero-Ping cd Zero-Ping pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu pip install vector-quantize-pytorch einops huggingface_hub pip install -e . ``` ## Usage ```python import torch, torchaudio from zpcodec import ZPCodec, GilbertElliottConfig, GilbertElliottSimulator # Load model (downloads weights automatically on first run) model = ZPCodec.from_pretrained("Lucabr01/Zero-Ping", device="cpu") # Load audio (must be 16 kHz mono) wav, sr = torchaudio.load("speech.wav") if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000) wav = wav.mean(0, keepdim=True).unsqueeze(0) # [1, 1, T] with torch.no_grad(): # Encode → decode (clean, no packet loss) z_q, indices = model.encode(wav) wav_clean = model.decode(z_q) # Simulate 10% packet loss and repair cfg = GilbertElliottConfig(p=0.05, r=0.5, k=0.999, h=0.5) sim = GilbertElliottSimulator(cfg, sample_rate=16000, hop_length=model.hop_length) mask = sim.sample_frame_mask(1, z_q.shape[-1]) wav_repaired = model.decode(z_q, frame_mask=mask) torchaudio.save("clean.wav", wav_clean.squeeze(0), 16000) torchaudio.save("repaired.wav", wav_repaired.squeeze(0), 16000) ``` ## Architecture Three-stage training: 1. Codec pre-training (GAN + multi-scale mel + waveform + STFT losses) 2. Repair transformer training (frozen codec, latent L1 on missing frames only) 3. Joint fine-tuning (all modules, Gilbert-Elliott curriculum from mild to severe loss) The `GilbertElliottConfig` parameters let you tune the simulated channel: - `p` — probability of entering the Bad state (higher = more frequent bursts) - `r` — probability of leaving the Bad state (higher = shorter bursts) - `h` — P(no loss | Bad state), default 0.5 ## Citation If you use Zero-Ping in your work, please cite: ``` @misc{zeropingcodec2026, author = {Lucabr01}, title = {Zero-Ping: Neural Speech Codec with Packet-Loss Repair}, year = {2026}, url = {https://huggingface.co/Lucabr01/Zero-Ping} } ```