Lucabr01
/

Zero-Ping

+---
+language: en
+tags:
+- audio
+- speech
+- codec
+- neural-codec
+- packet-loss
+- pytorch
+license: mit
+---
+# Zero-Ping
+Neural speech codec (16 kHz) with built-in **packet-loss repair** via a local masked attention transformer. Designed for real-time VoIP / WebRTC applications where packets are lost in transit.
+| | |
+|---|---|
+| Sample rate | 16 kHz mono |
+| Bitrate | ~8.6 kbps (9 RVQ codebooks × 1024 entries) |
+| Frame size | 15 ms (hop = 240 samples) |
+| Latency | ~30 ms algorithmic (2 future frames) |
+| Parameters | 17.8 M |
+| Best val STOI | **0.931** |
+Training data: LibriTTS train-clean-100 + VCTK + CommonVoice v2 (~700 h).
+## Install
+```bash
+git clone https://huggingface.co/Lucabr01/Zero-Ping
+cd Zero-Ping
+pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
+pip install vector-quantize-pytorch einops huggingface_hub
+pip install -e .
+```
+## Usage
+```python
+import torch, torchaudio
+from zpcodec import ZPCodec, GilbertElliottConfig, GilbertElliottSimulator
+# Load model (downloads weights automatically on first run)
+model = ZPCodec.from_pretrained("Lucabr01/Zero-Ping", device="cpu")
+# Load audio (must be 16 kHz mono)
+wav, sr = torchaudio.load("speech.wav")
+if sr != 16000:
+    wav = torchaudio.functional.resample(wav, sr, 16000)
+wav = wav.mean(0, keepdim=True).unsqueeze(0)   # [1, 1, T]
+with torch.no_grad():
+    # Encode → decode (clean, no packet loss)
+    z_q, indices = model.encode(wav)
+    wav_clean = model.decode(z_q)
+    # Simulate 10% packet loss and repair
+    cfg  = GilbertElliottConfig(p=0.05, r=0.5, k=0.999, h=0.5)
+    sim  = GilbertElliottSimulator(cfg, sample_rate=16000, hop_length=model.hop_length)
+    mask = sim.sample_frame_mask(1, z_q.shape[-1])
+    wav_repaired = model.decode(z_q, frame_mask=mask)
+torchaudio.save("clean.wav",    wav_clean.squeeze(0),    16000)
+torchaudio.save("repaired.wav", wav_repaired.squeeze(0), 16000)
+```
+## Architecture
+Three-stage training:
+1. Codec pre-training (GAN + multi-scale mel + waveform + STFT losses)
+2. Repair transformer training (frozen codec, latent L1 on missing frames only)
+3. Joint fine-tuning (all modules, Gilbert-Elliott curriculum from mild to severe loss)
+The `GilbertElliottConfig` parameters let you tune the simulated channel:
+- `p` — probability of entering the Bad state (higher = more frequent bursts)
+- `r` — probability of leaving the Bad state (higher = shorter bursts)
+- `h` — P(no loss | Bad state), default 0.5
+## Citation
+If you use Zero-Ping in your work, please cite:
+```
+@misc{zeropingcodec2025,
+  author = {Lucabr01},
+  title  = {Zero-Ping: Neural Speech Codec with Packet-Loss Repair},
+  year   = {2025},
+  url    = {https://huggingface.co/Lucabr01/Zero-Ping}
+}
+```