| --- |
| language: en |
| tags: |
| - audio |
| - speech |
| - codec |
| - neural-codec |
| - packet-loss |
| - pytorch |
| license: mit |
| --- |
| |
| # Zero-Ping |
|
|
| Neural speech codec (16 kHz) with built-in **packet-loss repair** via a local masked attention transformer. Designed for real-time VoIP / WebRTC applications where packets are lost in transit. |
|
|
| | | | |
| |---|---| |
| | Sample rate | 16 kHz mono | |
| | Bitrate | ~6 kbps (9 RVQ codebooks × 1024 entries) | |
| | Frame size | 15 ms (hop = 240 samples) | |
| | Latency | ~30 ms algorithmic (2 future frames) | |
| | Parameters | 17.8 M | |
| | Best val STOI | **0.94** | |
|
|
| Training data: LibriTTS train-clean-100 + VCTK (~700 h). |
|
|
| ## Install |
|
|
| ```bash |
| git clone https://huggingface.co/Lucabr01/Zero-Ping |
| cd Zero-Ping |
| pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu |
| pip install vector-quantize-pytorch einops huggingface_hub |
| pip install -e . |
| ``` |
|
|
| ## Usage |
|
|
| ```python |
| import torch, torchaudio |
| from zpcodec import ZPCodec, GilbertElliottConfig, GilbertElliottSimulator |
| |
| # Load model (downloads weights automatically on first run) |
| model = ZPCodec.from_pretrained("Lucabr01/Zero-Ping", device="cpu") |
| |
| # Load audio (must be 16 kHz mono) |
| wav, sr = torchaudio.load("speech.wav") |
| if sr != 16000: |
| wav = torchaudio.functional.resample(wav, sr, 16000) |
| wav = wav.mean(0, keepdim=True).unsqueeze(0) # [1, 1, T] |
| |
| with torch.no_grad(): |
| # Encode → decode (clean, no packet loss) |
| z_q, indices = model.encode(wav) |
| wav_clean = model.decode(z_q) |
| |
| # Simulate 10% packet loss and repair |
| cfg = GilbertElliottConfig(p=0.05, r=0.5, k=0.999, h=0.5) |
| sim = GilbertElliottSimulator(cfg, sample_rate=16000, hop_length=model.hop_length) |
| mask = sim.sample_frame_mask(1, z_q.shape[-1]) |
| wav_repaired = model.decode(z_q, frame_mask=mask) |
| |
| torchaudio.save("clean.wav", wav_clean.squeeze(0), 16000) |
| torchaudio.save("repaired.wav", wav_repaired.squeeze(0), 16000) |
| ``` |
|
|
| ## Architecture |
|
|
| Three-stage training: |
| 1. Codec pre-training (GAN + multi-scale mel + waveform + STFT losses) |
| 2. Repair transformer training (frozen codec, latent L1 on missing frames only) |
| 3. Joint fine-tuning (all modules, Gilbert-Elliott curriculum from mild to severe loss) |
|
|
| The `GilbertElliottConfig` parameters let you tune the simulated channel: |
| - `p` — probability of entering the Bad state (higher = more frequent bursts) |
| - `r` — probability of leaving the Bad state (higher = shorter bursts) |
| - `h` — P(no loss | Bad state), default 0.5 |
|
|
| ## Citation |
|
|
| If you use Zero-Ping in your work, please cite: |
| ``` |
| @misc{zeropingcodec2026, |
| author = {Lucabr01}, |
| title = {Zero-Ping: Neural Speech Codec with Packet-Loss Repair}, |
| year = {2026}, |
| url = {https://huggingface.co/Lucabr01/Zero-Ping} |
| } |
| ``` |
|
|