Lucabr01
/

Zero-Ping

Model card Files Files and versions

Zero-Ping / README.md

Lucabr01's picture

Update README.md

9fc71d2 verified 5 days ago

|

history blame contribute delete

2.71 kB

	---
	language: en
	tags:
	- audio
	- speech
	- codec
	- neural-codec
	- packet-loss
	- pytorch
	license: mit
	---

	# Zero-Ping

	Neural speech codec (16 kHz) with built-in packet-loss repair via a local masked attention transformer. Designed for real-time VoIP / WebRTC applications where packets are lost in transit.

	\| \| \|
	\|---\|---\|
	\| Sample rate \| 16 kHz mono \|
	\| Bitrate \| ~6 kbps (9 RVQ codebooks × 1024 entries) \|
	\| Frame size \| 15 ms (hop = 240 samples) \|
	\| Latency \| ~30 ms algorithmic (2 future frames) \|
	\| Parameters \| 17.8 M \|
	\| Best val STOI \| 0.94 \|

	Training data: LibriTTS train-clean-100 + VCTK (~700 h).

	## Install

	```bash
	git clone https://huggingface.co/Lucabr01/Zero-Ping
	cd Zero-Ping
	pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
	pip install vector-quantize-pytorch einops huggingface_hub
	pip install -e .
	```

	## Usage

	```python
	import torch, torchaudio
	from zpcodec import ZPCodec, GilbertElliottConfig, GilbertElliottSimulator

	# Load model (downloads weights automatically on first run)
	model = ZPCodec.from_pretrained("Lucabr01/Zero-Ping", device="cpu")

	# Load audio (must be 16 kHz mono)
	wav, sr = torchaudio.load("speech.wav")
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)
	wav = wav.mean(0, keepdim=True).unsqueeze(0) # [1, 1, T]

	with torch.no_grad():
	# Encode → decode (clean, no packet loss)
	z_q, indices = model.encode(wav)
	wav_clean = model.decode(z_q)

	# Simulate 10% packet loss and repair
	cfg = GilbertElliottConfig(p=0.05, r=0.5, k=0.999, h=0.5)
	sim = GilbertElliottSimulator(cfg, sample_rate=16000, hop_length=model.hop_length)
	mask = sim.sample_frame_mask(1, z_q.shape[-1])
	wav_repaired = model.decode(z_q, frame_mask=mask)

	torchaudio.save("clean.wav", wav_clean.squeeze(0), 16000)
	torchaudio.save("repaired.wav", wav_repaired.squeeze(0), 16000)
	```

	## Architecture

	Three-stage training:
	1. Codec pre-training (GAN + multi-scale mel + waveform + STFT losses)
	2. Repair transformer training (frozen codec, latent L1 on missing frames only)
	3. Joint fine-tuning (all modules, Gilbert-Elliott curriculum from mild to severe loss)

	The `GilbertElliottConfig` parameters let you tune the simulated channel:
	- `p` — probability of entering the Bad state (higher = more frequent bursts)
	- `r` — probability of leaving the Bad state (higher = shorter bursts)
	- `h` — P(no loss \| Bad state), default 0.5

	## Citation

	If you use Zero-Ping in your work, please cite:
	```
	@misc{zeropingcodec2026,
	author = {Lucabr01},
	title = {Zero-Ping: Neural Speech Codec with Packet-Loss Repair},
	year = {2026},
	url = {https://huggingface.co/Lucabr01/Zero-Ping}
	}
	```