JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure
A 1.6 kbps neural speech codec that pairs a JEPA (Joint-Embedding Predictive Architecture) encoder with Q2D2, a geometry-aware 2-D rhombic-lattice quantizer, and a HiFi-GAN decoder. Trained without adversarial losses on the codec objective (a frozen WavLM perceptual loss is used during decoder training).
This repository accompanies the APSIPA ASC 2026 paper "JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure." Code: https://github.com/anant-004/jepa-q2d2
Models
| Subfolder | Operating point | Quantizer | Bitrate | Reported quality |
|---|---|---|---|---|
jepa-q2d2-cd64-12.5hz |
12.5 Hz, code-dim 64 (main model) | Q2D2 | 1.6 kbps (100 tok/s) | PESQ 2.53, ESTOI 0.80 |
jepa-q2d2-sigreg-cd32-25hz |
25 Hz, code-dim 32, SIGReg co-design | Q2D2 | 1.6 kbps (100 tok/s) | ESTOI 0.79 |
teacher-cd128-fsq-12.5hz |
12.5 Hz, code-dim 128 (teacher) | FSQ | ~2.85 kbps (237.5 tok/s) | PESQ ~2.91 |
All metrics are on the paper's fixed 50-utterance LibriLight protocol. ESTOI is extended STOI (systematically lower than vanilla STOI; comparable only within this paper's identical pipeline).
Main codec (cd64, 12.5 Hz)
The headline reconstruction system. At 1.6 kbps it exceeds EnCodec at a comparable operating point (+0.91 PESQ, +0.15 ESTOI vs EnCodec-1.5 kbps) and beats Mimi on PESQ (+0.25) while trailing it on ESTOI. Under a matched internal ablation, Q2D2 improves over finite scalar quantization (FSQ) by +0.37 PESQ at the same bitrate.
SIGReg co-design model (cd32, 25 Hz)
Demonstrates the paper's central finding: at the aggressive 25 Hz / 32-dim operating point the codec collapses (ESTOI -0.004) unless the encoder's latent distribution is Gaussianized with SIGReg (lambda = 0.05), which restores normal training (ESTOI 0.79). Two checkpoints are identical except for this term.
Teacher codec (cd128, FSQ, 12.5 Hz)
The original higher-rate codec (code dim 128, FSQ quantizer, 237.5 tok/s, โ2.85 kbps, PESQ ~2.91). Architecturally distinct from the Q2D2 models โ it uses finite scalar quantization rather than the Q2D2 lattice. Released because it serves as the distillation teacher for the 1.6 kbps cd64 student and is a useful higher-quality reference point.
Each model folder contains
pytorch_model.ptโ inference checkpoint (ckpt["state_dict"]= full encoder + Q2D2 quantizer + HiFi-GAN decoder; optimizer / discriminator state stripped).model.safetensorsโ the same weights in safetensors format.config.jsonโ strides, code dim, frame rate, bitrate, sample rate (24 kHz), training step, and the checkpoint's eval metric.
Usage
import torch
# model definition lives in the companion repo:
# git clone https://github.com/anant-004/jepa-q2d2
from koe.fast.benchmark_codecs import build_v2_model # see repo for exact entry point
ckpt = torch.load("jepa-q2d2-cd64-12.5hz/pytorch_model.pt",
map_location="cpu", weights_only=False)
model = build_v2_model(ckpt["config"])
model.load_state_dict(ckpt["state_dict"], strict=False)
model.eval()
# wav: (1, T) mono @ 24 kHz
tokens = model.encode(wav) # discrete Q2D2 tokens
recon = model.decode(tokens) # reconstructed 24 kHz waveform
See the GitHub repo for runnable encode/decode scripts and the evaluation harness.
Training data
LibriLight (English read speech), 24 kHz. The codec is an English-trained reconstruction system; the cross-lingual structure reported in the paper is an emergent property of the JEPA encoder features, evaluated zero-shot on FLEURS.
Citation
@inproceedings{shukla2026jepaq2d2,
title = {JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure},
author = {Shukla, Anant and Anand, Aman and Shakya, Suryansh and Bharti, Vatsal},
booktitle = {Proc. APSIPA ASC},
year = {2026},
}
License
CC-BY-4.0. Weights are released for research use.