JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure

A 1.6 kbps neural speech codec that pairs a JEPA (Joint-Embedding Predictive Architecture) encoder with Q2D2, a geometry-aware 2-D rhombic-lattice quantizer, and a HiFi-GAN decoder. Trained without adversarial losses on the codec objective (a frozen WavLM perceptual loss is used during decoder training).

This repository accompanies the APSIPA ASC 2026 paper "JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure." Code: https://github.com/anant-004/jepa-q2d2

Models

Subfolder	Operating point	Quantizer	Bitrate	Reported quality
`jepa-q2d2-cd64-12.5hz`	12.5 Hz, code-dim 64 (main model)	Q2D2	1.6 kbps (100 tok/s)	PESQ 2.53, ESTOI 0.80
`jepa-q2d2-sigreg-cd32-25hz`	25 Hz, code-dim 32, SIGReg co-design	Q2D2	1.6 kbps (100 tok/s)	ESTOI 0.79
`teacher-cd128-fsq-12.5hz`	12.5 Hz, code-dim 128 (teacher)	FSQ	~2.85 kbps (237.5 tok/s)	PESQ ~2.91

All metrics are on the paper's fixed 50-utterance LibriLight protocol. ESTOI is extended STOI (systematically lower than vanilla STOI; comparable only within this paper's identical pipeline).

Main codec (cd64, 12.5 Hz)

The headline reconstruction system. At 1.6 kbps it exceeds EnCodec at a comparable operating point (+0.91 PESQ, +0.15 ESTOI vs EnCodec-1.5 kbps) and beats Mimi on PESQ (+0.25) while trailing it on ESTOI. Under a matched internal ablation, Q2D2 improves over finite scalar quantization (FSQ) by +0.37 PESQ at the same bitrate.

SIGReg co-design model (cd32, 25 Hz)

Demonstrates the paper's central finding: at the aggressive 25 Hz / 32-dim operating point the codec collapses (ESTOI -0.004) unless the encoder's latent distribution is Gaussianized with SIGReg (lambda = 0.05), which restores normal training (ESTOI 0.79). Two checkpoints are identical except for this term.

Teacher codec (cd128, FSQ, 12.5 Hz)

The original higher-rate codec (code dim 128, FSQ quantizer, 237.5 tok/s, ≈2.85 kbps, PESQ ~2.91). Architecturally distinct from the Q2D2 models — it uses finite scalar quantization rather than the Q2D2 lattice. Released because it serves as the distillation teacher for the 1.6 kbps cd64 student and is a useful higher-quality reference point.

Each model folder contains

pytorch_model.pt — inference checkpoint (ckpt["state_dict"] = full encoder + Q2D2 quantizer + HiFi-GAN decoder; optimizer / discriminator state stripped).
model.safetensors — the same weights in safetensors format.
config.json — strides, code dim, frame rate, bitrate, sample rate (24 kHz), training step, and the checkpoint's eval metric.

Usage

import torch
# model definition lives in the companion repo:
#   git clone https://github.com/anant-004/jepa-q2d2
from koe.fast.benchmark_codecs import build_v2_model   # see repo for exact entry point

ckpt = torch.load("jepa-q2d2-cd64-12.5hz/pytorch_model.pt",
                  map_location="cpu", weights_only=False)
model = build_v2_model(ckpt["config"])
model.load_state_dict(ckpt["state_dict"], strict=False)
model.eval()

# wav: (1, T) mono @ 24 kHz
tokens = model.encode(wav)        # discrete Q2D2 tokens
recon  = model.decode(tokens)     # reconstructed 24 kHz waveform

See the GitHub repo for runnable encode/decode scripts and the evaluation harness.

Training data

LibriLight (English read speech), 24 kHz. The codec is an English-trained reconstruction system; the cross-lingual structure reported in the paper is an emergent property of the JEPA encoder features, evaluated zero-shot on FLEURS.

Citation

@inproceedings{shukla2026jepaq2d2,
  title     = {JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure},
  author    = {Shukla, Anant and Anand, Aman and Shakya, Suryansh and Bharti, Vatsal},
  booktitle = {Proc. APSIPA ASC},
  year      = {2026},
}

License

CC-BY-4.0. Weights are released for research use.

Downloads last month: -; Downloads are not tracked for this model. How to track