Model Card for SW2V-120k

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

SW2V is a pure Transformer decoder based speech representation model. This model is trained via distillation of W2V-Bert-2.0

GitHub Repository: https://github.com/jhcodec843/jhcodec
Demo: https://jhcodec843.github.io/jhcodec/
License: MIT

Model Details

Model Description

To enhance noise robustness for future applications, we incorporated noise augmentation during SW2V training. To ensure the performance Flash-Attention is required.

Uses

JHCodec can be used for research and practical applications that require lossy audio compression. It is particularly well-suited for streaming speech, compressing large audio datasets, and serving as a neural front-end for speech recognition or synthesis pipelines.

Intended Use

Real-time low-latency audio codecs for speech-to-speech models
Research into neural codecs and generative modeling
Preprocessing for downstream speech and audio ML models

Out-of-Scope Use

Any malicious, deceptive, or privacy-violating applications

How to Get Started with JHCodec

For programmatic usage, please refer to the GitHub repository for installation, API documentation, and practical examples.

Training Details

Please refer to the GitHub repository README.

Authors

Anonymous, Submitted to Interspeech2026

Downloads last month: 5