Model Card for SW2V-120k
Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec
SW2V is a pure Transformer decoder based speech representation model. This model is trained via distillation of W2V-Bert-2.0
- GitHub Repository: https://github.com/jhcodec843/jhcodec
- Demo: https://jhcodec843.github.io/jhcodec/
- License: MIT
Model Details
Model Description
To enhance noise robustness for future applications, we incorporated noise augmentation during SW2V training. To ensure the performance Flash-Attention is required.
Uses
JHCodec can be used for research and practical applications that require lossy audio compression. It is particularly well-suited for streaming speech, compressing large audio datasets, and serving as a neural front-end for speech recognition or synthesis pipelines.
Intended Use
- Real-time low-latency audio codecs for speech-to-speech models
- Research into neural codecs and generative modeling
- Preprocessing for downstream speech and audio ML models
Out-of-Scope Use
- Any malicious, deceptive, or privacy-violating applications
How to Get Started with JHCodec
For programmatic usage, please refer to the GitHub repository for installation, API documentation, and practical examples.
Training Details
Please refer to the GitHub repository README.
Authors
Anonymous, Submitted to Interspeech2026