Model Card for SW2V-120k

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

SW2V is a pure Transformer decoder based speech representation model. This model is trained via distillation of W2V-Bert-2.0

Model Details

Model Description

To enhance noise robustness for future applications, we incorporated noise augmentation during SW2V training. To ensure the performance Flash-Attention is required.

Uses

JHCodec can be used for research and practical applications that require lossy audio compression. It is particularly well-suited for streaming speech, compressing large audio datasets, and serving as a neural front-end for speech recognition or synthesis pipelines.

Intended Use

  • Real-time low-latency audio codecs for speech-to-speech models
  • Research into neural codecs and generative modeling
  • Preprocessing for downstream speech and audio ML models

Out-of-Scope Use

  • Any malicious, deceptive, or privacy-violating applications

How to Get Started with JHCodec

For programmatic usage, please refer to the GitHub repository for installation, API documentation, and practical examples.

Training Details

Please refer to the GitHub repository README.

Authors

Anonymous, Submitted to Interspeech2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support