sw2v_60k / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag and link to paper
aeef7f7 verified
|
raw
history blame
2.49 kB
metadata
license: mit
pipeline_tag: audio-classification

Model Card for SW2V (60k)

SW2V is a pure Transformer decoder-based speech representation model introduced in the paper Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec.

This specific checkpoint (60k) is trained via distillation of W2V-BERT 2.0.

Model Details

Model Description

SW2V (Streaming wav2vec) is designed for high-intelligibility and low-latency speech representation. It utilizes Self-Supervised Representation Reconstruction (SSRR) loss, which fundamentally improves codec training by reconstructing distilled self-supervised representations from codec outputs.

To ensure optimal performance, Flash-Attention is required.

Uses

JHCodec and the SW2V extractor can be used for research and practical applications requiring lossy audio compression or high-quality speech representations.

Intended Use

  • Real-time low-latency audio codecs for speech-to-speech models
  • Research into neural codecs and generative modeling
  • Preprocessing for downstream speech and audio ML models (e.g., ASR or TTS)

Sample Usage

The following snippet from the official repository shows how to load data using the AudioDataset class:

from jhcodec.dataloader import AudioDataset, collate_fn
from torch.utils.data import DataLoader

dataset = AudioDataset(
    audio_dir='./data',                  # Path to your data
    sample_rate=16000,
    segment_duration=10.24,
    training=True,
    init_dataset=False,                  # Use True to scan files initially (slow), or False to load from cache
    cache_dir='cache_dir/dataloader/v9', # location of the cache
    use_mel=False,                       # Set True to return also Mel features
)

Citation

@article{ssrr_codec2026,
  title={Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec},
  author={Anonymous},
  journal={arXiv preprint arXiv:2603.05887},
  year={2026}
}