sw2v_60k / README.md

nielsr HF Staff

Add pipeline tag and link to paper

aeef7f7 verified about 1 month ago

2.49 kB

license: mit
pipeline_tag: audio-classification

Model Card for SW2V (60k)

SW2V is a pure Transformer decoder-based speech representation model introduced in the paper Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec.

This specific checkpoint (60k) is trained via distillation of W2V-BERT 2.0.

GitHub Repository: https://github.com/jhcodec843/jhcodec
Demo: https://jhcodec843.github.io/jhcodec/
License: MIT

Model Details

Model Description

SW2V (Streaming wav2vec) is designed for high-intelligibility and low-latency speech representation. It utilizes Self-Supervised Representation Reconstruction (SSRR) loss, which fundamentally improves codec training by reconstructing distilled self-supervised representations from codec outputs.

To ensure optimal performance, Flash-Attention is required.

Uses

JHCodec and the SW2V extractor can be used for research and practical applications requiring lossy audio compression or high-quality speech representations.

Intended Use

Real-time low-latency audio codecs for speech-to-speech models
Research into neural codecs and generative modeling
Preprocessing for downstream speech and audio ML models (e.g., ASR or TTS)

Sample Usage

The following snippet from the official repository shows how to load data using the AudioDataset class:

from jhcodec.dataloader import AudioDataset, collate_fn
from torch.utils.data import DataLoader

dataset = AudioDataset(
    audio_dir='./data',                  # Path to your data
    sample_rate=16000,
    segment_duration=10.24,
    training=True,
    init_dataset=False,                  # Use True to scan files initially (slow), or False to load from cache
    cache_dir='cache_dir/dataloader/v9', # location of the cache
    use_mel=False,                       # Set True to return also Mel features
)

Citation

@article{ssrr_codec2026,
  title={Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec},
  author={Anonymous},
  journal={arXiv preprint arXiv:2603.05887},
  year={2026}
}