jhcodec
/

sw2v_120k

Audio Classification

Model card Files Files and versions

xet

Community

Add pipeline tag and link to paper

by nielsr HF Staff - opened Mar 9

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+71

-46

Files changed (1) hide show

README.md +71 -46

README.md CHANGED Viewed

@@ -1,46 +1,71 @@
----
-license: mit
----
-# Model Card for SW2V-120k
-*Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec*
-SW2V is a pure Transformer decoder based speech representation model. This model is trained via distillation of [W2V-Bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0)
-- **GitHub Repository:** [https://github.com/jhcodec843/jhcodec](https://github.com/jhcodec843/jhcodec)
-- **Demo:** [https://jhcodec843.github.io/jhcodec/](https://jhcodec843.github.io/jhcodec/)
-- **License:** MIT
-## Model Details
-### Model Description
-To enhance noise robustness for future applications, we incorporated noise augmentation during SW2V training.
-To ensure the performance Flash-Attention is required.
-## Uses
-JHCodec can be used for research and practical applications that require lossy audio compression. It is particularly well-suited for streaming speech, compressing large audio datasets, and serving as a neural front-end for speech recognition or synthesis pipelines.
-### Intended Use
-- Real-time low-latency audio codecs for speech-to-speech models
-- Research into neural codecs and generative modeling
-- Preprocessing for downstream speech and audio ML models
-### Out-of-Scope Use
-- Any malicious, deceptive, or privacy-violating applications
-## How to Get Started with JHCodec
-For programmatic usage, please refer to the [GitHub repository](https://github.com/jhcodec843/jhcodec) for installation, API documentation, and practical examples.
-## Training Details
-Please refer to the GitHub repository README.
-## Authors
-Anonymous, Submitted to Interspeech2026

+---
+license: mit
+pipeline_tag: audio-classification
+---
+# Model Card for SW2V-120k
+SW2V (Streaming Speech-to-Vector) is a pure Transformer decoder-based speech representation model. This specific checkpoint (120k) is trained with noise augmentation to enhance robustness for various real-world speech applications.
+The model was introduced in the paper [Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec](https://huggingface.co/papers/2603.05887).
+- **GitHub Repository:** [https://github.com/jhcodec843/jhcodec](https://github.com/jhcodec843/jhcodec)
+- **Demo:** [https://jhcodec843.github.io/jhcodec/](https://jhcodec843.github.io/jhcodec/)
+- **License:** MIT
+## Model Details
+### Model Description
+SW2V-120k is a streaming speech representation extractor trained via distillation of [W2V-Bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0). It leverages self-supervised representation reconstruction (SSRR) loss to fundamentally improve codec training, ensuring high intelligibility and content preservation with zero lookahead. This variant incorporates noise augmentation during training for improved performance in noisy environments.
+**Note:** Flash-Attention is required for optimal performance.
+## Uses
+JHCodec and SW2V can be used for research and practical applications requiring:
+- Real-time low-latency audio codecs for speech-to-speech models.
+- Neural front-ends for speech recognition or synthesis pipelines.
+- Lossy audio compression and speech representation extraction.
+### Out-of-Scope Use
+- Any malicious, deceptive, or privacy-violating applications.
+## How to Get Started
+For programmatic usage, please refer to the [GitHub repository](https://github.com/jhcodec843/jhcodec) for installation and environment setup.
+### Sample Usage
+You can use the `AudioDataset` from the official implementation to load data for the model:
+```python
+from jhcodec.dataloader import AudioDataset, collate_fn
+from torch.utils.data import DataLoader
+dataset = AudioDataset(
+    audio_dir='./data',                  # Path to your data
+    sample_rate=16000,
+    segment_duration=10.24,
+    training=True,
+    init_dataset=False,                  # Use True to scan files initially (slow), or False to load from cache
+    cache_dir='cache_dir/dataloader/v9', # location of the cache
+    use_mel=False,                       # Set True to return also Mel features
+)
+```
+## Citation
+```bibtex
+@article{sw2v2026ssrr,
+  title={Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec},
+  author={Anonymous},
+  journal={arXiv preprint arXiv:2603.05887},
+  year={2026}
+}
+```
+## Authors
+Anonymous, Submitted to Interspeech 2026.