Add pipeline tag and link to paper
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,46 +1,71 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
SW2V is a pure Transformer decoder
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
- **
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
##
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
- Real-time low-latency audio codecs for speech-to-speech models
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
-
|
| 31 |
-
### Out-of-Scope Use
|
| 32 |
-
|
| 33 |
-
- Any malicious, deceptive, or privacy-violating applications
|
| 34 |
-
|
| 35 |
-
## How to Get Started
|
| 36 |
-
|
| 37 |
-
For programmatic usage, please refer to the [GitHub repository](https://github.com/jhcodec843/jhcodec) for installation
|
| 38 |
-
|
| 39 |
-
##
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: audio-classification
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Model Card for SW2V-120k
|
| 7 |
+
|
| 8 |
+
SW2V (Streaming Speech-to-Vector) is a pure Transformer decoder-based speech representation model. This specific checkpoint (120k) is trained with noise augmentation to enhance robustness for various real-world speech applications.
|
| 9 |
+
|
| 10 |
+
The model was introduced in the paper [Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec](https://huggingface.co/papers/2603.05887).
|
| 11 |
+
|
| 12 |
+
- **GitHub Repository:** [https://github.com/jhcodec843/jhcodec](https://github.com/jhcodec843/jhcodec)
|
| 13 |
+
- **Demo:** [https://jhcodec843.github.io/jhcodec/](https://jhcodec843.github.io/jhcodec/)
|
| 14 |
+
- **License:** MIT
|
| 15 |
+
|
| 16 |
+
## Model Details
|
| 17 |
+
|
| 18 |
+
### Model Description
|
| 19 |
+
|
| 20 |
+
SW2V-120k is a streaming speech representation extractor trained via distillation of [W2V-Bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0). It leverages self-supervised representation reconstruction (SSRR) loss to fundamentally improve codec training, ensuring high intelligibility and content preservation with zero lookahead. This variant incorporates noise augmentation during training for improved performance in noisy environments.
|
| 21 |
+
|
| 22 |
+
**Note:** Flash-Attention is required for optimal performance.
|
| 23 |
+
|
| 24 |
+
## Uses
|
| 25 |
+
|
| 26 |
+
JHCodec and SW2V can be used for research and practical applications requiring:
|
| 27 |
+
- Real-time low-latency audio codecs for speech-to-speech models.
|
| 28 |
+
- Neural front-ends for speech recognition or synthesis pipelines.
|
| 29 |
+
- Lossy audio compression and speech representation extraction.
|
| 30 |
+
|
| 31 |
+
### Out-of-Scope Use
|
| 32 |
+
|
| 33 |
+
- Any malicious, deceptive, or privacy-violating applications.
|
| 34 |
+
|
| 35 |
+
## How to Get Started
|
| 36 |
+
|
| 37 |
+
For programmatic usage, please refer to the [GitHub repository](https://github.com/jhcodec843/jhcodec) for installation and environment setup.
|
| 38 |
+
|
| 39 |
+
### Sample Usage
|
| 40 |
+
|
| 41 |
+
You can use the `AudioDataset` from the official implementation to load data for the model:
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
from jhcodec.dataloader import AudioDataset, collate_fn
|
| 45 |
+
from torch.utils.data import DataLoader
|
| 46 |
+
|
| 47 |
+
dataset = AudioDataset(
|
| 48 |
+
audio_dir='./data', # Path to your data
|
| 49 |
+
sample_rate=16000,
|
| 50 |
+
segment_duration=10.24,
|
| 51 |
+
training=True,
|
| 52 |
+
init_dataset=False, # Use True to scan files initially (slow), or False to load from cache
|
| 53 |
+
cache_dir='cache_dir/dataloader/v9', # location of the cache
|
| 54 |
+
use_mel=False, # Set True to return also Mel features
|
| 55 |
+
)
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Citation
|
| 59 |
+
|
| 60 |
+
```bibtex
|
| 61 |
+
@article{sw2v2026ssrr,
|
| 62 |
+
title={Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec},
|
| 63 |
+
author={Anonymous},
|
| 64 |
+
journal={arXiv preprint arXiv:2603.05887},
|
| 65 |
+
year={2026}
|
| 66 |
+
}
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Authors
|
| 70 |
+
|
| 71 |
+
Anonymous, Submitted to Interspeech 2026.
|