Update README.md
Browse files
README.md
CHANGED
|
@@ -17,7 +17,7 @@ This is a Speech Lanaguage Model trained for generating speech contiuations over
|
|
| 17 |
## Model Details
|
| 18 |
|
| 19 |
### Model Description
|
| 20 |
-
This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
|
| 21 |
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
|
| 22 |
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 23 |
|
|
@@ -33,7 +33,7 @@ The model was trained by next-token prediction over a subset of LibriSpeech, Lib
|
|
| 33 |
### Model Sources
|
| 34 |
|
| 35 |
- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
|
| 36 |
-
- **Paper:** [
|
| 37 |
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
|
| 38 |
|
| 39 |
## Uses
|
|
@@ -50,7 +50,7 @@ We refer users to the official repository for full usage explainations - [github
|
|
| 50 |
|
| 51 |
|
| 52 |
## Training Details
|
| 53 |
-
We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
|
| 54 |
|
| 55 |
|
| 56 |
### Training Data
|
|
@@ -61,7 +61,7 @@ dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
|
| 61 |
|
| 62 |
### Training Procedure
|
| 63 |
This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 64 |
-
Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
|
| 65 |
|
| 66 |
#### Preprocessing
|
| 67 |
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
|
|
@@ -98,7 +98,7 @@ The paper provides full results, we do give here some results and also refer to
|
|
| 98 |
|
| 99 |
|
| 100 |
### Compute Infrastructure
|
| 101 |
-
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
|
| 102 |
|
| 103 |
#### Hardware
|
| 104 |
This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
|
|
@@ -110,4 +110,14 @@ easy and efficent training of Speech Language Models.
|
|
| 110 |
## Citation
|
| 111 |
|
| 112 |
**BibTeX:**
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Model Details
|
| 18 |
|
| 19 |
### Model Description
|
| 20 |
+
This is a Speech Lanaguage Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
|
| 21 |
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
|
| 22 |
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 23 |
|
|
|
|
| 33 |
### Model Sources
|
| 34 |
|
| 35 |
- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
|
| 36 |
+
- **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
|
| 37 |
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
|
| 38 |
|
| 39 |
## Uses
|
|
|
|
| 50 |
|
| 51 |
|
| 52 |
## Training Details
|
| 53 |
+
We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
|
| 54 |
|
| 55 |
|
| 56 |
### Training Data
|
|
|
|
| 61 |
|
| 62 |
### Training Procedure
|
| 63 |
This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 64 |
+
Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
|
| 65 |
|
| 66 |
#### Preprocessing
|
| 67 |
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
|
|
|
|
| 98 |
|
| 99 |
|
| 100 |
### Compute Infrastructure
|
| 101 |
+
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
|
| 102 |
|
| 103 |
#### Hardware
|
| 104 |
This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
|
|
|
|
| 110 |
## Citation
|
| 111 |
|
| 112 |
**BibTeX:**
|
| 113 |
+
```
|
| 114 |
+
@misc{maimon2025slamming,
|
| 115 |
+
title={Slamming: Training a Speech Language Model on One GPU in a Day},
|
| 116 |
+
author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
|
| 117 |
+
year={2025},
|
| 118 |
+
eprint={2502.15814},
|
| 119 |
+
archivePrefix={arXiv},
|
| 120 |
+
primaryClass={cs.LG},
|
| 121 |
+
url={https://arxiv.org/abs/2502.15814},
|
| 122 |
+
}
|
| 123 |
+
```
|