Audio-to-Audio
Transformers
Safetensors
speech_language_model
avishai-elmakies commited on
Commit
922bb58
·
verified ·
1 Parent(s): 47ac107

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -155
README.md CHANGED
@@ -1,199 +1,131 @@
1
  ---
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
 
52
  ### Out-of-Scope Use
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
 
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
 
76
  ## Training Details
 
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
 
 
83
 
84
  ### Training Procedure
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
 
103
  ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
 
159
  ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
 
163
  #### Hardware
164
-
165
- [More Information Needed]
166
 
167
  #### Software
 
 
168
 
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
  **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-1.5B
4
+ datasets:
5
+ - openslr/librispeech_asr
6
+ - slprl/SpokenSwag
7
+ - slprl/sTinyStories
8
  library_name: transformers
9
+ license: mit
10
+ pipeline_tag: audio-to-audio
11
  ---
12
 
13
+ # Slamming: Training a Speech Language Model on One GPU in a Day
14
 
15
+ The model was presented in the paper [Slamming: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814).
16
 
17
+ # Paper abstract
18
 
19
+ We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
20
 
21
+ # Model Card for Model ID
22
+ This is a Speech Language Model (SLM) trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
23
 
 
24
 
25
+ ## Model Details
26
 
27
+ ### Model Description
28
+ This Speech Language Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focuses on efficient training.
29
+ It was fine-tuned from [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
30
+ the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
31
 
32
+ The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset
33
+ [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was subsequently fine-tuned with DPO on
34
+ [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
 
 
 
 
35
 
36
+ - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
37
+ - **Model type:** SpeechLM
38
+ - **License:** MIT
39
+ - **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
40
 
41
+ ### Model Sources
42
 
43
+ - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
44
+ - **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
45
+ - **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
46
 
47
  ## Uses
48
+ This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the _SlamKit_
49
+ [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ### Out-of-Scope Use
52
+ This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
53
 
 
 
 
 
 
 
 
 
 
 
 
54
 
 
 
 
55
 
56
  ## How to Get Started with the Model
57
+ We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
58
 
 
 
 
59
 
60
  ## Training Details
61
+ We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
62
 
 
63
 
64
+ ### Training Data
65
+ This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
66
+ [Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
67
+ [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
68
+ dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
69
 
70
  ### Training Procedure
71
+ This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
72
+ Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
73
 
74
+ #### Preprocessing
75
+ Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
76
+ official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
77
+ We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).
 
 
 
 
78
 
 
 
 
 
 
 
 
79
 
80
  ## Evaluation
81
+ The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.
82
+
83
+ | Model | GPUs | Params | Num Tokens | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
84
+ |-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
85
+ | **Speech only pre-training** | | | | | | | | |
86
+ | GSLM | 8×V100 | 100M | 1B | 54.2 | 53.3 | 66.6 | — | — |
87
+ | SyllableLM | 4×A40 | 300M | 16B | 63.7 | — | 75.4 | — | — |
88
+ | TWIST-350M | 8×V100 | 305M | 10.8B | 56.2 | — | — | 137.3 | 3.46 |
89
+ | TWIST-1.3B | 32×V100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 |
90
+ | TWIST-7B | 32×V100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 |
91
+ | TWIST-13B | 32×V100 | 13B | 36B | 59.2 | 55.4 | 76.4 | — | — |
92
+ | Scaled Optimal | — | 823M | 82B | **61.3** | 56.7 | 78.0 | — | — |
93
+ | Moshi | ?×H100 | 7B | ? | 58.9 | **58.7** | **81.8** | — | — |
94
+ | SpiritLM | 64×A100 | 7B | 100B | 58.0 | 54.8 | 72.9 | — | — |
95
+ | **With text / preference optimization** | | | | | | | | |
96
+ | Scaling Interleaving | — | 9B | ~1T | — | **62.4** | 82.9 | — | — |
97
+ | Moshi | ?×H100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | — | — |
98
+ | SpiritLM | 64×A100 | 7B | 100B | 58.3 | 61.0 | 82.9 | — | — |
99
+ | AlignSLM-1.3B | 64×A100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 | — | — |
100
+ | AlignSLM-7B | 64×A100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | — | — |
101
+ | **Ours (_Slam_)** | | | | | | | | |
102
+ | _Slam_ (-DPO) | 2×A100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 |
103
+ | _Slam_ | 1×A5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 |
104
+ | _Slam_ (scaled) | 2×A100 | 358M | 16.7B + 9M | 61.11 | 61.30 | 84.18 | 46.6 | 3.75 |
105
+ | _Slam_ (large) | 2×A100 | 1.3B | 6.1B + 9M | **61.43** | **61.52** | **85.30** | **41.2** | 3.89 |
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  ### Compute Infrastructure
109
+ This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
 
110
 
111
  #### Hardware
112
+ This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
 
113
 
114
  #### Software
115
+ The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
116
+ easy and efficient training of Speech Language Models.
117
 
118
+ ## Citation
 
 
 
 
119
 
120
  **BibTeX:**
121
+ ```
122
+ @misc{maimon2025slamming,
123
+ title={Slamming: Training a Speech Language Model on One GPU in a Day},
124
+ author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
125
+ year={2025},
126
+ eprint={2502.15814},
127
+ archivePrefix={arXiv},
128
+ primaryClass={cs.LG},
129
+ url={https://arxiv.org/abs/2502.15814},
130
+ }
131
+ ```