Update README.md
Browse files
README.md
CHANGED
|
@@ -1,199 +1,131 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
-
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
|
| 8 |
-
|
| 9 |
|
|
|
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
-
|
|
|
|
| 13 |
|
| 14 |
-
### Model Description
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
- **Repository:** [
|
| 33 |
-
- **Paper
|
| 34 |
-
- **Demo
|
| 35 |
|
| 36 |
## Uses
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
### Direct Use
|
| 41 |
-
|
| 42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 43 |
-
|
| 44 |
-
[More Information Needed]
|
| 45 |
-
|
| 46 |
-
### Downstream Use [optional]
|
| 47 |
-
|
| 48 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 49 |
-
|
| 50 |
-
[More Information Needed]
|
| 51 |
|
| 52 |
### Out-of-Scope Use
|
|
|
|
| 53 |
|
| 54 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 55 |
-
|
| 56 |
-
[More Information Needed]
|
| 57 |
-
|
| 58 |
-
## Bias, Risks, and Limitations
|
| 59 |
-
|
| 60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 61 |
-
|
| 62 |
-
[More Information Needed]
|
| 63 |
-
|
| 64 |
-
### Recommendations
|
| 65 |
|
| 66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 67 |
-
|
| 68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 69 |
|
| 70 |
## How to Get Started with the Model
|
|
|
|
| 71 |
|
| 72 |
-
Use the code below to get started with the model.
|
| 73 |
-
|
| 74 |
-
[More Information Needed]
|
| 75 |
|
| 76 |
## Training Details
|
|
|
|
| 77 |
|
| 78 |
-
### Training Data
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
[
|
|
|
|
|
|
|
| 83 |
|
| 84 |
### Training Procedure
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
[More Information Needed]
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
#### Training Hyperparameters
|
| 94 |
|
| 95 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 96 |
-
|
| 97 |
-
#### Speeds, Sizes, Times [optional]
|
| 98 |
-
|
| 99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
| 100 |
-
|
| 101 |
-
[More Information Needed]
|
| 102 |
|
| 103 |
## Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 106 |
-
|
| 107 |
-
### Testing Data, Factors & Metrics
|
| 108 |
-
|
| 109 |
-
#### Testing Data
|
| 110 |
-
|
| 111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
| 112 |
-
|
| 113 |
-
[More Information Needed]
|
| 114 |
-
|
| 115 |
-
#### Factors
|
| 116 |
-
|
| 117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
-
|
| 121 |
-
#### Metrics
|
| 122 |
-
|
| 123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
-
|
| 127 |
-
### Results
|
| 128 |
-
|
| 129 |
-
[More Information Needed]
|
| 130 |
-
|
| 131 |
-
#### Summary
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
-
|
| 141 |
-
## Environmental Impact
|
| 142 |
-
|
| 143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 144 |
-
|
| 145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 146 |
-
|
| 147 |
-
- **Hardware Type:** [More Information Needed]
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
-
|
| 153 |
-
## Technical Specifications [optional]
|
| 154 |
-
|
| 155 |
-
### Model Architecture and Objective
|
| 156 |
-
|
| 157 |
-
[More Information Needed]
|
| 158 |
|
| 159 |
### Compute Infrastructure
|
| 160 |
-
|
| 161 |
-
[More Information Needed]
|
| 162 |
|
| 163 |
#### Hardware
|
| 164 |
-
|
| 165 |
-
[More Information Needed]
|
| 166 |
|
| 167 |
#### Software
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
## Citation [optional]
|
| 172 |
-
|
| 173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 174 |
|
| 175 |
**BibTeX:**
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
[More Information Needed]
|
| 188 |
-
|
| 189 |
-
## More Information [optional]
|
| 190 |
-
|
| 191 |
-
[More Information Needed]
|
| 192 |
-
|
| 193 |
-
## Model Card Authors [optional]
|
| 194 |
-
|
| 195 |
-
[More Information Needed]
|
| 196 |
-
|
| 197 |
-
## Model Card Contact
|
| 198 |
-
|
| 199 |
-
[More Information Needed]
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-1.5B
|
| 4 |
+
datasets:
|
| 5 |
+
- openslr/librispeech_asr
|
| 6 |
+
- slprl/SpokenSwag
|
| 7 |
+
- slprl/sTinyStories
|
| 8 |
library_name: transformers
|
| 9 |
+
license: mit
|
| 10 |
+
pipeline_tag: audio-to-audio
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Slamming: Training a Speech Language Model on One GPU in a Day
|
| 14 |
|
| 15 |
+
The model was presented in the paper [Slamming: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814).
|
| 16 |
|
| 17 |
+
# Paper abstract
|
| 18 |
|
| 19 |
+
We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
|
| 20 |
|
| 21 |
+
# Model Card for Model ID
|
| 22 |
+
This is a Speech Language Model (SLM) trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 23 |
|
|
|
|
| 24 |
|
| 25 |
+
## Model Details
|
| 26 |
|
| 27 |
+
### Model Description
|
| 28 |
+
This Speech Language Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focuses on efficient training.
|
| 29 |
+
It was fine-tuned from [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
|
| 30 |
+
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 31 |
|
| 32 |
+
The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset
|
| 33 |
+
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was subsequently fine-tuned with DPO on
|
| 34 |
+
[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
|
| 37 |
+
- **Model type:** SpeechLM
|
| 38 |
+
- **License:** MIT
|
| 39 |
+
- **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
|
| 40 |
|
| 41 |
+
### Model Sources
|
| 42 |
|
| 43 |
+
- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
|
| 44 |
+
- **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
|
| 45 |
+
- **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
|
| 46 |
|
| 47 |
## Uses
|
| 48 |
+
This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the _SlamKit_
|
| 49 |
+
[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
### Out-of-Scope Use
|
| 52 |
+
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## How to Get Started with the Model
|
| 57 |
+
We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
|
| 58 |
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
## Training Details
|
| 61 |
+
We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
|
| 62 |
|
|
|
|
| 63 |
|
| 64 |
+
### Training Data
|
| 65 |
+
This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
|
| 66 |
+
[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
|
| 67 |
+
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
|
| 68 |
+
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 69 |
|
| 70 |
### Training Procedure
|
| 71 |
+
This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 72 |
+
Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
|
| 73 |
|
| 74 |
+
#### Preprocessing
|
| 75 |
+
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
|
| 76 |
+
official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
|
| 77 |
+
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
## Evaluation
|
| 81 |
+
The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.
|
| 82 |
+
|
| 83 |
+
| Model | GPUs | Params | Num Tokens | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
|
| 84 |
+
|-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
|
| 85 |
+
| **Speech only pre-training** | | | | | | | | |
|
| 86 |
+
| GSLM | 8×V100 | 100M | 1B | 54.2 | 53.3 | 66.6 | — | — |
|
| 87 |
+
| SyllableLM | 4×A40 | 300M | 16B | 63.7 | — | 75.4 | — | — |
|
| 88 |
+
| TWIST-350M | 8×V100 | 305M | 10.8B | 56.2 | — | — | 137.3 | 3.46 |
|
| 89 |
+
| TWIST-1.3B | 32×V100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 |
|
| 90 |
+
| TWIST-7B | 32×V100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 |
|
| 91 |
+
| TWIST-13B | 32×V100 | 13B | 36B | 59.2 | 55.4 | 76.4 | — | — |
|
| 92 |
+
| Scaled Optimal | — | 823M | 82B | **61.3** | 56.7 | 78.0 | — | — |
|
| 93 |
+
| Moshi | ?×H100 | 7B | ? | 58.9 | **58.7** | **81.8** | — | — |
|
| 94 |
+
| SpiritLM | 64×A100 | 7B | 100B | 58.0 | 54.8 | 72.9 | — | — |
|
| 95 |
+
| **With text / preference optimization** | | | | | | | | |
|
| 96 |
+
| Scaling Interleaving | — | 9B | ~1T | — | **62.4** | 82.9 | — | — |
|
| 97 |
+
| Moshi | ?×H100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | — | — |
|
| 98 |
+
| SpiritLM | 64×A100 | 7B | 100B | 58.3 | 61.0 | 82.9 | — | — |
|
| 99 |
+
| AlignSLM-1.3B | 64×A100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 | — | — |
|
| 100 |
+
| AlignSLM-7B | 64×A100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | — | — |
|
| 101 |
+
| **Ours (_Slam_)** | | | | | | | | |
|
| 102 |
+
| _Slam_ (-DPO) | 2×A100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 |
|
| 103 |
+
| _Slam_ | 1×A5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 |
|
| 104 |
+
| _Slam_ (scaled) | 2×A100 | 358M | 16.7B + 9M | 61.11 | 61.30 | 84.18 | 46.6 | 3.75 |
|
| 105 |
+
| _Slam_ (large) | 2×A100 | 1.3B | 6.1B + 9M | **61.43** | **61.52** | **85.30** | **41.2** | 3.89 |
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
### Compute Infrastructure
|
| 109 |
+
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
|
|
|
|
| 110 |
|
| 111 |
#### Hardware
|
| 112 |
+
This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
|
|
|
|
| 113 |
|
| 114 |
#### Software
|
| 115 |
+
The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
|
| 116 |
+
easy and efficient training of Speech Language Models.
|
| 117 |
|
| 118 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
**BibTeX:**
|
| 121 |
+
```
|
| 122 |
+
@misc{maimon2025slamming,
|
| 123 |
+
title={Slamming: Training a Speech Language Model on One GPU in a Day},
|
| 124 |
+
author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
|
| 125 |
+
year={2025},
|
| 126 |
+
eprint={2502.15814},
|
| 127 |
+
archivePrefix={arXiv},
|
| 128 |
+
primaryClass={cs.LG},
|
| 129 |
+
url={https://arxiv.org/abs/2502.15814},
|
| 130 |
+
}
|
| 131 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|