Update README.md
Browse files
README.md
CHANGED
|
@@ -10,15 +10,15 @@ pipeline_tag: audio-to-audio
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# Model Card for Model ID
|
| 13 |
-
This is a Speech Lanaguage Model trained for generating
|
| 14 |
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
### Model Description
|
| 19 |
-
This is a Speech Lanaguage Model,
|
| 20 |
-
|
| 21 |
-
|
| 22 |
slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam_scaled).
|
| 23 |
|
| 24 |
The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
|
|
@@ -37,13 +37,8 @@ The model was trained by next-token prediction over a subset of LibriSpeech, Lib
|
|
| 37 |
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
|
| 38 |
|
| 39 |
## Uses
|
| 40 |
-
This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning.
|
| 41 |
-
|
| 42 |
-
### Direct Use
|
| 43 |
-
|
| 44 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 45 |
-
|
| 46 |
-
[More Information Needed]
|
| 47 |
|
| 48 |
### Out-of-Scope Use
|
| 49 |
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
|
|
@@ -59,13 +54,14 @@ We highly encourage users to read the full [paper](), for full training details,
|
|
| 59 |
|
| 60 |
|
| 61 |
### Training Data
|
| 62 |
-
This model was trained on a subset of [LibriSpeech] train,
|
|
|
|
| 63 |
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
|
| 64 |
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 65 |
|
| 66 |
### Training Procedure
|
| 67 |
-
|
| 68 |
-
|
| 69 |
|
| 70 |
#### Preprocessing
|
| 71 |
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
|
|
@@ -73,21 +69,20 @@ official kmeans released with the model in [textlesslib](https://github.com/face
|
|
| 73 |
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slam).
|
| 74 |
|
| 75 |
|
| 76 |
-
#### Training Hyperparameters
|
| 77 |
-
|
| 78 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 79 |
-
|
| 80 |
-
|
| 81 |
## Evaluation
|
| 82 |
The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
**ADD Table**
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
### Model Architecture and Objective
|
| 89 |
|
| 90 |
-
[More Information Needed]
|
| 91 |
|
| 92 |
### Compute Infrastructure
|
| 93 |
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# Model Card for Model ID
|
| 13 |
+
This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 14 |
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
### Model Description
|
| 19 |
+
This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
|
| 20 |
+
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
|
| 21 |
+
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). For a stronger version of the model trained with
|
| 22 |
slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam_scaled).
|
| 23 |
|
| 24 |
The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
|
|
|
|
| 37 |
- **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
|
| 38 |
|
| 39 |
## Uses
|
| 40 |
+
This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _slam_
|
| 41 |
+
[codebase](https://github.com/slp-rl/slam) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
### Out-of-Scope Use
|
| 44 |
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
|
|
|
|
| 54 |
|
| 55 |
|
| 56 |
### Training Data
|
| 57 |
+
This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
|
| 58 |
+
[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
|
| 59 |
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
|
| 60 |
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 61 |
|
| 62 |
### Training Procedure
|
| 63 |
+
This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 64 |
+
Please refer to the [paper]() or [code](https://github.com/slp-rl/slam) for the full training recipes.
|
| 65 |
|
| 66 |
#### Preprocessing
|
| 67 |
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
|
|
|
|
| 69 |
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slam).
|
| 70 |
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
## Evaluation
|
| 73 |
The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
|
| 74 |
+
| Model | Compute (GPU days) | Parameters | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
|
| 75 |
+
|------------------------------------------|--------------------|------------|----------|--------------|--------------|---------|------------|
|
| 76 |
+
| [TWIST-1.3B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) | 160xV100 | 1B | 57.00 | 52.4 | 70.6 | 131.8 | 3.20 |
|
| 77 |
+
| [TWIST-7B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) | ? | 7B | 59.00 | 55.3 | 74.1 | 93.7 | 3.06 |
|
| 78 |
+
| [TWIST-13B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) | ? | 13B | 59.20 | 55.4 | 76.4 | - | - |
|
| 79 |
+
| [Scaled Optimal](https://arxiv.org/abs/2404.00685) | ? | 823M | **61.3** | **56.7** | **78.0** | - | - |
|
| 80 |
+
| [Predicted Optimal]((https://arxiv.org/abs/2404.00685)) | 1xA5000 | 78M | 56.85 | 54.09 | 70.49 | - | - |
|
| 81 |
+
| TWIST-350M (Original recipe) | 1xA5000 | 305M | 51.52 ± .19 | 53.65 ± .57 | 68.80 ± .47 | 259.2 ± 6.7 | 3.26 ± .46 |
|
| 82 |
+
| *Slam (-DPO) (ours)* | 1xA5000 | 358M | *56.45* ± .17 | *55.59* ± .30 | *78.01* ± .27 | *88.3* ± 1.0 | 3.47 ± .17 |
|
| 83 |
+
| **Slam (ours)** | 1xA5000 | 358M | **58.86** ± .20 | **58.04** ± .51 | **82.04** ± .21 | **62.8** ± 4.1 | 3.88 ± .11 |
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
|
|
|
| 86 |
|
| 87 |
### Compute Infrastructure
|
| 88 |
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
|