Audio-to-Audio
Transformers
Safetensors
speech_language_model
gallilmaimon commited on
Commit
d69bd36
·
verified ·
1 Parent(s): 82e1cfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -25
README.md CHANGED
@@ -10,15 +10,15 @@ pipeline_tag: audio-to-audio
10
  ---
11
 
12
  # Model Card for Model ID
13
- This is a Speech Lanaguage Model trained for generating audio contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
14
 
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
- This is a Speech Lanaguage Model, fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500
20
- speech tokens extracted from the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). It was trained as part of
21
- ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training. For a stronger model trained with
22
  slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam_scaled).
23
 
24
  The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
@@ -37,13 +37,8 @@ The model was trained by next-token prediction over a subset of LibriSpeech, Lib
37
  - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
38
 
39
  ## Uses
40
- This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning.
41
-
42
- ### Direct Use
43
-
44
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
45
-
46
- [More Information Needed]
47
 
48
  ### Out-of-Scope Use
49
  This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
@@ -59,13 +54,14 @@ We highly encourage users to read the full [paper](), for full training details,
59
 
60
 
61
  ### Training Data
62
- This model was trained on a subset of [LibriSpeech] train, [Libri-Light]() and the synthetic dataset
 
63
  [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
64
  dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
65
 
66
  ### Training Procedure
67
-
68
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
69
 
70
  #### Preprocessing
71
  Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
@@ -73,21 +69,20 @@ official kmeans released with the model in [textlesslib](https://github.com/face
73
  We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slam).
74
 
75
 
76
- #### Training Hyperparameters
77
-
78
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
79
-
80
-
81
  ## Evaluation
82
  The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
 
 
 
 
 
 
 
 
 
 
83
 
84
- **ADD Table**
85
-
86
-
87
-
88
- ### Model Architecture and Objective
89
 
90
- [More Information Needed]
91
 
92
  ### Compute Infrastructure
93
  This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.
 
10
  ---
11
 
12
  # Model Card for Model ID
13
+ This is a Speech Lanaguage Model trained for generating speech contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
14
 
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
+ This is a Speech Lanaguage Model, introduced in "_Slamming_: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
20
+ It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
21
+ the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). For a stronger version of the model trained with
22
  slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam_scaled).
23
 
24
  The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
 
37
  - **Demo:** [Link](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
38
 
39
  ## Uses
40
+ This is a base SpeechLM and as such can be used to generate contiuations for speech segments, or as base for further tuning. See the _slam_
41
+ [codebase](https://github.com/slp-rl/slam) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
 
 
 
 
 
42
 
43
  ### Out-of-Scope Use
44
  This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
 
54
 
55
 
56
  ### Training Data
57
+ This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
58
+ [Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
59
  [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
60
  dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
61
 
62
  ### Training Procedure
63
+ This model was trained by next token prediction over several dataset, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
64
+ Please refer to the [paper]() or [code](https://github.com/slp-rl/slam) for the full training recipes.
65
 
66
  #### Preprocessing
67
  Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
 
69
  We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slam).
70
 
71
 
 
 
 
 
 
72
  ## Evaluation
73
  The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
74
+ | Model | Compute (GPU days) | Parameters | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
75
+ |------------------------------------------|--------------------|------------|----------|--------------|--------------|---------|------------|
76
+ | [TWIST-1.3B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) | 160xV100 | 1B | 57.00 | 52.4 | 70.6 | 131.8 | 3.20 |
77
+ | [TWIST-7B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) | ? | 7B | 59.00 | 55.3 | 74.1 | 93.7 | 3.06 |
78
+ | [TWIST-13B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) | ? | 13B | 59.20 | 55.4 | 76.4 | - | - |
79
+ | [Scaled Optimal](https://arxiv.org/abs/2404.00685) | ? | 823M | **61.3** | **56.7** | **78.0** | - | - |
80
+ | [Predicted Optimal]((https://arxiv.org/abs/2404.00685)) | 1xA5000 | 78M | 56.85 | 54.09 | 70.49 | - | - |
81
+ | TWIST-350M (Original recipe) | 1xA5000 | 305M | 51.52 ± .19 | 53.65 ± .57 | 68.80 ± .47 | 259.2 ± 6.7 | 3.26 ± .46 |
82
+ | *Slam (-DPO) (ours)* | 1xA5000 | 358M | *56.45* ± .17 | *55.59* ± .30 | *78.01* ± .27 | *88.3* ± 1.0 | 3.47 ± .17 |
83
+ | **Slam (ours)** | 1xA5000 | 358M | **58.86** ± .20 | **58.04** ± .51 | **82.04** ± .21 | **62.8** ± 4.1 | 3.88 ± .11 |
84
 
 
 
 
 
 
85
 
 
86
 
87
  ### Compute Infrastructure
88
  This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training.