slprl
/

SIMS-7B

@@ -1,87 +1,99 @@
----
-library_name: transformers
-license: mit
-datasets:
-- slprl/sTinyStories
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-7B
-pipeline_tag: audio-to-audio
----
-# Scaling Analysis of Interleaved Speech-Text Language Models
-The model was presented in the paper [Scaling Analysis of Interleaved Speech-Text Language Models](https://arxiv.org/abs/2504.02398).
-# Paper abstract
-Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data
-compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from
-pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - _Do interleaved SLMs scale more efficiently than textless-SLMs?_
-In this paper we answer a resounding _yes!_ We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the
-scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the
-scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for
-increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential.
-Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less
-compute and data than other approaches.
-# Model Card for Model ID
-This is a Speech Language Model (SLM) trained for generating speech or text continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz) given speech-text prompts.
-## Model Details
-### Model Description
-This Speech Language Model, introduced in ["Scaling Analysis of Interleaved Speech-Text Language Models"](https://arxiv.org/abs/2504.02398), focuses on scaling analysis of interleaved speech-text SLMs.
-It was fine-tuned from [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) by extending its vocabulary with 500 speech tokens extracted from
-the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
-- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
-- **Model type:** SpeechLM
-- **License:** MIT
-- **Finetuned from model:** [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
-### Model Sources
-- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
-- **Paper:** [https://arxiv.org/abs/2504.02398](https://arxiv.org/abs/2504.02398)
-- **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/sims/](https://pages.cs.huji.ac.il/adiyoss-lab/sims/)
-## Uses
-This base SpeechLM can be used to generate continuations for speech segments, or cross-modal e.g generate a text contiuation to a speech prompt, or as a base for further tuning. See the _SlamKit_
-[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/sims/) for some generation examples
-### Out-of-Scope Use
-This model was trained on diverse speech datasets, as such the outputs should not be treated as factual in any way.
-## How to Get Started with the Model
-We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
-## Training Details
-We highly encourage users to read the full [paper](https://arxiv.org/abs/2504.02398), for full training details.
-### Compute Infrastructure
-#### Hardware
-This model was trained using 8 Nvidia H100 GPUs.
-#### Software
-The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
-easy and efficient training of Speech Language Models.
-## Citation
-**BibTeX:**
-```
-@misc{maimon2025scaling,
-      title={Scaling Analysis of Interleaved Speech-Text Language Models},
-      author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
-      year={2025},
-      eprint={2504.02398},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2504.02398},
-}
 ```

+---
+library_name: transformers
+license: mit
+datasets:
+- slprl/sTinyStories
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-7B
+pipeline_tag: audio-to-audio
+---
+# Scaling Analysis of Interleaved Speech-Text Language Models
+The model was presented in the paper [Scaling Analysis of Interleaved Speech-Text Language Models](https://arxiv.org/abs/2504.02398).
+# Paper abstract
+Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data
+compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from
+pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - _Do interleaved SLMs scale more efficiently than textless-SLMs?_
+In this paper we answer a resounding _yes!_ We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the
+scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the
+scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for
+increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential.
+Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less
+compute and data than other approaches.
+# Model Card for Model ID
+This is a Speech Language Model (SLM) trained for generating speech or text continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz) given speech-text prompts.
+## Model Details
+### Model Description
+This Speech Language Model, introduced in ["Scaling Analysis of Interleaved Speech-Text Language Models"](https://arxiv.org/abs/2504.02398), focuses on scaling analysis of interleaved speech-text SLMs.
+It was fine-tuned from [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) by extending its vocabulary with 500 speech tokens extracted from
+the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
+- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
+- **Model type:** SpeechLM
+- **License:** MIT
+- **Finetuned from model:** [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
+### Model Sources
+- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
+- **Paper:** [https://arxiv.org/abs/2504.02398](https://arxiv.org/abs/2504.02398)
+- **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/sims/](https://pages.cs.huji.ac.il/adiyoss-lab/sims/)
+## Uses
+This base SpeechLM can be used to generate continuations for speech segments, or cross-modal e.g generate a text contiuation to a speech prompt, or as a base for further tuning. See the _SlamKit_
+[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/sims/) for some generation examples
+### Out-of-Scope Use
+This model was trained on diverse speech datasets, as such the outputs should not be treated as factual in any way.
+## How to Get Started with the Model
+We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
+## Training Details
+We highly encourage users to read the full [paper](https://arxiv.org/abs/2504.02398), for full training details.
+### Compute Infrastructure
+#### Hardware
+This model was trained using 8 Nvidia H100 GPUs.
+#### Software
+The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
+easy and efficient training of Speech Language Models.
+## Citation
+**BibTeX:**
+```
+@misc{maimon2025scaling,
+      title={Scaling Analysis of Interleaved Speech-Text Language Models},
+      author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
+      year={2025},
+      eprint={2504.02398},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2504.02398},
+}
 ```