Create README.md
#1
by
reach-vb
- opened
README.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# lina-speech (beta)
|
| 2 |
+
|
| 3 |
+
Exploring "linear attention" for text-to-speech.
|
| 4 |
+
|
| 5 |
+
It predicts audio codec "à la" [MusicGen](https://arxiv.org/abs/2306.05284) : delayed residual vector quantizers so that we do not need multiple models.
|
| 6 |
+
|
| 7 |
+
Featuring [RWKV](https://github.com/BlinkDL/RWKV-LM), [Mamba](https://github.com/state-spaces/mamba), [Gated Linear Attention](https://github.com/sustcsonglin/flash-linear-attention).
|
| 8 |
+
|
| 9 |
+
Compared to other LM TTS model :
|
| 10 |
+
- Can be easily pretrained and finetuned on midrange GPUs.
|
| 11 |
+
- Tiny memory footprint.
|
| 12 |
+
- Trained on long context (up to 2000 tokens : ~27s).
|
| 13 |
+
|
| 14 |
+
### Models
|
| 15 |
+
|
| 16 |
+
| Model | #Params | Dataset | Checkpoint | Steps | Note |
|
| 17 |
+
| :---: | :---: |:---: |:---: |:---: |:---: |
|
| 18 |
+
| GLA | 60M, 130M | Librilight-medium | [Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) | 300k | GPU inference only |
|
| 19 |
+
| Mamba| 60M | Librilight-medium |[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9)| 300k | GPU inference only |
|
| 20 |
+
| RWKV v6 | 60M | LibriTTS |[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) | 150k | GPU inference only |
|
| 21 |
+
|
| 22 |
+
### Installation
|
| 23 |
+
Following the linear complexity LM you choose, follow respective instructions first:
|
| 24 |
+
- For Mamba check the [official repo](https://github.com/state-spaces/mamba).
|
| 25 |
+
- For GLA/RWKV inference check [flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention).
|
| 26 |
+
- For RWKV training check [RWKV-LM](https://github.com/BlinkDL/RWKV-LM)
|
| 27 |
+
|
| 28 |
+
### Inference
|
| 29 |
+
|
| 30 |
+
Download configuration and weights above, then check `Inference.ipynb`.
|
| 31 |
+
|
| 32 |
+
### TODO
|
| 33 |
+
|
| 34 |
+
- [x] Fix RWKV6 inference and/or switch to FLA implem.
|
| 35 |
+
- [ ] Provide a Datamodule for training (_lhotse_ might also work well).
|
| 36 |
+
- [ ] Implement CFG.
|
| 37 |
+
- [ ] Scale up.
|
| 38 |
+
|
| 39 |
+
### Acknowledgment
|
| 40 |
+
|
| 41 |
+
- The RWKV authors and the community around for carrying high-level truly opensource research.
|
| 42 |
+
- @SmerkyG for making my life easy at testing cutting edge language model.
|
| 43 |
+
- @lucidrains for its huge codebase.
|
| 44 |
+
- @sustcsonglin who made [GLA and FLA](https://github.com/sustcsonglin/flash-linear-attention).
|
| 45 |
+
- @harrisonvanderbyl fixing RWKV inference.
|
| 46 |
+
|
| 47 |
+
### Cite
|
| 48 |
+
```bib
|
| 49 |
+
@software{lemerle2024linaspeech,
|
| 50 |
+
title = {LinaSpeech: Exploring "linear attention" for text-to-speech.},
|
| 51 |
+
author = {Lemerle, Théodor},
|
| 52 |
+
url = {https://github.com/theodorblackbird/lina-speech},
|
| 53 |
+
month = april,
|
| 54 |
+
year = {2024}
|
| 55 |
+
}
|
| 56 |
+
```
|
| 57 |
+
### IRCAM
|
| 58 |
+
|
| 59 |
+
This work takes place at IRCAM, and is part of the following project :
|
| 60 |
+
[ANR Exovoices](https://anr.fr/Projet-ANR-21-CE23-0040)
|
| 61 |
+
|
| 62 |
+
<img align="left" width="200" height="200" src="logo_ircam.jpeg">
|