nabeix commited on
Commit
56bd726
Β·
verified Β·
1 Parent(s): 896ddec

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +5 -150
  2. config.json +2 -2
  3. generation_config.json +2 -2
  4. model.safetensors +1 -1
README.md CHANGED
@@ -1,150 +1,5 @@
1
- ---
2
- library_name: transformers
3
- tags:
4
- - text-to-speech
5
- - annotation
6
- license: apache-2.0
7
- base_model:
8
- - parler-tts/parler-tts-mini-v1
9
- language:
10
- - en
11
- pipeline_tag: text-to-speech
12
- inference: false
13
- datasets:
14
- - parler-tts/mls_eng
15
- - parler-tts/libritts_r_filtered
16
- - parler-tts/libritts-r-filtered-speaker-descriptions
17
- - parler-tts/mls-eng-speaker-descriptions
18
- ---
19
-
20
-
21
- <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
22
-
23
-
24
- # Parler-TTS Mini v1
25
-
26
- <a target="_blank" href="https://huggingface.co/spaces/parler-tts/parler_tts">
27
- <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
28
- </a>
29
-
30
- **Parler-TTS Mini v1** is a lightweight text-to-speech (TTS) model, trained on 45K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation).
31
-
32
- With [Parler-TTS Large v1](https://huggingface.co/parler-tts/parler-tts-large-v1), this is the second set of models published as part of the [Parler-TTS](https://github.com/huggingface/parler-tts) project, which aims to provide the community with TTS training resources and dataset pre-processing code.
33
-
34
- ## πŸ“– Quick Index
35
- * [πŸ‘¨β€πŸ’» Installation](#πŸ‘¨β€πŸ’»-installation)
36
- * [🎲 Using a random voice](#🎲-random-voice)
37
- * [🎯 Using a specific speaker](#🎯-using-a-specific-speaker)
38
- * [Motivation](#motivation)
39
- * [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
40
-
41
- ## πŸ› οΈ Usage
42
-
43
- ### πŸ‘¨β€πŸ’» Installation
44
-
45
- Using Parler-TTS is as simple as "bonjour". Simply install the library once:
46
-
47
- ```sh
48
- pip install git+https://github.com/huggingface/parler-tts.git
49
- ```
50
-
51
- ### 🎲 Random voice
52
-
53
-
54
- **Parler-TTS** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
55
-
56
- ```py
57
- import torch
58
- from parler_tts import ParlerTTSForConditionalGeneration
59
- from transformers import AutoTokenizer
60
- import soundfile as sf
61
-
62
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
63
-
64
- model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
65
- tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
66
-
67
- prompt = "Hey, how are you doing today?"
68
- description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
69
-
70
- input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
71
- prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
72
-
73
- generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
74
- audio_arr = generation.cpu().numpy().squeeze()
75
- sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
76
- ```
77
-
78
- ### 🎯 Using a specific speaker
79
-
80
- To ensure speaker consistency across generations, this checkpoint was also trained on 34 speakers, characterized by name (e.g. Jon, Lea, Gary, Jenna, Mike, Laura).
81
-
82
- To take advantage of this, simply adapt your text description to specify which speaker to use: `Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.`
83
-
84
- ```py
85
- import torch
86
- from parler_tts import ParlerTTSForConditionalGeneration
87
- from transformers import AutoTokenizer
88
- import soundfile as sf
89
-
90
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
91
-
92
- model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
93
- tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
94
-
95
- prompt = "Hey, how are you doing today?"
96
- description = "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
97
-
98
- input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
99
- prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
100
-
101
- generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
102
- audio_arr = generation.cpu().numpy().squeeze()
103
- sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
104
- ```
105
-
106
- **Tips**:
107
- * We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
108
- * Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
109
- * Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
110
- * The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
111
-
112
- ## Motivation
113
-
114
- Parler-TTS is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
115
-
116
- Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
117
- Parler-TTS was released alongside:
118
- * [The Parler-TTS repository](https://github.com/huggingface/parler-tts) - you can train and fine-tuned your own version of the model.
119
- * [The Data-Speech repository](https://github.com/huggingface/dataspeech) - a suite of utility scripts designed to annotate speech datasets.
120
- * [The Parler-TTS organization](https://huggingface.co/parler-tts) - where you can find the annotated datasets as well as the future checkpoints.
121
-
122
- ## Citation
123
-
124
- If you found this repository useful, please consider citing this work and also the original Stability AI paper:
125
-
126
- ```
127
- @misc{lacombe-etal-2024-parler-tts,
128
- author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
129
- title = {Parler-TTS},
130
- year = {2024},
131
- publisher = {GitHub},
132
- journal = {GitHub repository},
133
- howpublished = {\url{https://github.com/huggingface/parler-tts}}
134
- }
135
- ```
136
-
137
- ```
138
- @misc{lyth2024natural,
139
- title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
140
- author={Dan Lyth and Simon King},
141
- year={2024},
142
- eprint={2402.01912},
143
- archivePrefix={arXiv},
144
- primaryClass={cs.SD}
145
- }
146
- ```
147
-
148
- ## License
149
-
150
- This model is permissively licensed under the Apache 2.0 license.
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "/fsx/yoach/tmp/artefacts/training-50K-mini-without-accents-3-mononode/",
3
  "architectures": [
4
  "ParlerTTSForConditionalGeneration"
5
  ],
@@ -249,6 +249,6 @@
249
  "vocab_size": 32128
250
  },
251
  "torch_dtype": "float32",
252
- "transformers_version": "4.40.2",
253
  "vocab_size": 32128
254
  }
 
1
  {
2
+ "_name_or_path": "/fsx/yoach/tmp/artefacts/training-PES24-TTS-mini-v1/",
3
  "architectures": [
4
  "ParlerTTSForConditionalGeneration"
5
  ],
 
249
  "vocab_size": 32128
250
  },
251
  "torch_dtype": "float32",
252
+ "transformers_version": "4.43.3",
253
  "vocab_size": 32128
254
  }
generation_config.json CHANGED
@@ -2,11 +2,11 @@
2
  "_from_model_config": true,
3
  "bos_token_id": 1025,
4
  "decoder_start_token_id": 1025,
5
- "min_new_tokens": 10,
6
  "do_sample": true,
7
  "eos_token_id": 1024,
8
  "guidance_scale": 1,
9
  "max_length": 2580,
 
10
  "pad_token_id": 1024,
11
- "transformers_version": "4.40.2"
12
  }
 
2
  "_from_model_config": true,
3
  "bos_token_id": 1025,
4
  "decoder_start_token_id": 1025,
 
5
  "do_sample": true,
6
  "eos_token_id": 1024,
7
  "guidance_scale": 1,
8
  "max_length": 2580,
9
+ "min_new_tokens": 10,
10
  "pad_token_id": 1024,
11
+ "transformers_version": "4.43.3"
12
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a1a209338c13bdf2eae0a5be82657f04989c74c8d2c803b9ae4a156648a1c9fa
3
  size 3511490640
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:509e987ddb1c8371e75ff992c6f2c19f8e0d24925a23d5ba037c8254fa9928b0
3
  size 3511490640