DmitriiObukhov commited on
Commit
1ef4e69
Β·
1 Parent(s): 09945bc

Updated configs

Browse files
README.md DELETED
@@ -1,147 +0,0 @@
1
- ---
2
- library_name: transformers
3
- tags:
4
- - text-to-speech
5
- - annotation
6
- license: apache-2.0
7
- language:
8
- - en
9
- pipeline_tag: text-to-speech
10
- inference: false
11
- datasets:
12
- - parler-tts/mls_eng
13
- - parler-tts/libritts_r_filtered
14
- - parler-tts/libritts-r-filtered-speaker-descriptions
15
- - parler-tts/mls-eng-speaker-descriptions
16
- ---
17
-
18
- <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
19
-
20
-
21
- # Parler-TTS Mini v1
22
-
23
- <a target="_blank" href="https://huggingface.co/spaces/parler-tts/parler_tts">
24
- <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
25
- </a>
26
-
27
- **Parler-TTS Mini v1** is a lightweight text-to-speech (TTS) model, trained on 45K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation).
28
-
29
- With [Parler-TTS Large v1](https://huggingface.co/parler-tts/parler-tts-large-v1), this is the second set of models published as part of the [Parler-TTS](https://github.com/huggingface/parler-tts) project, which aims to provide the community with TTS training resources and dataset pre-processing code.
30
-
31
- ## πŸ“– Quick Index
32
- * [πŸ‘¨β€πŸ’» Installation](#πŸ‘¨β€πŸ’»-installation)
33
- * [🎲 Using a random voice](#🎲-random-voice)
34
- * [🎯 Using a specific speaker](#🎯-using-a-specific-speaker)
35
- * [Motivation](#motivation)
36
- * [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
37
-
38
- ## πŸ› οΈ Usage
39
-
40
- ### πŸ‘¨β€πŸ’» Installation
41
-
42
- Using Parler-TTS is as simple as "bonjour". Simply install the library once:
43
-
44
- ```sh
45
- pip install git+https://github.com/huggingface/parler-tts.git
46
- ```
47
-
48
- ### 🎲 Random voice
49
-
50
-
51
- **Parler-TTS** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
52
-
53
- ```py
54
- import torch
55
- from parler_tts import ParlerTTSForConditionalGeneration
56
- from transformers import AutoTokenizer
57
- import soundfile as sf
58
-
59
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
60
-
61
- model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
62
- tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
63
-
64
- prompt = "Hey, how are you doing today?"
65
- description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
66
-
67
- input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
68
- prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
69
-
70
- generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
71
- audio_arr = generation.cpu().numpy().squeeze()
72
- sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
73
- ```
74
-
75
- ### 🎯 Using a specific speaker
76
-
77
- To ensure speaker consistency across generations, this checkpoint was also trained on 34 speakers, characterized by name (e.g. Jon, Lea, Gary, Jenna, Mike, Laura).
78
-
79
- To take advantage of this, simply adapt your text description to specify which speaker to use: `Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.`
80
-
81
- ```py
82
- import torch
83
- from parler_tts import ParlerTTSForConditionalGeneration
84
- from transformers import AutoTokenizer
85
- import soundfile as sf
86
-
87
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
88
-
89
- model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
90
- tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
91
-
92
- prompt = "Hey, how are you doing today?"
93
- description = "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
94
-
95
- input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
96
- prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
97
-
98
- generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
99
- audio_arr = generation.cpu().numpy().squeeze()
100
- sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
101
- ```
102
-
103
- **Tips**:
104
- * We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
105
- * Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
106
- * Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
107
- * The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt
108
-
109
- ## Motivation
110
-
111
- Parler-TTS is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
112
-
113
- Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
114
- Parler-TTS was released alongside:
115
- * [The Parler-TTS repository](https://github.com/huggingface/parler-tts) - you can train and fine-tuned your own version of the model.
116
- * [The Data-Speech repository](https://github.com/huggingface/dataspeech) - a suite of utility scripts designed to annotate speech datasets.
117
- * [The Parler-TTS organization](https://huggingface.co/parler-tts) - where you can find the annotated datasets as well as the future checkpoints.
118
-
119
- ## Citation
120
-
121
- If you found this repository useful, please consider citing this work and also the original Stability AI paper:
122
-
123
- ```
124
- @misc{lacombe-etal-2024-parler-tts,
125
- author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
126
- title = {Parler-TTS},
127
- year = {2024},
128
- publisher = {GitHub},
129
- journal = {GitHub repository},
130
- howpublished = {\url{https://github.com/huggingface/parler-tts}}
131
- }
132
- ```
133
-
134
- ```
135
- @misc{lyth2024natural,
136
- title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
137
- author={Dan Lyth and Simon King},
138
- year={2024},
139
- eprint={2402.01912},
140
- archivePrefix={arXiv},
141
- primaryClass={cs.SD}
142
- }
143
- ```
144
-
145
- ## License
146
-
147
- This model is permissively licensed under the Apache 2.0 license.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,9 +1,10 @@
1
  {
2
- "_name_or_path": "/fsx/yoach/tmp/artefacts/training-50K-mini-without-accents-3-mononode/",
3
  "architectures": [
4
  "ParlerTTSForConditionalGeneration"
5
  ],
6
  "audio_encoder": {
 
7
  "_name_or_path": "parler-tts/dac_44khZ_8kbps",
8
  "add_cross_attention": false,
9
  "architectures": [
@@ -41,7 +42,7 @@
41
  "max_length": 20,
42
  "min_length": 0,
43
  "model_bitrate": 8,
44
- "model_type": "dac",
45
  "no_repeat_ngram_size": 0,
46
  "num_beam_groups": 1,
47
  "num_beams": 1,
@@ -75,6 +76,7 @@
75
  "use_bfloat16": false
76
  },
77
  "decoder": {
 
78
  "_name_or_path": "/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder",
79
  "activation_dropout": 0.0,
80
  "activation_function": "gelu",
@@ -87,6 +89,7 @@
87
  "begin_suppress_tokens": null,
88
  "bos_token_id": 1025,
89
  "chunk_size_feed_forward": 0,
 
90
  "cross_attention_hidden_size": null,
91
  "cross_attention_implementation_strategy": null,
92
  "decoder_start_token_id": null,
@@ -116,7 +119,7 @@
116
  "layerdrop": 0.0,
117
  "length_penalty": 1.0,
118
  "max_length": 20,
119
- "max_position_embeddings": 4096,
120
  "min_length": 0,
121
  "model_type": "parler_tts_decoder",
122
  "no_repeat_ngram_size": 0,
@@ -157,6 +160,7 @@
157
  "typical_p": 1.0,
158
  "use_bfloat16": false,
159
  "use_cache": true,
 
160
  "vocab_size": 1088
161
  },
162
  "decoder_start_token_id": 1025,
@@ -165,6 +169,7 @@
165
  "pad_token_id": 1024,
166
  "prompt_cross_attention": false,
167
  "text_encoder": {
 
168
  "_name_or_path": "google/flan-t5-large",
169
  "add_cross_attention": false,
170
  "architectures": [
@@ -249,6 +254,6 @@
249
  "vocab_size": 32128
250
  },
251
  "torch_dtype": "float32",
252
- "transformers_version": "4.40.2",
253
  "vocab_size": 32128
254
  }
 
1
  {
2
+ "_name_or_path": "speechmaster",
3
  "architectures": [
4
  "ParlerTTSForConditionalGeneration"
5
  ],
6
  "audio_encoder": {
7
+ "_attn_implementation_autoset": false,
8
  "_name_or_path": "parler-tts/dac_44khZ_8kbps",
9
  "add_cross_attention": false,
10
  "architectures": [
 
42
  "max_length": 20,
43
  "min_length": 0,
44
  "model_bitrate": 8,
45
+ "model_type": "dac_on_the_hub",
46
  "no_repeat_ngram_size": 0,
47
  "num_beam_groups": 1,
48
  "num_beams": 1,
 
76
  "use_bfloat16": false
77
  },
78
  "decoder": {
79
+ "_attn_implementation_autoset": false,
80
  "_name_or_path": "/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder",
81
  "activation_dropout": 0.0,
82
  "activation_function": "gelu",
 
89
  "begin_suppress_tokens": null,
90
  "bos_token_id": 1025,
91
  "chunk_size_feed_forward": 0,
92
+ "codebook_weights": null,
93
  "cross_attention_hidden_size": null,
94
  "cross_attention_implementation_strategy": null,
95
  "decoder_start_token_id": null,
 
119
  "layerdrop": 0.0,
120
  "length_penalty": 1.0,
121
  "max_length": 20,
122
+ "max_position_embeddings": 4311,
123
  "min_length": 0,
124
  "model_type": "parler_tts_decoder",
125
  "no_repeat_ngram_size": 0,
 
160
  "typical_p": 1.0,
161
  "use_bfloat16": false,
162
  "use_cache": true,
163
+ "use_fused_lm_heads": false,
164
  "vocab_size": 1088
165
  },
166
  "decoder_start_token_id": 1025,
 
169
  "pad_token_id": 1024,
170
  "prompt_cross_attention": false,
171
  "text_encoder": {
172
+ "_attn_implementation_autoset": false,
173
  "_name_or_path": "google/flan-t5-large",
174
  "add_cross_attention": false,
175
  "architectures": [
 
254
  "vocab_size": 32128
255
  },
256
  "torch_dtype": "float32",
257
+ "transformers_version": "4.46.1",
258
  "vocab_size": 32128
259
  }
generation_config.json CHANGED
@@ -2,11 +2,11 @@
2
  "_from_model_config": true,
3
  "bos_token_id": 1025,
4
  "decoder_start_token_id": 1025,
5
- "min_new_tokens": 10,
6
  "do_sample": true,
7
  "eos_token_id": 1024,
8
  "guidance_scale": 1,
9
- "max_length": 2580,
 
10
  "pad_token_id": 1024,
11
- "transformers_version": "4.40.2"
12
  }
 
2
  "_from_model_config": true,
3
  "bos_token_id": 1025,
4
  "decoder_start_token_id": 1025,
 
5
  "do_sample": true,
6
  "eos_token_id": 1024,
7
  "guidance_scale": 1,
8
+ "max_length": 4150,
9
+ "min_new_tokens": 10,
10
  "pad_token_id": 1024,
11
+ "transformers_version": "4.46.1"
12
  }
handler.py DELETED
@@ -1,44 +0,0 @@
1
- from typing import Dict, List, Any
2
- from parler_tts import ParlerTTSForConditionalGeneration
3
- from transformers import AutoTokenizer
4
- import torch
5
-
6
- class EndpointHandler:
7
- def __init__(self, path=""):
8
- # load model and processor from path
9
- self.tokenizer = AutoTokenizer.from_pretrained(path)
10
- self.model = ParlerTTSForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16).to("cuda")
11
-
12
- def __call__(self, data: Dict[str, Any]) -> Dict[str, str]:
13
- """
14
- Args:
15
- data (:dict:):
16
- The payload with the text prompt and generation parameters.
17
- """
18
- # process input
19
- inputs = data.pop("inputs", data)
20
- voice_description = data.pop("voice_description", "data")
21
- parameters = data.pop("parameters", None)
22
-
23
- gen_kwargs = {"min_new_tokens": 10}
24
- if parameters is not None:
25
- gen_kwargs.update(parameters)
26
-
27
- # preprocess
28
- inputs = self.tokenizer(
29
- text=[inputs],
30
- padding=True,
31
- return_tensors="pt",).to("cuda")
32
- voice_description = self.tokenizer(
33
- text=[voice_description],
34
- padding=True,
35
- return_tensors="pt",).to("cuda")
36
-
37
- # pass inputs with all kwargs in data
38
- with torch.autocast("cuda"):
39
- outputs = self.model.generate(**voice_description, prompt_input_ids=inputs.input_ids, **gen_kwargs)
40
-
41
- # postprocess the prediction
42
- prediction = outputs[0].cpu().numpy().tolist()
43
-
44
- return [{"generated_audio": prediction}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
preprocessor_config.json DELETED
@@ -1,10 +0,0 @@
1
- {
2
- "chunk_length_s": null,
3
- "feature_extractor_type": "EncodecFeatureExtractor",
4
- "feature_size": 1,
5
- "overlap": null,
6
- "padding_side": "right",
7
- "padding_value": 0.0,
8
- "return_attention_mask": true,
9
- "sampling_rate": 44100
10
- }
 
 
 
 
 
 
 
 
 
 
 
requirements.txt DELETED
@@ -1 +0,0 @@
1
- git+https://github.com/huggingface/parler-tts.git
 
 
tokenizer_config.json CHANGED
@@ -927,7 +927,7 @@
927
  "<extra_id_98>",
928
  "<extra_id_99>"
929
  ],
930
- "clean_up_tokenization_spaces": true,
931
  "eos_token": "</s>",
932
  "extra_ids": 100,
933
  "model_max_length": 512,
 
927
  "<extra_id_98>",
928
  "<extra_id_99>"
929
  ],
930
+ "clean_up_tokenization_spaces": false,
931
  "eos_token": "</s>",
932
  "extra_ids": 100,
933
  "model_max_length": 512,