BytesofSurajm commited on
Commit
7208a45
·
verified ·
1 Parent(s): 956c531

Upload 2 files

Browse files
Files changed (2) hide show
  1. requirements.txt +6 -0
  2. sesame_csm_(1b)_tts.py +373 -0
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ gradio
2
+ torch
3
+ transformers
4
+ soundfile
5
+ librosa
6
+ unsloth
sesame_csm_(1b)_tts.py ADDED
@@ -0,0 +1,373 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """Sesame_CSM_(1B)-TTS.ipynb
3
+
4
+ Automatically generated by Colab.
5
+
6
+ Original file is located at
7
+ https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Sesame_CSM_(1B)-TTS.ipynb
8
+
9
+ To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
10
+ <div class="align-center">
11
+ <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
12
+ <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
13
+ <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
14
+ </div>
15
+
16
+ To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).
17
+
18
+ You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)
19
+
20
+ ### News
21
+
22
+ Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).
23
+
24
+ [gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!
25
+
26
+ Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.
27
+
28
+ Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).
29
+
30
+ Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).
31
+
32
+ ### Installation
33
+ """
34
+
35
+ # Commented out IPython magic to ensure Python compatibility.
36
+ # %%capture
37
+ # import os, re
38
+ # if "COLAB_" not in "".join(os.environ.keys()):
39
+ # !pip install unsloth
40
+ # else:
41
+ # # Do this only in Colab notebooks! Otherwise use pip install unsloth
42
+ # import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
43
+ # xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
44
+ # !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
45
+ # !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
46
+ # !pip install --no-deps unsloth
47
+ # !pip install transformers==4.52.3
48
+ # !pip install --no-deps trl==0.22.2
49
+
50
+ """### Unsloth
51
+
52
+ `FastModel` supports loading nearly any model now! This includes Vision and Text models!
53
+ """
54
+
55
+ from unsloth import FastModel
56
+ from transformers import CsmForConditionalGeneration
57
+ import torch
58
+
59
+ model, processor = FastModel.from_pretrained(
60
+ model_name = "unsloth/csm-1b",
61
+ max_seq_length= 2048, # Choose any for long context!
62
+ dtype = None, # Leave as None for auto-detection
63
+ auto_model = CsmForConditionalGeneration,
64
+ load_in_4bit = False, # Select True for 4bit - reduces memory usage
65
+ )
66
+
67
+ """We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"""
68
+
69
+ model = FastModel.get_peft_model(
70
+ model,
71
+ r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
72
+ target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
73
+ "gate_proj", "up_proj", "down_proj",],
74
+ lora_alpha = 32,
75
+ lora_dropout = 0, # Supports any, but = 0 is optimized
76
+ bias = "none", # Supports any, but = "none" is optimized
77
+ # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
78
+ use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
79
+ random_state = 3407,
80
+ use_rslora = False, # We support rank stabilized LoRA
81
+ loftq_config = None, # And LoftQ
82
+ )
83
+
84
+ """<a name="Data"></a>
85
+ ### Data Prep
86
+
87
+ We will use the `MrDragonFox/Elise`, which is designed for training TTS models. Ensure that your dataset follows the required format: **text, audio** for single-speaker models or **source, text, audio** for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.
88
+ """
89
+
90
+ #@title Dataset Prep functions
91
+ from datasets import load_dataset, Audio, Dataset
92
+ import os
93
+ from transformers import AutoProcessor
94
+ processor = AutoProcessor.from_pretrained("unsloth/csm-1b")
95
+
96
+ raw_ds = load_dataset("MrDragonFox/Elise", split="train")
97
+
98
+ # Getting the speaker id is important for multi-speaker models and speaker consistency
99
+ speaker_key = "source"
100
+ if "source" not in raw_ds.column_names and "speaker_id" not in raw_ds.column_names:
101
+ print("Unsloth: No speaker found, adding default \"source\" of 0 for all examples")
102
+ new_column = ["0"] * len(raw_ds)
103
+ raw_ds = raw_ds.add_column("source", new_column)
104
+ elif "source" not in raw_ds.column_names and "speaker_id" in raw_ds.column_names:
105
+ speaker_key = "speaker_id"
106
+
107
+ target_sampling_rate = 24000
108
+ raw_ds = raw_ds.cast_column("audio", Audio(sampling_rate=target_sampling_rate))
109
+
110
+ def preprocess_example(example):
111
+ conversation = [
112
+ {
113
+ "role": str(example[speaker_key]),
114
+ "content": [
115
+ {"type": "text", "text": example["text"]},
116
+ {"type": "audio", "path": example["audio"]["array"]},
117
+ ],
118
+ }
119
+ ]
120
+
121
+ try:
122
+ model_inputs = processor.apply_chat_template(
123
+ conversation,
124
+ tokenize=True,
125
+ return_dict=True,
126
+ output_labels=True,
127
+ text_kwargs = {
128
+ "padding": "max_length", # pad to the max_length
129
+ "max_length": 256, # this should be the max length of audio
130
+ "pad_to_multiple_of": 8,
131
+ "padding_side": "right",
132
+ },
133
+ audio_kwargs = {
134
+ "sampling_rate": 24_000,
135
+ "max_length": 240001, # max input_values length of the whole dataset
136
+ "padding": "max_length",
137
+ },
138
+ common_kwargs = {"return_tensors": "pt"},
139
+ )
140
+ except Exception as e:
141
+ print(f"Error processing example with text '{example['text'][:50]}...': {e}")
142
+ return None
143
+
144
+ required_keys = ["input_ids", "attention_mask", "labels", "input_values", "input_values_cutoffs"]
145
+ processed_example = {}
146
+ # print(model_inputs.keys())
147
+ for key in required_keys:
148
+ if key not in model_inputs:
149
+ print(f"Warning: Required key '{key}' not found in processor output for example.")
150
+ return None
151
+
152
+ value = model_inputs[key][0]
153
+ processed_example[key] = value
154
+
155
+
156
+ # Final check (optional but good)
157
+ if not all(isinstance(processed_example[key], torch.Tensor) for key in processed_example):
158
+ print(f"Error: Not all required keys are tensors in final processed example. Keys: {list(processed_example.keys())}")
159
+ return None
160
+
161
+ return processed_example
162
+
163
+ processed_ds = raw_ds.map(
164
+ preprocess_example,
165
+ remove_columns=raw_ds.column_names,
166
+ desc="Preprocessing dataset",
167
+ )
168
+
169
+ """<a name="Train"></a>
170
+ ### Train the model
171
+ Now let's use Huggingface `Trainer`! More docs here: [Transformers docs](https://huggingface.co/docs/transformers/main_classes/trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
172
+ """
173
+
174
+ from transformers import TrainingArguments, Trainer
175
+ from unsloth import is_bfloat16_supported
176
+
177
+ trainer = Trainer(
178
+ model = model,
179
+ train_dataset = processed_ds,
180
+ args = TrainingArguments(
181
+ per_device_train_batch_size = 2,
182
+ gradient_accumulation_steps = 4,
183
+ warmup_steps = 5,
184
+ max_steps = 60,
185
+ learning_rate = 2e-4,
186
+ fp16 = not is_bfloat16_supported(),
187
+ bf16 = is_bfloat16_supported(),
188
+ logging_steps = 1,
189
+ optim = "adamw_8bit",
190
+ weight_decay = 0.01, # Turn this on if overfitting
191
+ lr_scheduler_type = "linear",
192
+ seed = 3407,
193
+ output_dir = "outputs",
194
+ report_to = "none", # Use this for WandB etc
195
+ ),
196
+ )
197
+
198
+ # @title Show current memory stats
199
+ gpu_stats = torch.cuda.get_device_properties(0)
200
+ start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
201
+ max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
202
+ print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
203
+ print(f"{start_gpu_memory} GB of memory reserved.")
204
+
205
+ trainer_stats = trainer.train()
206
+
207
+ # @title Show final memory and time stats
208
+ used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
209
+ used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
210
+ used_percentage = round(used_memory / max_memory * 100, 3)
211
+ lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
212
+ print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
213
+ print(
214
+ f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
215
+ )
216
+ print(f"Peak reserved memory = {used_memory} GB.")
217
+ print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
218
+ print(f"Peak reserved memory % of max memory = {used_percentage} %.")
219
+ print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
220
+
221
+ """<a name="Inference"></a>
222
+ ### Inference
223
+ Let's run the model! You can change the prompts
224
+ """
225
+
226
+ from IPython.display import Audio, display
227
+ import soundfile as sf
228
+
229
+ text = "We just finished fine tuning a text to speech model... and it's pretty good!"
230
+ speaker_id = 0
231
+ inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True).to("cuda")
232
+ audio_values = model.generate(
233
+ **inputs,
234
+ max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
235
+ # play with these parameters to tweak results
236
+ # depth_decoder_top_k=0,
237
+ # depth_decoder_top_p=0.9,
238
+ # depth_decoder_do_sample=True,
239
+ # depth_decoder_temperature=0.9,
240
+ # top_k=0,
241
+ # top_p=1.0,
242
+ # temperature=0.9,
243
+ # do_sample=True,
244
+ #########################################################
245
+ output_audio=True
246
+ )
247
+ audio = audio_values[0].to(torch.float32).cpu().numpy()
248
+ sf.write("example_without_context.wav", audio, 24000)
249
+ display(Audio(audio, rate=24000))
250
+
251
+ text = "Sesame is a super cool TTS model which can be fine tuned with Unsloth."
252
+
253
+ speaker_id = 0
254
+ # Another equivalent way to prepare the inputs
255
+ conversation = [
256
+ {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
257
+ ]
258
+ audio_values = model.generate(
259
+ **processor.apply_chat_template(
260
+ conversation,
261
+ tokenize=True,
262
+ return_dict=True,
263
+ ).to("cuda"),
264
+ max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
265
+ # play with these parameters to tweak results
266
+ # depth_decoder_top_k=0,
267
+ # depth_decoder_top_p=0.9,
268
+ # depth_decoder_do_sample=True,
269
+ # depth_decoder_temperature=0.9,
270
+ # top_k=0,
271
+ # top_p=1.0,
272
+ # temperature=0.9,
273
+ # do_sample=True,
274
+ #########################################################
275
+ output_audio=True
276
+ )
277
+ audio = audio_values[0].to(torch.float32).cpu().numpy()
278
+ sf.write("example_without_context.wav", audio, 24000)
279
+ display(Audio(audio, rate=24000))
280
+
281
+ """#### Voice and style consistency
282
+
283
+ Sesame CSM's power comes from providing audio context for each speaker. Let's pass a sample utterance from our dataset to ground speaker identity and style.
284
+ """
285
+
286
+ speaker_id = 0
287
+
288
+ utterance = raw_ds[3]["audio"]["array"]
289
+ utterance_text = raw_ds[3]["text"]
290
+ text = "Sesame is a super cool TTS model which can be fine tuned with Unsloth."
291
+
292
+ # CSM will fill in the audio for the last text.
293
+ # You can even provide a conversation history back in as you generate new audio
294
+
295
+ conversation = [
296
+ {"role": str(speaker_id), "content": [{"type": "text", "text": utterance_text},{"type": "audio", "path": utterance}]},
297
+ {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
298
+ ]
299
+
300
+ inputs = processor.apply_chat_template(
301
+ conversation,
302
+ tokenize=True,
303
+ return_dict=True,
304
+ )
305
+ audio_values = model.generate(
306
+ **inputs.to("cuda"),
307
+ max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer text increase this
308
+ # play with these parameters to tweak results
309
+ # depth_decoder_top_k=0,
310
+ # depth_decoder_top_p=0.9,
311
+ # depth_decoder_do_sample=True,
312
+ # depth_decoder_temperature=0.9,
313
+ # top_k=0,
314
+ # top_p=1.0,
315
+ # temperature=0.9,
316
+ # do_sample=True,
317
+ #########################################################
318
+ output_audio=True
319
+ )
320
+ audio = audio_values[0].to(torch.float32).cpu().numpy()
321
+ sf.write("example_with_context.wav", audio, 24000)
322
+ display(Audio(audio, rate=24000))
323
+
324
+ """<a name="Save"></a>
325
+ ### Saving, loading finetuned models
326
+ To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.
327
+
328
+ **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
329
+ """
330
+
331
+ model.save_pretrained("lora_model") # Local saving
332
+ processor.save_pretrained("lora_model")
333
+ # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
334
+ # processor.push_to_hub("your_name/lora_model", token = "...") # Online saving
335
+
336
+ """### Saving to float16
337
+
338
+ We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
339
+ """
340
+
341
+ # Merge to 16bit
342
+ if False: model.save_pretrained_merged("model", processor, save_method = "merged_16bit",)
343
+ if False: model.push_to_hub_merged("hf/model", processor, save_method = "merged_16bit", token = "")
344
+
345
+ # Merge to 4bit
346
+ if False: model.save_pretrained_merged("model", processor, save_method = "merged_4bit",)
347
+ if False: model.push_to_hub_merged("hf/model", processor, save_method = "merged_4bit", token = "")
348
+
349
+ # Just LoRA adapters
350
+ if False:
351
+ model.save_pretrained("model")
352
+ processor.save_pretrained("model")
353
+ if False:
354
+ model.push_to_hub("hf/model", token = "")
355
+ processor.push_to_hub("hf/model", token = "")
356
+
357
+ """And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
358
+
359
+ Some other links:
360
+ 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
361
+ 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
362
+ 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
363
+ 6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!
364
+
365
+ <div class="align-center">
366
+ <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
367
+ <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
368
+ <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
369
+
370
+ Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
371
+ </div>
372
+
373
+ """