BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Paper
• 2402.08093 • Published
• 62
CSM Maya TTS
This is a finetuned sesame/csm that sounds like the demo.
Try it out at TinkerSpace HF Space.
Use
speaker_id=4only
import torch
from peft import PeftModel
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "shb777/csm-maya-exp2")
conversation = [
{"role": "4", "content": [{"type": "text", "text": "Hey there, I am Maya."}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
gen_kwargs = {
"max_new_tokens": 375,
# "do_sample": True,
# "temperature": 0.7,
# "depth_decoder_do_sample": True,
# "depth_decoder_temperature": 0.7,
# "depth_decoder_top_k": 20,
# "depth_decoder_top_p": 0.95,
}
audio = model.generate(**inputs, **gen_kwargs, output_audio=True)
processor.save_audio(audio, "example.wav")
Raw data was processed using a 5 step Emilia-Pipe like custom pipeline.
I have another chunking algorithm that uses stable-whisper for forced alignment and produces a better mixture of small + large (upto 30s) clips, but is too slow to run locally. I will leave that + full data to a future training run on the cloud.
Some observations:
40 onwards seems bad(and ) and " and " and ; and ?! and [ and ] and / (also seen in official demo, I guess this is due to the nature of sesame's preprocessing)I ran several ablations with about 4 hours of data to find the best parameters and understand more about the model.
attn + mlp in backbone and decoder excluding codec16321e-4 , 0.1 warmup with cosine scheduleradamw_torch_fused48and Sesame's own blog
This is meant for research and personal use only. The license is due to the source of the training data.