Samples

Inference

Use speaker_id = 4 only

import torch

from peft import PeftModel
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "shb777/csm-maya-exp2")

conversation = [
    {"role": "4", "content": [{"type": "text", "text": "Hey there, I am Maya."}]},
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

gen_kwargs = {
    "max_new_tokens": 375,
    # "do_sample": True,
    # "temperature": 0.7,
    # "depth_decoder_do_sample": True,
    # "depth_decoder_temperature": 0.7,
    # "depth_decoder_top_k": 20,
    # "depth_decoder_top_p": 0.95,
}

audio = model.generate(**inputs, **gen_kwargs, output_audio=True)
processor.save_audio(audio, "example.wav")

Training

Raw data was processed using a 5 step Emilia-Pipe like custom pipeline.

Parakeet v2 was used for STT
VAD chunking algo was tweaked for cleaner cuts, each clip upto 20s
Chunked clips were filtered by UTMOSv2 score with additional filtering to remove clips with artifacts
About 30% of the collected data was used for training (around 31 hours)

I have another chunking algorithm that uses stable-whisper for forced alignment and produces a better mixture of small + large (upto 30s) clips, but is too slow to run locally. I will leave that + full data to a future training run on the cloud.

Some observations:

Inconsistent voice with same speaker ID (this is expected as its a base model)
Noise at the end of generated audio (reduces with finetuning, especially with longer clips)
Speaker ID 40 onwards seems bad
Struggles with (and ) and " and " and ; and ?! and [ and ] and / (also seen in official demo, I guess this is due to the nature of sesame's preprocessing)

I ran several ablations with about 4 hours of data to find the best parameters and understand more about the model.

Framework: Unsloth (SFT)
LoRA Target: attn + mlp in backbone and decoder excluding codec
LoRA Rank: 16
LoRA Alpha: 32
Learning Rate: 1e-4 , 0.1 warmup with cosine scheduler
Optimizer: adamw_torch_fused
Epochs: 4
Batch Size: 8

Limitations

The real strength of the model (and the reason it was designed) is multi-turn conversation with audio context. Since most of the data was single-turn, it may not generalize as well as using full duplex training data.
The model struggles with certain characters.
There might be some noise at the end of some generated clips.

Acknowledgements

Base TTS for the example sentences
Blog Post by Thomas Wolf

and Sesame's own blog

License

This is meant for research and personal use only. The license is due to the source of the training data.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shb777/csm-maya-exp2

Base model

sesame/csm-1b

Finetuned

unsloth/csm-1b

Finetuned

(200)

this model

Space using shb777/csm-maya-exp2 1

Paper for shb777/csm-maya-exp2

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Paper • 2402.08093 • Published Feb 12, 2024 • 61