CSM Maya TTS

This is a finetuned sesame/csm that sounds like the demo.

Try it out at TinkerSpace HF Space.

Samples

Inference

Use speaker_id = 4 only

import torch

from peft import PeftModel
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "shb777/csm-maya-exp2")

conversation = [
    {"role": "4", "content": [{"type": "text", "text": "Hey there, I am Maya."}]},
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

gen_kwargs = {
    "max_new_tokens": 375,
    # "do_sample": True,
    # "temperature": 0.7,
    # "depth_decoder_do_sample": True,
    # "depth_decoder_temperature": 0.7,
    # "depth_decoder_top_k": 20,
    # "depth_decoder_top_p": 0.95,
}

audio = model.generate(**inputs, **gen_kwargs, output_audio=True)
processor.save_audio(audio, "example.wav")

Training

Raw data was processed using a 5 step Emilia-Pipe like custom pipeline.

  • Parakeet v2 was used for STT
  • VAD chunking algo was tweaked for cleaner cuts, each clip upto 20s
  • Chunked clips were filtered by UTMOSv2 score with additional filtering to remove clips with artifacts
  • About 30% of the collected data was used for training (around 31 hours)

I have another chunking algorithm that uses stable-whisper for forced alignment and produces a better mixture of small + large (upto 30s) clips, but is too slow to run locally. I will leave that + full data to a future training run on the cloud.

Some observations:

  • Inconsistent voice with same speaker ID (this is expected as its a base model)
  • Noise at the end of generated audio (reduces with finetuning, especially with longer clips)
  • Speaker ID 40 onwards seems bad
  • Struggles with (and ) and " and " and ; and ?! and [ and ] and / (also seen in official demo, I guess this is due to the nature of sesame's preprocessing)

I ran several ablations with about 4 hours of data to find the best parameters and understand more about the model.

  • Framework: Unsloth (SFT)
  • LoRA Target: attn + mlp in backbone and decoder excluding codec
  • LoRA Rank: 16
  • LoRA Alpha: 32
  • Learning Rate: 1e-4 , 0.1 warmup with cosine scheduler
  • Optimizer: adamw_torch_fused
  • Epochs: 4
  • Batch Size: 8

Limitations

  • The real strength of the model (and the reason it was designed) is multi-turn conversation with audio context. Since most of the data was single-turn, it may not generalize as well as using full duplex training data.
  • The model struggles with certain characters.
  • There might be some noise at the end of some generated clips.

Acknowledgements

and Sesame's own blog

License

This is meant for research and personal use only. The license is due to the source of the training data.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shb777/csm-maya-exp2

Base model

sesame/csm-1b
Finetuned
unsloth/csm-1b
Finetuned
(184)
this model

Space using shb777/csm-maya-exp2 1

Paper for shb777/csm-maya-exp2