Text-to-Speech
Transformers
Safetensors
Yoruba
yoruba_cfm_dit
tts
yoruba
diffusion
conditional-flow-matching
encodec
speech-synthesis
custom_code

Yoruba CFM-DiT: Text-to-Speech for Yoruba

A Conditional Flow Matching (CFM) model with a Diffusion Transformer (DiT) backbone for generating natural Yoruba speech from text. The model operates in the continuous latent space of Meta's EnCodec audio codec, learning to transform Gaussian noise into speech latents conditioned on phoneme sequences.

Quick Start

from transformers import AutoModel
from IPython.display import Audio

model = AutoModel.from_pretrained(
    "FloatinggOnion/yoruba-cfm-dit",
    trust_remote_code=True,
)

output = model.generate("Bawo ni, ẹ kú àárọ̀.")

# Play in a notebook
Audio(output["audio"].squeeze().cpu().numpy(), rate=output["sample_rate"])

Save to WAV

import wave
import numpy as np

def save_wav(path, audio_tensor, sr=24000):
    audio = audio_tensor.detach().cpu()
    if audio.dim() == 3:
        audio = audio[0]
    if audio.dim() == 2:
        audio = audio[0] if audio.shape[0] == 1 else audio.mean(dim=0)
    audio = audio.clamp(-1.0, 1.0).numpy()
    audio_i16 = (audio * 32767.0).astype(np.int16)
    with wave.open(path, "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sr)
        wf.writeframes(audio_i16.tobytes())

save_wav("output.wav", output["audio"], sr=output["sample_rate"])

How It Works

Architecture

The model has three stages:

  1. Text Encoder -- A 4-layer Transformer encoder that converts Yoruba phoneme sequences (produced by YorubaG2P) into conditioning embeddings.
  2. Diffusion Transformer (DiT) -- 10 DiT blocks with self-attention over the latent sequence and cross-attention to the text conditioning. Sinusoidal timestep embeddings are injected via an MLP.
  3. EnCodec Decoder -- Meta's pretrained EnCodec 24kHz decoder converts the generated continuous latents back into a 24kHz audio waveform.

Conditional Flow Matching

Instead of the standard diffusion denoising objective, this model uses Conditional Flow Matching (CFM) with a linear interpolation path:

  • Forward process: x_t = (1 - t) * x_0 + t * x_1 where x_0 ~ N(0, I) and x_1 is the target audio latent
  • The model learns to predict the velocity field v = x_1 - x_0
  • At inference, an ODE solver (Euler method, 24 steps) integrates from noise to data

This approach is simpler and more stable than score-based diffusion, and allows fast generation with few sampling steps.

Generation Pipeline

Yoruba text -> YorubaG2P -> Phoneme IDs -> TextEncoder -> conditioning
                                                          |
                              Gaussian noise -> ODE sampling (24 steps) -> latents [T, 128]
                                                                              |
                                                          EnCodec decoder -> 24kHz audio

Model Details

Parameter Value
Model dimension 512
Attention heads 8
DiT blocks 10
Text encoder layers 4
Latent dimension 128 (EnCodec)
Max latent length 2048 frames
Phoneme vocabulary 67 tokens
Total parameters ~57M (CFM only, excludes EnCodec)
Sample rate 24,000 Hz
Audio codec facebook/encodec_24khz
ODE steps (default) 24

Training

The model was trained on a single GPU using PyTorch Lightning with the following configuration:

Setting Value
Dataset PlotweaverAI/yoruba-tts-selected-speakers + Hidi-agili/yoruba_male_dataset
Training steps 120,000
Batch size 8
Optimizer AdamW (lr=2e-4, betas=(0.9, 0.95), weight_decay=1e-2)
Precision Mixed (fp16)
Gradient clipping 1.0
EMA decay 0.999
Checkpoints Every 5,000 steps
Platform Kaggle (single GPU)

The released weights are the Exponential Moving Average (EMA) of the model parameters, which produces more stable and higher-quality outputs than the raw training weights.

Pre-encoded Latents

Audio from the training dataset was pre-encoded into continuous EnCodec latents (shape [T, 128] per sample) and stored as .pt files. These are available at FloatinggOnion/yoruba-cfm-latents.

Finetuning

You can finetune this model on additional Yoruba speech data:

from transformers import AutoModel
import copy, torch

# Load pretrained
pretrained = AutoModel.from_pretrained("FloatinggOnion/yoruba-cfm-dit", trust_remote_code=True)
cfm_model = pretrained.cfm.to("cuda")

# Set up EMA
ema_model = copy.deepcopy(cfm_model).eval()
for p in ema_model.parameters():
    p.requires_grad = False

# Train with lower LR
optimizer = torch.optim.AdamW(cfm_model.parameters(), lr=5e-5, betas=(0.9, 0.95))

for batch in your_dataloader:
    loss = cfm_loss(cfm_model, batch)  # same CFM loss function
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Update EMA
    with torch.no_grad():
        for ep, mp in zip(ema_model.parameters(), cfm_model.parameters()):
            ep.mul_(0.999).add_(mp, alpha=0.001)

# Save finetuned model
pretrained.cfm.load_state_dict(ema_model.state_dict())
pretrained.save_pretrained("./finetuned-yoruba-cfm")

New data must be encoded with the same EnCodec model (facebook/encodec_24khz) and phonemized with YorubaG2P using the same vocabulary. See the training notebook for the full data preparation and finetuning pipeline.

generate() API

output = model.generate(
    text="Bawo ni",           # Raw Yoruba text (uses YorubaG2P internally)
    # phoneme_ids=tensor,     # Or pass pre-computed phoneme IDs [1, L]
    num_latent_frames=150,    # Target duration in EnCodec frames (default: 150)
    num_ode_steps=24,         # ODE solver steps (default: 24, higher = better quality)
)

output["audio"]        # torch.Tensor -- waveform
output["sample_rate"]  # int -- 24000

Text input requires yoruba-g2p (pip install yoruba-g2p). Pass phoneme_ids directly to skip this dependency.

Dependencies

torch>=2.4
transformers>=4.40
safetensors
huggingface_hub
yoruba-g2p    # for text input (optional if passing phoneme_ids)
epitran       # required by yoruba-g2p

Files in This Repository

File Description
config.json Model configuration (hyperparameters, auto_map)
model.safetensors Pretrained EMA weights (safetensors format)
phoneme_vocab.json Phoneme-to-ID mapping (67 tokens)
modeling_yoruba_cfm.py Model implementation (YorubaCFMForTTS)
configuration_yoruba_cfm.py Config class (YorubaCFMConfig)
yoruba_cfm_ema_weights.pt Legacy EMA weights (raw PyTorch format)
yoruba_cfm_last.ckpt Legacy Lightning checkpoint

Limitations

  • Trained on a single speaker dataset; voice diversity is limited
  • No explicit duration or prosody control
  • Audio quality depends on the EnCodec decoder, which can introduce artifacts at boundaries
  • The model generates a fixed number of latent frames; very short or very long utterances may have silence or truncation

Acknowledgements and Citations

Training Data

This model was trained on:

@dataset{plotweaverai_yoruba_tts,
  author = {PlotweaverAI},
  title = {Yoruba TTS Selected Speakers},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/PlotweaverAI/yoruba-tts-selected-speakers}
}

@dataset{hidi_agili_yoruba_male,
  author = {Hidi-agili},
  title = {Yoruba Male Dataset},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/Hidi-agili/yoruba_male_dataset}
}

EnCodec

Audio encoding and decoding uses Meta's EnCodec neural audio codec:

@article{defossez2022encodec,
  title={High Fidelity Neural Audio Compression},
  author={D{\'e}fossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

Conditional Flow Matching

The training objective is based on Flow Matching for Generative Modeling:

@article{lipman2023flow,
  title={Flow Matching for Generative Modeling},
  author={Lipman, Yoel and Chen, Ricky T. Q. and Ben-Hamu, Heli and Nickel, Maximilian},
  journal={arXiv preprint arXiv:2210.02747},
  year={2023}
}

YorubaG2P

Text-to-phoneme conversion uses the yoruba-g2p library for Yoruba grapheme-to-phoneme conversion.

License

Apache 2.0

Downloads last month
140
Safetensors
Model size
58.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train FloatinggOnion/yoruba-cfm-dit

Papers for FloatinggOnion/yoruba-cfm-dit