Yoruba CFM-DiT: Text-to-Speech for Yoruba

A Conditional Flow Matching (CFM) model with a Diffusion Transformer (DiT) backbone for generating natural Yoruba speech from text. The model operates in the continuous latent space of Meta's EnCodec audio codec, learning to transform Gaussian noise into speech latents conditioned on phoneme sequences.

Quick Start

from transformers import AutoModel
from IPython.display import Audio

model = AutoModel.from_pretrained(
    "FloatinggOnion/yoruba-cfm-dit",
    trust_remote_code=True,
)

output = model.generate("Bawo ni, ẹ kú àárọ̀.")

# Play in a notebook
Audio(output["audio"].squeeze().cpu().numpy(), rate=output["sample_rate"])

Save to WAV

import wave
import numpy as np

def save_wav(path, audio_tensor, sr=24000):
    audio = audio_tensor.detach().cpu()
    if audio.dim() == 3:
        audio = audio[0]
    if audio.dim() == 2:
        audio = audio[0] if audio.shape[0] == 1 else audio.mean(dim=0)
    audio = audio.clamp(-1.0, 1.0).numpy()
    audio_i16 = (audio * 32767.0).astype(np.int16)
    with wave.open(path, "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sr)
        wf.writeframes(audio_i16.tobytes())

save_wav("output.wav", output["audio"], sr=output["sample_rate"])

How It Works

Architecture

The model has three stages:

Text Encoder -- A 4-layer Transformer encoder that converts Yoruba phoneme sequences (produced by YorubaG2P) into conditioning embeddings.
Diffusion Transformer (DiT) -- 10 DiT blocks with self-attention over the latent sequence and cross-attention to the text conditioning. Sinusoidal timestep embeddings are injected via an MLP.
EnCodec Decoder -- Meta's pretrained EnCodec 24kHz decoder converts the generated continuous latents back into a 24kHz audio waveform.

Conditional Flow Matching

Instead of the standard diffusion denoising objective, this model uses Conditional Flow Matching (CFM) with a linear interpolation path:

Forward process: x_t = (1 - t) * x_0 + t * x_1 where x_0 ~ N(0, I) and x_1 is the target audio latent
The model learns to predict the velocity field v = x_1 - x_0
At inference, an ODE solver (Euler method, 24 steps) integrates from noise to data

This approach is simpler and more stable than score-based diffusion, and allows fast generation with few sampling steps.

Generation Pipeline

Yoruba text -> YorubaG2P -> Phoneme IDs -> TextEncoder -> conditioning
                                                          |
                              Gaussian noise -> ODE sampling (24 steps) -> latents [T, 128]
                                                                              |
                                                          EnCodec decoder -> 24kHz audio

Model Details

Parameter	Value
Model dimension	512
Attention heads	8
DiT blocks	10
Text encoder layers	4
Latent dimension	128 (EnCodec)
Max latent length	2048 frames
Phoneme vocabulary	67 tokens
Total parameters	~57M (CFM only, excludes EnCodec)
Sample rate	24,000 Hz
Audio codec	facebook/encodec_24khz
ODE steps (default)	24

Training

The model was trained on a single GPU using PyTorch Lightning with the following configuration:

Setting	Value
Dataset	PlotweaverAI/yoruba-tts-selected-speakers + Hidi-agili/yoruba_male_dataset
Training steps	120,000
Batch size	8
Optimizer	AdamW (lr=2e-4, betas=(0.9, 0.95), weight_decay=1e-2)
Precision	Mixed (fp16)
Gradient clipping	1.0
EMA decay	0.999
Checkpoints	Every 5,000 steps
Platform	Kaggle (single GPU)

The released weights are the Exponential Moving Average (EMA) of the model parameters, which produces more stable and higher-quality outputs than the raw training weights.

Pre-encoded Latents

Audio from the training dataset was pre-encoded into continuous EnCodec latents (shape [T, 128] per sample) and stored as .pt files. These are available at FloatinggOnion/yoruba-cfm-latents.

Finetuning

You can finetune this model on additional Yoruba speech data:

from transformers import AutoModel
import copy, torch

# Load pretrained
pretrained = AutoModel.from_pretrained("FloatinggOnion/yoruba-cfm-dit", trust_remote_code=True)
cfm_model = pretrained.cfm.to("cuda")

# Set up EMA
ema_model = copy.deepcopy(cfm_model).eval()
for p in ema_model.parameters():
    p.requires_grad = False

# Train with lower LR
optimizer = torch.optim.AdamW(cfm_model.parameters(), lr=5e-5, betas=(0.9, 0.95))

for batch in your_dataloader:
    loss = cfm_loss(cfm_model, batch)  # same CFM loss function
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Update EMA
    with torch.no_grad():
        for ep, mp in zip(ema_model.parameters(), cfm_model.parameters()):
            ep.mul_(0.999).add_(mp, alpha=0.001)

# Save finetuned model
pretrained.cfm.load_state_dict(ema_model.state_dict())
pretrained.save_pretrained("./finetuned-yoruba-cfm")

New data must be encoded with the same EnCodec model (facebook/encodec_24khz) and phonemized with YorubaG2P using the same vocabulary. See the training notebook for the full data preparation and finetuning pipeline.

`generate()` API

output = model.generate(
    text="Bawo ni",           # Raw Yoruba text (uses YorubaG2P internally)
    # phoneme_ids=tensor,     # Or pass pre-computed phoneme IDs [1, L]
    num_latent_frames=150,    # Target duration in EnCodec frames (default: 150)
    num_ode_steps=24,         # ODE solver steps (default: 24, higher = better quality)
)

output["audio"]        # torch.Tensor -- waveform
output["sample_rate"]  # int -- 24000

Text input requires yoruba-g2p (pip install yoruba-g2p). Pass phoneme_ids directly to skip this dependency.

Dependencies

torch>=2.4
transformers>=4.40
safetensors
huggingface_hub
yoruba-g2p    # for text input (optional if passing phoneme_ids)
epitran       # required by yoruba-g2p

Files in This Repository

File	Description
`config.json`	Model configuration (hyperparameters, auto_map)
`model.safetensors`	Pretrained EMA weights (safetensors format)
`phoneme_vocab.json`	Phoneme-to-ID mapping (67 tokens)
`modeling_yoruba_cfm.py`	Model implementation (`YorubaCFMForTTS`)
`configuration_yoruba_cfm.py`	Config class (`YorubaCFMConfig`)
`yoruba_cfm_ema_weights.pt`	Legacy EMA weights (raw PyTorch format)
`yoruba_cfm_last.ckpt`	Legacy Lightning checkpoint

Limitations

Trained on a single speaker dataset; voice diversity is limited
No explicit duration or prosody control
Audio quality depends on the EnCodec decoder, which can introduce artifacts at boundaries
The model generates a fixed number of latent frames; very short or very long utterances may have silence or truncation

Acknowledgements and Citations

Training Data

This model was trained on:

Yoruba TTS Selected Speakers by PlotweaverAI
Yoruba Male Dataset by Hidi-agili (10,446 samples of male Yoruba speech)

@dataset{plotweaverai_yoruba_tts,
  author = {PlotweaverAI},
  title = {Yoruba TTS Selected Speakers},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/PlotweaverAI/yoruba-tts-selected-speakers}
}

@dataset{hidi_agili_yoruba_male,
  author = {Hidi-agili},
  title = {Yoruba Male Dataset},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/Hidi-agili/yoruba_male_dataset}
}

EnCodec

Audio encoding and decoding uses Meta's EnCodec neural audio codec:

@article{defossez2022encodec,
  title={High Fidelity Neural Audio Compression},
  author={D{\'e}fossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

Conditional Flow Matching

The training objective is based on Flow Matching for Generative Modeling:

@article{lipman2023flow,
  title={Flow Matching for Generative Modeling},
  author={Lipman, Yoel and Chen, Ricky T. Q. and Ben-Hamu, Heli and Nickel, Maximilian},
  journal={arXiv preprint arXiv:2210.02747},
  year={2023}
}

YorubaG2P

Text-to-phoneme conversion uses the yoruba-g2p library for Yoruba grapheme-to-phoneme conversion.

License

Apache 2.0

Downloads last month: 140

Safetensors

Model size

58.5M params

Tensor type

F32

Datasets used to train FloatinggOnion/yoruba-cfm-dit

Papers for FloatinggOnion/yoruba-cfm-dit

High Fidelity Neural Audio Compression

Paper • 2210.13438 • Published Oct 24, 2022 • 4

Flow Matching for Generative Modeling

Paper • 2210.02747 • Published Oct 6, 2022 • 4

Yoruba CFM-DiT: Text-to-Speech for Yoruba

Quick Start

Save to WAV

How It Works

Architecture

Conditional Flow Matching

Generation Pipeline

Model Details

Training

Pre-encoded Latents

Finetuning

generate() API

Dependencies

Files in This Repository

Limitations

Acknowledgements and Citations

Training Data

EnCodec

Conditional Flow Matching

YorubaG2P

License

Datasets used to train FloatinggOnion/yoruba-cfm-dit

Papers for FloatinggOnion/yoruba-cfm-dit

`generate()` API