Instructions to use FloatinggOnion/yoruba-cfm-dit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FloatinggOnion/yoruba-cfm-dit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="FloatinggOnion/yoruba-cfm-dit", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FloatinggOnion/yoruba-cfm-dit", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Yoruba CFM-DiT: Text-to-Speech for Yoruba
A Conditional Flow Matching (CFM) model with a Diffusion Transformer (DiT) backbone for generating natural Yoruba speech from text. The model operates in the continuous latent space of Meta's EnCodec audio codec, learning to transform Gaussian noise into speech latents conditioned on phoneme sequences.
Quick Start
from transformers import AutoModel
from IPython.display import Audio
model = AutoModel.from_pretrained(
"FloatinggOnion/yoruba-cfm-dit",
trust_remote_code=True,
)
output = model.generate("Bawo ni, ẹ kú àárọ̀.")
# Play in a notebook
Audio(output["audio"].squeeze().cpu().numpy(), rate=output["sample_rate"])
Save to WAV
import wave
import numpy as np
def save_wav(path, audio_tensor, sr=24000):
audio = audio_tensor.detach().cpu()
if audio.dim() == 3:
audio = audio[0]
if audio.dim() == 2:
audio = audio[0] if audio.shape[0] == 1 else audio.mean(dim=0)
audio = audio.clamp(-1.0, 1.0).numpy()
audio_i16 = (audio * 32767.0).astype(np.int16)
with wave.open(path, "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(sr)
wf.writeframes(audio_i16.tobytes())
save_wav("output.wav", output["audio"], sr=output["sample_rate"])
How It Works
Architecture
The model has three stages:
- Text Encoder -- A 4-layer Transformer encoder that converts Yoruba phoneme sequences (produced by YorubaG2P) into conditioning embeddings.
- Diffusion Transformer (DiT) -- 10 DiT blocks with self-attention over the latent sequence and cross-attention to the text conditioning. Sinusoidal timestep embeddings are injected via an MLP.
- EnCodec Decoder -- Meta's pretrained EnCodec 24kHz decoder converts the generated continuous latents back into a 24kHz audio waveform.
Conditional Flow Matching
Instead of the standard diffusion denoising objective, this model uses Conditional Flow Matching (CFM) with a linear interpolation path:
- Forward process:
x_t = (1 - t) * x_0 + t * x_1wherex_0 ~ N(0, I)andx_1is the target audio latent - The model learns to predict the velocity field
v = x_1 - x_0 - At inference, an ODE solver (Euler method, 24 steps) integrates from noise to data
This approach is simpler and more stable than score-based diffusion, and allows fast generation with few sampling steps.
Generation Pipeline
Yoruba text -> YorubaG2P -> Phoneme IDs -> TextEncoder -> conditioning
|
Gaussian noise -> ODE sampling (24 steps) -> latents [T, 128]
|
EnCodec decoder -> 24kHz audio
Model Details
| Parameter | Value |
|---|---|
| Model dimension | 512 |
| Attention heads | 8 |
| DiT blocks | 10 |
| Text encoder layers | 4 |
| Latent dimension | 128 (EnCodec) |
| Max latent length | 2048 frames |
| Phoneme vocabulary | 67 tokens |
| Total parameters | ~57M (CFM only, excludes EnCodec) |
| Sample rate | 24,000 Hz |
| Audio codec | facebook/encodec_24khz |
| ODE steps (default) | 24 |
Training
The model was trained on a single GPU using PyTorch Lightning with the following configuration:
| Setting | Value |
|---|---|
| Dataset | PlotweaverAI/yoruba-tts-selected-speakers + Hidi-agili/yoruba_male_dataset |
| Training steps | 120,000 |
| Batch size | 8 |
| Optimizer | AdamW (lr=2e-4, betas=(0.9, 0.95), weight_decay=1e-2) |
| Precision | Mixed (fp16) |
| Gradient clipping | 1.0 |
| EMA decay | 0.999 |
| Checkpoints | Every 5,000 steps |
| Platform | Kaggle (single GPU) |
The released weights are the Exponential Moving Average (EMA) of the model parameters, which produces more stable and higher-quality outputs than the raw training weights.
Pre-encoded Latents
Audio from the training dataset was pre-encoded into continuous EnCodec latents (shape [T, 128] per sample) and stored as .pt files. These are available at FloatinggOnion/yoruba-cfm-latents.
Finetuning
You can finetune this model on additional Yoruba speech data:
from transformers import AutoModel
import copy, torch
# Load pretrained
pretrained = AutoModel.from_pretrained("FloatinggOnion/yoruba-cfm-dit", trust_remote_code=True)
cfm_model = pretrained.cfm.to("cuda")
# Set up EMA
ema_model = copy.deepcopy(cfm_model).eval()
for p in ema_model.parameters():
p.requires_grad = False
# Train with lower LR
optimizer = torch.optim.AdamW(cfm_model.parameters(), lr=5e-5, betas=(0.9, 0.95))
for batch in your_dataloader:
loss = cfm_loss(cfm_model, batch) # same CFM loss function
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Update EMA
with torch.no_grad():
for ep, mp in zip(ema_model.parameters(), cfm_model.parameters()):
ep.mul_(0.999).add_(mp, alpha=0.001)
# Save finetuned model
pretrained.cfm.load_state_dict(ema_model.state_dict())
pretrained.save_pretrained("./finetuned-yoruba-cfm")
New data must be encoded with the same EnCodec model (facebook/encodec_24khz) and phonemized with YorubaG2P using the same vocabulary. See the training notebook for the full data preparation and finetuning pipeline.
generate() API
output = model.generate(
text="Bawo ni", # Raw Yoruba text (uses YorubaG2P internally)
# phoneme_ids=tensor, # Or pass pre-computed phoneme IDs [1, L]
num_latent_frames=150, # Target duration in EnCodec frames (default: 150)
num_ode_steps=24, # ODE solver steps (default: 24, higher = better quality)
)
output["audio"] # torch.Tensor -- waveform
output["sample_rate"] # int -- 24000
Text input requires yoruba-g2p (pip install yoruba-g2p). Pass phoneme_ids directly to skip this dependency.
Dependencies
torch>=2.4
transformers>=4.40
safetensors
huggingface_hub
yoruba-g2p # for text input (optional if passing phoneme_ids)
epitran # required by yoruba-g2p
Files in This Repository
| File | Description |
|---|---|
config.json |
Model configuration (hyperparameters, auto_map) |
model.safetensors |
Pretrained EMA weights (safetensors format) |
phoneme_vocab.json |
Phoneme-to-ID mapping (67 tokens) |
modeling_yoruba_cfm.py |
Model implementation (YorubaCFMForTTS) |
configuration_yoruba_cfm.py |
Config class (YorubaCFMConfig) |
yoruba_cfm_ema_weights.pt |
Legacy EMA weights (raw PyTorch format) |
yoruba_cfm_last.ckpt |
Legacy Lightning checkpoint |
Limitations
- Trained on a single speaker dataset; voice diversity is limited
- No explicit duration or prosody control
- Audio quality depends on the EnCodec decoder, which can introduce artifacts at boundaries
- The model generates a fixed number of latent frames; very short or very long utterances may have silence or truncation
Acknowledgements and Citations
Training Data
This model was trained on:
- Yoruba TTS Selected Speakers by PlotweaverAI
- Yoruba Male Dataset by Hidi-agili (10,446 samples of male Yoruba speech)
@dataset{plotweaverai_yoruba_tts,
author = {PlotweaverAI},
title = {Yoruba TTS Selected Speakers},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/PlotweaverAI/yoruba-tts-selected-speakers}
}
@dataset{hidi_agili_yoruba_male,
author = {Hidi-agili},
title = {Yoruba Male Dataset},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Hidi-agili/yoruba_male_dataset}
}
EnCodec
Audio encoding and decoding uses Meta's EnCodec neural audio codec:
@article{defossez2022encodec,
title={High Fidelity Neural Audio Compression},
author={D{\'e}fossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}
Conditional Flow Matching
The training objective is based on Flow Matching for Generative Modeling:
@article{lipman2023flow,
title={Flow Matching for Generative Modeling},
author={Lipman, Yoel and Chen, Ricky T. Q. and Ben-Hamu, Heli and Nickel, Maximilian},
journal={arXiv preprint arXiv:2210.02747},
year={2023}
}
YorubaG2P
Text-to-phoneme conversion uses the yoruba-g2p library for Yoruba grapheme-to-phoneme conversion.
License
Apache 2.0
- Downloads last month
- 140