Text-to-Speech
F5-TTS
Assamese
tts
open-bible
assamese

F5-TTS Open Bible — Assamese

A zero-shot text-to-speech model for Assamese, trained from scratch on the Open Bible corpus using the F5-TTS architecture (diffusion transformer with vocos vocoder, 24 kHz output).

The model takes a short reference audio clip (5–10 seconds) and a target text, and synthesises the target text in the voice of the reference speaker. No fine-tuning per voice is required.

Files

File Purpose
model_last.pt Trained model weights.
vocab.txt Character vocabulary built from the training transcripts.
F5-TTS_OpenBible_Assamese.yaml Hydra training/inference config (architecture, mel spec settings, tokenizer).

Intended use

  • Zero-shot TTS for Assamese, controlled by a user-supplied reference clip.
  • Research on multilingual TTS, low-resource TTS evaluation, and listening studies on Open Bible–style read-speech.

How to use

Install F5-TTS:

pip install git+https://github.com/SWivid/F5-TTS.git

Download the checkpoint and run inference:

from huggingface_hub import hf_hub_download
from f5_tts.api import F5TTS

repo_id = "multilingual-tts/F5-TTS-OpenBible-Assamese"
ckpt   = hf_hub_download(repo_id, "model_last.pt")
vocab  = hf_hub_download(repo_id, "vocab.txt")
config = hf_hub_download(repo_id, "F5-TTS_OpenBible_Assamese.yaml")

model = F5TTS(ckpt_file=ckpt, vocab_file=vocab, model_cfg=config)

# Supply your own clean reference clip — 5–10 s, single speaker and its transcription.
ref_audio = "/path/to/your-assamese-clip.wav"
ref_text  = "Exact transcription of the clip"
gen_text  = "..."   # text to synthesise in Assamese

wav, sr, _ = model.infer(ref_audio=ref_audio, ref_text=ref_text, gen_text=gen_text)

Training data

  • Source: davidguzmanr/open-bible-resources, config Assamese
  • Size: approximately 30,500 utterances
  • Speakers: multispeaker; speaker identity is supplied at inference time via the reference clip, not by a fixed speaker id
  • Sample rate: 24 kHz
  • Maximum utterance duration during training: 15 s

Training procedure

  • Base architecture: F5-TTS v1 Base (DiT, 1024 dim, 22 layers, 16 heads, text dim 512, 4 convolutional layers).
  • Tokenizer: custom character-level, built from the training transcripts.
  • Vocoder: vocos.
  • Mel spectrogram: 100 channels, hop 256, win 1024, n_fft 1024.
  • Optimizer: AdamW, learning rate 7.5e-5, 20 000 warmup updates.
  • Training budget: 500,000 optimizer updates on 4 GPUs with mixed precision (bf16), global batch ≈ 112,000 frames.

Audio preprocessing, vocab generation, and config sizing are reproducible via the upstream open-bible-models repo.

Evaluation

Evaluated alongside other Open-Bible TTS systems on character/word error rate (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the open-bible-models repository for the evaluation pipeline and the open-bible-surveys repository for the human-listening survey methodology.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for multilingual-tts/F5-TTS-OpenBible-Assamese

Base model

SWivid/F5-TTS
Finetuned
(104)
this model

Dataset used to train multilingual-tts/F5-TTS-OpenBible-Assamese

Collection including multilingual-tts/F5-TTS-OpenBible-Assamese