YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ‡±πŸ‡° Sinhala TTS β€” Natural Sinhala Text-to-Speech with Voice Cloning

Build a natural-sounding Sinhala text-to-speech system with zero-shot voice cloning β€” speak Sinhala in anyone's voice from a 5-second reference clip.

Status: The LLM-assisted CC pipeline under process/raw-extract/ is now the primary high-quality dataset path. compact_windowed chunking plus selective acquisition has been verified on pilot 006, with the legacy CC-only script kept as a fast baseline. Next: scale selective across more captioned videos, audit the dashboard output, then fine-tune from IndicF5 or an existing Sinhala checkpoint.

πŸ“‹ Voice Cloning Plan: See VOICE_CLONING_PLAN.md for full architecture research and implementation roadmap.
πŸ“Š Iteration Log: See iterations/README.md for what's been tried and what worked. 🧭 End-to-End Technical Overview: See END_TO_END_TECHNICAL_OVERVIEW.md for the full system architecture, training flow, inference path, and operational lessons learned.


Why This Exists

Existing Sinhala TTS (like MMS-TTS) sounds robotic because of 5 compounding failures:

  1. Terrible training data β€” MMS-TTS was trained on New Testament Bible recordings (~1-3h, monotone prosody, background music)
  2. Sinhala's Brahmic script is hostile to character-level models (multi-codepoint phonemes, conjunct consonants like ΰΆšΰ·Šβ€ΰ·‚)
  3. VITS architectural weaknesses get amplified on small data (deterministic duration prediction, MAS alignment gets stuck)
  4. Duration/prosody modeling fails on inconsistent-tempo religious recordings
  5. Domain mismatch β€” religious text training β†’ modern Sinhala inference

Our approach: ~100+ hours of clean Sinhala speech from YouTube, processed through a research-grade enhancement pipeline, used to fine-tune F5-TTS β€” starting from IndicF5 (11 Indic languages, 1,417h).


Architecture

Training Target: F5-TTS (Voice Cloning)

Reference Audio (5s) ──→ Mel Spectrogram ──┐
                                           β”œβ”€β”€β†’ F5-TTS (Flow DiT) ──→ Vocos ──→ Cloned Audio
Sinhala Text ──→ Char Embeddings β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why F5-TTS? After evaluating 10+ models (full analysis):

Model Type Voice Clone Sinhala-Ready Open
F5-TTS βœ… Flow DiT Mel infilling, 3-10s ref βœ… Sinhala checkpoint exists CC-BY-NC
Fish Audio S2 Dual-AR (Qwen3-4B) Multi-speaker + instruction ❓ BPE needs validation Apache 2.0
IndexTTS GPT-style Zero-shot ❌ CN/EN only Apache 2.0
CosyVoice 3 LLM + Flow Token prompting ❌ Needs BPE retrain Apache 2.0
Spark-TTS Single-stream LLM Global tokens ❌ Needs codec retrain Apache 2.0

Key discovery: A Sinhala F5-TTS already exists (tharindumihi/tts-si-F5-TTS, 7.7h, 230K steps) and IndicF5 covers 11 Brahmic-script Indic languages on 1,417h. We fine-tune from these instead of training from scratch.


Pipeline Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 1: Data Evaluation βœ… DONE                            β”‚
β”‚                                                             β”‚
β”‚   evaluate_channels.py β†’ Found @sunchare (727 videos, 371h)β”‚
β”‚   speaker_analysis.py β†’ 2 speakers, dominant = 88%         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 2: Raw Audio Upload βœ… 621/727 videos uploaded        β”‚
β”‚                                                             β”‚
β”‚   Dataset: outlawmold/sinhala-tts-raw-audio (31.1GB)       β”‚
β”‚   Missing: 106 videos (60 local, 46 need re-download)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 3: LLM-Assisted CC Dataset Build ← YOU ARE HERE       β”‚
β”‚                                                             β”‚
β”‚   process/raw-extract/pipeline.py:                          β”‚
β”‚     1. Parse YouTube Sinhala captions                       β”‚
β”‚     2. Ask Gemini/OpenAI-compatible LLM for semantic chunks β”‚
β”‚     3. Run deterministic text/source/language gates         β”‚
β”‚     4. In selective mode, acquire audio only for candidates β”‚
β”‚     5. Cut clips, validate audio/ASR, write dashboards      β”‚
β”‚     6. Export F5-TTS dataset                               β”‚
β”‚   Current best mode: Gemini + compact_windowed + selective β”‚
β”‚   Legacy baseline: scripts/cc_pipeline.py v5                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 4: F5-TTS Fine-tuning                                 β”‚
β”‚                                                             β”‚
β”‚   train_f5tts.py βœ… READY:                                  β”‚
β”‚     Starting checkpoint: IndicF5 or tharindumihi Sinhala    β”‚
β”‚     Vocab: data/sinhala_vocab/vocab.txt (174 chars)         β”‚
β”‚     Hardware: 1Γ—A100, ~3-7 days from IndicF5               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 5: Voice Cloning Inference                            β”‚
β”‚                                                             β”‚
β”‚   app.py: Upload 5s reference + type Sinhala text β†’          β”‚
β”‚   Generates speech in the reference voice                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Run LLM-Assisted CC Pipeline

The current recommended path is selective: captions and LLM chunking run first, then audio is downloaded or cut only for text-passed candidates.

.\.venv310\Scripts\python.exe process/raw-extract/pipeline.py `
  --videos "https://www.youtube.com/watch?v=0cROexEjv80" `
  --output-dir pipeline_output/llm_cc/pilot_006_selective `
  --dataset-name sinhala_tts_llm_cc_pilot_006_selective `
  --mode selective `
  --llm-provider gemini `
  --llm-model gemini-3.1-pro-preview `
  --chunk-mode compact_windowed `
  --resume `
  --verbose

Set GEMINI_API_KEY or pass --api-key. --videos accepts a YouTube URL, a raw video ID, or a JSON/JSONL manifest for batches. Missing Sinhala JSON3 captions are downloaded automatically into cc_output before LLM chunking. The run writes run_manifest.json, candidates/, accepted/rejected manifests, a review dashboard, and an F5-compatible dataset when rows pass. ASR is now handled as an optional checkpointed audit with --mode asr-audit; local ASR prefers faster-whisper on CUDA.

Legacy CC-Only Baseline

scripts/cc_pipeline.py remains useful for quick non-LLM comparisons:

python scripts/cc_pipeline.py download-cc --video-list tts_channel_eval/unlimited_history_videos.json
python scripts/cc_pipeline.py process --cc-dir cc_output --max-videos 10
python scripts/cc_pipeline.py stats --cc-dir cc_output

Deprecated local pipeline: local_pipeline.py v4 (wav2vec2 CTC ASR) still exists for comparison, but it requires GPU and produced lower-quality transcripts than YouTube CC in earlier tests.

Train F5-TTS

pip install f5-tts

# 1. Generate vocab (already done β€” see data/sinhala_vocab/vocab.txt)
python scripts/train_f5tts.py vocab

# 2. Prepare dataset
python scripts/train_f5tts.py prepare --dataset_dir pipeline_output/dataset

# 3. Build Arrow dataset
f5-tts_finetune-data data/sinhala_tts/metadata.csv data/sinhala_tts

# 4. Fine-tune from IndicF5
python scripts/train_f5tts.py finetune --pretrain /path/to/indicf5/model.safetensors

# 5. Inference
python scripts/train_f5tts.py infer \
    --checkpoint ckpts/sinhala_tts/model_XXXXX.pt \
    --ref_audio reference.wav \
    --ref_text "ࢸේ ΰΆšΰ·œΰΆ§ΰ·”ΰΆΆΰ·’ΰΆ½ΰ·Šΰ·€ΰΆ½ ΰ·€ΰ·’ΰ·„ΰ·ΰΆ»ΰΆΊΰ·š" \
    --gen_text "ΰΆΆΰ·”ΰΆ―ΰ·”ΰΆ―ΰ·„ΰΆΈ ࢜ැࢱ ࢸේ ࢅࢧුවා ΰΆ§ΰ·’ΰΆš ΰΆ΄ΰΆ»ΰ·’ΰ·€ΰΆ»ΰ·ŠΰΆ­ΰΆ±ΰΆΊ ࢚ࢻࢱ්ࢯ"

Data Sources

Primary: @sunchare "Unlimited History" (YouTube)

Metric Value
Channel @sunchare
Videos 727 total, 621 uploaded to Hub
Total hours ~371h raw
Raw audio outlawmold/sinhala-tts-raw-audio (31.1GB)
Est. yield (legacy v4) ~10-30h clean after wav2vec2 ASR pipeline
Est. yield (legacy CC v5) Historical projection only; needs re-audit against current quality gates
Current dataset path LLM-assisted CC chunks with selective audio acquisition

Key Technical Decisions (Iterations)

# Decision Rationale
004 ASR: SpideyDLK wav2vec2 CTC CTC can't hallucinate. MMS doesn't support Sinhala. Whisper-large-v3 was the worst. Details
005 TTS: F5-TTS from IndicF5 Existing Sinhala model proves feasibility. IndicF5 has 11 Brahmic languages. Fish Audio S2 is a future upgrade. Details
006 Pipeline: YouTube CC + LLM chunking YouTube Sinhala captions are the source text, while the LLM chooses semantic boundaries. The current scaling path is process/raw-extract/pipeline.py --mode selective, so expensive audio/ASR work happens only after CC/text gates pass. Details

File Structure

outlawmold/sinhala-tts/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ VOICE_CLONING_PLAN.md              # F5-TTS architecture research & plan
β”œβ”€β”€ TTS_QUALITY_ENHANCEMENT_PLAN.md    # Dataset quality improvement roadmap
β”œβ”€β”€ TODO.md                            # Raw upload quality gate
β”œβ”€β”€ yt-to-do.md                        # Missing raw video tracking
β”œβ”€β”€ data/
β”‚   └── sinhala_vocab/vocab.txt        # Sinhala vocab for F5-TTS (174 chars) βœ…
β”œβ”€β”€ iterations/
β”‚   β”œβ”€β”€ 001-2video-test/               # HTDemucs broken, Whisper hallucinated
β”‚   β”œβ”€β”€ 002-demucs-fixed/              # Demucs fixed, ASR still garbage
β”‚   β”œβ”€β”€ 003-asr-comparison/            # ASR comparison plan
β”‚   β”œβ”€β”€ 004-asr-comparison-results/    # βœ… wav2vec2 CTC wins
β”‚   └── 005-tts-model-research/        # βœ… F5-TTS confirmed, IndicF5 found
β”œβ”€β”€ process/
β”‚   └── raw-extract/
β”‚       β”œβ”€β”€ pipeline.py                # ⭐ Primary LLM-assisted CC pipeline βœ…
β”‚       β”œβ”€β”€ README.md                  # Current runbook and CLI reference
β”‚       └── LLM_SELECTION_ANALYSIS.md  # Gemini/OpenAI mode comparison
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ cc_pipeline.py                 # Legacy fast CC-only baseline
β”‚   β”œβ”€β”€ train_f5tts.py                 # F5-TTS fine-tuning βœ…
β”‚   β”œβ”€β”€ finalize_distributed_dataset.py # Merge distributed outputs βœ…
β”‚   β”œβ”€β”€ validate_dataset.py            # Pre-training audio validation βœ…
β”‚   β”œβ”€β”€ download_and_upload.py         # Raw audio download/upload βœ…
β”‚   β”œβ”€β”€ evaluate_channels.py           # Channel quality evaluation βœ…
β”‚   β”œβ”€β”€ local_pipeline.py              # Legacy pipeline v4 (wav2vec2 CTC β€” deprecated)
β”‚   β”œβ”€β”€ convert_to_f5.py               # Dataset format converter
β”‚   β”œβ”€β”€ sinhala_tokenizer.py           # Grapheme tokenizer
β”‚   β”œβ”€β”€ train_fastpitch.py             # FastPitch fallback (deprecated)
β”‚   └── test_asr_models.py            # ASR comparison test
└── app.py                             # Gradio voice-cloning inference app βœ…

Current Status & Next Steps

  • Data source evaluation & channel quality scoring
  • Architecture selection β†’ F5-TTS (confirmed iteration 005)
  • Raw audio upload to Hub (621/727)
  • Legacy CC-based pipeline v5 (YouTube auto-captions)
  • LLM-assisted CC pipeline with Gemini/OpenAI provider routing
  • compact_windowed, onepass, and windowed chunk modes
  • candidate, acquire, and selective modes for text-first scaling
  • Run manifest, token usage logging, process lock, dashboards, and human-audit sample output
  • ASR model comparison β€” 7 models tested (iteration 004)
  • TTS model landscape research β€” 10 models evaluated (iteration 005)
  • Sinhala vocab.txt for F5-TTS (174 characters)
  • F5-TTS fine-tuning script (train_f5tts.py)
  • Distributed processing support with HF-backed leases
  • Video-level train/val split (fixes data leakage)
  • app.py rewritten for F5-TTS voice cloning inference
  • β†’ Request access to IndicF5 and tharindumihi/tts-si-F5-TTS
  • β†’ Scale selective over more captioned videos
  • β†’ Audit accepted/rejected chunks in the review dashboard
  • Upload remaining 106 raw videos
  • Fine-tune F5-TTS on processed Sinhala data
  • Evaluate with MOS/MUSHRA listening tests
  • Deploy as HF Space

Hardware Requirements

Stage Min Hardware Recommended Time
CC download Any Laptop ~4 sec/video
LLM candidate chunking CPU + API key Gemini gemini-3.1-pro-preview Model/API latency
Selective audio acquisition CPU + network yt-dlp + ffmpeg Only for text-passed candidates
Local ASR validation GPU optional GTX 4050 via CUDA Per accepted clip
Legacy CC-only processing CPU Any ~30 sec/video
Legacy local pipeline GPU 6GB+ GTX 4050 ~15 min/video
F5-TTS fine-tune (from IndicF5) 1Γ— A100 80GB 1Γ— A100 3-7 days
F5-TTS fine-tune (from base) 1Γ— A100 80GB 8Γ— A100 2 weeks / 3 days
Inference GPU 4GB+ Any modern GPU Real-time

Distributed Processing

python scripts/local_pipeline.py --distributed --resume

Multiple machines share work via HF-backed claim leases. After all videos processed:

python scripts/finalize_distributed_dataset.py --upload

Quality Controls

The current LLM-assisted pipeline records reproducible runs and exposes quality decisions in manifests, reports, and dashboards:

  • run_manifest.json captures command args, config, prompt hashes, code hashes, git state, and redacts API keys.
  • Text-source gates compare LLM output to the selected caption range and reject excessive drift by default.
  • Language/vocab gates record Sinhala ratio, Latin ratio, Unicode NFC/control checks, and repeated whitespace issues.
  • Extended audio metrics record estimated SNR, speech-frame ratio, loudness, spectral metrics, and text rate.
  • selective mode writes candidates/candidate_manifest.jsonl before downloading/cutting audio.
  • Validated runs write manifest_accepted, manifest_rejected, train/val/test manifests, human_audit_sample.jsonl, and reports/review_dashboard.html.

Older local-pipeline quality options still exist for the deprecated wav2vec2 flow:

python scripts/local_pipeline.py --no-trim-silence
python scripts/local_pipeline.py --no-loudness-normalize
python scripts/local_pipeline.py --target-lufs -23 --silence-top-db 35 --silence-pad-ms 100
python scripts/local_pipeline.py --enable-speaker-filter --speaker-similarity-min 0.45
python scripts/local_pipeline.py --verify-asr-model <wav2vec2-ctc-model> --min-asr-agreement 0.75

See TTS_QUALITY_ENHANCEMENT_PLAN.md for the full roadmap.


A personal project. Not affiliated with any organization.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for outlawmold/sinhala-tts