YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π±π° Sinhala TTS β Natural Sinhala Text-to-Speech with Voice Cloning
Build a natural-sounding Sinhala text-to-speech system with zero-shot voice cloning β speak Sinhala in anyone's voice from a 5-second reference clip.
Status: The LLM-assisted CC pipeline under process/raw-extract/ is now the primary high-quality dataset path. compact_windowed chunking plus selective acquisition has been verified on pilot 006, with the legacy CC-only script kept as a fast baseline. Next: scale selective across more captioned videos, audit the dashboard output, then fine-tune from IndicF5 or an existing Sinhala checkpoint.
π Voice Cloning Plan: See
VOICE_CLONING_PLAN.mdfor full architecture research and implementation roadmap.
π Iteration Log: Seeiterations/README.mdfor what's been tried and what worked. π§ End-to-End Technical Overview: SeeEND_TO_END_TECHNICAL_OVERVIEW.mdfor the full system architecture, training flow, inference path, and operational lessons learned.
Why This Exists
Existing Sinhala TTS (like MMS-TTS) sounds robotic because of 5 compounding failures:
- Terrible training data β MMS-TTS was trained on New Testament Bible recordings (~1-3h, monotone prosody, background music)
- Sinhala's Brahmic script is hostile to character-level models (multi-codepoint phonemes, conjunct consonants like ΰΆΰ·βΰ·)
- VITS architectural weaknesses get amplified on small data (deterministic duration prediction, MAS alignment gets stuck)
- Duration/prosody modeling fails on inconsistent-tempo religious recordings
- Domain mismatch β religious text training β modern Sinhala inference
Our approach: ~100+ hours of clean Sinhala speech from YouTube, processed through a research-grade enhancement pipeline, used to fine-tune F5-TTS β starting from IndicF5 (11 Indic languages, 1,417h).
Architecture
Training Target: F5-TTS (Voice Cloning)
Reference Audio (5s) βββ Mel Spectrogram βββ
ββββ F5-TTS (Flow DiT) βββ Vocos βββ Cloned Audio
Sinhala Text βββ Char Embeddings βββββββββββ
Why F5-TTS? After evaluating 10+ models (full analysis):
| Model | Type | Voice Clone | Sinhala-Ready | Open |
|---|---|---|---|---|
| F5-TTS β | Flow DiT | Mel infilling, 3-10s ref | β Sinhala checkpoint exists | CC-BY-NC |
| Fish Audio S2 | Dual-AR (Qwen3-4B) | Multi-speaker + instruction | β BPE needs validation | Apache 2.0 |
| IndexTTS | GPT-style | Zero-shot | β CN/EN only | Apache 2.0 |
| CosyVoice 3 | LLM + Flow | Token prompting | β Needs BPE retrain | Apache 2.0 |
| Spark-TTS | Single-stream LLM | Global tokens | β Needs codec retrain | Apache 2.0 |
Key discovery: A Sinhala F5-TTS already exists (tharindumihi/tts-si-F5-TTS, 7.7h, 230K steps) and IndicF5 covers 11 Brahmic-script Indic languages on 1,417h. We fine-tune from these instead of training from scratch.
Pipeline Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: Data Evaluation β
DONE β
β β
β evaluate_channels.py β Found @sunchare (727 videos, 371h)β
β speaker_analysis.py β 2 speakers, dominant = 88% β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: Raw Audio Upload β
621/727 videos uploaded β
β β
β Dataset: outlawmold/sinhala-tts-raw-audio (31.1GB) β
β Missing: 106 videos (60 local, 46 need re-download) β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: LLM-Assisted CC Dataset Build β YOU ARE HERE β
β β
β process/raw-extract/pipeline.py: β
β 1. Parse YouTube Sinhala captions β
β 2. Ask Gemini/OpenAI-compatible LLM for semantic chunks β
β 3. Run deterministic text/source/language gates β
β 4. In selective mode, acquire audio only for candidates β
β 5. Cut clips, validate audio/ASR, write dashboards β
β 6. Export F5-TTS dataset β
β Current best mode: Gemini + compact_windowed + selective β
β Legacy baseline: scripts/cc_pipeline.py v5 β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: F5-TTS Fine-tuning β
β β
β train_f5tts.py β
READY: β
β Starting checkpoint: IndicF5 or tharindumihi Sinhala β
β Vocab: data/sinhala_vocab/vocab.txt (174 chars) β
β Hardware: 1ΓA100, ~3-7 days from IndicF5 β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 5: Voice Cloning Inference β
β β
β app.py: Upload 5s reference + type Sinhala text β β
β Generates speech in the reference voice β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quick Start
Run LLM-Assisted CC Pipeline
The current recommended path is selective: captions and LLM chunking run first, then audio is downloaded or cut only for text-passed candidates.
.\.venv310\Scripts\python.exe process/raw-extract/pipeline.py `
--videos "https://www.youtube.com/watch?v=0cROexEjv80" `
--output-dir pipeline_output/llm_cc/pilot_006_selective `
--dataset-name sinhala_tts_llm_cc_pilot_006_selective `
--mode selective `
--llm-provider gemini `
--llm-model gemini-3.1-pro-preview `
--chunk-mode compact_windowed `
--resume `
--verbose
Set GEMINI_API_KEY or pass --api-key. --videos accepts a YouTube URL, a raw video ID, or a JSON/JSONL manifest for batches. Missing Sinhala JSON3 captions are downloaded automatically into cc_output before LLM chunking. The run writes run_manifest.json, candidates/, accepted/rejected manifests, a review dashboard, and an F5-compatible dataset when rows pass. ASR is now handled as an optional checkpointed audit with --mode asr-audit; local ASR prefers faster-whisper on CUDA.
Legacy CC-Only Baseline
scripts/cc_pipeline.py remains useful for quick non-LLM comparisons:
python scripts/cc_pipeline.py download-cc --video-list tts_channel_eval/unlimited_history_videos.json
python scripts/cc_pipeline.py process --cc-dir cc_output --max-videos 10
python scripts/cc_pipeline.py stats --cc-dir cc_output
Deprecated local pipeline:
local_pipeline.pyv4 (wav2vec2 CTC ASR) still exists for comparison, but it requires GPU and produced lower-quality transcripts than YouTube CC in earlier tests.
Train F5-TTS
pip install f5-tts
# 1. Generate vocab (already done β see data/sinhala_vocab/vocab.txt)
python scripts/train_f5tts.py vocab
# 2. Prepare dataset
python scripts/train_f5tts.py prepare --dataset_dir pipeline_output/dataset
# 3. Build Arrow dataset
f5-tts_finetune-data data/sinhala_tts/metadata.csv data/sinhala_tts
# 4. Fine-tune from IndicF5
python scripts/train_f5tts.py finetune --pretrain /path/to/indicf5/model.safetensors
# 5. Inference
python scripts/train_f5tts.py infer \
--checkpoint ckpts/sinhala_tts/model_XXXXX.pt \
--ref_audio reference.wav \
--ref_text "ΰΆΈΰ· ΰΆΰ·ΰΆ§ΰ·ΰΆΆΰ·ΰΆ½ΰ·ΰ·ΰΆ½ ΰ·ΰ·ΰ·ΰ·ΰΆ»ΰΆΊΰ·" \
--gen_text "ΰΆΆΰ·ΰΆ―ΰ·ΰΆ―ΰ·ΰΆΈ ΰΆΰ·ΰΆ± ΰΆΈΰ· ΰΆ
ΰΆ§ΰ·ΰ·ΰ· ΰΆ§ΰ·ΰΆ ΰΆ΄ΰΆ»ΰ·ΰ·ΰΆ»ΰ·ΰΆΰΆ±ΰΆΊ ΰΆΰΆ»ΰΆ±ΰ·ΰΆ―"
Data Sources
Primary: @sunchare "Unlimited History" (YouTube)
| Metric | Value |
|---|---|
| Channel | @sunchare |
| Videos | 727 total, 621 uploaded to Hub |
| Total hours | ~371h raw |
| Raw audio | outlawmold/sinhala-tts-raw-audio (31.1GB) |
| Est. yield (legacy v4) | ~10-30h clean after wav2vec2 ASR pipeline |
| Est. yield (legacy CC v5) | Historical projection only; needs re-audit against current quality gates |
| Current dataset path | LLM-assisted CC chunks with selective audio acquisition |
Key Technical Decisions (Iterations)
| # | Decision | Rationale |
|---|---|---|
| 004 | ASR: SpideyDLK wav2vec2 CTC | CTC can't hallucinate. MMS doesn't support Sinhala. Whisper-large-v3 was the worst. Details |
| 005 | TTS: F5-TTS from IndicF5 | Existing Sinhala model proves feasibility. IndicF5 has 11 Brahmic languages. Fish Audio S2 is a future upgrade. Details |
| 006 | Pipeline: YouTube CC + LLM chunking | YouTube Sinhala captions are the source text, while the LLM chooses semantic boundaries. The current scaling path is process/raw-extract/pipeline.py --mode selective, so expensive audio/ASR work happens only after CC/text gates pass. Details |
File Structure
outlawmold/sinhala-tts/
βββ README.md # This file
βββ VOICE_CLONING_PLAN.md # F5-TTS architecture research & plan
βββ TTS_QUALITY_ENHANCEMENT_PLAN.md # Dataset quality improvement roadmap
βββ TODO.md # Raw upload quality gate
βββ yt-to-do.md # Missing raw video tracking
βββ data/
β βββ sinhala_vocab/vocab.txt # Sinhala vocab for F5-TTS (174 chars) β
βββ iterations/
β βββ 001-2video-test/ # HTDemucs broken, Whisper hallucinated
β βββ 002-demucs-fixed/ # Demucs fixed, ASR still garbage
β βββ 003-asr-comparison/ # ASR comparison plan
β βββ 004-asr-comparison-results/ # β
wav2vec2 CTC wins
β βββ 005-tts-model-research/ # β
F5-TTS confirmed, IndicF5 found
βββ process/
β βββ raw-extract/
β βββ pipeline.py # β Primary LLM-assisted CC pipeline β
β βββ README.md # Current runbook and CLI reference
β βββ LLM_SELECTION_ANALYSIS.md # Gemini/OpenAI mode comparison
βββ scripts/
β βββ cc_pipeline.py # Legacy fast CC-only baseline
β βββ train_f5tts.py # F5-TTS fine-tuning β
β βββ finalize_distributed_dataset.py # Merge distributed outputs β
β βββ validate_dataset.py # Pre-training audio validation β
β βββ download_and_upload.py # Raw audio download/upload β
β βββ evaluate_channels.py # Channel quality evaluation β
β βββ local_pipeline.py # Legacy pipeline v4 (wav2vec2 CTC β deprecated)
β βββ convert_to_f5.py # Dataset format converter
β βββ sinhala_tokenizer.py # Grapheme tokenizer
β βββ train_fastpitch.py # FastPitch fallback (deprecated)
β βββ test_asr_models.py # ASR comparison test
βββ app.py # Gradio voice-cloning inference app β
Current Status & Next Steps
- Data source evaluation & channel quality scoring
- Architecture selection β F5-TTS (confirmed iteration 005)
- Raw audio upload to Hub (621/727)
- Legacy CC-based pipeline v5 (YouTube auto-captions)
- LLM-assisted CC pipeline with Gemini/OpenAI provider routing
-
compact_windowed,onepass, andwindowedchunk modes -
candidate,acquire, andselectivemodes for text-first scaling - Run manifest, token usage logging, process lock, dashboards, and human-audit sample output
- ASR model comparison β 7 models tested (iteration 004)
- TTS model landscape research β 10 models evaluated (iteration 005)
- Sinhala vocab.txt for F5-TTS (174 characters)
- F5-TTS fine-tuning script (
train_f5tts.py) - Distributed processing support with HF-backed leases
- Video-level train/val split (fixes data leakage)
- app.py rewritten for F5-TTS voice cloning inference
- β Request access to IndicF5 and tharindumihi/tts-si-F5-TTS
- β Scale
selectiveover more captioned videos - β Audit accepted/rejected chunks in the review dashboard
- Upload remaining 106 raw videos
- Fine-tune F5-TTS on processed Sinhala data
- Evaluate with MOS/MUSHRA listening tests
- Deploy as HF Space
Hardware Requirements
| Stage | Min Hardware | Recommended | Time |
|---|---|---|---|
| CC download | Any | Laptop | ~4 sec/video |
| LLM candidate chunking | CPU + API key | Gemini gemini-3.1-pro-preview |
Model/API latency |
| Selective audio acquisition | CPU + network | yt-dlp + ffmpeg | Only for text-passed candidates |
| Local ASR validation | GPU optional | GTX 4050 via CUDA | Per accepted clip |
| Legacy CC-only processing | CPU | Any | ~30 sec/video |
| Legacy local pipeline | GPU 6GB+ | GTX 4050 | ~15 min/video |
| F5-TTS fine-tune (from IndicF5) | 1Γ A100 80GB | 1Γ A100 | 3-7 days |
| F5-TTS fine-tune (from base) | 1Γ A100 80GB | 8Γ A100 | 2 weeks / 3 days |
| Inference | GPU 4GB+ | Any modern GPU | Real-time |
Distributed Processing
python scripts/local_pipeline.py --distributed --resume
Multiple machines share work via HF-backed claim leases. After all videos processed:
python scripts/finalize_distributed_dataset.py --upload
Quality Controls
The current LLM-assisted pipeline records reproducible runs and exposes quality decisions in manifests, reports, and dashboards:
run_manifest.jsoncaptures command args, config, prompt hashes, code hashes, git state, and redacts API keys.- Text-source gates compare LLM output to the selected caption range and reject excessive drift by default.
- Language/vocab gates record Sinhala ratio, Latin ratio, Unicode NFC/control checks, and repeated whitespace issues.
- Extended audio metrics record estimated SNR, speech-frame ratio, loudness, spectral metrics, and text rate.
selectivemode writescandidates/candidate_manifest.jsonlbefore downloading/cutting audio.- Validated runs write
manifest_accepted,manifest_rejected, train/val/test manifests,human_audit_sample.jsonl, andreports/review_dashboard.html.
Older local-pipeline quality options still exist for the deprecated wav2vec2 flow:
python scripts/local_pipeline.py --no-trim-silence
python scripts/local_pipeline.py --no-loudness-normalize
python scripts/local_pipeline.py --target-lufs -23 --silence-top-db 35 --silence-pad-ms 100
python scripts/local_pipeline.py --enable-speaker-filter --speaker-similarity-min 0.45
python scripts/local_pipeline.py --verify-asr-model <wav2vec2-ctc-model> --min-asr-agreement 0.75
See TTS_QUALITY_ENHANCEMENT_PLAN.md for the full roadmap.
A personal project. Not affiliated with any organization.