YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🇱🇰 Sinhala TTS — Natural Sinhala Text-to-Speech with Voice Cloning

Build a natural-sounding Sinhala text-to-speech system with zero-shot voice cloning — speak Sinhala in anyone's voice from a 5-second reference clip.

Status: The LLM-assisted CC pipeline under process/raw-extract/ is now the primary high-quality dataset path. compact_windowed chunking plus selective acquisition has been verified on pilot 006, with the legacy CC-only script kept as a fast baseline. Next: scale selective across more captioned videos, audit the dashboard output, then fine-tune from IndicF5 or an existing Sinhala checkpoint.

📋 Voice Cloning Plan: See VOICE_CLONING_PLAN.md for full architecture research and implementation roadmap.
📊 Iteration Log: See iterations/README.md for what's been tried and what worked. 🧭 End-to-End Technical Overview: See END_TO_END_TECHNICAL_OVERVIEW.md for the full system architecture, training flow, inference path, and operational lessons learned.

Why This Exists

Existing Sinhala TTS (like MMS-TTS) sounds robotic because of 5 compounding failures:

Terrible training data — MMS-TTS was trained on New Testament Bible recordings (~1-3h, monotone prosody, background music)
Sinhala's Brahmic script is hostile to character-level models (multi-codepoint phonemes, conjunct consonants like ක්‍ෂ)
VITS architectural weaknesses get amplified on small data (deterministic duration prediction, MAS alignment gets stuck)
Duration/prosody modeling fails on inconsistent-tempo religious recordings
Domain mismatch — religious text training → modern Sinhala inference

Our approach: ~100+ hours of clean Sinhala speech from YouTube, processed through a research-grade enhancement pipeline, used to fine-tune F5-TTS — starting from IndicF5 (11 Indic languages, 1,417h).

Architecture

Training Target: F5-TTS (Voice Cloning)

Reference Audio (5s) ──→ Mel Spectrogram ──┐
                                           ├──→ F5-TTS (Flow DiT) ──→ Vocos ──→ Cloned Audio
Sinhala Text ──→ Char Embeddings ──────────┘

Why F5-TTS? After evaluating 10+ models (full analysis):

Model	Type	Voice Clone	Sinhala-Ready	Open
F5-TTS ✅	Flow DiT	Mel infilling, 3-10s ref	✅ Sinhala checkpoint exists	CC-BY-NC
Fish Audio S2	Dual-AR (Qwen3-4B)	Multi-speaker + instruction	❓ BPE needs validation	Apache 2.0
IndexTTS	GPT-style	Zero-shot	❌ CN/EN only	Apache 2.0
CosyVoice 3	LLM + Flow	Token prompting	❌ Needs BPE retrain	Apache 2.0
Spark-TTS	Single-stream LLM	Global tokens	❌ Needs codec retrain	Apache 2.0

Key discovery: A Sinhala F5-TTS already exists (tharindumihi/tts-si-F5-TTS, 7.7h, 230K steps) and IndicF5 covers 11 Brahmic-script Indic languages on 1,417h. We fine-tune from these instead of training from scratch.

Pipeline Overview

┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: Data Evaluation ✅ DONE                            │
│                                                             │
│   evaluate_channels.py → Found @sunchare (727 videos, 371h)│
│   speaker_analysis.py → 2 speakers, dominant = 88%         │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│ STAGE 2: Raw Audio Upload ✅ 621/727 videos uploaded        │
│                                                             │
│   Dataset: outlawmold/sinhala-tts-raw-audio (31.1GB)       │
│   Missing: 106 videos (60 local, 46 need re-download)      │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│ STAGE 3: LLM-Assisted CC Dataset Build ← YOU ARE HERE       │
│                                                             │
│   process/raw-extract/pipeline.py:                          │
│     1. Parse YouTube Sinhala captions                       │
│     2. Ask Gemini/OpenAI-compatible LLM for semantic chunks │
│     3. Run deterministic text/source/language gates         │
│     4. In selective mode, acquire audio only for candidates │
│     5. Cut clips, validate audio + Gemini/CC, write dashboards│
│     6. Export F5-TTS dataset                               │
│   Current best mode: Gemini + compact_windowed + selective │
│   Legacy baseline: scripts/cc_pipeline.py v5                │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│ STAGE 4: F5-TTS Fine-tuning                                 │
│                                                             │
│   train_f5tts.py ⚠ REQUIRES TRAINING-READY DATA:            │
│     Starting checkpoint: IndicF5 or tharindumihi Sinhala    │
│     Vocab: data/sinhala_vocab/vocab.txt (174 chars)         │
│     Hardware: 1×A100, ~3-7 days from IndicF5               │
└─────────────────┬───────────────────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────────────────┐
│ STAGE 5: Voice Cloning Inference                            │
│                                                             │
│   app.py: Upload 5s reference + type Sinhala text →          │
│   Generates speech in the reference voice                    │
└─────────────────────────────────────────────────────────────┘

Quick Start

Run LLM-Assisted CC Pipeline

The current recommended path is selective: captions and LLM chunking run first, then audio is downloaded or cut only for text-passed candidates.

.\.venv310\Scripts\python.exe process/raw-extract/pipeline.py `
  --videos "https://www.youtube.com/watch?v=0cROexEjv80" `
  --output-dir pipeline_output/llm_cc/pilot_006_selective `
  --dataset-name sinhala_tts_llm_cc_pilot_006_selective `
  --mode selective `
  --llm-provider gemini `
  --llm-model gemini-3.1-pro-preview `
  --chunk-mode compact_windowed `
  --semantic-meaningfulness-gate-action reject `
  --run-gemini-audio-verification `
  --resume `
  --verbose

Set GEMINI_API_KEY or pass --api-key. --videos accepts a YouTube URL, a raw video ID, or a JSON/JSONL manifest for batches. Missing Sinhala JSON3 captions are downloaded automatically into cc_output before LLM chunking. Candidate spans can begin/end inside JSON3 cues when word timings exist, and preserve their CC word references. The run writes a create-once run_manifest.json, candidates, accepted/rejected manifests, a review dashboard, and an F5-compatible dataset when rows pass. --resume verifies material inputs, configuration, model arguments, prompts, code hashes, stage input fingerprints, and every stage-owned artifact; use a new output directory after any material change. Gemini 3.6 Flash audio + exact-CC verification is the default production evidence; its immutable cache is run-local under verification_cache/gemini_audio/. ASR remains an optional checkpointed audit with --mode asr-audit.

Verified export requires explicit Gemini-audio, human, or ASR evidence, a clean promotion decision, and an identity-aware leakage-safe split. Target-voice promotion should use a versioned global speaker registry and stable enrolled speaker ID; per-video cluster IDs are audit-only by default. Caption-only rows are experimental and cannot promote automatically, even when Gemini verification is intentionally skipped. Use ASR audit for calibration/comparison, then review evidence before --mode promote and --mode export.

Human Sinhala gate calibration and model selection now have standalone, model-free contracts in calibration_workflow.py and tts_checkpoint_evaluation.py. They enforce calibration/evaluation identity separation, immutable audio/text provenance, independent review/adjudication, and per-speaker/per-prompt regression gates.

External F5 exports are also immutable: each run records an exact file-hash receipt, and --resume verifies the receipt and exported directory instead of rewriting an existing dataset.

Legacy CC-Only Baseline

scripts/cc_pipeline.py remains useful for quick non-LLM comparisons:

python scripts/cc_pipeline.py download-cc --video-list tts_channel_eval/unlimited_history_videos.json
python scripts/cc_pipeline.py process --cc-dir cc_output --max-videos 10
python scripts/cc_pipeline.py stats --cc-dir cc_output

Deprecated local pipeline: local_pipeline.py v4 (wav2vec2 CTC ASR) still exists for comparison, but it requires GPU and produced lower-quality transcripts than YouTube CC in earlier tests.

Train F5-TTS

pip install f5-tts

# 1. Generate vocab (already done — see data/sinhala_vocab/vocab.txt)
python scripts/train_f5tts.py vocab

# 2. Prepare dataset
python scripts/train_f5tts.py prepare --dataset_dir pipeline_output/dataset

# 3. Build Arrow dataset
f5-tts_finetune-data data/sinhala_tts/metadata.csv data/sinhala_tts

# 4. Fine-tune from IndicF5
python scripts/train_f5tts.py finetune --pretrain /path/to/indicf5/model.safetensors

# 5. Inference
python scripts/train_f5tts.py infer \
    --checkpoint ckpts/sinhala_tts/model_XXXXX.pt \
    --ref_audio reference.wav \
    --ref_text "මේ කොටුබිල්වල විහාරයේ" \
    --gen_text "බුදුදහම ගැන මේ අටුවා ටික පරිවර්තනය කරන්ද"

Data Sources

Primary: @sunchare "Unlimited History" (YouTube)

Metric	Value
Channel	@sunchare
Videos	727 total, 621 uploaded to Hub
Total hours	~371h raw
Raw audio	outlawmold/sinhala-tts-raw-audio (31.1GB)
Est. yield (legacy v4)	~10-30h clean after wav2vec2 ASR pipeline
Est. yield (legacy CC v5)	Historical projection only; needs re-audit against current quality gates
Current dataset path	LLM-assisted CC chunks with selective audio acquisition

Key Technical Decisions (Iterations)

#	Decision	Rationale
004	ASR: SpideyDLK wav2vec2 CTC	CTC can't hallucinate. MMS doesn't support Sinhala. Whisper-large-v3 was the worst. Details
005	TTS: F5-TTS from IndicF5	Existing Sinhala model proves feasibility. IndicF5 has 11 Brahmic languages. Fish Audio S2 is a future upgrade. Details
006	Pipeline: YouTube CC + LLM chunking	YouTube Sinhala captions are the source text, while the LLM chooses semantic boundaries. The current scaling path is `process/raw-extract/pipeline.py --mode selective`, so expensive audio/ASR work happens only after CC/text gates pass. Details

File Structure

outlawmold/sinhala-tts/
├── README.md                          # This file
├── VOICE_CLONING_PLAN.md              # F5-TTS architecture research & plan
├── TTS_QUALITY_ENHANCEMENT_PLAN.md    # Dataset quality improvement roadmap
├── TODO.md                            # Raw upload quality gate
├── yt-to-do.md                        # Missing raw video tracking
├── data/
│   └── sinhala_vocab/vocab.txt        # Sinhala vocab for F5-TTS (174 chars) ✅
├── iterations/
│   ├── 001-2video-test/               # HTDemucs broken, Whisper hallucinated
│   ├── 002-demucs-fixed/              # Demucs fixed, ASR still garbage
│   ├── 003-asr-comparison/            # ASR comparison plan
│   ├── 004-asr-comparison-results/    # ✅ wav2vec2 CTC wins
│   └── 005-tts-model-research/        # ✅ F5-TTS confirmed, IndicF5 found
├── process/
│   └── raw-extract/
│       ├── pipeline.py                # ⭐ Primary LLM-assisted CC pipeline ✅
│       ├── README.md                  # Current runbook and CLI reference
│       └── LLM_SELECTION_ANALYSIS.md  # Gemini/OpenAI mode comparison
├── scripts/
│   ├── cc_pipeline.py                 # Legacy fast CC-only baseline
│   ├── train_f5tts.py                 # F5-TTS fine-tuning ✅
│   ├── finalize_distributed_dataset.py # Merge distributed outputs ✅
│   ├── validate_dataset.py            # Pre-training audio validation ✅
│   ├── download_and_upload.py         # Raw audio download/upload ✅
│   ├── evaluate_channels.py           # Channel quality evaluation ✅
│   ├── local_pipeline.py              # Legacy pipeline v4 (wav2vec2 CTC — deprecated)
│   ├── convert_to_f5.py               # Dataset format converter
│   ├── sinhala_tokenizer.py           # Grapheme tokenizer
│   ├── train_fastpitch.py             # FastPitch fallback (deprecated)
│   └── test_asr_models.py            # ASR comparison test
└── app.py                             # Gradio voice-cloning inference app ✅

Current Status & Next Steps

Data source evaluation & channel quality scoring
Architecture selection → F5-TTS (confirmed iteration 005)
Raw audio upload to Hub (621/727)
Legacy CC-based pipeline v5 (YouTube auto-captions)
LLM-assisted CC pipeline with Gemini/OpenAI provider routing
compact_windowed, onepass, and windowed chunk modes
candidate, acquire, and selective modes for text-first scaling
Word-level partial-cue spans with source-only transcript reconstruction
Gemini audio + exact-CC verification with resumable immutable caching
Run manifest, token usage logging, process lock, dashboards, and human-audit sample output
ASR model comparison — 7 models tested (iteration 004)
TTS model landscape research — 10 models evaluated (iteration 005)
Sinhala vocab.txt for F5-TTS (174 characters)
F5-TTS fine-tuning script (train_f5tts.py)
Distributed processing support with HF-backed leases
Video-level train/val split (fixes data leakage)
app.py rewritten for F5-TTS voice cloning inference
→ Request access to IndicF5 and tharindumihi/tts-si-F5-TTS
→ Scale selective over more captioned videos
→ Audit accepted/rejected chunks in the review dashboard
Upload remaining 106 raw videos
Fine-tune F5-TTS on processed Sinhala data
Evaluate with MOS/MUSHRA listening tests
Deploy as HF Space

Hardware Requirements

Stage	Min Hardware	Recommended	Time
CC download	Any	Laptop	~4 sec/video
LLM candidate chunking	CPU + API key	Gemini `gemini-3.1-pro-preview`	Model/API latency
Selective audio acquisition	CPU + network	yt-dlp + ffmpeg	Only for text-passed candidates
Optional local ASR audit	GPU optional	GTX 4050 via CUDA	Per audited clip
Legacy CC-only processing	CPU	Any	~30 sec/video
Legacy local pipeline	GPU 6GB+	GTX 4050	~15 min/video
F5-TTS fine-tune (from IndicF5)	1× A100 80GB	1× A100	3-7 days
F5-TTS fine-tune (from base)	1× A100 80GB	8× A100	2 weeks / 3 days
Inference	GPU 4GB+	Any modern GPU	Real-time

Distributed Processing

python scripts/local_pipeline.py --distributed --resume

Multiple machines share work via HF-backed claim leases. After all videos processed:

python scripts/finalize_distributed_dataset.py --upload

Quality Controls

The current LLM-assisted pipeline records reproducible runs and exposes quality decisions in manifests, reports, and dashboards:

run_manifest.json captures command args, config, prompt hashes, code hashes, git state, and redacts API keys.
Text-source gates compare LLM output to the selected caption range and reject excessive drift by default.
Language/vocab gates record Sinhala ratio, Latin ratio, Unicode NFC/control checks, and repeated whitespace issues.
Extended audio metrics record estimated SNR, speech-frame ratio, loudness, spectral metrics, and text rate.
selective mode writes candidates/candidate_manifest.jsonl before downloading/cutting audio.
Validated runs write manifest_accepted, manifest_rejected, train/val/test manifests, human_audit_sample.jsonl, and reports/review_dashboard.html.

Older local-pipeline quality options still exist for the deprecated wav2vec2 flow:

python scripts/local_pipeline.py --no-trim-silence
python scripts/local_pipeline.py --no-loudness-normalize
python scripts/local_pipeline.py --target-lufs -23 --silence-top-db 35 --silence-pad-ms 100
python scripts/local_pipeline.py --enable-speaker-filter --speaker-similarity-min 0.45
python scripts/local_pipeline.py --verify-asr-model <wav2vec2-ctc-model> --min-asr-agreement 0.75

See TTS_QUALITY_ENHANCEMENT_PLAN.md for the full roadmap.

A personal project. Not affiliated with any organization.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for outlawmold/sinhala-tts

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published Oct 9, 2024 • 48