Thedal Transcript(TT): Bilingual Tamil-English Whisper (V11)

Thedal (தேடல்) transcript is an optimized, privacy-focused bilingual speech-to-text engine developed for on-device search applications.
Version 11 is the result of a rigorous 8-stage fine-tuning process, specifically architected for deployment on Android ARM CPUs without requiring external API dependencies or internet connectivity.

Model Architecture and Deployment

This repository provides two distinct versions of the model to support different operational requirements.

1. Master Research Weights

File: model.safetensors (291 MB)
Precision: FP16 / FP32
Context: High-precision weights suitable for desktop inference, server-side processing, or further fine-tuning.
Vocabulary: Features a specialized tokenizer with 500+ custom-mapped Tamil tokens targeted at banking and financial search contexts.

2. Mobile Optimized Engine

Path: android_slim/ (98 MB total)
Format: INT8 Quantized ONNX
Optimization: Unified single-decoder architecture (non-cache) to reduce disk footprint and memory overhead
Compatibility: Optimized for inference via ONNX Runtime or Sherpa-ONNX on mobile edge devices

Technical Specifications

Custom Vocabulary:
Integration of 500+ domain-specific Tamil tokens to prevent common transcription hallucinations and repetition loops found in base Whisper models.
Bilingual Capability:
Native support for code-switching between Tamil and English within a single audio stream.
Latency Profile:
Optimized for short-form audio queries typical of search interfaces.
Privacy:
100% local execution — no audio data or transcriptions leave the device.

Performance Validation

Language	Word Error Rate (WER)	Technical Note
English	0.05 (5%)	High fidelity for command-and-control and search intent
Tamil	0.33 (33%)	Strong phonetic preservation; V11 reduces INT8 quantization noise (P/V phoneme flip)

Note:
Benchmarks evaluated using 200 bilingual search query samples with
num_beams = 5

Training Procedure

The V11 engine was developed through a specialized 8-stage fine-tuning protocol.

This phased approach used a strategic combination of datasets and a non-linear learning rate schedule to achieve stable convergence across both languages.

8-Stage Protocol

Initial Domain Alignment
Baseline adaptation using LibriSpeech to solidify English phonetic anchors.
Tamil Foundation
Primary weight initialization for the custom 500-token Tamil vocabulary using Kathbath.
Bilingual Bridge
High learning rate training on Indic Voice to align cross-lingual phonetic representations.
Iterative Refinement (Stages 4–7)
Progressive learning rate decay cycles
1e-4 → 1e-6
Mixed-dataset batches to reduce WER and prevent catastrophic forgetting.
Acoustic Hardening (Stage 8)
Final ultra-low learning rate stage to minimize P/V phoneme flip
caused by INT8 quantization artifacts.

Training Hyperparameters

Optimizer: AdamW (fused)
Betas: 0.9, 0.999
Epsilon: 1e-8
Learning Rate: 1e-4 → 1e-6 (phased decay)
Global Batch Size: 16
Per Device: 4
Accumulation Steps: 4
Mixed Precision: AMP

Datasets Used

Kathbath — Native Tamil speech
Indic Voice — Bilingual / code-switch audio
LibriSpeech — English phonetic base

Implementation Guide

Python Inference (Research)

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import WhisperProcessor

model_path = "Badri0510/thedal-tamil-full-v11"

model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    subfolder="android_slim",
    use_cache=False
)

processor = WhisperProcessor.from_pretrained(
    model_path,
    subfolder="android_slim"
)

predicted_ids = model.generate(
    input_features,
    num_beams=5,
    use_cache=False
)