Thedal Transcript(TT): Bilingual Tamil-English Whisper (V11)
Thedal (தேடல்) transcript is an optimized, privacy-focused bilingual speech-to-text engine developed for on-device search applications.
Version 11 is the result of a rigorous 8-stage fine-tuning process, specifically architected for deployment on Android ARM CPUs without requiring external API dependencies or internet connectivity.
Model Architecture and Deployment
This repository provides two distinct versions of the model to support different operational requirements.
1. Master Research Weights
- File:
model.safetensors(291 MB) - Precision: FP16 / FP32
- Context: High-precision weights suitable for desktop inference, server-side processing, or further fine-tuning.
- Vocabulary: Features a specialized tokenizer with 500+ custom-mapped Tamil tokens targeted at banking and financial search contexts.
2. Mobile Optimized Engine
- Path:
android_slim/(98 MB total) - Format: INT8 Quantized ONNX
- Optimization: Unified single-decoder architecture (non-cache) to reduce disk footprint and memory overhead
- Compatibility: Optimized for inference via ONNX Runtime or Sherpa-ONNX on mobile edge devices
Technical Specifications
Custom Vocabulary:
Integration of 500+ domain-specific Tamil tokens to prevent common transcription hallucinations and repetition loops found in base Whisper models.Bilingual Capability:
Native support for code-switching between Tamil and English within a single audio stream.Latency Profile:
Optimized for short-form audio queries typical of search interfaces.Privacy:
100% local execution — no audio data or transcriptions leave the device.
Performance Validation
| Language | Word Error Rate (WER) | Technical Note |
|---|---|---|
| English | 0.05 (5%) | High fidelity for command-and-control and search intent |
| Tamil | 0.33 (33%) | Strong phonetic preservation; V11 reduces INT8 quantization noise (P/V phoneme flip) |
Note:
Benchmarks evaluated using 200 bilingual search query samples withnum_beams = 5
Training Procedure
The V11 engine was developed through a specialized 8-stage fine-tuning protocol.
This phased approach used a strategic combination of datasets and a non-linear learning rate schedule to achieve stable convergence across both languages.
8-Stage Protocol
Initial Domain Alignment
Baseline adaptation using LibriSpeech to solidify English phonetic anchors.Tamil Foundation
Primary weight initialization for the custom 500-token Tamil vocabulary using Kathbath.Bilingual Bridge
High learning rate training on Indic Voice to align cross-lingual phonetic representations.Iterative Refinement (Stages 4–7)
Progressive learning rate decay cycles1e-4 → 1e-6
Mixed-dataset batches to reduce WER and prevent catastrophic forgetting.Acoustic Hardening (Stage 8)
Final ultra-low learning rate stage to minimize P/V phoneme flip
caused by INT8 quantization artifacts.
Training Hyperparameters
- Optimizer: AdamW (fused)
- Betas: 0.9, 0.999
- Epsilon: 1e-8
- Learning Rate: 1e-4 → 1e-6 (phased decay)
- Global Batch Size: 16
- Per Device: 4
- Accumulation Steps: 4
- Mixed Precision: AMP
Datasets Used
- Kathbath — Native Tamil speech
- Indic Voice — Bilingual / code-switch audio
- LibriSpeech — English phonetic base
Implementation Guide
Python Inference (Research)
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import WhisperProcessor
model_path = "Badri0510/thedal-tamil-full-v11"
model = ORTModelForSpeechSeq2Seq.from_pretrained(
model_path,
subfolder="android_slim",
use_cache=False
)
processor = WhisperProcessor.from_pretrained(
model_path,
subfolder="android_slim"
)
predicted_ids = model.generate(
input_features,
num_beams=5,
use_cache=False
)
Android Deployment
Include the contents of:
android_slim/
inside your app assets.
Required files:
- model.onnx
- tokenizer.json
- config.json
The tokenizer is required to map the custom 500-token Tamil vocabulary.
Environment
- Transformers 5.3.0
- PyTorch 2.1.0 + cu128
- Optimum 1.24.0
- Tokenizers 0.22.2
Project
Developed under Thedal Edge-AI Initiative
Privacy-first on-device Tamil speech intelligence.
- Downloads last month
- 109
Model tree for Badri0510/thedal-tamil-full-v11
Base model
Badri0510/thedal-tamil-full-v3