Thedal Transcript(TT): Bilingual Tamil-English Whisper (V11)

Thedal (தேடல்) transcript is an optimized, privacy-focused bilingual speech-to-text engine developed for on-device search applications.
Version 11 is the result of a rigorous 8-stage fine-tuning process, specifically architected for deployment on Android ARM CPUs without requiring external API dependencies or internet connectivity.


Model Architecture and Deployment

This repository provides two distinct versions of the model to support different operational requirements.

1. Master Research Weights

  • File: model.safetensors (291 MB)
  • Precision: FP16 / FP32
  • Context: High-precision weights suitable for desktop inference, server-side processing, or further fine-tuning.
  • Vocabulary: Features a specialized tokenizer with 500+ custom-mapped Tamil tokens targeted at banking and financial search contexts.

2. Mobile Optimized Engine

  • Path: android_slim/ (98 MB total)
  • Format: INT8 Quantized ONNX
  • Optimization: Unified single-decoder architecture (non-cache) to reduce disk footprint and memory overhead
  • Compatibility: Optimized for inference via ONNX Runtime or Sherpa-ONNX on mobile edge devices

Technical Specifications

  • Custom Vocabulary:
    Integration of 500+ domain-specific Tamil tokens to prevent common transcription hallucinations and repetition loops found in base Whisper models.

  • Bilingual Capability:
    Native support for code-switching between Tamil and English within a single audio stream.

  • Latency Profile:
    Optimized for short-form audio queries typical of search interfaces.

  • Privacy:
    100% local execution — no audio data or transcriptions leave the device.


Performance Validation

Language Word Error Rate (WER) Technical Note
English 0.05 (5%) High fidelity for command-and-control and search intent
Tamil 0.33 (33%) Strong phonetic preservation; V11 reduces INT8 quantization noise (P/V phoneme flip)

Note:
Benchmarks evaluated using 200 bilingual search query samples with
num_beams = 5


Training Procedure

The V11 engine was developed through a specialized 8-stage fine-tuning protocol.

This phased approach used a strategic combination of datasets and a non-linear learning rate schedule to achieve stable convergence across both languages.

8-Stage Protocol

  1. Initial Domain Alignment
    Baseline adaptation using LibriSpeech to solidify English phonetic anchors.

  2. Tamil Foundation
    Primary weight initialization for the custom 500-token Tamil vocabulary using Kathbath.

  3. Bilingual Bridge
    High learning rate training on Indic Voice to align cross-lingual phonetic representations.

  4. Iterative Refinement (Stages 4–7)
    Progressive learning rate decay cycles
    1e-4 → 1e-6
    Mixed-dataset batches to reduce WER and prevent catastrophic forgetting.

  5. Acoustic Hardening (Stage 8)
    Final ultra-low learning rate stage to minimize P/V phoneme flip
    caused by INT8 quantization artifacts.


Training Hyperparameters

  • Optimizer: AdamW (fused)
  • Betas: 0.9, 0.999
  • Epsilon: 1e-8
  • Learning Rate: 1e-4 → 1e-6 (phased decay)
  • Global Batch Size: 16
  • Per Device: 4
  • Accumulation Steps: 4
  • Mixed Precision: AMP

Datasets Used

  • Kathbath — Native Tamil speech
  • Indic Voice — Bilingual / code-switch audio
  • LibriSpeech — English phonetic base

Implementation Guide

Python Inference (Research)

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import WhisperProcessor

model_path = "Badri0510/thedal-tamil-full-v11"

model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    subfolder="android_slim",
    use_cache=False
)

processor = WhisperProcessor.from_pretrained(
    model_path,
    subfolder="android_slim"
)

predicted_ids = model.generate(
    input_features,
    num_beams=5,
    use_cache=False
)

Android Deployment

Include the contents of:

android_slim/

inside your app assets.

Required files:

  • model.onnx
  • tokenizer.json
  • config.json

The tokenizer is required to map the custom 500-token Tamil vocabulary.


Environment

  • Transformers 5.3.0
  • PyTorch 2.1.0 + cu128
  • Optimum 1.24.0
  • Tokenizers 0.22.2

Project

Developed under Thedal Edge-AI Initiative
Privacy-first on-device Tamil speech intelligence.

Downloads last month
109
Safetensors
Model size
72.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Badri0510/thedal-tamil-full-v11