Vox Hinglish RNN: Conversational Phonetic Transliteration Model
This repository contains a production-ready, ultra-lightweight Seq2Seq RNN (BiLSTM + Attention) model designed for Vox to dynamically transliterate Hindi (Devanagari) conversational transcripts into SMS-style, natural casual Hinglish (Latin).
The model is exported to ONNX Runtime for high-efficiency, sub-3ms inference on standard CPUs, making it ideal for real-time edge integration.
📂 Repository Structure
The repository follows a clean, segregated, industry-grade layout:
.
├── corpus/
│ ├── cleaned_dataset/ # Tokenized Aksharantar split mapped Arrow files
│ └── unique_word_pairs.json # Curated 1,129 texting spelling corrections lexicon
├── models/
│ ├── encoder.onnx # ONNX graph of Bidirectional LSTM Encoder
│ ├── decoder.onnx # ONNX graph of Attention Decoder
│ ├── encoder.pt # PyTorch state dictionary weights of Encoder
│ ├── decoder.pt # PyTorch state dictionary weights of Decoder
│ ├── input_vocab.json # Source Devanagari character index map
│ └── target_vocab.json # Target Latin character index map
├── scripts/
│ ├── prepare_dataset.py # Dataset cleansing, direct override & oversampling
│ ├── train_rnn.py # Seq2Seq training loop and PyTorch-to-ONNX exporter
│ └── merge_vocabs.py # Script utilized to merge vocabularies
└── testing/
├── eval_transcript.json # Reference spoken-language audio transcripts
├── test_inference_onnx.py # Latency benchmark and phrase transliteration test
└── test_transcripts.py # Standard batch evaluation script over transcript JSON
⚡ Quick Start & Testing
Both testing scripts use dynamic, self-contained relative paths. You can execute them immediately out-of-the-box after cloning:
1. Benchmark Phrase Inference
To benchmark CPU latencies and inspect standard conversational sentence transliterations:
python testing/test_inference_onnx.py
2. Batch Transcript Evaluation
To evaluate transliteration output quality and character accuracy over full spoken conversational transcripts:
python testing/test_transcripts.py
🧠 Model Architecture & Details
- Encoder: 2-Layer Bidirectional LSTM (Embedding Dim: 128, Hidden Dim: 256).
- Decoder: 2-Layer LSTM with Bahdanau attention over encoder outputs.
- Parameters: ~4.1 Million weights.
- Target Latency: 2.9 ms per sentence (on single-threaded CPU cores).
- Memory Footprint:
< 18 MBoverall.
🛠️ Training & Cleansing Pipeline
- Vocabulary Standardization: Built using character-level tokenizers.
- Dataset Cleansing: Parsed 1.2 million rows of Hugging Face's Aksharantar corpus, with target replacements dynamically forced from a vetted custom Hinglish slang lexicon (
unique_word_pairs.json) to enforce conversational spelling standards (e.g.,"achha"instead of"achchha","raha"instead of"rahaa"). - GPU Training: Optimized on NVIDIA RTX 5070 Ti with a massive batch size of 2048 using Cross-Entropy loss over 5 epochs.