Vox Hinglish RNN: Conversational Phonetic Transliteration Model

This repository contains a production-ready, ultra-lightweight Seq2Seq RNN (BiLSTM + Attention) model designed for Vox to dynamically transliterate Hindi (Devanagari) conversational transcripts into SMS-style, natural casual Hinglish (Latin).

The model is exported to ONNX Runtime for high-efficiency, sub-3ms inference on standard CPUs, making it ideal for real-time edge integration.

📂 Repository Structure

The repository follows a clean, segregated, industry-grade layout:

.
├── corpus/
│   ├── cleaned_dataset/         # Tokenized Aksharantar split mapped Arrow files
│   └── unique_word_pairs.json   # Curated 1,129 texting spelling corrections lexicon
├── models/
│   ├── encoder.onnx             # ONNX graph of Bidirectional LSTM Encoder
│   ├── decoder.onnx             # ONNX graph of Attention Decoder
│   ├── encoder.pt               # PyTorch state dictionary weights of Encoder
│   ├── decoder.pt               # PyTorch state dictionary weights of Decoder
│   ├── input_vocab.json         # Source Devanagari character index map
│   └── target_vocab.json        # Target Latin character index map
├── scripts/
│   ├── prepare_dataset.py       # Dataset cleansing, direct override & oversampling
│   ├── train_rnn.py             # Seq2Seq training loop and PyTorch-to-ONNX exporter
│   └── merge_vocabs.py          # Script utilized to merge vocabularies
└── testing/
    ├── eval_transcript.json     # Reference spoken-language audio transcripts
    ├── test_inference_onnx.py   # Latency benchmark and phrase transliteration test
    └── test_transcripts.py      # Standard batch evaluation script over transcript JSON

⚡ Quick Start & Testing

Both testing scripts use dynamic, self-contained relative paths. You can execute them immediately out-of-the-box after cloning:

1. Benchmark Phrase Inference

To benchmark CPU latencies and inspect standard conversational sentence transliterations:

python testing/test_inference_onnx.py

2. Batch Transcript Evaluation

To evaluate transliteration output quality and character accuracy over full spoken conversational transcripts:

python testing/test_transcripts.py

🧠 Model Architecture & Details

Encoder: 2-Layer Bidirectional LSTM (Embedding Dim: 128, Hidden Dim: 256).
Decoder: 2-Layer LSTM with Bahdanau attention over encoder outputs.
Parameters: ~4.1 Million weights.
Target Latency: 2.9 ms per sentence (on single-threaded CPU cores).
Memory Footprint: < 18 MB overall.

🛠️ Training & Cleansing Pipeline

Vocabulary Standardization: Built using character-level tokenizers.
Dataset Cleansing: Parsed 1.2 million rows of Hugging Face's Aksharantar corpus, with target replacements dynamically forced from a vetted custom Hinglish slang lexicon (unique_word_pairs.json) to enforce conversational spelling standards (e.g., "achha" instead of "achchha", "raha" instead of "rahaa").
GPU Training: Optimized on NVIDIA RTX 5070 Ti with a massive batch size of 2048 using Cross-Entropy loss over 5 epochs.

Downloads last month: -; Downloads are not tracked for this model. How to track