Nano Translate v2: LSTM + Bahdanau Attention for English→Hindi Translation
This model is a custom LSTM-based sequence-to-sequence neural machine translation system for English-to-Hindi translation. It is implemented in PyTorch, wrapped with the Hugging Face PreTrainedModel API, and uploaded for inference with trust_remote_code=True.
Compared with the earlier baseline, this version improves the architecture substantially by adding:
- a bidirectional LSTM encoder
- Bahdanau attention
- encoder-to-decoder bridge layers
- input feeding in the decoder
- full compatibility with Hugging Face
generate()and beam search
The model is still a recurrent seq2seq system, but it is stronger than the earlier no-attention baseline.
Model Architecture
This model is a custom LSTM + attention encoder-decoder architecture for English→Hindi machine translation.
Encoder
The encoder consists of:
- token embedding layer
- embedding dropout
- 3-layer bidirectional LSTM
Encoder settings
- Embedding dimension: 500
- Hidden dimension: 500
- Number of layers: 3
- Dropout: 0.15
- Directionality: bidirectional
The encoder processes tokenized English input and returns:
- token-level encoder outputs
- final hidden states
- final cell states
Attention
The decoder uses Bahdanau attention over the encoder outputs.
Attention components:
- linear projection of decoder hidden state
- linear projection of encoder outputs
- additive attention scoring
- softmax attention weights
- context vector computed as weighted sum of encoder outputs
This allows the decoder to attend dynamically to different source positions during generation.
Decoder
The decoder consists of:
- token embedding layer
- embedding dropout
- Bahdanau attention
- 3-layer LSTM
- input feeding mechanism
- pre-output projection layer
- output projection to vocabulary logits
At each decoding step, the decoder consumes:
- current token embedding
- previous attention context
- previous attentional output
This makes the decoder richer than a plain LSTM decoder.
Decoder settings
- Embedding dimension: 500
- Hidden dimension: 500
- Number of layers: 3
- Dropout: 0.15
Bridge Layers
Because the encoder is bidirectional and the decoder is unidirectional, the model uses learned bridge layers:
- hidden bridge: maps encoder final hidden states to decoder hidden states
- cell bridge: maps encoder final cell states to decoder cell states
These bridges combine forward and backward encoder states before initializing the decoder.
Vocabulary / Tokenization
Tokenizer used:
kd13/nano-translate-v2
Vocabulary:
- shared tokenizer
- vocabulary size: 75000
Special tokens:
- uses tokenizer-defined:
pad_token_idbos_token_ideos_token_id
Generation Configuration
Default generation setup:
num_beams = 10max_length = 42length_penalty = 1.0early_stopping = truedo_sample = false
Dataset
The model was trained on:
- Dataset:
cfilt/iitb-english-hindi
Splits used:
- train
- validation
- test
Source language:
- English (
en)
Target language:
- Hindi (
hi)
Training Configuration
Framework:
- PyTorch
- Hugging Face Transformers
- Hugging Face Accelerate
Training setup:
- mixed precision: fp16
- gradient accumulation steps: 4
- batch size per dataloader step: 48
- epochs: 11
- optimizer: AdamW
- learning rate: 0.0008
- weight decay: 0.0002
- betas: (0.9, 0.999)
- scheduler: cosine decay with warmup
- warmup ratio: 8%
- gradient clipping: max_grad_norm = 1.0
- loss function: cross-entropy
- ignore index: -100
- label smoothing: 0.0
Training Results
The model trains successfully and supports:
- full end-to-end Hugging Face loading
- custom remote-code inference
- beam-search generation
- attention-based decoding
On manual qualitative testing, the model produces:
Strengths
- grammatically reasonable Hindi
- correct basic sentence structure
- good simple-sentence fluency
- usable everyday sentence translations
Examples of good behavior:
- “What is your name?” → “आपका नाम क्या है ?”
- “Where is the station?” → “स्टेशन कहाँ है ?”
- “My father bought a new car yesterday” → “मेरे पिता ने कल एक नई कार खरीदी”
Weaknesses
The model often translates technical or multi-word concepts too literally.
Examples:
- “machine learning” → literal-style translation instead of मशीन लर्निंग
- “deep learning” → literal-style translation instead of डीप लर्निंग
- “artificial intelligence” → weaker rendering than standard कृत्रिम बुद्धिमत्ता
The model also shows weaker handling of:
- negation
- semantic polarity
- role-sensitive meaning
- technical terminology
- some long and complex clauses
Evaluation Findings
Manual experiments suggest the model is:
- linguistically decent
- semantically shallow on technical phrases
- better at surface fluency than deep relational meaning
Observed behavior from diagnostic tests:
- strong grammatical and lexical pattern learning
- weaker sensitivity to:
- negation (
likevsdon't like) - antonymy (
happyvsunhappy) - subject-object reversal
- fixed technical expressions
- negation (
This suggests the encoder representations capture:
- lexical overlap
- sentence template similarity
- broad topical similarity
more strongly than:
- precise compositional semantics
- polarity
- logical meaning changes
Limitations
This model has several important limitations:
1. Still a recurrent architecture
Although attention improves performance, this is still an LSTM-based seq2seq model, which is generally weaker than modern Transformer architectures on large-scale translation tasks.
2. Weak technical terminology handling
The model tends to over-translate domain terms compositionally rather than preserving them as established concepts.
Examples:
- machine learning
- deep learning
- artificial intelligence
3. Weak negation sensitivity
Diagnostic similarity experiments suggest the model does not strongly separate:
- positive vs negative forms
- antonyms
- role-reversed sentence meanings
4. Limited long-sentence robustness
Even with attention, longer and more information-dense sentences can still degrade translation quality.
5. Inconsistent register and style
The model sometimes mixes:
- formal Hindi
- informal Hindi
- literal phrasing
Training Results
BLEU Evalution
- BLEU score: 14.35916135911631
Translation
Encoder Experiments
Example Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("kd13/nano-translate-v2", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("kd13/nano-translate-v2", trust_remote_code=True)
text = "What is your name?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs,
num_beams=10,
max_length=42,
length_penalty=1.0,
early_stopping=True
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
- Downloads last month
- -





