Nano Translate v2: LSTM + Bahdanau Attention for English→Hindi Translation

This model is a custom LSTM-based sequence-to-sequence neural machine translation system for English-to-Hindi translation. It is implemented in PyTorch, wrapped with the Hugging Face PreTrainedModel API, and uploaded for inference with trust_remote_code=True.

Compared with the earlier baseline, this version improves the architecture substantially by adding:

a bidirectional LSTM encoder
Bahdanau attention
encoder-to-decoder bridge layers
input feeding in the decoder
full compatibility with Hugging Face generate() and beam search

The model is still a recurrent seq2seq system, but it is stronger than the earlier no-attention baseline.

Model Architecture

This model is a custom LSTM + attention encoder-decoder architecture for English→Hindi machine translation.

Encoder

The encoder consists of:

token embedding layer
embedding dropout
3-layer bidirectional LSTM

Encoder settings

Embedding dimension: 500
Hidden dimension: 500
Number of layers: 3
Dropout: 0.15
Directionality: bidirectional

The encoder processes tokenized English input and returns:

token-level encoder outputs
final hidden states
final cell states

Attention

The decoder uses Bahdanau attention over the encoder outputs.

Attention components:

linear projection of decoder hidden state
linear projection of encoder outputs
additive attention scoring
softmax attention weights
context vector computed as weighted sum of encoder outputs

This allows the decoder to attend dynamically to different source positions during generation.

Decoder

The decoder consists of:

token embedding layer
embedding dropout
Bahdanau attention
3-layer LSTM
input feeding mechanism
pre-output projection layer
output projection to vocabulary logits

At each decoding step, the decoder consumes:

current token embedding
previous attention context
previous attentional output

This makes the decoder richer than a plain LSTM decoder.

Decoder settings

Embedding dimension: 500
Hidden dimension: 500
Number of layers: 3
Dropout: 0.15

Bridge Layers

Because the encoder is bidirectional and the decoder is unidirectional, the model uses learned bridge layers:

hidden bridge: maps encoder final hidden states to decoder hidden states
cell bridge: maps encoder final cell states to decoder cell states

These bridges combine forward and backward encoder states before initializing the decoder.

Vocabulary / Tokenization

Tokenizer used:

kd13/nano-translate-v2

Vocabulary:

shared tokenizer
vocabulary size: 75000

Special tokens:

uses tokenizer-defined:
- pad_token_id
- bos_token_id
- eos_token_id

Generation Configuration

Default generation setup:

num_beams = 10
max_length = 42
length_penalty = 1.0
early_stopping = true
do_sample = false

Dataset

The model was trained on:

Dataset: cfilt/iitb-english-hindi

Splits used:

train
validation
test

Source language:

English (en)

Target language:

Hindi (hi)

Training Configuration

Framework:

PyTorch
Hugging Face Transformers
Hugging Face Accelerate

Training setup:

mixed precision: fp16
gradient accumulation steps: 4
batch size per dataloader step: 48
epochs: 11
optimizer: AdamW
learning rate: 0.0008
weight decay: 0.0002
betas: (0.9, 0.999)
scheduler: cosine decay with warmup
warmup ratio: 8%
gradient clipping: max_grad_norm = 1.0
loss function: cross-entropy
ignore index: -100
label smoothing: 0.0

Training Results

The model trains successfully and supports:

full end-to-end Hugging Face loading
custom remote-code inference
beam-search generation
attention-based decoding

On manual qualitative testing, the model produces:

Strengths

grammatically reasonable Hindi
correct basic sentence structure
good simple-sentence fluency
usable everyday sentence translations

Examples of good behavior:

“What is your name?” → “आपका नाम क्या है ?”
“Where is the station?” → “स्टेशन कहाँ है ?”
“My father bought a new car yesterday” → “मेरे पिता ने कल एक नई कार खरीदी”

Weaknesses

The model often translates technical or multi-word concepts too literally.

Examples:

“machine learning” → literal-style translation instead of मशीन लर्निंग
“deep learning” → literal-style translation instead of डीप लर्निंग
“artificial intelligence” → weaker rendering than standard कृत्रिम बुद्धिमत्ता

The model also shows weaker handling of:

negation
semantic polarity
role-sensitive meaning
technical terminology
some long and complex clauses

Evaluation Findings

Manual experiments suggest the model is:

linguistically decent
semantically shallow on technical phrases
better at surface fluency than deep relational meaning

Observed behavior from diagnostic tests:

strong grammatical and lexical pattern learning
weaker sensitivity to:
- negation (like vs don't like)
- antonymy (happy vs unhappy)
- subject-object reversal
- fixed technical expressions

This suggests the encoder representations capture:

lexical overlap
sentence template similarity
broad topical similarity

more strongly than:

precise compositional semantics
polarity
logical meaning changes

Limitations

This model has several important limitations:

1. Still a recurrent architecture

Although attention improves performance, this is still an LSTM-based seq2seq model, which is generally weaker than modern Transformer architectures on large-scale translation tasks.

2. Weak technical terminology handling

The model tends to over-translate domain terms compositionally rather than preserving them as established concepts.

Examples:

machine learning
deep learning
artificial intelligence

3. Weak negation sensitivity

Diagnostic similarity experiments suggest the model does not strongly separate:

positive vs negative forms
antonyms
role-reversed sentence meanings

4. Limited long-sentence robustness

Even with attention, longer and more information-dense sentences can still degrade translation quality.

5. Inconsistent register and style

The model sometimes mixes:

formal Hindi
informal Hindi
literal phrasing

Training Results

BLEU Evalution

BLEU score: 14.35916135911631

Translation

Encoder Experiments

Example Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("kd13/nano-translate-v2", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("kd13/nano-translate-v2", trust_remote_code=True)

text = "What is your name?"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=10,
    max_length=42,
    length_penalty=1.0,
    early_stopping=True
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

Downloads last month: 7

Safetensors

Model size

0.1B params

Tensor type

F32

kd13
/

nano-translate-v2

Nano Translate v2: LSTM + Bahdanau Attention for English→Hindi Translation

Model Architecture

Encoder

Attention

Decoder

Bridge Layers

Vocabulary / Tokenization

Generation Configuration

Dataset

Training Configuration

Training Results

Strengths

Weaknesses

Evaluation Findings

Limitations

1. Still a recurrent architecture

2. Weak technical terminology handling

3. Weak negation sensitivity

4. Limited long-sentence robustness

5. Inconsistent register and style

Training Results

BLEU Evalution

Translation

Encoder Experiments

Example Usage

Dataset used to train kd13/nano-translate-v2