Nano Translate v2: LSTM + Bahdanau Attention for English→Hindi Translation

This model is a custom LSTM-based sequence-to-sequence neural machine translation system for English-to-Hindi translation. It is implemented in PyTorch, wrapped with the Hugging Face PreTrainedModel API, and uploaded for inference with trust_remote_code=True.

Compared with the earlier baseline, this version improves the architecture substantially by adding:

  • a bidirectional LSTM encoder
  • Bahdanau attention
  • encoder-to-decoder bridge layers
  • input feeding in the decoder
  • full compatibility with Hugging Face generate() and beam search

The model is still a recurrent seq2seq system, but it is stronger than the earlier no-attention baseline.


Model Architecture

This model is a custom LSTM + attention encoder-decoder architecture for English→Hindi machine translation.

Encoder

The encoder consists of:

  • token embedding layer
  • embedding dropout
  • 3-layer bidirectional LSTM

Encoder settings

  • Embedding dimension: 500
  • Hidden dimension: 500
  • Number of layers: 3
  • Dropout: 0.15
  • Directionality: bidirectional

The encoder processes tokenized English input and returns:

  • token-level encoder outputs
  • final hidden states
  • final cell states

Attention

The decoder uses Bahdanau attention over the encoder outputs.

Attention components:

  • linear projection of decoder hidden state
  • linear projection of encoder outputs
  • additive attention scoring
  • softmax attention weights
  • context vector computed as weighted sum of encoder outputs

This allows the decoder to attend dynamically to different source positions during generation.


Decoder

The decoder consists of:

  • token embedding layer
  • embedding dropout
  • Bahdanau attention
  • 3-layer LSTM
  • input feeding mechanism
  • pre-output projection layer
  • output projection to vocabulary logits

At each decoding step, the decoder consumes:

  • current token embedding
  • previous attention context
  • previous attentional output

This makes the decoder richer than a plain LSTM decoder.

Decoder settings

  • Embedding dimension: 500
  • Hidden dimension: 500
  • Number of layers: 3
  • Dropout: 0.15

Bridge Layers

Because the encoder is bidirectional and the decoder is unidirectional, the model uses learned bridge layers:

  • hidden bridge: maps encoder final hidden states to decoder hidden states
  • cell bridge: maps encoder final cell states to decoder cell states

These bridges combine forward and backward encoder states before initializing the decoder.


Vocabulary / Tokenization

Tokenizer used:

  • kd13/nano-translate-v2

Vocabulary:

  • shared tokenizer
  • vocabulary size: 75000

Special tokens:

  • uses tokenizer-defined:
    • pad_token_id
    • bos_token_id
    • eos_token_id

Generation Configuration

Default generation setup:

  • num_beams = 10
  • max_length = 42
  • length_penalty = 1.0
  • early_stopping = true
  • do_sample = false

Dataset

The model was trained on:

  • Dataset: cfilt/iitb-english-hindi

Splits used:

  • train
  • validation
  • test

Source language:

  • English (en)

Target language:

  • Hindi (hi)

Training Configuration

Framework:

  • PyTorch
  • Hugging Face Transformers
  • Hugging Face Accelerate

Training setup:

  • mixed precision: fp16
  • gradient accumulation steps: 4
  • batch size per dataloader step: 48
  • epochs: 11
  • optimizer: AdamW
  • learning rate: 0.0008
  • weight decay: 0.0002
  • betas: (0.9, 0.999)
  • scheduler: cosine decay with warmup
  • warmup ratio: 8%
  • gradient clipping: max_grad_norm = 1.0
  • loss function: cross-entropy
  • ignore index: -100
  • label smoothing: 0.0

Training Results

The model trains successfully and supports:

  • full end-to-end Hugging Face loading
  • custom remote-code inference
  • beam-search generation
  • attention-based decoding

On manual qualitative testing, the model produces:

Strengths

  • grammatically reasonable Hindi
  • correct basic sentence structure
  • good simple-sentence fluency
  • usable everyday sentence translations

Examples of good behavior:

  • “What is your name?” → “आपका नाम क्या है ?”
  • “Where is the station?” → “स्टेशन कहाँ है ?”
  • “My father bought a new car yesterday” → “मेरे पिता ने कल एक नई कार खरीदी”

Weaknesses

The model often translates technical or multi-word concepts too literally.

Examples:

  • “machine learning” → literal-style translation instead of मशीन लर्निंग
  • “deep learning” → literal-style translation instead of डीप लर्निंग
  • “artificial intelligence” → weaker rendering than standard कृत्रिम बुद्धिमत्ता

The model also shows weaker handling of:

  • negation
  • semantic polarity
  • role-sensitive meaning
  • technical terminology
  • some long and complex clauses

Evaluation Findings

Manual experiments suggest the model is:

  • linguistically decent
  • semantically shallow on technical phrases
  • better at surface fluency than deep relational meaning

Observed behavior from diagnostic tests:

  • strong grammatical and lexical pattern learning
  • weaker sensitivity to:
    • negation (like vs don't like)
    • antonymy (happy vs unhappy)
    • subject-object reversal
    • fixed technical expressions

This suggests the encoder representations capture:

  • lexical overlap
  • sentence template similarity
  • broad topical similarity

more strongly than:

  • precise compositional semantics
  • polarity
  • logical meaning changes

Limitations

This model has several important limitations:

1. Still a recurrent architecture

Although attention improves performance, this is still an LSTM-based seq2seq model, which is generally weaker than modern Transformer architectures on large-scale translation tasks.

2. Weak technical terminology handling

The model tends to over-translate domain terms compositionally rather than preserving them as established concepts.

Examples:

  • machine learning
  • deep learning
  • artificial intelligence

3. Weak negation sensitivity

Diagnostic similarity experiments suggest the model does not strongly separate:

  • positive vs negative forms
  • antonyms
  • role-reversed sentence meanings

4. Limited long-sentence robustness

Even with attention, longer and more information-dense sentences can still degrade translation quality.

5. Inconsistent register and style

The model sometimes mixes:

  • formal Hindi
  • informal Hindi
  • literal phrasing

Training Results

Screenshot 2026-04-01 at 4.09.49 AM

Screenshot 2026-04-01 at 4.10.10 AM


BLEU Evalution

  • BLEU score: 14.35916135911631

Translation

Screenshot 2026-04-02 at 12.11.43 AM


Encoder Experiments

Screenshot 2026-04-02 at 12.15.26 AM Screenshot 2026-04-02 at 12.15.02 AM Screenshot 2026-04-02 at 12.14.33 AM


Example Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("kd13/nano-translate-v2", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("kd13/nano-translate-v2", trust_remote_code=True)

text = "What is your name?"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=10,
    max_length=42,
    length_penalty=1.0,
    early_stopping=True
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train kd13/nano-translate-v2