This model is inspired by the 2014 seq2seq architecture for neural machine translation, including the reversed source sequence preprocessing strategy proposed in that work.

Model Architecture

This model is a custom LSTM-based sequence-to-sequence (seq2seq) model for English-to-Hindi machine translation, implemented in PyTorch and wrapped in the Hugging Face PreTrainedModel API.

The architecture is strongly inspired by the 2014 seq2seq paper, especially the idea of reversing the source sequence before encoding. In preprocessing, the English input tokens are reversed while preserving BOS/EOS token positions when present. The target Hindi sequence is kept in normal order.

Main components

  • Encoder:

    • Token embedding layer
    • 4-layer LSTM
    • Embedding dimension: 1000
    • Hidden dimension: 1000
    • Dropout: 0.15
  • Decoder:

    • Token embedding layer
    • 4-layer LSTM
    • Linear output projection from hidden state to vocabulary logits
    • Embedding dimension: 1000
    • Hidden dimension: 1000
    • Dropout: 0.15
  • Vocabulary / Tokenization:

    • Shared tokenizer loaded from: kd13/nano-translate-v1
    • Shared vocabulary size: 52000
  • Special tokens:

    • Uses tokenizer-defined pad_token_id, bos_token_id, and eos_token_id

Forward pass behavior

During training:

  • The encoder processes reversed English source tokens.
  • The decoder receives shifted-right Hindi target tokens.
  • Cross-entropy loss is computed over decoder logits, with padding ignored.

This model does not include:

  • Attention
  • Transformer blocks
  • Explicit coverage or alignment modeling

It is a pure recurrent encoder-decoder baseline in the spirit of early neural machine translation systems.


Limitations

This model has several important limitations:

  1. No attention mechanism
    The decoder relies only on the final encoder hidden and cell states. This can make translation harder for long or information-dense sentences.

  2. Performance may degrade on long sequences
    Although LSTMs are stronger than vanilla RNNs, fixed-size context transfer from encoder to decoder can still bottleneck long-sentence translation quality.

  3. No explicit attention/alignment interpretability
    Because the model has no attention layer, it does not provide token-to-token alignment information.


Training Details & Training Results

Dataset

The model was trained on:

  • Dataset: cfilt/iitb-english-hindi
  • Splits used: - train - validation - test

    The source language is English (en) and the target language is Hindi (hi).

Preprocessing

  • English source sentences are tokenized and reversed at the token level
  • Hindi target sentences are tokenized normally
  • Maximum source length: 100
  • Maximum target length: 100
  • Dynamic padding is applied using DataCollatorForSeq2Seq

Training configuration

  • Framework: PyTorch + Hugging Face Transformers + Accelerate
  • Mixed precision: fp16
  • Number of processes: 2
  • Gradient accumulation steps: 4
  • Batch size per dataloader step: 32
  • Epochs: 11
  • Optimizer: AdamW
  • Learning rate: 0.005
  • Weight decay: 1e-5
  • Betas: (0.9, 0.98)
  • Scheduler: cosine decay with warmup
  • Warmup ratio: 8% of total training steps
  • Gradient clipping: max_grad_norm = 3.0
  • Loss function: cross-entropy with padding ignored (ignore_index = -100)

Model selection

The best checkpoint is selected based on lowest validation loss across training epochs.

Training results

The training script records:

  • train_loss for each epoch
  • val_loss for each epoch
  • best_epoch
  • best_val_loss

Screenshot 2026-03-28 at 9.21.14 AM

Screenshot 2026-03-28 at 9.21.53 AM


Evaluation

  • BLEU score: 2.786434934588405

Translations

Screenshot 2026-03-29 at 12.19.17 AM

Experiments(Encoder)

Screenshot 2026-03-29 at 12.33.04 AM

Screenshot 2026-03-29 at 12.31.18 AM

Downloads last month
280
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train kd13/nano-translate-v1