This model is inspired by the 2014 seq2seq architecture for neural machine translation, including the reversed source sequence preprocessing strategy proposed in that work.
Model Architecture
This model is a custom LSTM-based sequence-to-sequence (seq2seq) model for English-to-Hindi machine translation, implemented in PyTorch and wrapped in the Hugging Face PreTrainedModel API.
The architecture is strongly inspired by the 2014 seq2seq paper, especially the idea of reversing the source sequence before encoding. In preprocessing, the English input tokens are reversed while preserving BOS/EOS token positions when present. The target Hindi sequence is kept in normal order.
Main components
Encoder:
- Token embedding layer
- 4-layer LSTM
- Embedding dimension: 1000
- Hidden dimension: 1000
- Dropout: 0.15
Decoder:
- Token embedding layer
- 4-layer LSTM
- Linear output projection from hidden state to vocabulary logits
- Embedding dimension: 1000
- Hidden dimension: 1000
- Dropout: 0.15
Vocabulary / Tokenization:
- Shared tokenizer loaded from:
kd13/nano-translate-v1 - Shared vocabulary size:
52000
- Shared tokenizer loaded from:
Special tokens:
- Uses tokenizer-defined
pad_token_id,bos_token_id, andeos_token_id
- Uses tokenizer-defined
Forward pass behavior
During training:
- The encoder processes reversed English source tokens.
- The decoder receives shifted-right Hindi target tokens.
- Cross-entropy loss is computed over decoder logits, with padding ignored.
This model does not include:
- Attention
- Transformer blocks
- Explicit coverage or alignment modeling
It is a pure recurrent encoder-decoder baseline in the spirit of early neural machine translation systems.
Limitations
This model has several important limitations:
No attention mechanism
The decoder relies only on the final encoder hidden and cell states. This can make translation harder for long or information-dense sentences.Performance may degrade on long sequences
Although LSTMs are stronger than vanilla RNNs, fixed-size context transfer from encoder to decoder can still bottleneck long-sentence translation quality.No explicit attention/alignment interpretability
Because the model has no attention layer, it does not provide token-to-token alignment information.
Training Details & Training Results
Dataset
The model was trained on:
- Dataset:
cfilt/iitb-english-hindi Splits used: -
The source language is English (train-validation-testen) and the target language is Hindi (hi).
Preprocessing
- English source sentences are tokenized and reversed at the token level
- Hindi target sentences are tokenized normally
- Maximum source length: 100
- Maximum target length: 100
- Dynamic padding is applied using
DataCollatorForSeq2Seq
Training configuration
- Framework: PyTorch + Hugging Face Transformers + Accelerate
- Mixed precision:
fp16 - Number of processes:
2 - Gradient accumulation steps:
4 - Batch size per dataloader step:
32 - Epochs:
11 - Optimizer:
AdamW - Learning rate:
0.005 - Weight decay:
1e-5 - Betas:
(0.9, 0.98) - Scheduler: cosine decay with warmup
- Warmup ratio:
8%of total training steps - Gradient clipping:
max_grad_norm = 3.0 - Loss function: cross-entropy with padding ignored (
ignore_index = -100)
Model selection
The best checkpoint is selected based on lowest validation loss across training epochs.
Training results
The training script records:
train_lossfor each epochval_lossfor each epochbest_epochbest_val_loss
Evaluation
- BLEU score: 2.786434934588405
Translations
Experiments(Encoder)
- Downloads last month
- 280




