This model is inspired by the 2014 seq2seq architecture for neural machine translation, including the reversed source sequence preprocessing strategy proposed in that work.

Model Architecture

This model is a custom LSTM-based sequence-to-sequence (seq2seq) model for English-to-Hindi machine translation, implemented in PyTorch and wrapped in the Hugging Face PreTrainedModel API.

The architecture is strongly inspired by the 2014 seq2seq paper, especially the idea of reversing the source sequence before encoding. In preprocessing, the English input tokens are reversed while preserving BOS/EOS token positions when present. The target Hindi sequence is kept in normal order.

Main components

Encoder:
- Token embedding layer
- 4-layer LSTM
- Embedding dimension: 1000
- Hidden dimension: 1000
- Dropout: 0.15
Decoder:
- Token embedding layer
- 4-layer LSTM
- Linear output projection from hidden state to vocabulary logits
- Embedding dimension: 1000
- Hidden dimension: 1000
- Dropout: 0.15
Vocabulary / Tokenization:
- Shared tokenizer loaded from: kd13/nano-translate-v1
- Shared vocabulary size: 52000
Special tokens:
- Uses tokenizer-defined pad_token_id, bos_token_id, and eos_token_id

Forward pass behavior

During training:

The encoder processes reversed English source tokens.
The decoder receives shifted-right Hindi target tokens.
Cross-entropy loss is computed over decoder logits, with padding ignored.

This model does not include:

Attention
Transformer blocks
Explicit coverage or alignment modeling

It is a pure recurrent encoder-decoder baseline in the spirit of early neural machine translation systems.

Limitations

This model has several important limitations:

No attention mechanism
The decoder relies only on the final encoder hidden and cell states. This can make translation harder for long or information-dense sentences.
Performance may degrade on long sequences
Although LSTMs are stronger than vanilla RNNs, fixed-size context transfer from encoder to decoder can still bottleneck long-sentence translation quality.
No explicit attention/alignment interpretability
Because the model has no attention layer, it does not provide token-to-token alignment information.

Training Details & Training Results

Dataset

The model was trained on:

Dataset: cfilt/iitb-english-hindi
Splits used: - train - validation - test
The source language is English (en) and the target language is Hindi (hi).

Preprocessing

English source sentences are tokenized and reversed at the token level
Hindi target sentences are tokenized normally
Maximum source length: 100
Maximum target length: 100
Dynamic padding is applied using DataCollatorForSeq2Seq

Training configuration

Framework: PyTorch + Hugging Face Transformers + Accelerate
Mixed precision: fp16
Number of processes: 2
Gradient accumulation steps: 4
Batch size per dataloader step: 32
Epochs: 11
Optimizer: AdamW
Learning rate: 0.005
Weight decay: 1e-5
Betas: (0.9, 0.98)
Scheduler: cosine decay with warmup
Warmup ratio: 8% of total training steps
Gradient clipping: max_grad_norm = 3.0
Loss function: cross-entropy with padding ignored (ignore_index = -100)

Model selection

The best checkpoint is selected based on lowest validation loss across training epochs.

Training results

The training script records:

train_loss for each epoch
val_loss for each epoch
best_epoch
best_val_loss

Evaluation

BLEU score: 2.786434934588405

Translations

Experiments(Encoder)

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

kd13
/

nano-translate-v1

This model is inspired by the 2014 seq2seq architecture for neural machine translation, including the reversed source sequence preprocessing strategy proposed in that work.

Model Architecture

Main components

Forward pass behavior

Limitations

Training Details & Training Results

Dataset

Splits used: - `train` - `validation` - `test`

Preprocessing

Training configuration

Model selection

Training results

Evaluation

Translations

Experiments(Encoder)

Dataset used to train kd13/nano-translate-v1

This model is inspired by the 2014 seq2seq architecture for neural machine translation, including the reversed source sequence preprocessing strategy proposed in that work.

Model Architecture

Main components

Forward pass behavior

Limitations

Training Details & Training Results

Dataset

Splits used: - train - validation - test

Preprocessing

Training configuration

Model selection

Training results

Evaluation

Translations

Experiments(Encoder)

Dataset used to train kd13/nano-translate-v1

Splits used: - `train` - `validation` - `test`