File size: 5,497 Bytes

---
license: mit
datasets:
- wmt/wmt19
language:
- en
- de
pipeline_tag: translation
---

# Seq2Seq German-English Translation Model

A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.

## Model Description

This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation:

- **Encoder**: 2-layer LSTM that processes German input sequences
- **Decoder**: 2-layer LSTM that generates English output sequences  
- **Training Strategy**: Teacher forcing during training, autoregressive generation during inference
- **Vocabulary**: 30k German words, 25k English words
- **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)

## Model Architecture

```
German Input → Embedding → LSTM Encoder → Context Vector → LSTM Decoder → Embedding → English Output
```

**Hyperparameters:**
- Embedding size: 256
- Hidden size: 512
- LSTM layers: 2 (both encoder/decoder)
- Dropout: 0.3
- Batch size: 64
- Learning rate: 0.0003

## Training Data

- **Dataset**: WMT19 German-English Translation Task
- **Size**: 2M sentence pairs (filtered subset)
- **Preprocessing**: Sentences filtered by length (5-50 tokens)
- **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`)

## Performance

**Training Results (5 epochs):**
- Initial Training Loss: 4.0949 → Final: 3.1843 (91% improvement)
- Initial Validation Loss: 4.1918 → Final: 3.8537 (34% improvement)
- Training Device: Apple Silicon (MPS)

## Usage

### Quick Start

```python
# This is a custom PyTorch model, not a Transformers model
# Download the files and use with the provided inference script

import requests
from pathlib import Path

# Download model files
base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]

for file in files:
    response = requests.get(f"{base_url}/{file}")
    Path(file).write_bytes(response.content)
    print(f"Downloaded {file}")
```

### Translation Examples

```bash
# Interactive mode
python inference.py --interactive

# Single translation
python inference.py --sentence "Hallo, wie geht es dir?" --verbose

# Demo mode
python inference.py
```

**Example Translations:**
- `"Das ist ein gutes Buch."` → `"this is a good idea."`
- `"Wo ist der Bahnhof?"` → `"where is the <UNK>"`
- `"Ich liebe Deutschland."` → `"i share."`

## Files Included

- `best_model.pt`: PyTorch model checkpoint (trained weights + architecture)
- `german_tokenizer.pkl`: German vocabulary and tokenization logic
- `english_tokenizer.pkl`: English vocabulary and tokenization logic

## Installation & Setup

1. **Clone the repository:**
   ```bash
   git clone https://github.com/sumitdotml/seq2seq
   cd seq2seq
   ```

2. **Set up environment:**
   ```bash
   uv venv && source .venv/bin/activate  # or python -m venv .venv
   uv pip install torch requests tqdm    # or pip install torch requests tqdm
   ```

3. **Download model:**
   ```bash
   python scripts/download_pretrained.py
   ```

4. **Start translating:**
   ```bash
   python scripts/inference.py --interactive
   ```

## Model Architecture Details

The model uses a custom implementation with these components:

- **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer
- **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture  
- **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic

## Limitations

- **Vocabulary constraints**: Limited to 30k German / 25k English words
- **Training data**: Only 2M sentence pairs (vs 35M in full WMT19)
- **No attention mechanism**: Basic encoder-decoder without attention
- **Simple tokenization**: Word-level tokenization without subword units
- **Translation quality**: Suitable for basic phrases, struggles with complex sentences

## Training Details

**Environment:**
- Framework: PyTorch 2.0+
- Device: Apple Silicon (MPS acceleration)  
- Training time: ~5 epochs
- Validation strategy: Hold-out validation set

**Optimization:**
- Optimizer: Adam (lr=0.0003)
- Loss function: CrossEntropyLoss (ignoring padding)
- Gradient clipping: 1.0
- Scheduler: StepLR (step_size=3, gamma=0.5)

## Reproduce Training

```bash
# Full training pipeline
python scripts/data_preparation.py      # Download WMT19 data
python src/data/tokenization.py        # Build vocabularies  
python scripts/train.py                # Train model

# For full dataset training, modify data_preparation.py:
# use_full_dataset = True  # Line 133-134
```

## Citation

If you use this model, please cite:

```bibtex
@misc{seq2seq-de-en,
  author = {sumitdotml},
  title = {German-English Seq2Seq Translation Model},
  year = {2025},
  url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
  note = {PyTorch implementation of sequence-to-sequence translation}
}
```

## References

- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
- WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19

## License

MIT License - See repository for full license text.

## Contact

For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq).