File size: 5,497 Bytes
d9c02a4 c3a5fb5 e42ffd1 c3a5fb5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
license: mit
datasets:
- wmt/wmt19
language:
- en
- de
pipeline_tag: translation
---
# Seq2Seq German-English Translation Model
A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.
## Model Description
This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation:
- **Encoder**: 2-layer LSTM that processes German input sequences
- **Decoder**: 2-layer LSTM that generates English output sequences
- **Training Strategy**: Teacher forcing during training, autoregressive generation during inference
- **Vocabulary**: 30k German words, 25k English words
- **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)
## Model Architecture
```
German Input β Embedding β LSTM Encoder β Context Vector β LSTM Decoder β Embedding β English Output
```
**Hyperparameters:**
- Embedding size: 256
- Hidden size: 512
- LSTM layers: 2 (both encoder/decoder)
- Dropout: 0.3
- Batch size: 64
- Learning rate: 0.0003
## Training Data
- **Dataset**: WMT19 German-English Translation Task
- **Size**: 2M sentence pairs (filtered subset)
- **Preprocessing**: Sentences filtered by length (5-50 tokens)
- **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`)
## Performance
**Training Results (5 epochs):**
- Initial Training Loss: 4.0949 β Final: 3.1843 (91% improvement)
- Initial Validation Loss: 4.1918 β Final: 3.8537 (34% improvement)
- Training Device: Apple Silicon (MPS)
## Usage
### Quick Start
```python
# This is a custom PyTorch model, not a Transformers model
# Download the files and use with the provided inference script
import requests
from pathlib import Path
# Download model files
base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]
for file in files:
response = requests.get(f"{base_url}/{file}")
Path(file).write_bytes(response.content)
print(f"Downloaded {file}")
```
### Translation Examples
```bash
# Interactive mode
python inference.py --interactive
# Single translation
python inference.py --sentence "Hallo, wie geht es dir?" --verbose
# Demo mode
python inference.py
```
**Example Translations:**
- `"Das ist ein gutes Buch."` β `"this is a good idea."`
- `"Wo ist der Bahnhof?"` β `"where is the <UNK>"`
- `"Ich liebe Deutschland."` β `"i share."`
## Files Included
- `best_model.pt`: PyTorch model checkpoint (trained weights + architecture)
- `german_tokenizer.pkl`: German vocabulary and tokenization logic
- `english_tokenizer.pkl`: English vocabulary and tokenization logic
## Installation & Setup
1. **Clone the repository:**
```bash
git clone https://github.com/sumitdotml/seq2seq
cd seq2seq
```
2. **Set up environment:**
```bash
uv venv && source .venv/bin/activate # or python -m venv .venv
uv pip install torch requests tqdm # or pip install torch requests tqdm
```
3. **Download model:**
```bash
python scripts/download_pretrained.py
```
4. **Start translating:**
```bash
python scripts/inference.py --interactive
```
## Model Architecture Details
The model uses a custom implementation with these components:
- **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer
- **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture
- **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic
## Limitations
- **Vocabulary constraints**: Limited to 30k German / 25k English words
- **Training data**: Only 2M sentence pairs (vs 35M in full WMT19)
- **No attention mechanism**: Basic encoder-decoder without attention
- **Simple tokenization**: Word-level tokenization without subword units
- **Translation quality**: Suitable for basic phrases, struggles with complex sentences
## Training Details
**Environment:**
- Framework: PyTorch 2.0+
- Device: Apple Silicon (MPS acceleration)
- Training time: ~5 epochs
- Validation strategy: Hold-out validation set
**Optimization:**
- Optimizer: Adam (lr=0.0003)
- Loss function: CrossEntropyLoss (ignoring padding)
- Gradient clipping: 1.0
- Scheduler: StepLR (step_size=3, gamma=0.5)
## Reproduce Training
```bash
# Full training pipeline
python scripts/data_preparation.py # Download WMT19 data
python src/data/tokenization.py # Build vocabularies
python scripts/train.py # Train model
# For full dataset training, modify data_preparation.py:
# use_full_dataset = True # Line 133-134
```
## Citation
If you use this model, please cite:
```bibtex
@misc{seq2seq-de-en,
author = {sumitdotml},
title = {German-English Seq2Seq Translation Model},
year = {2025},
url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
note = {PyTorch implementation of sequence-to-sequence translation}
}
```
## References
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
- WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19
## License
MIT License - See repository for full license text.
## Contact
For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq). |