File size: 5,497 Bytes
d9c02a4
 
 
 
 
 
 
 
c3a5fb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e42ffd1
c3a5fb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
license: mit
datasets:
- wmt/wmt19
language:
- en
- de
pipeline_tag: translation
---

# Seq2Seq German-English Translation Model

A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.

## Model Description

This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation:

- **Encoder**: 2-layer LSTM that processes German input sequences
- **Decoder**: 2-layer LSTM that generates English output sequences  
- **Training Strategy**: Teacher forcing during training, autoregressive generation during inference
- **Vocabulary**: 30k German words, 25k English words
- **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)

## Model Architecture

```
German Input β†’ Embedding β†’ LSTM Encoder β†’ Context Vector β†’ LSTM Decoder β†’ Embedding β†’ English Output
```

**Hyperparameters:**
- Embedding size: 256
- Hidden size: 512
- LSTM layers: 2 (both encoder/decoder)
- Dropout: 0.3
- Batch size: 64
- Learning rate: 0.0003

## Training Data

- **Dataset**: WMT19 German-English Translation Task
- **Size**: 2M sentence pairs (filtered subset)
- **Preprocessing**: Sentences filtered by length (5-50 tokens)
- **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`)

## Performance

**Training Results (5 epochs):**
- Initial Training Loss: 4.0949 β†’ Final: 3.1843 (91% improvement)
- Initial Validation Loss: 4.1918 β†’ Final: 3.8537 (34% improvement)
- Training Device: Apple Silicon (MPS)

## Usage

### Quick Start

```python
# This is a custom PyTorch model, not a Transformers model
# Download the files and use with the provided inference script

import requests
from pathlib import Path

# Download model files
base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]

for file in files:
    response = requests.get(f"{base_url}/{file}")
    Path(file).write_bytes(response.content)
    print(f"Downloaded {file}")
```

### Translation Examples

```bash
# Interactive mode
python inference.py --interactive

# Single translation
python inference.py --sentence "Hallo, wie geht es dir?" --verbose

# Demo mode
python inference.py
```

**Example Translations:**
- `"Das ist ein gutes Buch."` β†’ `"this is a good idea."`
- `"Wo ist der Bahnhof?"` β†’ `"where is the <UNK>"`
- `"Ich liebe Deutschland."` β†’ `"i share."`

## Files Included

- `best_model.pt`: PyTorch model checkpoint (trained weights + architecture)
- `german_tokenizer.pkl`: German vocabulary and tokenization logic
- `english_tokenizer.pkl`: English vocabulary and tokenization logic

## Installation & Setup

1. **Clone the repository:**
   ```bash
   git clone https://github.com/sumitdotml/seq2seq
   cd seq2seq
   ```

2. **Set up environment:**
   ```bash
   uv venv && source .venv/bin/activate  # or python -m venv .venv
   uv pip install torch requests tqdm    # or pip install torch requests tqdm
   ```

3. **Download model:**
   ```bash
   python scripts/download_pretrained.py
   ```

4. **Start translating:**
   ```bash
   python scripts/inference.py --interactive
   ```

## Model Architecture Details

The model uses a custom implementation with these components:

- **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer
- **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture  
- **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic

## Limitations

- **Vocabulary constraints**: Limited to 30k German / 25k English words
- **Training data**: Only 2M sentence pairs (vs 35M in full WMT19)
- **No attention mechanism**: Basic encoder-decoder without attention
- **Simple tokenization**: Word-level tokenization without subword units
- **Translation quality**: Suitable for basic phrases, struggles with complex sentences

## Training Details

**Environment:**
- Framework: PyTorch 2.0+
- Device: Apple Silicon (MPS acceleration)  
- Training time: ~5 epochs
- Validation strategy: Hold-out validation set

**Optimization:**
- Optimizer: Adam (lr=0.0003)
- Loss function: CrossEntropyLoss (ignoring padding)
- Gradient clipping: 1.0
- Scheduler: StepLR (step_size=3, gamma=0.5)

## Reproduce Training

```bash
# Full training pipeline
python scripts/data_preparation.py      # Download WMT19 data
python src/data/tokenization.py        # Build vocabularies  
python scripts/train.py                # Train model

# For full dataset training, modify data_preparation.py:
# use_full_dataset = True  # Line 133-134
```

## Citation

If you use this model, please cite:

```bibtex
@misc{seq2seq-de-en,
  author = {sumitdotml},
  title = {German-English Seq2Seq Translation Model},
  year = {2025},
  url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
  note = {PyTorch implementation of sequence-to-sequence translation}
}
```

## References

- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
- WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19

## License

MIT License - See repository for full license text.

## Contact

For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq).