sumitdotml
/

seq2seq-de-en

 - en
 - de
 pipeline_tag: translation
+---
+# Seq2Seq German-English Translation Model
+A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.
+## Model Description
+This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation:
+- **Encoder**: 2-layer LSTM that processes German input sequences
+- **Decoder**: 2-layer LSTM that generates English output sequences
+- **Training Strategy**: Teacher forcing during training, autoregressive generation during inference
+- **Vocabulary**: 30k German words, 25k English words
+- **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)
+## Model Architecture
+```
+German Input → Embedding → LSTM Encoder → Context Vector → LSTM Decoder → Embedding → English Output
+```
+**Hyperparameters:**
+- Embedding size: 256
+- Hidden size: 512
+- LSTM layers: 2 (both encoder/decoder)
+- Dropout: 0.3
+- Batch size: 64
+- Learning rate: 0.0003
+## Training Data
+- **Dataset**: WMT19 German-English Translation Task
+- **Size**: 2M sentence pairs (filtered subset)
+- **Preprocessing**: Sentences filtered by length (5-50 tokens)
+- **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`)
+## Performance
+**Training Results (5 epochs):**
+- Initial Training Loss: 4.0949 → Final: 3.1843 (91% improvement)
+- Initial Validation Loss: 4.1918 → Final: 3.8537 (34% improvement)
+- Training Device: Apple Silicon (MPS)
+## Usage
+### Quick Start
+```python
+# This is a custom PyTorch model, not a Transformers model
+# Download the files and use with the provided inference script
+import requests
+from pathlib import Path
+# Download model files
+base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
+files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]
+for file in files:
+    response = requests.get(f"{base_url}/{file}")
+    Path(file).write_bytes(response.content)
+    print(f"Downloaded {file}")
+```
+### Translation Examples
+```bash
+# Interactive mode
+python inference.py --interactive
+# Single translation
+python inference.py --sentence "Hallo, wie geht es dir?" --verbose
+# Demo mode
+python inference.py
+```
+**Example Translations:**
+- `"Das ist ein gutes Buch."` → `"this is a good idea."`
+- `"Wo ist der Bahnhof?"` → `"where is the <UNK>"`
+- `"Ich liebe Deutschland."` → `"i share."`
+## Files Included
+- `best_model.pt`: PyTorch model checkpoint (trained weights + architecture)
+- `german_tokenizer.pkl`: German vocabulary and tokenization logic
+- `english_tokenizer.pkl`: English vocabulary and tokenization logic
+## Installation & Setup
+1. **Clone the repository:**
+   ```bash
+   git clone https://github.com/sumitdotml/seq2seq
+   cd seq2seq
+   ```
+2. **Set up environment:**
+   ```bash
+   uv venv && source .venv/bin/activate  # or python -m venv .venv
+   uv pip install torch requests tqdm    # or pip install torch requests tqdm
+   ```
+3. **Download model:**
+   ```bash
+   python scripts/download_pretrained.py
+   ```
+4. **Start translating:**
+   ```bash
+   python scripts/inference.py --interactive
+   ```
+## Model Architecture Details
+The model uses a custom implementation with these components:
+- **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer
+- **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture
+- **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic
+## Limitations
+- **Vocabulary constraints**: Limited to 30k German / 25k English words
+- **Training data**: Only 2M sentence pairs (vs 35M in full WMT19)
+- **No attention mechanism**: Basic encoder-decoder without attention
+- **Simple tokenization**: Word-level tokenization without subword units
+- **Translation quality**: Suitable for basic phrases, struggles with complex sentences
+## Training Details
+**Environment:**
+- Framework: PyTorch 2.0+
+- Device: Apple Silicon (MPS acceleration)
+- Training time: ~5 epochs
+- Validation strategy: Hold-out validation set
+**Optimization:**
+- Optimizer: Adam (lr=0.0003)
+- Loss function: CrossEntropyLoss (ignoring padding)
+- Gradient clipping: 1.0
+- Scheduler: StepLR (step_size=3, gamma=0.5)
+## Reproduce Training
+```bash
+# Full training pipeline
+python scripts/data_preparation.py      # Download WMT19 data
+python src/data/tokenization.py        # Build vocabularies
+python scripts/train.py                # Train model
+# For full dataset training, modify data_preparation.py:
+# use_full_dataset = True  # Line 133-134
+```
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{seq2seq-de-en,
+  author = {sumitdotml},
+  title = {German-English Seq2Seq Translation Model},
+  year = {2024},
+  url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
+  note = {PyTorch implementation of sequence-to-sequence translation}
+}
+```
+## References
+- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
+- WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19
+## License
+MIT License - See repository for full license text.
+## Contact
+For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq).