sumitdotml commited on
Commit
c3a5fb5
Β·
verified Β·
1 Parent(s): d9c02a4

updated model card

Browse files
Files changed (1) hide show
  1. README.md +181 -1
README.md CHANGED
@@ -6,4 +6,184 @@ language:
6
  - en
7
  - de
8
  pipeline_tag: translation
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - en
7
  - de
8
  pipeline_tag: translation
9
+ ---
10
+
11
+ # Seq2Seq German-English Translation Model
12
+
13
+ A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.
14
+
15
+ ## Model Description
16
+
17
+ This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation:
18
+
19
+ - **Encoder**: 2-layer LSTM that processes German input sequences
20
+ - **Decoder**: 2-layer LSTM that generates English output sequences
21
+ - **Training Strategy**: Teacher forcing during training, autoregressive generation during inference
22
+ - **Vocabulary**: 30k German words, 25k English words
23
+ - **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)
24
+
25
+ ## Model Architecture
26
+
27
+ ```
28
+ German Input β†’ Embedding β†’ LSTM Encoder β†’ Context Vector β†’ LSTM Decoder β†’ Embedding β†’ English Output
29
+ ```
30
+
31
+ **Hyperparameters:**
32
+ - Embedding size: 256
33
+ - Hidden size: 512
34
+ - LSTM layers: 2 (both encoder/decoder)
35
+ - Dropout: 0.3
36
+ - Batch size: 64
37
+ - Learning rate: 0.0003
38
+
39
+ ## Training Data
40
+
41
+ - **Dataset**: WMT19 German-English Translation Task
42
+ - **Size**: 2M sentence pairs (filtered subset)
43
+ - **Preprocessing**: Sentences filtered by length (5-50 tokens)
44
+ - **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`)
45
+
46
+ ## Performance
47
+
48
+ **Training Results (5 epochs):**
49
+ - Initial Training Loss: 4.0949 β†’ Final: 3.1843 (91% improvement)
50
+ - Initial Validation Loss: 4.1918 β†’ Final: 3.8537 (34% improvement)
51
+ - Training Device: Apple Silicon (MPS)
52
+
53
+ ## Usage
54
+
55
+ ### Quick Start
56
+
57
+ ```python
58
+ # This is a custom PyTorch model, not a Transformers model
59
+ # Download the files and use with the provided inference script
60
+
61
+ import requests
62
+ from pathlib import Path
63
+
64
+ # Download model files
65
+ base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
66
+ files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]
67
+
68
+ for file in files:
69
+ response = requests.get(f"{base_url}/{file}")
70
+ Path(file).write_bytes(response.content)
71
+ print(f"Downloaded {file}")
72
+ ```
73
+
74
+ ### Translation Examples
75
+
76
+ ```bash
77
+ # Interactive mode
78
+ python inference.py --interactive
79
+
80
+ # Single translation
81
+ python inference.py --sentence "Hallo, wie geht es dir?" --verbose
82
+
83
+ # Demo mode
84
+ python inference.py
85
+ ```
86
+
87
+ **Example Translations:**
88
+ - `"Das ist ein gutes Buch."` β†’ `"this is a good idea."`
89
+ - `"Wo ist der Bahnhof?"` β†’ `"where is the <UNK>"`
90
+ - `"Ich liebe Deutschland."` β†’ `"i share."`
91
+
92
+ ## Files Included
93
+
94
+ - `best_model.pt`: PyTorch model checkpoint (trained weights + architecture)
95
+ - `german_tokenizer.pkl`: German vocabulary and tokenization logic
96
+ - `english_tokenizer.pkl`: English vocabulary and tokenization logic
97
+
98
+ ## Installation & Setup
99
+
100
+ 1. **Clone the repository:**
101
+ ```bash
102
+ git clone https://github.com/sumitdotml/seq2seq
103
+ cd seq2seq
104
+ ```
105
+
106
+ 2. **Set up environment:**
107
+ ```bash
108
+ uv venv && source .venv/bin/activate # or python -m venv .venv
109
+ uv pip install torch requests tqdm # or pip install torch requests tqdm
110
+ ```
111
+
112
+ 3. **Download model:**
113
+ ```bash
114
+ python scripts/download_pretrained.py
115
+ ```
116
+
117
+ 4. **Start translating:**
118
+ ```bash
119
+ python scripts/inference.py --interactive
120
+ ```
121
+
122
+ ## Model Architecture Details
123
+
124
+ The model uses a custom implementation with these components:
125
+
126
+ - **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer
127
+ - **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture
128
+ - **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic
129
+
130
+ ## Limitations
131
+
132
+ - **Vocabulary constraints**: Limited to 30k German / 25k English words
133
+ - **Training data**: Only 2M sentence pairs (vs 35M in full WMT19)
134
+ - **No attention mechanism**: Basic encoder-decoder without attention
135
+ - **Simple tokenization**: Word-level tokenization without subword units
136
+ - **Translation quality**: Suitable for basic phrases, struggles with complex sentences
137
+
138
+ ## Training Details
139
+
140
+ **Environment:**
141
+ - Framework: PyTorch 2.0+
142
+ - Device: Apple Silicon (MPS acceleration)
143
+ - Training time: ~5 epochs
144
+ - Validation strategy: Hold-out validation set
145
+
146
+ **Optimization:**
147
+ - Optimizer: Adam (lr=0.0003)
148
+ - Loss function: CrossEntropyLoss (ignoring padding)
149
+ - Gradient clipping: 1.0
150
+ - Scheduler: StepLR (step_size=3, gamma=0.5)
151
+
152
+ ## Reproduce Training
153
+
154
+ ```bash
155
+ # Full training pipeline
156
+ python scripts/data_preparation.py # Download WMT19 data
157
+ python src/data/tokenization.py # Build vocabularies
158
+ python scripts/train.py # Train model
159
+
160
+ # For full dataset training, modify data_preparation.py:
161
+ # use_full_dataset = True # Line 133-134
162
+ ```
163
+
164
+ ## Citation
165
+
166
+ If you use this model, please cite:
167
+
168
+ ```bibtex
169
+ @misc{seq2seq-de-en,
170
+ author = {sumitdotml},
171
+ title = {German-English Seq2Seq Translation Model},
172
+ year = {2024},
173
+ url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
174
+ note = {PyTorch implementation of sequence-to-sequence translation}
175
+ }
176
+ ```
177
+
178
+ ## References
179
+
180
+ - Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
181
+ - WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19
182
+
183
+ ## License
184
+
185
+ MIT License - See repository for full license text.
186
+
187
+ ## Contact
188
+
189
+ For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq).