File size: 6,791 Bytes
c4a5de3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | # RNN-based Neural Machine Translation (NMT)
A PyTorch implementation of RNN-based Neural Machine Translation system for Chinese-to-English translation, featuring LSTM encoder-decoder architecture with attention mechanisms.
## Introduction
This repository implements a RNN-based Neural Machine Translation system with the following key components:
**Model**: Implement a model using LSTM, with both the encoder and decoder consisting of unidirectional layers.
**Attention mechanism**: Implement the attention mechanism and investigate the impact of different alignment functionsβsuch as dot-product, multiplicative, and additiveβon model performance.
**Training policy**: Compare the effectiveness of Teacher Forcing and Free Running strategies.
**Decoding policy**: Compare the effectiveness of greedy and beam-search decoding strategies.
### Key Features
- **Encoder**: Unidirectional LSTM encoder for source language (Chinese)
- **Decoder**: Unidirectional LSTM decoder with attention mechanism for target language (English)
- **Attention Types**:
- Dot-product attention
- Multiplicative attention
- Additive attention (Bahdanau-style)
- **Tokenization**:
- Chinese: Jieba word segmentation
- English: SentencePiece subword tokenization
- **Training Strategies**:
- Teacher Forcing (configurable ratio)
- Free Running
- **Decoding Strategies**:
- Greedy decoding
- Beam search decoding (configurable beam size)
## Data Preparation
The compressed package contains four JSONL files, corresponding respectively to the small training set, large training set, validation set, and test set, with sizes of 100k, 10k, 500, and 200 samples. Each line in a JSONL file contains one parallel sentence pair. The final model performance will be evaluated based on results on the test set.
### Data Format
Each line in the JSONL files follows this format:
```json
{"chinese": "δΈζε₯ε", "english": "English sentence"}
```
### Data Directory Structure
```
translation_dataset_zh_en/
βββ train_small.jsonl # 100k samples
βββ train_large.jsonl # 10k samples
βββ dev.jsonl # 500 samples
βββ test.jsonl # 200 samples
```
### Preprocessing
The data preprocessing pipeline includes:
1. Chinese text segmentation using Jieba
2. English text tokenization using SentencePiece
3. Vocabulary construction with frequency cutoff
4. Sentence padding and batching
## Environment
### Requirements
- **Python**: Python 3.9.25
- **PyTorch**: torch 2.0.1+cu118 (or compatible version)
- **CUDA**: CUDA 11.8 (optional, for GPU acceleration)
### Installation
1. Clone the repository:
```bash
git clone <repository-url>
cd RNN_NMT
```
2. Install dependencies:
```bash
pip install -r requirement.txt
```
3. Download NLTK data (required for BLEU score calculation):
```python
import nltk
nltk.download('punkt')
```
### Dependencies
Key dependencies include:
- `torch>=1.12.0` - Deep learning framework
- `numpy>=1.21.0` - Numerical computing
- `hydra-core>=1.3.0` - Configuration management
- `omegaconf>=2.2.0` - Configuration objects
- `sentencepiece>=0.1.96` - English subword tokenization
- `jieba>=0.42.1` - Chinese word segmentation
- `nltk>=3.7` - BLEU score evaluation
- `tqdm>=4.62.0` - Progress bars
## Training and Evaluation
### Training
Train the model using the default configuration:
```bash
python train.py
```
The training script uses Hydra for configuration management. You can override configuration parameters via command line:
```bash
python train.py attention_type=additive teacher_forcing_ratio=0.7 decoding_strategy=beam-search beam_size=5
```
### Configuration
Main training parameters can be configured in `configs/train.yaml`:
- `attention_type`: "dot-product", "multiplicative", or "additive"
- `teacher_forcing_ratio`: Ratio for teacher forcing (0.0-1.0)
- `decoding_strategy`: "greedy" or "beam-search"
- `beam_size`: Beam size for beam search (default: 5)
- `learning_rate`: Initial learning rate (default: 5e-5)
- `batch_size`: Batch size (default: 128)
- `max_epochs`: Maximum training epochs (default: 50)
### Evaluation
Evaluate a trained model on the test set:
```bash
python eval.py
```
Or with custom parameters:
```bash
python eval.py --model_path <path_to_model> --data_path <path_to_data> --decoding_strategy beam-search --beam_size 5
```
Alternatively, you can use `inference.py` directly (same functionality):
```bash
python inference.py --model_path <path_to_model> --data_path <path_to_data> --decoding_strategy beam-search --beam_size 5
```
The evaluation script will output:
- Perplexity (PPL) on test set
- BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores
- Detailed translation examples
### Model Checkpoints
During training, the model saves:
- **Best model**: `save_dir/model_rnn_best.pt` (best validation perplexity)
- **Last model**: `save_dir/model_rnn_last.pt` (most recent checkpoint)
- **Optimizer state**: Saved alongside model files (`.optim` extension)
### Resuming Training
To resume training from a checkpoint:
```yaml
# In configs/train.yaml
resume_from_model: "save_dir/model_rnn_last.pt"
```
## Project Structure
```
RNN_NMT/
βββ configs/
β βββ train.yaml # Training configuration
βββ dataset/
β βββ vocab.py # Vocabulary management
βββ models/
β βββ rnn_nmt.py # Main NMT model
β βββ model_embeddings.py # Embedding layers
β βββ char_decoder.py # Character-level decoder
βββ utils/
β βββ utils.py # Utility functions (BLEU, batching, etc.)
β βββ preprocess_data.py # Data preprocessing
βββ train.py # Training script
βββ inference.py # Evaluation script
βββ eval.py # Evaluation script (alias for inference.py)
βββ requirement.txt # Python dependencies
βββ README.md # This file
```
## Experimental Results
The model performance is evaluated using:
- **Perplexity (PPL)**: Lower is better
- **BLEU Score**: Higher is better (BLEU-4 as primary metric)
Training metrics are automatically saved to `training_metrics.json` for visualization and analysis.
## Acknowledgement
ζθ°’δ»₯δΈε δΈͺδ»εΊοΌ
1. **Jieba** (Chinese word segmentation tool): [https://github.com/fxsjy/jieba](https://github.com/fxsjy/jieba)
2. **SentencePiece** (English and multilingual subword tokenization tool): [https://github.com/google/sentencepiece](https://github.com/google/sentencepiece)
3. **RNN Machine Translation**: [https://github.com/pi-tau/machine-translation](https://github.com/pi-tau/machine-translation)
## License
[Add your license information here]
## Contact
[Add your contact information here]
|