File size: 6,791 Bytes
c4a5de3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# RNN-based Neural Machine Translation (NMT)

A PyTorch implementation of RNN-based Neural Machine Translation system for Chinese-to-English translation, featuring LSTM encoder-decoder architecture with attention mechanisms.

## Introduction

This repository implements a RNN-based Neural Machine Translation system with the following key components:

**Model**: Implement a model using LSTM, with both the encoder and decoder consisting of unidirectional layers.

**Attention mechanism**: Implement the attention mechanism and investigate the impact of different alignment functionsβ€”such as dot-product, multiplicative, and additiveβ€”on model performance.

**Training policy**: Compare the effectiveness of Teacher Forcing and Free Running strategies.

**Decoding policy**: Compare the effectiveness of greedy and beam-search decoding strategies.

### Key Features

- **Encoder**: Unidirectional LSTM encoder for source language (Chinese)
- **Decoder**: Unidirectional LSTM decoder with attention mechanism for target language (English)
- **Attention Types**: 
  - Dot-product attention
  - Multiplicative attention
  - Additive attention (Bahdanau-style)
- **Tokenization**:
  - Chinese: Jieba word segmentation
  - English: SentencePiece subword tokenization
- **Training Strategies**:
  - Teacher Forcing (configurable ratio)
  - Free Running
- **Decoding Strategies**:
  - Greedy decoding
  - Beam search decoding (configurable beam size)

## Data Preparation

The compressed package contains four JSONL files, corresponding respectively to the small training set, large training set, validation set, and test set, with sizes of 100k, 10k, 500, and 200 samples. Each line in a JSONL file contains one parallel sentence pair. The final model performance will be evaluated based on results on the test set.

### Data Format

Each line in the JSONL files follows this format:
```json
{"chinese": "δΈ­ζ–‡ε₯子", "english": "English sentence"}
```

### Data Directory Structure

```
translation_dataset_zh_en/
β”œβ”€β”€ train_small.jsonl      # 100k samples
β”œβ”€β”€ train_large.jsonl      # 10k samples  
β”œβ”€β”€ dev.jsonl              # 500 samples
└── test.jsonl             # 200 samples
```

### Preprocessing

The data preprocessing pipeline includes:
1. Chinese text segmentation using Jieba
2. English text tokenization using SentencePiece
3. Vocabulary construction with frequency cutoff
4. Sentence padding and batching

## Environment

### Requirements

- **Python**: Python 3.9.25
- **PyTorch**: torch 2.0.1+cu118 (or compatible version)
- **CUDA**: CUDA 11.8 (optional, for GPU acceleration)

### Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd RNN_NMT
```

2. Install dependencies:
```bash
pip install -r requirement.txt
```

3. Download NLTK data (required for BLEU score calculation):
```python
import nltk
nltk.download('punkt')
```

### Dependencies

Key dependencies include:
- `torch>=1.12.0` - Deep learning framework
- `numpy>=1.21.0` - Numerical computing
- `hydra-core>=1.3.0` - Configuration management
- `omegaconf>=2.2.0` - Configuration objects
- `sentencepiece>=0.1.96` - English subword tokenization
- `jieba>=0.42.1` - Chinese word segmentation
- `nltk>=3.7` - BLEU score evaluation
- `tqdm>=4.62.0` - Progress bars

## Training and Evaluation

### Training

Train the model using the default configuration:

```bash
python train.py
```

The training script uses Hydra for configuration management. You can override configuration parameters via command line:

```bash
python train.py attention_type=additive teacher_forcing_ratio=0.7 decoding_strategy=beam-search beam_size=5
```

### Configuration

Main training parameters can be configured in `configs/train.yaml`:

- `attention_type`: "dot-product", "multiplicative", or "additive"
- `teacher_forcing_ratio`: Ratio for teacher forcing (0.0-1.0)
- `decoding_strategy`: "greedy" or "beam-search"
- `beam_size`: Beam size for beam search (default: 5)
- `learning_rate`: Initial learning rate (default: 5e-5)
- `batch_size`: Batch size (default: 128)
- `max_epochs`: Maximum training epochs (default: 50)

### Evaluation

Evaluate a trained model on the test set:

```bash
python eval.py
```

Or with custom parameters:

```bash
python eval.py --model_path <path_to_model> --data_path <path_to_data> --decoding_strategy beam-search --beam_size 5
```

Alternatively, you can use `inference.py` directly (same functionality):

```bash
python inference.py --model_path <path_to_model> --data_path <path_to_data> --decoding_strategy beam-search --beam_size 5
```

The evaluation script will output:
- Perplexity (PPL) on test set
- BLEU-1, BLEU-2, BLEU-3, BLEU-4 scores
- Detailed translation examples

### Model Checkpoints

During training, the model saves:
- **Best model**: `save_dir/model_rnn_best.pt` (best validation perplexity)
- **Last model**: `save_dir/model_rnn_last.pt` (most recent checkpoint)
- **Optimizer state**: Saved alongside model files (`.optim` extension)

### Resuming Training

To resume training from a checkpoint:

```yaml
# In configs/train.yaml
resume_from_model: "save_dir/model_rnn_last.pt"
```

## Project Structure

```
RNN_NMT/
β”œβ”€β”€ configs/
β”‚   └── train.yaml          # Training configuration
β”œβ”€β”€ dataset/
β”‚   └── vocab.py            # Vocabulary management
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ rnn_nmt.py          # Main NMT model
β”‚   β”œβ”€β”€ model_embeddings.py # Embedding layers
β”‚   └── char_decoder.py     # Character-level decoder
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ utils.py            # Utility functions (BLEU, batching, etc.)
β”‚   └── preprocess_data.py  # Data preprocessing
β”œβ”€β”€ train.py                # Training script
β”œβ”€β”€ inference.py            # Evaluation script
β”œβ”€β”€ eval.py                 # Evaluation script (alias for inference.py)
β”œβ”€β”€ requirement.txt         # Python dependencies
└── README.md              # This file
```

## Experimental Results

The model performance is evaluated using:
- **Perplexity (PPL)**: Lower is better
- **BLEU Score**: Higher is better (BLEU-4 as primary metric)

Training metrics are automatically saved to `training_metrics.json` for visualization and analysis.

## Acknowledgement

ζ„Ÿθ°’δ»₯δΈ‹ε‡ δΈͺδ»“εΊ“οΌš

1. **Jieba** (Chinese word segmentation tool): [https://github.com/fxsjy/jieba](https://github.com/fxsjy/jieba)

2. **SentencePiece** (English and multilingual subword tokenization tool): [https://github.com/google/sentencepiece](https://github.com/google/sentencepiece)

3. **RNN Machine Translation**: [https://github.com/pi-tau/machine-translation](https://github.com/pi-tau/machine-translation)

## License

[Add your license information here]

## Contact

[Add your contact information here]