File size: 6,702 Bytes
d819af6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
# Transformer-based Neural Machine Translation (NMT)
A Chinese-English neural machine translation system based on the Transformer architecture, supporting both training from scratch and fine-tuning of pre-trained models.
πππ Due to the model checkpoint is too big to put on github, so the checkpoint is upload to the Huggingface: https://huggingface.co/DarcyCheng/Transformer-based-NMT
## Introduction
This project implements a complete Transformer-based NMT system, with core tasks including:
### 1. From Scratch Training
Construct and train a Chinese-English translation model based on the Transformer architecture, including:
- Adoption of the Encoder-Decoder structure
- Model training from scratch
- Support for distributed training (multi-GPU)
### 2. Architectural Ablation
Implement and compare different architectural variants:
- **Positional Encoding Schemes**: Absolute Positional Encoding vs Relative Positional Encoding
- **Normalization Methods**: LayerNorm vs RMSNorm
### 3. Hyperparameter Sensitivity
Evaluate the impact of hyperparameter adjustments on translation performance:
- Batch Size
- Learning Rate
- Model Scale
### 4. From Pretrained Language Model
Support fine-tuning based on pre-trained language models (e.g., T5) and compare performance with models trained from scratch.
## Data Preparation
The dataset consists of four JSONL files, corresponding to:
- **Small Training Set**: 100k samples
- **Large Training Set**: 10k samples
- **Validation Set**: 500 samples
- **Test Set**: 200 samples
Each line in the JSONL files contains a parallel Chinese-English sentence pair. The final model performance will be evaluated based on the results from the test set.
Data Format Example:
```json
{"zh": "δ½ ε₯½οΌδΈηγ", "en": "Hello, world."}
```
## Environment
### System Requirements
- **Python**: 3.9.25
- **PyTorch**: 2.0.1+cu118
- **CUDA**: 11.8 (recommended)
### Install Dependencies
```bash
pip install -r requirements.txt
```
Key dependencies include:
- `torch>=2.0.1`
- `torchvision`
- `numpy`
- `matplotlib`
- `tqdm`
- `hydra-core`
- `omegaconf`
- `sentencepiece`
- `nltk`
## Project Structure
```
Transformer_NMT/
βββ src/ # Core source code
β βββ model.py # Transformer model definition
β βββ dataset.py # Data processing and dataset classes
β βββ utils.py # Utility functions
β βββ visualize_training.py # Training visualization
βββ configs/ # Configuration files
β βββ train.yaml # Training configuration
β βββ inference.yaml # Inference configuration
βββ checkpoints/ # Model checkpoint directory
β βββ exp_*/ # Experiment-specific checkpoint directories
βββ logs/ # Training logs
βββ outputs/ # Output results
βββ train.py # Training script
βββ evaluation.py # Evaluation script
βββ inference.py # Inference script
```
## Training, Evaluation, and Inference
### Train the Model
#### Single-GPU Training
```bash
python train.py
```
#### Multi-GPU Distributed Training
```bash
torchrun --nproc_per_node=<num_gpus> train.py
```
#### Configure Training Parameters
Edit the `configs/train.yaml` file to adjust training parameters, including:
- Model architecture parameters (`D_MODEL`, `NHEAD`, `NUM_ENCODER_LAYERS`, etc.)
- Training hyperparameters (`BATCH_SIZE`, `LEARNING_RATE`, `NUM_EPOCHS`, etc.)
- Ablation experiment configurations (`POS_ENCODING_TYPE`, `NORM_TYPE`, etc.)
#### Ablation Experiment Configuration Examples
**Absolute Positional Encoding + LayerNorm**:
```yaml
POS_ENCODING_TYPE: absolute
NORM_TYPE: layernorm
```
**Relative Positional Encoding + RMSNorm**:
```yaml
POS_ENCODING_TYPE: relative
NORM_TYPE: rmsnorm
```
During training, model checkpoints will be automatically saved to the `checkpoints/exp_<experiment_name>/` directory, where the experiment name is generated automatically based on the configuration.
### Evaluate the Model
Evaluate the model's performance on the test set, outputting BLEU-1, BLEU-2, BLEU-3, BLEU-4, and Perplexity scores:
```bash
python evaluation.py model_path=<checkpoint_path>
```
Example:
```bash
python evaluation.py model_path=checkpoints/exp_abs_pos_ln_bs32_lre4/checkpoint_best.pth
```
### Inference for Translation
Use the trained model for single-sentence or batch translation:
```bash
python inference.py
```
Or specify the model path:
```bash
python inference.py model_path=<checkpoint_path>
```
## Configuration
### Training Configuration (`configs/train.yaml`)
Key configuration items:
- **Model Parameters**:
- `D_MODEL`: Model dimension (default: 256)
- `NHEAD`: Number of attention heads (default: 8)
- `NUM_ENCODER_LAYERS`: Number of encoder layers (default: 4)
- `NUM_DECODER_LAYERS`: Number of decoder layers (default: 4)
- `DIM_FEEDFORWARD`: Feed-forward network dimension (default: 1024)
- `DROPOUT`: Dropout rate (default: 0.1)
- `MAX_LEN`: Maximum sequence length (default: 128)
- **Training Parameters**:
- `BATCH_SIZE`: Batch size (default: 64)
- `LEARNING_RATE`: Learning rate (default: 1e-5)
- `NUM_EPOCHS`: Number of training epochs (default: 30)
- `CLIP_GRAD`: Gradient clipping threshold (default: 5.0)
- `LABEL_SMOOTHING`: Label smoothing coefficient (default: 0.1)
- **Ablation Experiment Parameters**:
- `POS_ENCODING_TYPE`: Positional encoding type (`absolute` or `relative`)
- `NORM_TYPE`: Normalization type (`layernorm` or `rmsnorm`)
## Features
- β
Complete Transformer architecture implementation
- β
Support for absolute and relative positional encoding
- β
Support for LayerNorm and RMSNorm
- β
Distributed training support (DDP)
- β
Mixed precision training (AMP)
- β
Automatic experiment management and checkpoint saving
- β
Comprehensive evaluation metrics (BLEU-1/2/3/4, Perplexity)
- β
Training process visualization
- β
Numerical stability optimization (NaN detection and handling)
## Acknowledgement
Thanks to the following repositories and projects:
- [BERT-pytorch](https://github.com/codertimo/BERT-pytorch)
- [Neural-Machine-Translation-Based-on-Transformer](https://github.com/piaoranyc/Neural-Machine-Translation-Based-on-Transformer)
- β [Transformer-NMT-Translation](https://github.com/Kwen-Chen/Transformer-NMT-Translation)
- [T5-base](https://huggingface.co/google-t5/t5-base/tree/main)
- [Text-to-Text Transfer Transformer](https://github.com/google-research/text-to-text-transfer-transformer)
## License
This project is for educational and research purposes only. |