|
|
--- |
|
|
language: |
|
|
- is |
|
|
--- |
|
|
README |
|
|
|
|
|
Overview |
|
|
|
|
|
This project implements a language translation model using GPT-2, capable of translating between Icelandic and English. The pipeline includes data preprocessing, model training, evaluation, and an interactive user interface for translations. |
|
|
|
|
|
Features |
|
|
|
|
|
Text Preprocessing: Tokenization and padding for uniform input size. |
|
|
|
|
|
Model Training: Training a GPT-2 model on paired Icelandic-English sentences. |
|
|
|
|
|
Evaluation: Perplexity-based validation of model performance. |
|
|
|
|
|
Interactive Interface: An easy-to-use widget for real-time translations. |
|
|
|
|
|
Installation |
|
|
|
|
|
Prerequisites |
|
|
|
|
|
Ensure you have the following installed: |
|
|
|
|
|
Python (>= 3.8) |
|
|
|
|
|
PyTorch |
|
|
|
|
|
Transformers library by Hugging Face |
|
|
|
|
|
ipywidgets (for the translation interface) |
|
|
|
|
|
Steps |
|
|
|
|
|
Clone the repository: |
|
|
|
|
|
git clone <repository_url> |
|
|
cd <repository_name> |
|
|
|
|
|
Install the required libraries: |
|
|
|
|
|
pip install -r requirements.txt |
|
|
|
|
|
Ensure GPU availability for faster training (optional but recommended). |
|
|
|
|
|
Usage |
|
|
|
|
|
Training the Model |
|
|
|
|
|
Prepare your dataset with English-Icelandic sentence pairs. |
|
|
|
|
|
Run the script to preprocess the data and train the model: |
|
|
|
|
|
python train_model.py |
|
|
|
|
|
The trained model and tokenizer will be saved in the ./trained_gpt2 directory. |
|
|
|
|
|
Evaluating the Model |
|
|
|
|
|
Evaluate the trained model using validation data: |
|
|
|
|
|
python evaluate_model.py |
|
|
|
|
|
The script computes perplexity to measure model performance. |
|
|
|
|
|
Running the Interactive Interface |
|
|
|
|
|
Launch a Jupyter Notebook or Jupyter Lab. |
|
|
|
|
|
Open the file interactive_translation.ipynb. |
|
|
|
|
|
Enter a sentence in English or Icelandic, and view the translation in real-time. |
|
|
|
|
|
File Structure |
|
|
|
|
|
train_model.py: Contains code for data preprocessing, model training, and saving. |
|
|
|
|
|
evaluate_model.py: Evaluates model performance using perplexity. |
|
|
|
|
|
interactive_translation.ipynb: Interactive interface for testing translations. |
|
|
|
|
|
requirements.txt: List of required Python packages. |
|
|
|
|
|
trained_gpt2/: Directory to save trained model and tokenizer. |
|
|
|
|
|
Key Parameters |
|
|
|
|
|
Max Length: Maximum token length for inputs (default: 128). |
|
|
|
|
|
Learning Rate: . |
|
|
|
|
|
Batch Size: 4 (both training and validation). |
|
|
|
|
|
Epochs: 10. |
|
|
|
|
|
Beam Search: Used for generating translations, with a beam size of 5. |
|
|
|
|
|
Future Improvements |
|
|
|
|
|
Expand dataset to include additional language pairs. |
|
|
|
|
|
Optimize the model for faster inference. |
|
|
|
|
|
Integrate the application into a web-based interface. |
|
|
|
|
|
Acknowledgements |
|
|
|
|
|
Hugging Face for providing the GPT-2 model and libraries. |
|
|
|
|
|
PyTorch for enabling seamless implementation and training. |
|
|
|
|
|
License |
|
|
|
|
|
This project is licensed under the MIT License. See the LICENSE file for details. |