File size: 2,581 Bytes
42b4c73 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
language:
- is
---
README
Overview
This project implements a language translation model using GPT-2, capable of translating between Icelandic and English. The pipeline includes data preprocessing, model training, evaluation, and an interactive user interface for translations.
Features
Text Preprocessing: Tokenization and padding for uniform input size.
Model Training: Training a GPT-2 model on paired Icelandic-English sentences.
Evaluation: Perplexity-based validation of model performance.
Interactive Interface: An easy-to-use widget for real-time translations.
Installation
Prerequisites
Ensure you have the following installed:
Python (>= 3.8)
PyTorch
Transformers library by Hugging Face
ipywidgets (for the translation interface)
Steps
Clone the repository:
git clone <repository_url>
cd <repository_name>
Install the required libraries:
pip install -r requirements.txt
Ensure GPU availability for faster training (optional but recommended).
Usage
Training the Model
Prepare your dataset with English-Icelandic sentence pairs.
Run the script to preprocess the data and train the model:
python train_model.py
The trained model and tokenizer will be saved in the ./trained_gpt2 directory.
Evaluating the Model
Evaluate the trained model using validation data:
python evaluate_model.py
The script computes perplexity to measure model performance.
Running the Interactive Interface
Launch a Jupyter Notebook or Jupyter Lab.
Open the file interactive_translation.ipynb.
Enter a sentence in English or Icelandic, and view the translation in real-time.
File Structure
train_model.py: Contains code for data preprocessing, model training, and saving.
evaluate_model.py: Evaluates model performance using perplexity.
interactive_translation.ipynb: Interactive interface for testing translations.
requirements.txt: List of required Python packages.
trained_gpt2/: Directory to save trained model and tokenizer.
Key Parameters
Max Length: Maximum token length for inputs (default: 128).
Learning Rate: .
Batch Size: 4 (both training and validation).
Epochs: 10.
Beam Search: Used for generating translations, with a beam size of 5.
Future Improvements
Expand dataset to include additional language pairs.
Optimize the model for faster inference.
Integrate the application into a web-based interface.
Acknowledgements
Hugging Face for providing the GPT-2 model and libraries.
PyTorch for enabling seamless implementation and training.
License
This project is licensed under the MIT License. See the LICENSE file for details. |