File size: 2,581 Bytes
42b4c73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language:
- is
---
README

Overview

This project implements a language translation model using GPT-2, capable of translating between Icelandic and English. The pipeline includes data preprocessing, model training, evaluation, and an interactive user interface for translations.

Features

Text Preprocessing: Tokenization and padding for uniform input size.

Model Training: Training a GPT-2 model on paired Icelandic-English sentences.

Evaluation: Perplexity-based validation of model performance.

Interactive Interface: An easy-to-use widget for real-time translations.

Installation

Prerequisites

Ensure you have the following installed:

Python (>= 3.8)

PyTorch

Transformers library by Hugging Face

ipywidgets (for the translation interface)

Steps

Clone the repository:

git clone <repository_url>
cd <repository_name>

Install the required libraries:

pip install -r requirements.txt

Ensure GPU availability for faster training (optional but recommended).

Usage

Training the Model

Prepare your dataset with English-Icelandic sentence pairs.

Run the script to preprocess the data and train the model:

python train_model.py

The trained model and tokenizer will be saved in the ./trained_gpt2 directory.

Evaluating the Model

Evaluate the trained model using validation data:

python evaluate_model.py

The script computes perplexity to measure model performance.

Running the Interactive Interface

Launch a Jupyter Notebook or Jupyter Lab.

Open the file interactive_translation.ipynb.

Enter a sentence in English or Icelandic, and view the translation in real-time.

File Structure

train_model.py: Contains code for data preprocessing, model training, and saving.

evaluate_model.py: Evaluates model performance using perplexity.

interactive_translation.ipynb: Interactive interface for testing translations.

requirements.txt: List of required Python packages.

trained_gpt2/: Directory to save trained model and tokenizer.

Key Parameters

Max Length: Maximum token length for inputs (default: 128).

Learning Rate: .

Batch Size: 4 (both training and validation).

Epochs: 10.

Beam Search: Used for generating translations, with a beam size of 5.

Future Improvements

Expand dataset to include additional language pairs.

Optimize the model for faster inference.

Integrate the application into a web-based interface.

Acknowledgements

Hugging Face for providing the GPT-2 model and libraries.

PyTorch for enabling seamless implementation and training.

License

This project is licensed under the MIT License. See the LICENSE file for details.