| # NLP Model Training with Simple RNN |
|
|
|
|
| ## Overview |
|
|
| This repository contains the implementation of a simple Recurrent Neural Network (RNN) model for natural language processing tasks, trained on a cleaned dataset from multiple sources. The model is designed to generate text based on input sequences, showcasing capabilities in language modeling. |
|
|
| ## Features |
|
|
| - **RNN Architecture**: A straightforward RNN model for text generation. |
| - **Custom Tokenizer**: Utilizes the GPT-2 tokenizer for encoding and decoding text. |
| - **Data Processing**: Combines and cleans datasets from multiple sources to create a robust training corpus. |
| - **Checkpointing**: Regularly saves model checkpoints during training for easy recovery. |
| - **Performance Evaluation**: Tracks training and validation losses and calculates perplexity scores. |
|
|
| ## Getting Started |
|
|
| ### Installation |
|
|
| Make sure to install the required libraries by running: |
|
|
| ```bash |
| pip install torch torchvision transformers pandas matplotlib huggingface-hub |
| |
| ### Data Preparation |
| The datasets used in this project are: |
| |
| - English Dataset: Alpaca Cleaned Dataset |
| - Aymara Dataset: Aymara text data in JSON format stored in Google Drive. |
| ### Training the Model |
| To train the model, simply run the main training script. The model's hyperparameters can be modified in the script as needed. |
| |
| ```bash |
| # Start training |
| python train.py # Adjust to your script's name |
| Generated Text |
| After training, you can generate text using the trained model. Here’s an example of generating text with a context: |
|
|
|
|
| context = "Once upon a time" |
| generated_text = model.generate(context, max_new_tokens=200) |
| print(generated_text) |
| Performance Metrics |
| Training Loss: Logged every 300 iterations. |
| Validation Loss: Also logged every 300 iterations. |
| Perplexity: Calculated from validation loss to assess model performance. |
| Results Visualization |
| Training and validation losses are plotted for better understanding and visualization of the training process. |
|
|
| python |
| Copy code |
| import matplotlib.pyplot as plt |
|
|
| # Read and plot the loss data |
| plt.plot(loss_data['epoch/step'], loss_data['training_loss'], label='Training Loss') |
| plt.plot(loss_data['epoch/step'], loss_data['val_loss'], label='Validation Loss') |
| plt.xscale('log') |
| plt.title('Training and Validation Loss Over Iterations') |
| plt.xlabel('Epoch/Step') |
| plt.ylabel('Loss') |
| plt.legend() |
| plt.grid() |
| plt.show() |
| Checkpoints |
| The trained model checkpoints are saved periodically during training. You can load them to continue training or for inference. |
|
|