ArshadBadfar
/

NLP_1

Model card Files Files and versions

NLP_1 / README.md

ArshadBadfar's picture

Update README.md

4352313 verified over 1 year ago

|

history blame contribute delete

2.51 kB

	# NLP Model Training with Simple RNN


	## Overview

	This repository contains the implementation of a simple Recurrent Neural Network (RNN) model for natural language processing tasks, trained on a cleaned dataset from multiple sources. The model is designed to generate text based on input sequences, showcasing capabilities in language modeling.

	## Features

	- RNN Architecture: A straightforward RNN model for text generation.
	- Custom Tokenizer: Utilizes the GPT-2 tokenizer for encoding and decoding text.
	- Data Processing: Combines and cleans datasets from multiple sources to create a robust training corpus.
	- Checkpointing: Regularly saves model checkpoints during training for easy recovery.
	- Performance Evaluation: Tracks training and validation losses and calculates perplexity scores.

	## Getting Started

	### Installation

	Make sure to install the required libraries by running:

	```bash
	pip install torch torchvision transformers pandas matplotlib huggingface-hub

	### Data Preparation
	The datasets used in this project are:

	- English Dataset: Alpaca Cleaned Dataset
	- Aymara Dataset: Aymara text data in JSON format stored in Google Drive.
	### Training the Model
	To train the model, simply run the main training script. The model's hyperparameters can be modified in the script as needed.

	```bash
	# Start training
	python train.py # Adjust to your script's name
	Generated Text
	After training, you can generate text using the trained model. Here’s an example of generating text with a context:


	context = "Once upon a time"
	generated_text = model.generate(context, max_new_tokens=200)
	print(generated_text)
	Performance Metrics
	Training Loss: Logged every 300 iterations.
	Validation Loss: Also logged every 300 iterations.
	Perplexity: Calculated from validation loss to assess model performance.
	Results Visualization
	Training and validation losses are plotted for better understanding and visualization of the training process.

	python
	Copy code
	import matplotlib.pyplot as plt

	# Read and plot the loss data
	plt.plot(loss_data['epoch/step'], loss_data['training_loss'], label='Training Loss')
	plt.plot(loss_data['epoch/step'], loss_data['val_loss'], label='Validation Loss')
	plt.xscale('log')
	plt.title('Training and Validation Loss Over Iterations')
	plt.xlabel('Epoch/Step')
	plt.ylabel('Loss')
	plt.legend()
	plt.grid()
	plt.show()
	Checkpoints
	The trained model checkpoints are saved periodically during training. You can load them to continue training or for inference.