Neural Network-Based Language Model for Next Token Prediction

Project Overview

This project focuses on the development of a Neural Network-Based Language Model for predicting the next token in a given text sequence. The model is trained on two languages: English and Azerbaijani. It uses traditional neural network architectures such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks. This project adheres to the guidelines of not using transformer-based or encoder-decoder architectures.

The goal of this project is to build a bilingual text generation model capable of performing next token prediction in both English and Azerbaijani, with training checkpoints, validation, and model evaluation through perplexity scores.

Key Features

Bilingual Text Prediction: The model is trained on datasets in both English and Azerbaijani to predict the next token in a given sequence.
Recurrent Neural Network (RNN) and LSTM Models: The project explores the use of RNN and LSTM architectures for sequence modeling and text generation.
Tokenizer Flexibility: A tokenizer is implemented to process multilingual text, allowing for the creation of input-output pairs for next-token prediction.
Checkpoint Training: The model saves checkpoints during training to demonstrate its progress over time.
Text Generation: Generates coherent sequences of text in both languages after training.
Perplexity Evaluation: The model's performance is measured using perplexity, which evaluates how well it predicts the next token.

Datasets

The model is trained on two datasets:

Azerbaijani Dataset: A dataset containing text in the Azerbaijani language.
Alpaca Data Cleaned (English): A cleaned English language dataset for text generation and next-token prediction.

These datasets are preprocessed to be used for the task of next-token prediction in the respective languages.

Model Architecture

Recurrent Neural Network (RNN): RNNs are used for sequential learning, where the output at each time step depends on the previous time steps, making it ideal for next-token prediction.
Long Short-Term Memory (LSTM): LSTM networks are designed to remember long-term dependencies, which is crucial for handling long sequences in text prediction tasks.

Why No Transformer or Encoder-Decoder Models?

This project strictly adheres to the guideline of not using transformers or encoder-decoder models. Instead, it explores classical sequence-based neural architectures like RNN and LSTM, which have been foundational in natural language processing.

Tokenization

To preprocess and tokenize the text data, we use a custom tokenizer that effectively handles both Azerbaijani and English texts. The tokenizer splits sentences into tokenized sequences which are then passed to the neural network for training.

Input-Output Pairs

The project generates input-output pairs from the tokenized text, where:

Input: A sequence of tokens up to a certain length.
Output: The next token in the sequence that the model has to predict.

Training Process

Data Preprocessing: Tokenization of both Azerbaijani and English texts is performed.
Model Training: The model is trained using an RNN or LSTM to predict the next token in a sequence.
Checkpointing: Training checkpoints are implemented to save the model at different stages of the training process.
Perplexity Evaluation: During the training process, the perplexity score is calculated to evaluate the performance of the model at predicting the next token.

Hyperparameters

Some of the key hyperparameters include:

Batch Size: 64
Sequence Length: 50 tokens
Learning Rate: 0.001
Number of Epochs: 20 (or as per experimentation)

Model Evaluation

The model's performance is evaluated based on:

Training and Validation Loss: The loss curves are visualized to observe how well the model learns during training.
Perplexity Score: The perplexity score is used to measure the model’s ability to predict the next token. A lower perplexity indicates a better-performing model.

Checkpoints and Model Progress

The model is saved at various checkpoints during the training process. These checkpoints allow for:

Demonstrating the model’s text generation capabilities at different stages of training.
Resuming training from a specific checkpoint if needed.

Text Generation

Once the model is trained, it can generate text in both English and Azerbaijani. Text generation examples are provided in the final output, showcasing the quality of the predicted tokens in both languages.

Examples:

English Text Generation:
- Input: "The weather today is"
- Output: "The weather today is sunny and warm."
Azerbaijani Text Generation:
- Input: "Bugün hava"
- Output: "Bugün hava çox gözəldir."

Installation

Prerequisites

You will need the following dependencies:

Python 3.x
PyTorch
Hugging Face Tokenizers
scikit-learn
pandas
numpy

Install the required Python packages by running:

pip install torch tokenizers scikit-learn pandas numpy

Running the Project

Clone the repository:

git clone https://github.com/your-repo-url.git

Navigate to the project directory:

cd your-repo-name

Run the Jupyter notebook to train the model:

jupyter notebook Venkateswarlu.ipynb

Usage

Training the Model: Run the Jupyter notebook to preprocess the data and start training the model. Checkpoint files will be saved at different stages.
Generating Text: After training, the model can generate text in both English and Azerbaijani by providing seed sentences.
Resuming from Checkpoint: Use saved checkpoints to resume training or generate text from a specific stage.

HuggingFace Repository

The trained model has been uploaded to HuggingFace and can be accessed here: Link to HuggingFace Model.

Demonstration Video

A video demonstration of the project can be found here: YouTube Video Link.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support