YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
"Neural Network-Based Language Model for Next Token Prediction":
Neural Network-Based Language Model for Next Token Prediction
Project Overview
This project focuses on the development of a Neural Network-Based Language Model for predicting the next token in a given text sequence. The model is trained on two languages: English and Azerbaijani. It uses traditional neural network architectures such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks. This project adheres to the guidelines of not using transformer-based or encoder-decoder architectures.
The goal of this project is to build a bilingual text generation model capable of performing next token prediction in both English and Azerbaijani, with training checkpoints, validation, and model evaluation through perplexity scores.
Key Features
- Bilingual Text Prediction: The model is trained on datasets in both English and Azerbaijani to predict the next token in a given sequence.
- Recurrent Neural Network (RNN) and LSTM Models: The project explores the use of RNN and LSTM architectures for sequence modeling and text generation.
- Tokenizer Flexibility: A tokenizer is implemented to process multilingual text, allowing for the creation of input-output pairs for next-token prediction.
- Checkpoint Training: The model saves checkpoints during training to demonstrate its progress over time.
- Text Generation: Generates coherent sequences of text in both languages after training.
- Perplexity Evaluation: The model's performance is measured using perplexity, which evaluates how well it predicts the next token.
Datasets
The model is trained on two datasets:
- Azerbaijani Dataset: A dataset containing text in the Azerbaijani language.
- Alpaca Data Cleaned (English): A cleaned English language dataset for text generation and next-token prediction.
These datasets are preprocessed to be used for the task of next-token prediction in the respective languages.
Model Architecture
- Recurrent Neural Network (RNN): RNNs are used for sequential learning, where the output at each time step depends on the previous time steps, making it ideal for next-token prediction.
- Long Short-Term Memory (LSTM): LSTM networks are designed to remember long-term dependencies, which is crucial for handling long sequences in text prediction tasks.
Why No Transformer or Encoder-Decoder Models?
This project strictly adheres to the guideline of not using transformers or encoder-decoder models. Instead, it explores classical sequence-based neural architectures like RNN and LSTM, which have been foundational in natural language processing.
Tokenization
To preprocess and tokenize the text data, we use a custom tokenizer that effectively handles both Azerbaijani and English texts. The tokenizer splits sentences into tokenized sequences which are then passed to the neural network for training.
Input-Output Pairs
The project generates input-output pairs from the tokenized text, where:
- Input: A sequence of tokens up to a certain length.
- Output: The next token in the sequence that the model has to predict.
Training Process
- Data Preprocessing: Tokenization of both Azerbaijani and English texts is performed.
- Model Training: The model is trained using an RNN or LSTM to predict the next token in a sequence.
- Checkpointing: Training checkpoints are implemented to save the model at different stages of the training process.
- Perplexity Evaluation: During the training process, the perplexity score is calculated to evaluate the performance of the model at predicting the next token.
Hyperparameters
Some of the key hyperparameters include:
- Batch Size: 64
- Sequence Length: 50 tokens
- Learning Rate: 0.001
- Number of Epochs: 20 (or as per experimentation)
Model Evaluation
The model's performance is evaluated based on:
- Training and Validation Loss: The loss curves are visualized to observe how well the model learns during training.
- Perplexity Score: The perplexity score is used to measure the model’s ability to predict the next token. A lower perplexity indicates a better-performing model.
Checkpoints and Model Progress
The model is saved at various checkpoints during the training process. These checkpoints allow for:
- Demonstrating the model’s text generation capabilities at different stages of training.
- Resuming training from a specific checkpoint if needed.
Text Generation
Once the model is trained, it can generate text in both English and Azerbaijani. Text generation examples are provided in the final output, showcasing the quality of the predicted tokens in both languages.
Examples:
English Text Generation:
- Input: "The weather today is"
- Output: "The weather today is sunny and warm."
Azerbaijani Text Generation:
- Input: "Bugün hava"
- Output: "Bugün hava çox gözəldir."
Installation
Prerequisites
You will need the following dependencies:
- Python 3.x
- PyTorch
- Hugging Face Tokenizers
- scikit-learn
- pandas
- numpy
Install the required Python packages by running:
pip install torch tokenizers scikit-learn pandas numpy
Running the Project
- Clone the repository:
git clone https://github.com/your-repo-url.git
- Navigate to the project directory:
cd your-repo-name
- Run the Jupyter notebook to train the model:
jupyter notebook Venkateswarlu.ipynb
Usage
- Training the Model: Run the Jupyter notebook to preprocess the data and start training the model. Checkpoint files will be saved at different stages.
- Generating Text: After training, the model can generate text in both English and Azerbaijani by providing seed sentences.
- Resuming from Checkpoint: Use saved checkpoints to resume training or generate text from a specific stage.
HuggingFace Repository
The trained model has been uploaded to HuggingFace and can be accessed here: Link to HuggingFace Model.
Demonstration Video
A video demonstration of the project can be found here: YouTube Video Link.