**"Neural Network-Based Language Model for Next Token Prediction"**: --- # Neural Network-Based Language Model for Next Token Prediction ## Project Overview This project focuses on the development of a **Neural Network-Based Language Model** for predicting the next token in a given text sequence. The model is trained on two languages: **English** and **Azerbaijani**. It uses traditional neural network architectures such as **Recurrent Neural Networks (RNN)** and **Long Short-Term Memory (LSTM)** networks. This project adheres to the guidelines of not using transformer-based or encoder-decoder architectures. The goal of this project is to build a bilingual text generation model capable of performing **next token prediction** in both English and Azerbaijani, with training checkpoints, validation, and model evaluation through perplexity scores. ## Key Features - **Bilingual Text Prediction**: The model is trained on datasets in both English and Azerbaijani to predict the next token in a given sequence. - **Recurrent Neural Network (RNN) and LSTM Models**: The project explores the use of RNN and LSTM architectures for sequence modeling and text generation. - **Tokenizer Flexibility**: A tokenizer is implemented to process multilingual text, allowing for the creation of input-output pairs for next-token prediction. - **Checkpoint Training**: The model saves checkpoints during training to demonstrate its progress over time. - **Text Generation**: Generates coherent sequences of text in both languages after training. - **Perplexity Evaluation**: The model's performance is measured using perplexity, which evaluates how well it predicts the next token. ## Datasets The model is trained on two datasets: 1. **Azerbaijani Dataset**: A dataset containing text in the Azerbaijani language. 2. **Alpaca Data Cleaned (English)**: A cleaned English language dataset for text generation and next-token prediction. These datasets are preprocessed to be used for the task of next-token prediction in the respective languages. ## Model Architecture - **Recurrent Neural Network (RNN)**: RNNs are used for sequential learning, where the output at each time step depends on the previous time steps, making it ideal for next-token prediction. - **Long Short-Term Memory (LSTM)**: LSTM networks are designed to remember long-term dependencies, which is crucial for handling long sequences in text prediction tasks. ### Why No Transformer or Encoder-Decoder Models? This project strictly adheres to the guideline of not using transformers or encoder-decoder models. Instead, it explores classical sequence-based neural architectures like RNN and LSTM, which have been foundational in natural language processing. ## Tokenization To preprocess and tokenize the text data, we use a custom tokenizer that effectively handles both Azerbaijani and English texts. The tokenizer splits sentences into tokenized sequences which are then passed to the neural network for training. ### Input-Output Pairs The project generates input-output pairs from the tokenized text, where: - **Input**: A sequence of tokens up to a certain length. - **Output**: The next token in the sequence that the model has to predict. ## Training Process 1. **Data Preprocessing**: Tokenization of both Azerbaijani and English texts is performed. 2. **Model Training**: The model is trained using an RNN or LSTM to predict the next token in a sequence. 3. **Checkpointing**: Training checkpoints are implemented to save the model at different stages of the training process. 4. **Perplexity Evaluation**: During the training process, the perplexity score is calculated to evaluate the performance of the model at predicting the next token. ### Hyperparameters Some of the key hyperparameters include: - **Batch Size**: 64 - **Sequence Length**: 50 tokens - **Learning Rate**: 0.001 - **Number of Epochs**: 20 (or as per experimentation) ## Model Evaluation The model's performance is evaluated based on: 1. **Training and Validation Loss**: The loss curves are visualized to observe how well the model learns during training. 2. **Perplexity Score**: The perplexity score is used to measure the model’s ability to predict the next token. A lower perplexity indicates a better-performing model. ## Checkpoints and Model Progress The model is saved at various checkpoints during the training process. These checkpoints allow for: - Demonstrating the model’s text generation capabilities at different stages of training. - Resuming training from a specific checkpoint if needed. ## Text Generation Once the model is trained, it can generate text in both English and Azerbaijani. Text generation examples are provided in the final output, showcasing the quality of the predicted tokens in both languages. ### Examples: 1. **English Text Generation**: - Input: "The weather today is" - Output: "The weather today is sunny and warm." 2. **Azerbaijani Text Generation**: - Input: "Bugün hava" - Output: "Bugün hava çox gözəldir." ## Installation ### Prerequisites You will need the following dependencies: - Python 3.x - PyTorch - Hugging Face Tokenizers - scikit-learn - pandas - numpy Install the required Python packages by running: ```bash pip install torch tokenizers scikit-learn pandas numpy ``` ### Running the Project 1. Clone the repository: ```bash git clone https://github.com/your-repo-url.git ``` 2. Navigate to the project directory: ```bash cd your-repo-name ``` 3. Run the Jupyter notebook to train the model: ```bash jupyter notebook Venkateswarlu.ipynb ``` ## Usage 1. **Training the Model**: Run the Jupyter notebook to preprocess the data and start training the model. Checkpoint files will be saved at different stages. 2. **Generating Text**: After training, the model can generate text in both English and Azerbaijani by providing seed sentences. 3. **Resuming from Checkpoint**: Use saved checkpoints to resume training or generate text from a specific stage. ## HuggingFace Repository The trained model has been uploaded to HuggingFace and can be accessed here: [Link to HuggingFace Model](https://huggingface.co/Venkateswarlu15). ## Demonstration Video A video demonstration of the project can be found here: [YouTube Video Link](https://www.youtube.com/@VenkateswarluBondalapati-dk4nk).