| # Neural Network-Based Language Model for Next Token Prediction | |
| ## Overview | |
| This project implements a neural network-based language model designed for next-token prediction using two languages: English and Icelandic. The model is built without the use of transformer or encoder-decoder architectures, focusing instead on traditional neural network techniques. | |
| ###Table of Contents: | |
| Installation | |
| Usage | |
| Model Architecture | |
| Training | |
| Text Generation | |
| Results | |
| License | |
| ###Installation: | |
| To run this project, you need to have Python installed along with the following libraries: | |
| pip install torch numpy pandas huggingface_hub | |
| ###Usage | |
| Upload or open the notebook in Google Colab. | |
| Navigate to Google Colab and open the notebook. | |
| Run all cells sequentially to load the models, configure the text generation process, and view outputs. | |
| Modify the seed text to generate different text sequences. You can provide your own input to see how the model generates text in response. | |
| ##Model Architecture | |
| The model used in this notebook is based on Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) networks, which are commonly used for sequence prediction tasks like text generation. The architecture consists of: | |
| Embedding Layer: Converts input words into dense vectors of fixed size. | |
| LSTM/GRU Layers: These handle sequential data and maintain long-range dependencies between words. | |
| Dense Output Layer: Generates predictions for the next word in the sequence. | |
| This architecture helps the model learn from previous words and predict the next one in the sequence effectively. | |
| ##Training | |
| The model used for this notebook is pre-trained, meaning it has already been trained on a large dataset for both English and Icelandic text generation. | |
| However, if you wish to re-train the model or fine-tune it for your own data, you can do so by adding a training loop in the notebook. Ensure you have a dataset and adjust the training parameters (like batch size, epochs, and learning rate). | |
| Here’s a basic outline of how the training could be set up: | |
| Preprocess your text data into sequences. | |
| Split the data into training and validation sets. | |
| Train the model using the sequences, optimizing for the loss function. | |
| Save the model after training for future use. | |
| ##Text Generation | |
| In this notebook, the model is used for text generation. It works by taking an initial seed text (a starting sequence) and predicting the next word repeatedly to generate a longer sequence. | |
| Steps for text generation: | |
| Provide a seed text in English or Icelandic. | |
| Run the code cell to generate text based on the provided input. | |
| The output will be displayed as a continuation of the seed text. | |
| Example: | |
| English Seed Text: "Today is a good day" | |
| Generated Output: "Today is a good day to explore the new opportunities available." | |
| Icelandic Seed Text: "þetta mun auka" | |
| Generated Output: "þetta mun auka áberandi í utan eins og vieigandi..." | |
| ##License | |
| License | |
| This notebook is available for educational purposes. Feel free to modify and use it as needed for your own experiments or projects. However, the pre-trained models and certain dependencies may have their own licenses, so ensure you comply with their usage policies. | |
| ##Results | |
| The training curves for both loss and validation loss are provided in the submission. | |
| The model's performance is evaluated based on the generated text quality and perplexity score during training. |