NextTokenPrediction / README.md
tejagowda's picture
Update README.md
3413b8d verified
---
language:
- en
- he
---
# Bilingual Language Model for Next Token Prediction
## Overview
This project focuses on building a neural network-based language model for next token prediction using two languages: **English** and **Hebrew**. The model is implemented using an LSTM (Long Short-Term Memory) architecture, designed to predict the next word in a sequence based on the training data provided. The project leverages Recurrent Neural Networks (RNNs) and evaluates the model using the **perplexity** metric to measure the quality of the predictions.
The final model and checkpoints are provided, along with training history including perplexity and loss values.
## Model Architecture
- **Embedding Layer**: Converts tokenized words into dense vector representations.
- **LSTM Layer**: Consists of 128 units to capture long-term dependencies in the sequence data.
- **Dense Output Layer**: Outputs a probability distribution over the vocabulary to predict the next word.
- **Total Vocabulary Size**: The model is trained on a corpus of size `[total_words]` (combining both English and Hebrew datasets).
## Dataset
The model is trained using a combination of English and Hebrew text datasets. The input sequences are tokenized and padded to ensure consistent input length for training the model.
## Training
The model was trained with the following parameters:
- **Optimizer**: Adam
- **Loss Function**: Categorical Crossentropy
- **Batch Size**: 64
- **Epochs**: 20
- **Validation Split**: 20%
## Evaluation Metric: Perplexity
Perplexity is used to measure the model's performance, with lower perplexity indicating better generalization to unseen data. The final perplexity scores are:
- **Final Training Perplexity**: `[Final Training Perplexity]`
- **Final Validation Perplexity**: `[Final Validation Perplexity]`
## Checkpoints
A checkpoint mechanism is used to save the model at its best-performing stage based on validation loss. The best model checkpoint (`best_model.keras`) is included, which can be loaded for inference.
## Results
The model demonstrates competitive performance in predicting next tokens for both English and Hebrew, achieving satisfactory perplexity scores on both training and validation datasets.
## How to Use
To use this model, follow these steps:
1. **Clone the repository**:
```bash
git clone https://huggingface.co/username/model-name
cd model-name