|
|
--- |
|
|
language: |
|
|
- en |
|
|
- he |
|
|
--- |
|
|
# Bilingual Language Model for Next Token Prediction |
|
|
|
|
|
## Overview |
|
|
This project focuses on building a neural network-based language model for next token prediction using two languages: **English** and **Hebrew**. The model is implemented using an LSTM (Long Short-Term Memory) architecture, designed to predict the next word in a sequence based on the training data provided. The project leverages Recurrent Neural Networks (RNNs) and evaluates the model using the **perplexity** metric to measure the quality of the predictions. |
|
|
|
|
|
The final model and checkpoints are provided, along with training history including perplexity and loss values. |
|
|
|
|
|
## Model Architecture |
|
|
- **Embedding Layer**: Converts tokenized words into dense vector representations. |
|
|
- **LSTM Layer**: Consists of 128 units to capture long-term dependencies in the sequence data. |
|
|
- **Dense Output Layer**: Outputs a probability distribution over the vocabulary to predict the next word. |
|
|
- **Total Vocabulary Size**: The model is trained on a corpus of size `[total_words]` (combining both English and Hebrew datasets). |
|
|
|
|
|
## Dataset |
|
|
The model is trained using a combination of English and Hebrew text datasets. The input sequences are tokenized and padded to ensure consistent input length for training the model. |
|
|
|
|
|
## Training |
|
|
The model was trained with the following parameters: |
|
|
- **Optimizer**: Adam |
|
|
- **Loss Function**: Categorical Crossentropy |
|
|
- **Batch Size**: 64 |
|
|
- **Epochs**: 20 |
|
|
- **Validation Split**: 20% |
|
|
|
|
|
## Evaluation Metric: Perplexity |
|
|
Perplexity is used to measure the model's performance, with lower perplexity indicating better generalization to unseen data. The final perplexity scores are: |
|
|
- **Final Training Perplexity**: `[Final Training Perplexity]` |
|
|
- **Final Validation Perplexity**: `[Final Validation Perplexity]` |
|
|
|
|
|
## Checkpoints |
|
|
A checkpoint mechanism is used to save the model at its best-performing stage based on validation loss. The best model checkpoint (`best_model.keras`) is included, which can be loaded for inference. |
|
|
|
|
|
## Results |
|
|
The model demonstrates competitive performance in predicting next tokens for both English and Hebrew, achieving satisfactory perplexity scores on both training and validation datasets. |
|
|
|
|
|
## How to Use |
|
|
To use this model, follow these steps: |
|
|
|
|
|
1. **Clone the repository**: |
|
|
```bash |
|
|
git clone https://huggingface.co/username/model-name |
|
|
cd model-name |
|
|
|