Update README.md
Browse files
README.md
CHANGED
|
@@ -2,4 +2,46 @@
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- he
|
| 5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- he
|
| 5 |
+
---
|
| 6 |
+
# Bilingual Language Model for Next Token Prediction
|
| 7 |
+
|
| 8 |
+
## Overview
|
| 9 |
+
This project focuses on building a neural network-based language model for next token prediction using two languages: **English** and **Hebrew**. The model is implemented using an LSTM (Long Short-Term Memory) architecture, designed to predict the next word in a sequence based on the training data provided. The project leverages Recurrent Neural Networks (RNNs) and evaluates the model using the **perplexity** metric to measure the quality of the predictions.
|
| 10 |
+
|
| 11 |
+
The final model and checkpoints are provided, along with training history including perplexity and loss values.
|
| 12 |
+
|
| 13 |
+
## Model Architecture
|
| 14 |
+
- **Embedding Layer**: Converts tokenized words into dense vector representations.
|
| 15 |
+
- **LSTM Layer**: Consists of 128 units to capture long-term dependencies in the sequence data.
|
| 16 |
+
- **Dense Output Layer**: Outputs a probability distribution over the vocabulary to predict the next word.
|
| 17 |
+
- **Total Vocabulary Size**: The model is trained on a corpus of size `[total_words]` (combining both English and Hebrew datasets).
|
| 18 |
+
|
| 19 |
+
## Dataset
|
| 20 |
+
The model is trained using a combination of English and Hebrew text datasets. The input sequences are tokenized and padded to ensure consistent input length for training the model.
|
| 21 |
+
|
| 22 |
+
## Training
|
| 23 |
+
The model was trained with the following parameters:
|
| 24 |
+
- **Optimizer**: Adam
|
| 25 |
+
- **Loss Function**: Categorical Crossentropy
|
| 26 |
+
- **Batch Size**: 64
|
| 27 |
+
- **Epochs**: 20
|
| 28 |
+
- **Validation Split**: 20%
|
| 29 |
+
|
| 30 |
+
## Evaluation Metric: Perplexity
|
| 31 |
+
Perplexity is used to measure the model's performance, with lower perplexity indicating better generalization to unseen data. The final perplexity scores are:
|
| 32 |
+
- **Final Training Perplexity**: `[Final Training Perplexity]`
|
| 33 |
+
- **Final Validation Perplexity**: `[Final Validation Perplexity]`
|
| 34 |
+
|
| 35 |
+
## Checkpoints
|
| 36 |
+
A checkpoint mechanism is used to save the model at its best-performing stage based on validation loss. The best model checkpoint (`best_model.keras`) is included, which can be loaded for inference.
|
| 37 |
+
|
| 38 |
+
## Results
|
| 39 |
+
The model demonstrates competitive performance in predicting next tokens for both English and Hebrew, achieving satisfactory perplexity scores on both training and validation datasets.
|
| 40 |
+
|
| 41 |
+
## How to Use
|
| 42 |
+
To use this model, follow these steps:
|
| 43 |
+
|
| 44 |
+
1. **Clone the repository**:
|
| 45 |
+
```bash
|
| 46 |
+
git clone https://huggingface.co/username/model-name
|
| 47 |
+
cd model-name
|