--- language: - en - he --- # Bilingual Language Model for Next Token Prediction ## Overview This project focuses on building a neural network-based language model for next token prediction using two languages: **English** and **Hebrew**. The model is implemented using an LSTM (Long Short-Term Memory) architecture, designed to predict the next word in a sequence based on the training data provided. The project leverages Recurrent Neural Networks (RNNs) and evaluates the model using the **perplexity** metric to measure the quality of the predictions. The final model and checkpoints are provided, along with training history including perplexity and loss values. ## Model Architecture - **Embedding Layer**: Converts tokenized words into dense vector representations. - **LSTM Layer**: Consists of 128 units to capture long-term dependencies in the sequence data. - **Dense Output Layer**: Outputs a probability distribution over the vocabulary to predict the next word. - **Total Vocabulary Size**: The model is trained on a corpus of size `[total_words]` (combining both English and Hebrew datasets). ## Dataset The model is trained using a combination of English and Hebrew text datasets. The input sequences are tokenized and padded to ensure consistent input length for training the model. ## Training The model was trained with the following parameters: - **Optimizer**: Adam - **Loss Function**: Categorical Crossentropy - **Batch Size**: 64 - **Epochs**: 20 - **Validation Split**: 20% ## Evaluation Metric: Perplexity Perplexity is used to measure the model's performance, with lower perplexity indicating better generalization to unseen data. The final perplexity scores are: - **Final Training Perplexity**: `[Final Training Perplexity]` - **Final Validation Perplexity**: `[Final Validation Perplexity]` ## Checkpoints A checkpoint mechanism is used to save the model at its best-performing stage based on validation loss. The best model checkpoint (`best_model.keras`) is included, which can be loaded for inference. ## Results The model demonstrates competitive performance in predicting next tokens for both English and Hebrew, achieving satisfactory perplexity scores on both training and validation datasets. ## How to Use To use this model, follow these steps: 1. **Clone the repository**: ```bash git clone https://huggingface.co/username/model-name cd model-name