YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Translation Model Evaluation with BLEU and ChrF Scores Project Overview This project focuses on training and evaluating a sequence-to-sequence (Seq2Seq) model for translation tasks using LSTM and other models. We compare the performance of these models using two important evaluation metrics: BLEU score and ChrF score. These scores are used to measure the quality of machine-generated translations against human reference translations.
Key Highlights: Training Models: We trained two models (LSTM and Seq2Seq) on a translation dataset for 20 epochs. Evaluation Metrics: We calculated BLEU scores and ChrF scores at each epoch to monitor model performance. Plotting and Comparison: After training, we plotted the BLEU and ChrF scores to compare the performance of both models across epochs. Dataset: The models were trained on a custom translation dataset with parallel text in English and another language (e.g., Hindi or Spanish). Models Used
- Seq2Seq Model A sequence-to-sequence model with an encoder-decoder architecture. Uses LSTM layers for both the encoder and decoder. Trained using the translation dataset for 20 epochs.
- LSTM Model A simplified LSTM model specifically used for sequence modeling in translation tasks. Trained using the same dataset for comparison with the Seq2Seq model. Evaluation Metrics BLEU Score The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text which has been machine-translated from one language to another. It compares the n-grams in the predicted translation with those in the reference translations.
ChrF Score The ChrF score is based on character n-grams and is used for evaluating translation quality in a more fine-grained way. It has been shown to work well for languages with rich morphology and is effective in comparison to other word-based evaluation metrics.
Steps Involved in the Project
- Data Preprocessing Tokenization: The dataset was tokenized to convert the raw text into sequences of tokens. Padding: Sequences were padded to a fixed length for input into the models. Encoding: Both source and target sequences were encoded using integer indices.
- Model Training Seq2Seq Model: The Seq2Seq model was trained for 20 epochs using the training data. LSTM Model: An LSTM model was also trained for comparison using the same data.
- Evaluation After each epoch, predictions were generated for the validation data using both models. BLEU and ChrF scores were calculated for each epoch to evaluate the model's translation quality.
- Saving and Plotting Results Both models were saved after each epoch. BLEU and ChrF scores were plotted for visual comparison. Results Here are the ChrF scores for both models after training for 5 epochs:
ChrF Scores Epoch Model A (Seq2Seq) Model B (LSTM) 1 0.20 0.26 2 0.35 0.33 3 0.43 0.43 4 0.54 0.52 5 0.55 0.57 BLEU Scores Epoch Model A (Seq2Seq) Model B (LSTM) 1 0.28 0.33 2 0.36 0.38 3 0.41 0.43 4 0.47 0.50 5 0.52 0.55 Final Observations Model B (LSTM) showed slightly better performance in terms of ChrF and BLEU scores, especially after the 5th epoch. Both models showed improvement over the epochs, indicating that training for more epochs helps in improving translation quality. How to Use To use the trained models, you can follow these steps:
Download the Model: Use the Hugging Face interface to download the trained models.
Load the Model in Code:
python Copy code from tensorflow.keras.models import load_model model = load_model('path_to_model') Generate Predictions:
python Copy code predictions = model.predict([input_data]) Calculate Scores: You can calculate BLEU and ChrF scores using the provided functions:
python Copy code from nltk.translate.bleu_score import corpus_bleu from nltk.translate.chrf_score import corpus_chrf
bleu_score = corpus_bleu([[ref] for ref in references], hypotheses) chrf_score = corpus_chrf([[ref] for ref in references], hypotheses) Future Work Model Optimization: Experiment with advanced architectures like Transformer or BERT for better performance. Evaluation on Real Data: Test the models on a larger, real-world dataset to further assess their robustness. Fine-Tuning: Fine-tune models with pre-trained embeddings or models for domain-specific translation tasks. Dependencies tensorflow: For model training and evaluation. nltk: For calculating BLEU and ChrF scores. huggingface_hub: To upload models and datasets to Hugging Face. numpy: For numerical operations.