--- {} --- language: en license: cc-by-4.0 tags: - text-classification repo: https://github.com/AAP9002/COMP34812-NLU-NLI --- # Model Card for z72819ap-e91802zc-NLI This is a classification model that was trained to detect whether a premise and hypothesis entail each other or not, using binary classification. ## Model Details ### Model Description This model is based upon a ensemble of RoBERTa models that was fine-tuned using over 24K premise-hypothesis pairs from the shared task dataset for Natural Language Inference (NLI). - **Developed by:** Alan Prophett and Zac Curtis - **Language(s):** English - **Model type:** Supervised - **Model architecture:** Transformers - **Finetuned from model [optional]:** roberta-base ### Model Resources - **Repository:** https://huggingface.co/FacebookAI/roberta-base - **Paper or documentation:** https://arxiv.org/abs/1907.11692 ## Training Details ### Training Data 24K+ premise-hypothesis pairs from the shared task dataset provided for Natural Language Inference (NLI). ### Training Procedure #### Training Hyperparameters All Models and datasets - seed: 42 Roberta Large NLI Binary Classification Model - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - num_epochs: 5 Semantic Textual Similarity Binary Classification Model - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - num_epochs: 5 Ensemble Meta Model - learning_rate: 2e-05 - train_batch_size: 128 - eval_batch_size: 16 - num_epochs: 3 #### Speeds, Sizes, Times - overall training time: 309 minutes 30 seconds Roberta Large NLI Binary Classification Model - duration per training epoch: 11 minutes - model size: 1.42 GB Semantic Textual Similarity Binary Classification Model - duration per training epoch: 4 minutes 30 seconds - model size: 501 MB Ensamble Meta Model - duration per training epoch: 4 minutes - model size: 1.92 GB ## Evaluation ### Testing Data & Metrics #### Testing Data A subset of the development set provided, amounting to 5.3k+ pairs for validation and 1.3k+ for testing. #### Metrics - Precision - Recall - F1-score - Accuracy ### Results The Ensemble Model obtained an F1-score of 91% and an accuracy of 91%. Validation set - Macro Precision: 91.0% - Macro Recall: 91.0% - Macro F1-score: 91.0% - Weighted Precision: 91.0% - Weighted Recall: 91.0% - Weighted F1-score: 91.0% - accuracy: 91.0% - Support: 5389 Test set - Macro Precision: 91.0% - Macro Recall: 91.0% - Macro F1-score: 91.0% - Weighted Precision: 91.0% - Weighted Recall: 91.0% - Weighted F1-score: 91.0% - accuracy: 91.0% - Support: 1347 ## Technical Specifications ### Hardware - RAM: at least 10 GB - Storage: at least 4GB, - GPU: a100 40GB ### Software - Tensorflow 2.18.0+cu12.4 - Transformers 4.50.3 - Pandas 2.2.2 - NumPy 2.0.2 - Seaborn 0.13.2 - Huggingface_hub 0.30.1 - Matplotlib 3.10.0 - Scikit-learn 1.6.1 ## Bias, Risks, and Limitations Any inputs (concatenation of two sequences) longer than 512 subwords will be truncated by the model. ## Additional Information The hyperparameters were determined by experimentation with different values.