File size: 4,640 Bytes
745352d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3065026
745352d
 
 
 
 
 
 
 
 
 
 
 
 
3065026
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
745352d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
{}
---
language: en
license: cc-by-4.0
tags:
- text-classification
repo: https://github.com/AAP9002/COMP34812-NLU-NLI

---

# Model Card for z72819ap-e91802zc-NLI

<!-- Provide a quick summary of what the model is/does. -->

This is a classification model that was trained to detect whether a premise and hypothesis entail each other or not, using binary classification.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This model is based upon a ensemble of RoBERTa models that was fine-tuned using over 24K premise-hypothesis pairs from the shared task dataset for Natural Language Inference (NLI).

- **Developed by:** Alan Prophett and Zac Curtis
- **Language(s):** English
- **Model type:** Supervised
- **Model architecture:** Transformers
- **Finetuned from model [optional]:** roberta-base

### Model Resources

<!-- Provide links where applicable. -->

- **Repository:** https://huggingface.co/FacebookAI/roberta-base
- **Paper or documentation:** https://arxiv.org/abs/1907.11692

## Training Details

### Training Data

<!-- This is a short stub of information on the training data that was used, and documentation related to data pre-processing or additional filtering (if applicable). -->

24K+ premise-hypothesis pairs from the shared task dataset provided for Natural Language Inference (NLI).

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Training Hyperparameters

<!-- This is a summary of the values of hyperparameters used in training the model. -->


    All Models and datasets
      - seed: 42

    Roberta Large NLI Binary Classification Model
      - learning_rate: 2e-05
      - train_batch_size: 16
      - eval_batch_size: 16
      - num_epochs: 5

    Semantic Textual Similarity Binary Classification Model
      - learning_rate: 2e-05
      - train_batch_size: 16
      - eval_batch_size: 16
      - num_epochs: 5

    Ensemble Meta Model
      - learning_rate: 2e-05
      - train_batch_size: 128
      - eval_batch_size: 16
      - num_epochs: 3
      

#### Speeds, Sizes, Times

<!-- This section provides information about how roughly how long it takes to train the model and the size of the resulting model. -->


      - overall training time: 309 minutes 30 seconds

    Roberta Large NLI Binary Classification Model
      - duration per training epoch: 11 minutes
      - model size: 1.42 GB

    Semantic Textual Similarity Binary Classification Model
      - duration per training epoch: 4 minutes 30 seconds
      - model size: 501 MB

    Ensamble Meta Model
      - duration per training epoch: 4 minutes
      - model size: 1.92 GB

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data & Metrics

#### Testing Data

<!-- This should describe any evaluation data used (e.g., the development/validation set provided). -->

A subset of the development set provided, amounting to 5.3k+ pairs for validation and 1.3k+ for testing.

#### Metrics

<!-- These are the evaluation metrics being used. -->


      - Precision
      - Recall
      - F1-score
      - Accuracy

### Results


      The Ensemble Model obtained an F1-score of 91% and an accuracy of 91%.

      Validation set
      - Macro Precision: 91.0%
      - Macro Recall: 91.0%
      - Macro F1-score: 91.0%
      - Weighted Precision: 91.0%
      - Weighted Recall: 91.0%
      - Weighted F1-score: 91.0%
      - accuracy: 91.0%
      - Support: 5389

      Test set
      - Macro Precision: 91.0%
      - Macro Recall: 91.0%
      - Macro F1-score: 91.0%
      - Weighted Precision: 91.0%
      - Weighted Recall: 91.0%
      - Weighted F1-score: 91.0%
      - accuracy: 91.0%
      - Support: 1347
      

## Technical Specifications

### Hardware


      - RAM: at least 10 GB
      - Storage: at least 4GB,
      - GPU: a100 40GB

### Software


      - Tensorflow 2.18.0+cu12.4
      - Transformers 4.50.3
      - Pandas 2.2.2
      - NumPy 2.0.2
      - Seaborn 0.13.2
      - Huggingface_hub 0.30.1
      - Matplotlib 3.10.0
      - Scikit-learn 1.6.1

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Any inputs (concatenation of two sequences) longer than
      512 subwords will be truncated by the model.

## Additional Information

<!-- Any other information that would be useful for other people to know. -->

The hyperparameters were determined by experimentation
      with different values.