Cbelem's picture
Upload README.md
6efb6df verified
This is a text classification model, fully fine-tuned from a ` allenai/scibert_scivocab_uncased`. It re-uses the main BERT model and fits an ordinal regression head on the `[CLS]` token. The model is fine-tuned on the certainty labels collected in [Wurl et al (2024): _Understanding Fine-Grained Distortions in Reports for Scientific Finding_](https://aclanthology.org/2024.findings-acl.369/). The authors originally collect certainty annotations from humans using a 4-point Likert Scale ranging from (1) Uncertain to (4) Certain. Because the resulting datasets suffer from severe class imbalance, we merge the classes (1) Uncertain and (2) Somewhat Uncertain.
### Dataset Statistics
There are 1330 examples in the training set and 334 in the test set.
Each example is a sentence long.
Examples are filtered from the [copenlu/spiced](https://huggingface.co/datasets/copenlu/spiced) dataset to exhibit final score greater or equal than 4.
The original base rates are as follows:
| Class | Base Rate in Training set | Base Rate in Test set |
| ----- | ------------------------- | --------------------- |
| 0 - Uncertain | 5.5970 | 7.1856 |
| 1 - Somewhat Uncertain | 15.2985 | 17.6647 |
| 2 - Somewhat Certain | 32.3881 | 33.2335 |
| 3 - Certain | 46.7164 | 41.9162 |
After combining classes 0 and 1, we obtain the base rates below. Note that this mimicks the procedure adopted in the original paper.
| Class | Base Rate in Training set | Base Rate in Test set |
| ----- | ------------------------- | --------------------- |
| 0 - Uncertain | 20.8955 | 24.8503 |
| 1 - Somewhat Certain | 32.3881 | 33.2335 |
| 2 - Certain | 46.7164 | 41.9162 |
### Hyperparameter Optimization
The published model represents one of the 29 models different configurations. The selected model maximizes Quadratic Weighted Kappa (implemented using [cohen_kappa with quadratic weights](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)), which is better adapted to ordinal problems, such as ordinal scales. Under this metric, a random model would score 0. We adopt this metric as opposed to accuracy or macro F1 to address class imbalances.
Here is the classification report and test set metrics:
```
17:44:36 INFO test loss=0.9565 acc=0.578 QWK=0.5004
17:44:36 INFO
precision recall f1-score support
0 0.58 0.51 0.54 83
1 0.47 0.46 0.46 111
2 0.65 0.71 0.68 140
accuracy 0.58 334
macro avg 0.57 0.56 0.56 334
weighted avg 0.57 0.58 0.57 334
```
We conduct a hyperparameter sweep of the following hyperp
- Freeze / Unfreeze
- LR: 1e-6 through 1e-3
- Batch Size: 16, 32
- Hidden Size Dimensions: 256, 128
- Warmup Ratio: 0.05, 0.1, 0.2, 0.3
- Epochs 30 (with patience)
## Usage
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True)
```