| This is a text classification model, fully fine-tuned from a ` allenai/scibert_scivocab_uncased`. It re-uses the main BERT model and fits an ordinal regression head on the `[CLS]` token. The model is fine-tuned on the certainty labels collected in [Wurl et al (2024): _Understanding Fine-Grained Distortions in Reports for Scientific Finding_](https://aclanthology.org/2024.findings-acl.369/). The authors originally collect certainty annotations from humans using a 4-point Likert Scale ranging from (1) Uncertain to (4) Certain. Because the resulting datasets suffer from severe class imbalance, we merge the classes (1) Uncertain and (2) Somewhat Uncertain. |
|
|
|
|
|
|
| ### Dataset Statistics |
|
|
| There are 1330 examples in the training set and 334 in the test set. |
| Each example is a sentence long. |
| Examples are filtered from the [copenlu/spiced](https://huggingface.co/datasets/copenlu/spiced) dataset to exhibit final score greater or equal than 4. |
|
|
| The original base rates are as follows: |
|
|
| | Class | Base Rate in Training set | Base Rate in Test set | |
| | ----- | ------------------------- | --------------------- | |
| | 0 - Uncertain | 5.5970 | 7.1856 | |
| | 1 - Somewhat Uncertain | 15.2985 | 17.6647 | |
| | 2 - Somewhat Certain | 32.3881 | 33.2335 | |
| | 3 - Certain | 46.7164 | 41.9162 | |
|
|
| After combining classes 0 and 1, we obtain the base rates below. Note that this mimicks the procedure adopted in the original paper. |
|
|
| | Class | Base Rate in Training set | Base Rate in Test set | |
| | ----- | ------------------------- | --------------------- | |
| | 0 - Uncertain | 20.8955 | 24.8503 | |
| | 1 - Somewhat Certain | 32.3881 | 33.2335 | |
| | 2 - Certain | 46.7164 | 41.9162 | |
|
|
|
|
|
|
|
|
| ### Hyperparameter Optimization |
|
|
| The published model represents one of the 29 models different configurations. The selected model maximizes Quadratic Weighted Kappa (implemented using [cohen_kappa with quadratic weights](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)), which is better adapted to ordinal problems, such as ordinal scales. Under this metric, a random model would score 0. We adopt this metric as opposed to accuracy or macro F1 to address class imbalances. |
|
|
| Here is the classification report and test set metrics: |
|
|
| ``` |
| 17:44:36 INFO test loss=0.9565 acc=0.578 QWK=0.5004 |
| 17:44:36 INFO |
| precision recall f1-score support |
| |
| 0 0.58 0.51 0.54 83 |
| 1 0.47 0.46 0.46 111 |
| 2 0.65 0.71 0.68 140 |
| |
| accuracy 0.58 334 |
| macro avg 0.57 0.56 0.56 334 |
| weighted avg 0.57 0.58 0.57 334 |
| ``` |
|
|
|
|
| We conduct a hyperparameter sweep of the following hyperp |
|
|
| - Freeze / Unfreeze |
| - LR: 1e-6 through 1e-3 |
| - Batch Size: 16, 32 |
| - Hidden Size Dimensions: 256, 128 |
| - Warmup Ratio: 0.05, 0.1, 0.2, 0.3 |
| - Epochs 30 (with patience) |
|
|
|
|
|
|
| ## Usage |
| |
| ```python |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| |
| model = AutoModelForSequenceClassification.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("cbelem/scibert-certainty-ordinal", trust_remote_code=True) |
| ``` |
|
|
|
|
|
|