AshleyBanksNIHR commited on
Commit
d09fefc
·
verified ·
1 Parent(s): 68c1b39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -3
README.md CHANGED
@@ -1,3 +1,145 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - NIHRDataInsights/HRCSData
5
+ tags:
6
+ - text-classification
7
+ - biology
8
+ - medical
9
+ ---
10
+
11
+ # HRCS Research Activity Code Classifier
12
+
13
+ ## Overview
14
+ This model, developed by the National Institute for Health and Care Research (NIHR), assigns HRCS Research Activity Codes to research awards using the award title and abstract (micro F1 = 0.60). When tags are aggregated to Research Activity Groups (RAGs), performance increases to a micro F1 of 0.71. It is a multi-label transformer classifier built on BiomedBERT-large, domain-adapted (DAPT) on healthcare grant titles and abstracts, then fine-tuned on cross-funder labelled HRCS data. The goal is to support portfolio analysis, automated tagging, and reproducible classification of biomedical research funding.
15
+
16
+ ## Model details
17
+ * **Base model:** `microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract`
18
+ * **Architecture:** Transformer encoder + multi-label classification head
19
+ * **Task:** Multi-label text classification
20
+ * **Input:** Award title + abstract
21
+ * **Output:** Probability per Health Category
22
+
23
+ ## Training approach
24
+ The model was trained in two stages using a 24GB GPU.
25
+
26
+ ### Domain-adaptive pretraining (DAPT)
27
+ We continued masked language modelling on grant titles and abstracts to adapt the encoder to research funding language as opposed to publications. This data used was a healthcare funder specific subset of Gomez Magenti, J. (2025) ‘Harmonised datasets of research project grants from UK and European funders’. Zenodo. doi:10.5281/zenodo.15479412.
28
+
29
+ **Settings:**
30
+ * Max sequence length: 512
31
+ * Mask probability: 0.15
32
+ * Epochs: 1
33
+ * Learning rate: 5e-5
34
+ * Warmup ratio: 0.01
35
+ * Weight decay: 0.01
36
+ * Effective batch size: 64
37
+ * Mixed precision: bf16/fp16
38
+ * Gradient checkpointing enabled
39
+
40
+ The adapted checkpoint was then used for supervised training.
41
+
42
+ ### Supervised fine-tuning
43
+ The adapted model was fine-tuned for multi-label classification using sigmoid outputs and binary cross-entropy loss.
44
+
45
+ **Input format:**
46
+ `AwardTitle` + newline + `AwardAbstract`
47
+
48
+ **Tokenisation:**
49
+ * Max length: 512 tokens
50
+ * Truncation enabled
51
+ * Fixed-length padding during training
52
+
53
+ **Handling class imbalance:**
54
+ A per-label weighting vector (`pos_weight`) is applied in the loss to reduce bias toward common categories.
55
+
56
+ **Training configuration:**
57
+ * Learning rate: 3e-5
58
+ * Weight decay: 0.01
59
+ * Epochs: up to 20
60
+ * Batch size: 14 per device
61
+ * Gradient accumulation: 2
62
+ * Mixed precision: fp16
63
+ * Early stopping patience: 4
64
+ * Best checkpoint selected by micro-F1
65
+
66
+ ## Evaluation protocol
67
+ Data was split into three disjoint sets:
68
+ * **Training set** – used for optimisation
69
+ * **Validation set** – used for early stopping and threshold tuning
70
+ * **Held-out test set** – used only once for final evaluation
71
+
72
+ The test set was not used during training, checkpoint selection, or threshold tuning. The dataset used is listed at the top of the model card. Predictions are converted to labels using per-category probability thresholds tuned on the validation set. These thresholds are included in `metadata.json`.
73
+
74
+ ### Full Evaluation Results
75
+ Overall RAC Metrics:
76
+ * **f1 micro** – 0.60
77
+ * **f1 macro** - 0.51
78
+ * **precision micro** - 0.56
79
+ * **recall micro** - 0.63
80
+
81
+ Overall RAG Metrics:
82
+ * **f1 micro** – 0.71
83
+ * **f1 macro** - 0.68
84
+ * **precision micro** - 0.70
85
+ * **recall micro** - 0.73
86
+
87
+ For a comprehensive breakdown of the model's performance, including Overall Metrics, Metrics per Category across both validation and test sets and Metrics per Funder across the validation set, please refer to the detailed evaluation spreadsheet included in this repository.
88
+
89
+ **Download/View the Evaluation Results](https://huggingface.co/NIHRDataInsights/HRCSHealthCategories/resolve/main/evaluation/health_category_rac_evaluation_results.xlsx)** *(Located in the `Files and versions` tab of this repository)*.
90
+
91
+ ## Intended use
92
+ This model is intended for:
93
+ * Portfolio analysis
94
+ * Large-scale tagging of funding datasets
95
+ * Exploratory research landscape mapping
96
+ * Automation support for HRCS coding workflows
97
+
98
+ **It is not intended to completely replace expert review.**
99
+
100
+ ## Limitations
101
+ * **Performance depends on similarity to the training corpus.**
102
+ * **Rare categories remain harder to detect despite class weighting.**
103
+ * **Abstract length:** Long or poorly structured abstracts may be truncated.
104
+ * **Threshold calibration:** Thresholds are tuned for this dataset and may need recalibration for new domains.
105
+ * **Temporal bias:** Model trained on data up to 2022. Therefore, any evaluation needs to use awards starting since 2023 to avoid inflated metrics.
106
+ * **Annotation Ambiguity and Niche Categories:** The model's performance reflects the historical consistency of human coding within the training data. Categories that are historically difficult for human coders to classify consistently under HRCS guidelines (such as 7.1, 8.1 and 8.3) are naturally more challenging for the model.
107
+
108
+ ## Inference / How to use
109
+ A companion script is provided to run this model (and the companion health category model) on new award data.
110
+
111
+ **The script:**
112
+ 1. Loads the trained model and tokenizer
113
+ 2. Applies sigmoid to obtain probabilities
114
+ 3. Converts probabilities to labels using the per-category thresholds stored in `metadata.json`
115
+ 4. Outputs a CSV containing predicted Health Categories and confidence indicators
116
+
117
+ **Expected input format:**
118
+ The script expects a CSV containing at minimum: `AwardTitle`, `AwardAbstract`. Optional columns such as `ID` or `FunderAcronym` will be preserved in the output.
119
+
120
+ See the inference script in this repository for full usage details.
121
+
122
+ ## Selective automation and human-in-the-loop use
123
+ In addition to predicted labels, the inference script reports how close each prediction is to the model’s decision boundary in logit space. This is computed as the smallest absolute difference between any category’s logit and its corresponding decision threshold.
124
+
125
+ Records with logits close to the threshold represent borderline cases where the model is uncertain. These can be prioritised for human review, while higher-confidence predictions can be automated.
126
+
127
+ When progressively excluding records whose predictions lie closest to the decision boundary, the remaining high-confidence subset shows increasing accuracy:
128
+
129
+ | % of records excluded for human review | RAG Micro-F1 on remaining subset |
130
+ | :--- | :--- |
131
+ | 0% | 0.71 |
132
+ | 10% | 0.72 |
133
+ | 20% | 0.72 |
134
+ | 30% | 0.74 |
135
+ | 40% | 0.74 |
136
+ | 50% | 0.76 |
137
+ | 60% | 0.79 |
138
+ | 70% | 0.83 |
139
+ | 80% | 0.85 |
140
+ | 90% | 0.90 |
141
+
142
+ This demonstrates that the model supports hybrid workflows in which uncertain cases are reviewed by experts while confident predictions can be automated.
143
+
144
+ ## Citation
145
+ NIHR, 2026. HRCS Health Category Classifier (BiomedBERT, DAPT). [Model]. Developed by Banks, A., Baghurst, D., Wang, K. and Downes, N. Available from: https://huggingface.co/NIHRDataInsights/HRCSHealthCategories