NIHRDataInsights
/

HRCSResearchActivityCodes

+---
+license: mit
+datasets:
+  - NIHRDataInsights/HRCSData
+tags:
+  - text-classification
+  - biology
+  - medical
+---
+# HRCS Research Activity Code Classifier
+## Overview
+This model, developed by the National Institute for Health and Care Research (NIHR), assigns HRCS Research Activity Codes to research awards using the award title and abstract (micro F1 = 0.60). When tags are aggregated to Research Activity Groups (RAGs), performance increases to a micro F1 of 0.71. It is a multi-label transformer classifier built on BiomedBERT-large, domain-adapted (DAPT) on healthcare grant titles and abstracts, then fine-tuned on cross-funder labelled HRCS data. The goal is to support portfolio analysis, automated tagging, and reproducible classification of biomedical research funding.
+## Model details
+* **Base model:** `microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract`
+* **Architecture:** Transformer encoder + multi-label classification head
+* **Task:** Multi-label text classification
+* **Input:** Award title + abstract
+* **Output:** Probability per Health Category
+## Training approach
+The model was trained in two stages using a 24GB GPU.
+### Domain-adaptive pretraining (DAPT)
+We continued masked language modelling on grant titles and abstracts to adapt the encoder to research funding language as opposed to publications. This data used was a healthcare funder specific subset of Gomez Magenti, J. (2025) ‘Harmonised datasets of research project grants from UK and European funders’. Zenodo. doi:10.5281/zenodo.15479412.
+**Settings:**
+* Max sequence length: 512
+* Mask probability: 0.15
+* Epochs: 1
+* Learning rate: 5e-5
+* Warmup ratio: 0.01
+* Weight decay: 0.01
+* Effective batch size: 64
+* Mixed precision: bf16/fp16
+* Gradient checkpointing enabled
+The adapted checkpoint was then used for supervised training.
+### Supervised fine-tuning
+The adapted model was fine-tuned for multi-label classification using sigmoid outputs and binary cross-entropy loss.
+**Input format:**
+`AwardTitle` + newline + `AwardAbstract`
+**Tokenisation:**
+* Max length: 512 tokens
+* Truncation enabled
+* Fixed-length padding during training
+**Handling class imbalance:**
+A per-label weighting vector (`pos_weight`) is applied in the loss to reduce bias toward common categories.
+**Training configuration:**
+* Learning rate: 3e-5
+* Weight decay: 0.01
+* Epochs: up to 20
+* Batch size: 14 per device
+* Gradient accumulation: 2
+* Mixed precision: fp16
+* Early stopping patience: 4
+* Best checkpoint selected by micro-F1
+## Evaluation protocol
+Data was split into three disjoint sets:
+* **Training set** – used for optimisation
+* **Validation set** – used for early stopping and threshold tuning
+* **Held-out test set** – used only once for final evaluation
+The test set was not used during training, checkpoint selection, or threshold tuning. The dataset used is listed at the top of the model card. Predictions are converted to labels using per-category probability thresholds tuned on the validation set. These thresholds are included in `metadata.json`.
+### Full Evaluation Results
+Overall RAC Metrics:
+* **f1 micro** – 0.60
+* **f1 macro** - 0.51
+* **precision micro** - 0.56
+* **recall micro** - 0.63
+Overall RAG Metrics:
+* **f1 micro** – 0.71
+* **f1 macro** - 0.68
+* **precision micro** - 0.70
+* **recall micro** - 0.73
+For a comprehensive breakdown of the model's performance, including Overall Metrics, Metrics per Category across both validation and test sets and Metrics per Funder across the validation set, please refer to the detailed evaluation spreadsheet included in this repository.
+**Download/View the Evaluation Results](https://huggingface.co/NIHRDataInsights/HRCSHealthCategories/resolve/main/evaluation/health_category_rac_evaluation_results.xlsx)** *(Located in the `Files and versions` tab of this repository)*.
+## Intended use
+This model is intended for:
+* Portfolio analysis
+* Large-scale tagging of funding datasets
+* Exploratory research landscape mapping
+* Automation support for HRCS coding workflows
+**It is not intended to completely replace expert review.**
+## Limitations
+* **Performance depends on similarity to the training corpus.**
+* **Rare categories remain harder to detect despite class weighting.**
+* **Abstract length:** Long or poorly structured abstracts may be truncated.
+* **Threshold calibration:** Thresholds are tuned for this dataset and may need recalibration for new domains.
+* **Temporal bias:** Model trained on data up to 2022. Therefore, any evaluation needs to use awards starting since 2023 to avoid inflated metrics.
+* **Annotation Ambiguity and Niche Categories:** The model's performance reflects the historical consistency of human coding within the training data. Categories that are historically difficult for human coders to classify consistently under HRCS guidelines (such as 7.1, 8.1 and 8.3) are naturally more challenging for the model.
+## Inference / How to use
+A companion script is provided to run this model (and the companion health category model) on new award data.
+**The script:**
+1. Loads the trained model and tokenizer
+2. Applies sigmoid to obtain probabilities
+3. Converts probabilities to labels using the per-category thresholds stored in `metadata.json`
+4. Outputs a CSV containing predicted Health Categories and confidence indicators
+**Expected input format:**
+The script expects a CSV containing at minimum: `AwardTitle`, `AwardAbstract`. Optional columns such as `ID` or `FunderAcronym` will be preserved in the output.
+See the inference script in this repository for full usage details.
+## Selective automation and human-in-the-loop use
+In addition to predicted labels, the inference script reports how close each prediction is to the model’s decision boundary in logit space. This is computed as the smallest absolute difference between any category’s logit and its corresponding decision threshold.
+Records with logits close to the threshold represent borderline cases where the model is uncertain. These can be prioritised for human review, while higher-confidence predictions can be automated.
+When progressively excluding records whose predictions lie closest to the decision boundary, the remaining high-confidence subset shows increasing accuracy:
+| % of records excluded for human review | RAG Micro-F1 on remaining subset |
+| :--- | :--- |
+| 0% | 0.71 |
+| 10% | 0.72 |
+| 20% | 0.72 |
+| 30% | 0.74 |
+| 40% | 0.74 |
+| 50% | 0.76 |
+| 60% | 0.79 |
+| 70% | 0.83 |
+| 80% | 0.85 |
+| 90% | 0.90 |
+This demonstrates that the model supports hybrid workflows in which uncertain cases are reviewed by experts while confident predictions can be automated.
+## Citation
+NIHR, 2026. HRCS Health Category Classifier (BiomedBERT, DAPT). [Model]. Developed by Banks, A., Baghurst, D., Wang, K. and Downes, N. Available from: https://huggingface.co/NIHRDataInsights/HRCSHealthCategories