NCAS Hospital Indication Classifier
A BioClinicalBERT-based multilabel classifier for categorising antimicrobial prescription indication text from hospital electronic medical records (EMR). Developed as part of a research project at RMIT University / The Royal Melbourne Hospital (RMH) investigating automated antimicrobial stewardship support.
Model description
| Attribute | Value |
|---|---|
| Base encoder | emilyalsentzer/Bio_ClinicalBERT |
| Pooling | Mean pooling over token embeddings |
| Classification head | Linear + Sigmoid |
| Task | Multilabel classification (8 categories) |
| Training data | ~2,000 manually annotated hospital prescription records (RMH 2021) |
| Held-out evaluation | 600 records from RMH 2022, 2023, 2024 |
Label schema (8catb)
| Label | Description |
|---|---|
respiratory - ioi |
Respiratory infection of indication |
skin and soft tissue - ioi |
Skin/soft-tissue infection of indication |
urinary tract - ioi |
Urinary tract infection of indication |
other |
Other or unspecified indication |
sepsis |
Sepsis or bacteraemia |
undifferentiated infection |
Infection without identified source |
organism only |
Organism identified but no clinical syndrome specified |
no indication documented |
No clinical indication present in the text |
A sample can receive one or more labels simultaneously (multilabel).
Post-processing rule
After model prediction, sepsis is suppressed from any sample that also receives
respiratory - ioi OR skin and soft tissue - ioi. If suppression would leave zero
labels, the removal is reverted (fallback guarantee).
Usage
Quick start
from huggingface_hub import hf_hub_download
from ncas_indication.model import ClinicalBERTClassifier
from transformers import AutoTokenizer
# Download checkpoint
model_path = hf_hub_download(
repo_id="jibmaird/NCAS-hospital-indication-classifier",
filename="indication_classifier_model.pt",
)
# Load model (label names and thresholds are embedded in the checkpoint)
model, label_columns, thresholds = ClinicalBERTClassifier.from_checkpoint(model_path)
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
Or using the inference script from the GitHub repository:
# Single text
python inference/predict.py --text "UTI prophylaxis post-renal transplant"
# CSV file
python inference/predict.py --input your_file.csv --output predictions.csv
Desktop application
A cross-platform desktop GUI is available in the app/ folder of the repository.
See app/README.md.
Training
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 1e-5 |
| Batch size | 8 |
| Epochs | 20 |
| Optimizer | AdamW |
| Loss function | Weighted BCE (inverse-frequency weights) |
| Validation split | 20% of training data |
| Threshold selection | Per-label F1 maximisation on validation set |
Training procedure
- The combined dataset of ~2,000 labelled records was split 80/20 for training and validation.
- Inverse-frequency class weights were applied to the BCE loss to address label imbalance.
- Per-label decision thresholds were optimised on the validation set by grid search over [0.1, 0.2, …, 0.8] to maximise label-specific F1.
- The model with the best weighted-macro F1 across epochs was retained.
Checkpoint format
The .pt file is a standard PyTorch checkpoint dict with keys:
{
"model_state_dict": ..., # nn.Module weights
"label_columns": [...], # ordered label names
"optimal_thresholds": [...], # per-label decision thresholds
"n_labels": 8,
"base_model": "emilyalsentzer/Bio_ClinicalBERT",
}
Limitations and intended use
- The model was trained and evaluated on de-identified records from a single Australian tertiary hospital (RMH). Performance may differ on records from other hospitals, health systems, or clinical workflows.
- This model is intended for research purposes and is not a validated clinical decision support tool. Clinical decisions must remain with qualified healthcare professionals.
- The training data cannot be shared due to privacy restrictions; the annotation schema and data format are documented in the companion GitHub repository.
Citation
If you use this model in your research, please cite:
@article{ncas_indication_classifier_2025,
title = {Automated Classification of Antimicrobial Prescription Indications
Using BioClinicalBERT},
author = {...},
journal = {...},
year = {2025},
note = {Under review}
}
Repository
Source code, training scripts, and the desktop application are available at:
https://github.com/jibmaird/NCAS-hospital-indication-classifier
License
Apache 2.0 — see LICENSE.
Model tree for jibmaird/NCAS-hospital-indication-classifier
Base model
emilyalsentzer/Bio_ClinicalBERT