File size: 5,142 Bytes
2cffee2
4777384
90136af
4777384
 
 
90136af
 
 
 
 
4777384
90136af
 
 
 
4777384
90136af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cffee2
 
4777384
2cffee2
 
 
 
 
4777384
2cffee2
4777384
 
 
 
 
 
2cffee2
4777384
2cffee2
4777384
2cffee2
4777384
2cffee2
4777384
 
2cffee2
4777384
2cffee2
4777384
2cffee2
4777384
 
2cffee2
4777384
2cffee2
4777384
 
 
2cffee2
4777384
2cffee2
4777384
2cffee2
4777384
2cffee2
4777384
 
 
2cffee2
4777384
 
 
2cffee2
4777384
 
2cffee2
4777384
 
 
2cffee2
4777384
 
 
2cffee2
 
 
 
 
90136af
2cffee2
4777384
 
90136af
2cffee2
 
 
 
 
4777384
90136af
4777384
 
 
2cffee2
4777384
2cffee2
90136af
2cffee2
 
 
 
 
4777384
2cffee2
 
 
4777384
 
 
 
 
 
2cffee2
4777384
2cffee2
 
 
90136af
4777384
 
2cffee2
 
 
4777384
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
language:
  - en
license: apache-2.0
base_model: emilyalsentzer/Bio_ClinicalBERT
tags:
  - medical
  - clinical
  - ssi
  - classification
  - surveillance
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: clinicalSSIBERT
    results:
      - task:
          type: text-classification
          name: SSI Detection
        dataset:
          name: Synthetic UK NHS Clinical Notes
          type: synthetic
          split: test
        metrics:
          - name: Accuracy
            type: accuracy
            value: 1.0
          - name: F1
            type: f1
            value: 1.0
---

# Model Card for Ch3DS/clinicalSSIBERT

## Model Details

### Model Description

This model is a fine-tuned version of [Bio_ClinicalBERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) designed for the surveillance of **Surgical Site Infections (SSI)** in postoperative clinical notes. It is specifically tailored to **UK NHS terminology**, covering specialties such as Orthopaedics, General Surgery (GI), and Obstetrics (C-sections).

- **Developed by:** Daryn Sutton
- **Model type:** Text Classification (BERT)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** [emilyalsentzer/Bio_ClinicalBERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT)
- **Repository:** [https://huggingface.co/Ch3DS/clinicalSSIBERT](https://huggingface.co/Ch3DS/clinicalSSIBERT)

### Uses

#### Direct Use

This model is intended for use in clinical natural language processing (NLP) pipelines to automatically flag postoperative notes that indicate a potential Surgical Site Infection. It classifies notes into:

- **0 (Routine)**: Normal healing, no signs of infection.
- **1 (Infection)**: Signs of SSI (e.g., purulent discharge, erythema, antibiotic escalation).

It is particularly effective for notes containing UK-specific medical abbreviations and terminology (e.g., "Lap. Chole.", "THR", "Co-amoxiclav", "SHO review").

#### Out-of-Scope Use

- **Diagnosis**: This model is a surveillance tool and should **not** be used to make clinical diagnoses without human verification.
- **Non-UK Contexts**: Performance may vary on clinical notes from other healthcare systems with different terminology or documentation styles.

### Bias, Risks, and Limitations

- **Synthetic Data**: The model was trained on a large synthetic dataset. While designed to be realistic, it may not capture the full "messiness" or ambiguity of real-world clinical data.
- **False Negatives**: There is a risk of missing subtle infections that do not use standard keywords.
- **Bias**: The synthetic data generation process may have introduced biases based on the templates used.

### Recommendations

Users should validate the model on their own local clinical data before deploying it for active surveillance. It is recommended to use this model as a "first pass" filter to prioritize cases for manual review by Infection Prevention and Control (IPC) teams.

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "Ch3DS/clinicalSSIBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Day 5 post THR. Wound red and oozing pus. Patient pyrexial. Plan: Start Flucloxacillin."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    predicted_class_id = logits.argmax().item()

labels = ["Routine", "Infection"]
print(f"Prediction: {labels[predicted_class_id]}")
```

## Training Details

### Training Data

The model was trained on **5 million synthetic clinical notes** generated to mimic UK NHS postoperative records. The data covers:

- **Procedures**: Total Hip/Knee Replacement, C-Section, Cholecystectomy, Hernia Repair, etc.
- **Terminology**: UK-specific staff titles (Reg, SHO, FY1), antibiotics (Co-amoxiclav, Teicoplanin), and wound descriptions.
- **Balance**: Approximately 5% infection rate.

### Training Procedure

#### Training Hyperparameters

- **Epochs**: 3
- **Batch Size**: 64 (per device) with Gradient Accumulation of 4
- **Learning Rate**: 2e-5
- **Precision**: Mixed Precision (FP16)
- **Optimizer**: AdamW

#### Hardware

- **GPU**: NVIDIA GeForce RTX 5070 Ti

## Evaluation

### Testing Data, Factors & Metrics

The model was evaluated on a held-out test set of 100,000 synthetic records.

### Results

| Metric        | Value |
| :------------ | :---- |
| **Accuracy**  | 100%  |
| **Precision** | 1.0   |
| **Recall**    | 1.0   |
| **F1-Score**  | 1.0   |

_Note: The perfect scores reflect the synthetic nature of the test data, which follows the same distribution as the training data. Real-world performance is expected to be lower and requires further validation._

## Environmental Impact

- **Hardware Type**: NVIDIA GeForce RTX 5070 Ti
- **Hours used**: ~2 hours
- **Carbon Emitted**: Negligible (local training)

## Model Card Contact

**Daryn Sutton**  
Email: darynsutton@hotmail.com  
GitHub: [Ch3w3y](https://github.com/Ch3w3y)