OHS Severity Classifier
DistilBERT fine-tuned for workplace incident severity classification. Developed as part of MSc Computer Science research at Northumbria University (2024).
Research paper: Predicting Workplace Incident Severity Using BERT-Based NLP β Probabilistic Classification and SHAP Analysis for UK OHS Decision Support arXiv: submission in progress β link will be added on publication
Code repository: github.com/stuartclark-ml/incident-severity-bert
Model Description
This model classifies workplace incident descriptions into five severity categories based on time-loss incapacitation, aligned with UK RIDDOR reporting thresholds and HSE guidance frameworks.
The model takes a free-text incident narrative as input and outputs a probability distribution across five severity classes. It is designed as a decision-support tool in a human-in-the-loop workflow β not a standalone classifier. The probability distribution and SHAP token attribution together allow a safety professional to assess confidence and inspect the model's reasoning before acting on a prediction.
Severity Classification Framework
| Class | Label | Condition | UK OHS Alignment |
|---|---|---|---|
| 0 | None | 0 days lost | Below RIDDOR recording threshold |
| 1 | Minor | 1-2 days lost | Below mandatory internal recording threshold |
| 2 | Moderate | 3-7 days lost | Mandatory internal recording under RIDDOR |
| 3 | Severe | 8-28 days lost | Statutory reporting to HSE required |
| 4 | Major | 29+ days lost | Long-term incapacitation β triggers separate management and insurance obligations |
Time-loss days used as severity proxy. This metric is compatible with both US OSHA and UK RIDDOR reporting frameworks β the primary justification for using OSHA data to develop a UK-applicable tool.
Intended Uses
Appropriate uses:
- Triage support for high-volume incident report queues in health and social care settings
- Research into NLP-based severity prediction for occupational safety
- Demonstration of BERT-based classification with SHAP explainability in safety-critical domains
- Educational exploration of model bias in imbalanced safety datasets
Not appropriate for:
- Standalone severity determination without human review
- UK RIDDOR reportability decisions β the model was not trained on RIDDOR data
- Industries outside health and social care β trained on H&C sector data only
- Definitive clinical or legal severity assessments
Training Data
Dataset: US OSHA Incident Reports (2024 release) β Health and Care sector subset
Size: 181,586 records after preprocessing
- Full OSHA dataset: 751,770 records
- H&C sector subset: 193,512 records
- COVID-19 records removed: 10,016 (to prevent rare global event bias)
- Fatality records removed: 5 (statistically insignificant, severe class imbalance)
Features used:
new_nar_what_happenedβ what happened during the incidentnew_nar_before_incidentβ what the employee was doing beforehandnew_nar_object_substanceβ the object or substance causing harmsizeβ organisational size (1=<20, 2=20-249, 3=250+ employees)
Class distribution: Heavily imbalanced β dominated by None class, reflecting real-world incident distribution (Heinrich's Triangle). Weighted loss function applied during training (up to 4.05x penalty for Minor class).
Performance
Overall Metrics
| Metric | Random Forest (baseline) | DistilBERT | ModernBERT |
|---|---|---|---|
| Weighted Avg F1 | 0.55 | 0.58 | 0.58 |
| Macro ROC AUC | 0.77 | 0.77 | 0.78 |
| Overall Accuracy | 0.52 | 0.55 | 0.54 |
| Log Loss | 1.29 | 1.07 | 1.10 |
DistilBERT selected as the deployment model: equal F1 to ModernBERT, highest accuracy, lowest log loss, and smaller model size.
Per-Class Metrics β DistilBERT
| Class | Precision | Recall | F1 |
|---|---|---|---|
| None | 0.94 | 0.71 | 0.81 |
| Minor | 0.16 | 0.14 | 0.15 |
| Moderate | 0.19 | 0.25 | 0.22 |
| Severe | 0.21 | 0.20 | 0.20 |
| Major | 0.37 | 0.60 | 0.46 |
Key finding: 0.60 recall on Major (highest-severity) incidents makes the model useful for prioritisation β catching 60% of the most serious incidents for urgent human review β despite poor performance on the middle classes.
Critical Findings from SHAP Analysis
SHAP token attribution was applied to DistilBERT to inspect the model's learned logic. This revealed two significant issues that any user of this model must understand:
1. Needlestick Blind Spot
99.25% of needlestick injuries in the training dataset are labelled None.
The model learned this spurious correlation and systematically classifies needlestick injuries as None severity β regardless of their actual consequences. In clinical practice needlestick injuries can result in bloodborne pathogen exposure with serious long-term health consequences.
Mitigation: Any incident narrative containing needle-related terms
(needle, needlestick, lancet, sharps, puncture) should be
flagged for mandatory human review and not relied upon for model output.
2. Organisational Size Bias
The size feature introduced a spurious correlation: small organisations
(size=1, <20 employees) are associated with None class at 50.8%.
Mitigation: Model predictions for incidents from small organisations should be treated with additional caution. Future implementations should consider removing the size feature unless a more balanced dataset is available.
Positive Finding β No Significant US Linguistic Bias
SHAP global feature importance analysis found no significant US-centric linguistic bias in the top predictive features. The model relied on universal medical and safety terminology rather than US-specific language, supporting its tentative use as a UK proxy tool. However, regulatory and operational differences between US and UK health and social care systems mean direct comparison with UK standards still requires caution.
Limitations
- Jurisdiction: Trained on US OSHA data. UK RIDDOR performance not characterised. Time-loss metric provides a compatibility bridge but does not eliminate cross-jurisdictional differences.
- Industry: Health and social care only. Performance on other sectors is unknown and likely poor.
- Minority classes: Minor, Moderate, and Severe classes perform poorly (F1 0.15-0.22). Narratives for these classes are textually indistinct β a fundamental data limitation not resolvable by model architecture alone.
- Correlation not causation: SHAP values reflect learned correlations in the training data. They provide interpretations, not causal explanations.
- Static training data: Model accuracy will degrade over time as incident patterns and safety language evolve. Continuous retraining would be required for production deployment.
- Fatalities excluded: Insufficient fatality cases in the dataset (5 records) for reliable modelling.
How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "stuSterfc/ohs-severity-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
labels = ["None", "Minor", "Moderate", "Severe", "Major"]
def predict_severity(narrative: str) -> dict:
inputs = tokenizer(
narrative,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1).squeeze().tolist()
return {
"probabilities": {label: round(prob, 4)
for label, prob in zip(labels, probs)},
"predicted_class": labels[probs.index(max(probs))],
"confidence": round(max(probs), 4)
}
# Example
result = predict_severity(
"Employee was assisting patient transfer from bed to wheelchair. "
"Felt sharp pain in lower back. Unable to complete shift."
)
print(result)
# IMPORTANT: Check for needlestick blind spot
needle_terms = ["needle", "needlestick", "lancet", "sharps", "puncture"]
if any(term in narrative.lower() for term in needle_terms):
print("WARNING: Needle-related incident detected. "
"Model has known blind spot β refer for human review.")
---
## Input Format
The model was trained on concatenated narratives using `[SEP]` tokens
as separators between the three OSHA narrative fields, with an
organisational size prefix:
Organisation size: {size}. [SEP] {what_happened} [SEP] {before_incident} [SEP] {object_substance}
However, testing found that neither the `[SEP]` separators nor the
organisational size prefix had a significant impact on output
probabilities. A plain concatenated narrative performs equivalently:
{what_happened} {before_incident} {object_substance}
For practical use, pass the incident narrative as a single string
without special formatting. Include as much detail as available β
the model performs better with richer narratives (average word count
for well-classified Major incidents was 58.6 words vs 17.8 for
poorly-classified incidents).-
## Training Details
- **Base model:** `distilbert-base-cased`
- **Framework:** HuggingFace Transformers
- **Fine-tuning:** HuggingFace Trainer API
- **Hyperparameter optimisation:** Optuna
- **Class imbalance handling:** Weighted loss function (inverse class frequency)
- **Training data split:** Standard train/validation/test split
- **Hardware:** [GPU details if you have them]
---
## Citation
*Citation will be added when arXiv preprint is published.*
---
## About the Author
Stuart Clark β MSc Computer Science (Distinction), Northumbria University 2024.
NEBOSH National Diploma. 14 years occupational health and safety consulting.
GitHub: github.com/stuartclark-ml
[More Information Needed]
- Downloads last month
- 5