OHS Severity Classifier

DistilBERT fine-tuned for workplace incident severity classification. Developed as part of MSc Computer Science research at Northumbria University (2024).

Research paper: Predicting Workplace Incident Severity Using BERT-Based NLP — Probabilistic Classification and SHAP Analysis for UK OHS Decision Support arXiv: submission in progress — link will be added on publication

Code repository: github.com/stuartclark-ml/incident-severity-bert

Model Description

This model classifies workplace incident descriptions into five severity categories based on time-loss incapacitation, aligned with UK RIDDOR reporting thresholds and HSE guidance frameworks.

The model takes a free-text incident narrative as input and outputs a probability distribution across five severity classes. It is designed as a decision-support tool in a human-in-the-loop workflow — not a standalone classifier. The probability distribution and SHAP token attribution together allow a safety professional to assess confidence and inspect the model's reasoning before acting on a prediction.

Severity Classification Framework

Class	Label	Condition	UK OHS Alignment
0	None	0 days lost	Below RIDDOR recording threshold
1	Minor	1-2 days lost	Below mandatory internal recording threshold
2	Moderate	3-7 days lost	Mandatory internal recording under RIDDOR
3	Severe	8-28 days lost	Statutory reporting to HSE required
4	Major	29+ days lost	Long-term incapacitation — triggers separate management and insurance obligations

Time-loss days used as severity proxy. This metric is compatible with both US OSHA and UK RIDDOR reporting frameworks — the primary justification for using OSHA data to develop a UK-applicable tool.

Intended Uses

Appropriate uses:

Triage support for high-volume incident report queues in health and social care settings
Research into NLP-based severity prediction for occupational safety
Demonstration of BERT-based classification with SHAP explainability in safety-critical domains
Educational exploration of model bias in imbalanced safety datasets

Not appropriate for:

Standalone severity determination without human review
UK RIDDOR reportability decisions — the model was not trained on RIDDOR data
Industries outside health and social care — trained on H&C sector data only
Definitive clinical or legal severity assessments

Training Data

Dataset: US OSHA Incident Reports (2024 release) — Health and Care sector subset

Size: 181,586 records after preprocessing

Full OSHA dataset: 751,770 records
H&C sector subset: 193,512 records
COVID-19 records removed: 10,016 (to prevent rare global event bias)
Fatality records removed: 5 (statistically insignificant, severe class imbalance)

Features used:

new_nar_what_happened — what happened during the incident
new_nar_before_incident — what the employee was doing beforehand
new_nar_object_substance — the object or substance causing harm
size — organisational size (1=<20, 2=20-249, 3=250+ employees)

Class distribution: Heavily imbalanced — dominated by None class, reflecting real-world incident distribution (Heinrich's Triangle). Weighted loss function applied during training (up to 4.05x penalty for Minor class).

Performance

Overall Metrics

Metric	Random Forest (baseline)	DistilBERT	ModernBERT
Weighted Avg F1	0.55	0.58	0.58
Macro ROC AUC	0.77	0.77	0.78
Overall Accuracy	0.52	0.55	0.54
Log Loss	1.29	1.07	1.10

DistilBERT selected as the deployment model: equal F1 to ModernBERT, highest accuracy, lowest log loss, and smaller model size.

Per-Class Metrics — DistilBERT

Class	Precision	Recall	F1
None	0.94	0.71	0.81
Minor	0.16	0.14	0.15
Moderate	0.19	0.25	0.22
Severe	0.21	0.20	0.20
Major	0.37	0.60	0.46

Key finding: 0.60 recall on Major (highest-severity) incidents makes the model useful for prioritisation — catching 60% of the most serious incidents for urgent human review — despite poor performance on the middle classes.

Critical Findings from SHAP Analysis

SHAP token attribution was applied to DistilBERT to inspect the model's learned logic. This revealed two significant issues that any user of this model must understand:

1. Needlestick Blind Spot

99.25% of needlestick injuries in the training dataset are labelled None.

The model learned this spurious correlation and systematically classifies needlestick injuries as None severity — regardless of their actual consequences. In clinical practice needlestick injuries can result in bloodborne pathogen exposure with serious long-term health consequences.

Mitigation: Any incident narrative containing needle-related terms (needle, needlestick, lancet, sharps, puncture) should be flagged for mandatory human review and not relied upon for model output.

2. Organisational Size Bias

The size feature introduced a spurious correlation: small organisations (size=1, <20 employees) are associated with None class at 50.8%.

Mitigation: Model predictions for incidents from small organisations should be treated with additional caution. Future implementations should consider removing the size feature unless a more balanced dataset is available.

Positive Finding — No Significant US Linguistic Bias

SHAP global feature importance analysis found no significant US-centric linguistic bias in the top predictive features. The model relied on universal medical and safety terminology rather than US-specific language, supporting its tentative use as a UK proxy tool. However, regulatory and operational differences between US and UK health and social care systems mean direct comparison with UK standards still requires caution.

Limitations

Jurisdiction: Trained on US OSHA data. UK RIDDOR performance not characterised. Time-loss metric provides a compatibility bridge but does not eliminate cross-jurisdictional differences.
Industry: Health and social care only. Performance on other sectors is unknown and likely poor.
Minority classes: Minor, Moderate, and Severe classes perform poorly (F1 0.15-0.22). Narratives for these classes are textually indistinct — a fundamental data limitation not resolvable by model architecture alone.
Correlation not causation: SHAP values reflect learned correlations in the training data. They provide interpretations, not causal explanations.
Static training data: Model accuracy will degrade over time as incident patterns and safety language evolve. Continuous retraining would be required for production deployment.
Fatalities excluded: Insufficient fatality cases in the dataset (5 records) for reliable modelling.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "stuSterfc/ohs-severity-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

labels = ["None", "Minor", "Moderate", "Severe", "Major"]

def predict_severity(narrative: str) -> dict:
    inputs = tokenizer(
        narrative,
        return_tensors="pt",
        truncation=True,
        max_length=512,
        padding=True
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probs = torch.softmax(outputs.logits, dim=1).squeeze().tolist()
    
    return {
        "probabilities": {label: round(prob, 4) 
                         for label, prob in zip(labels, probs)},
        "predicted_class": labels[probs.index(max(probs))],
        "confidence": round(max(probs), 4)
    }

# Example
result = predict_severity(
    "Employee was assisting patient transfer from bed to wheelchair. "
    "Felt sharp pain in lower back. Unable to complete shift."
)
print(result)

# IMPORTANT: Check for needlestick blind spot
needle_terms = ["needle", "needlestick", "lancet", "sharps", "puncture"]
if any(term in narrative.lower() for term in needle_terms):
    print("WARNING: Needle-related incident detected. "
          "Model has known blind spot — refer for human review.")
---

## Input Format

The model was trained on concatenated narratives using `[SEP]` tokens 
as separators between the three OSHA narrative fields, with an 
organisational size prefix:

Organisation size: {size}. [SEP] {what_happened} [SEP] {before_incident} [SEP] {object_substance}


However, testing found that neither the `[SEP]` separators nor the 
organisational size prefix had a significant impact on output 
probabilities. A plain concatenated narrative performs equivalently:

{what_happened} {before_incident} {object_substance}


For practical use, pass the incident narrative as a single string 
without special formatting. Include as much detail as available — 
the model performs better with richer narratives (average word count 
for well-classified Major incidents was 58.6 words vs 17.8 for 
poorly-classified incidents).-

## Training Details

- **Base model:** `distilbert-base-cased`
- **Framework:** HuggingFace Transformers
- **Fine-tuning:** HuggingFace Trainer API
- **Hyperparameter optimisation:** Optuna
- **Class imbalance handling:** Weighted loss function (inverse class frequency)
- **Training data split:** Standard train/validation/test split
- **Hardware:** [GPU details if you have them]

---

## Citation

*Citation will be added when arXiv preprint is published.*

---

## About the Author

Stuart Clark — MSc Computer Science (Distinction), Northumbria University 2024.
NEBOSH National Diploma. 14 years occupational health and safety consulting.

GitHub: github.com/stuartclark-ml

[More Information Needed]

Downloads last month: 5

Safetensors

Model size

65.8M params

Tensor type

F32