OHS Severity Classifier

DistilBERT fine-tuned for workplace incident severity classification. Developed as part of MSc Computer Science research at Northumbria University (2024).

Research paper: Predicting Workplace Incident Severity Using BERT-Based NLP β€” Probabilistic Classification and SHAP Analysis for UK OHS Decision Support arXiv: submission in progress β€” link will be added on publication

Code repository: github.com/stuartclark-ml/incident-severity-bert


Model Description

This model classifies workplace incident descriptions into five severity categories based on time-loss incapacitation, aligned with UK RIDDOR reporting thresholds and HSE guidance frameworks.

The model takes a free-text incident narrative as input and outputs a probability distribution across five severity classes. It is designed as a decision-support tool in a human-in-the-loop workflow β€” not a standalone classifier. The probability distribution and SHAP token attribution together allow a safety professional to assess confidence and inspect the model's reasoning before acting on a prediction.


Severity Classification Framework

Class Label Condition UK OHS Alignment
0 None 0 days lost Below RIDDOR recording threshold
1 Minor 1-2 days lost Below mandatory internal recording threshold
2 Moderate 3-7 days lost Mandatory internal recording under RIDDOR
3 Severe 8-28 days lost Statutory reporting to HSE required
4 Major 29+ days lost Long-term incapacitation β€” triggers separate management and insurance obligations

Time-loss days used as severity proxy. This metric is compatible with both US OSHA and UK RIDDOR reporting frameworks β€” the primary justification for using OSHA data to develop a UK-applicable tool.


Intended Uses

Appropriate uses:

  • Triage support for high-volume incident report queues in health and social care settings
  • Research into NLP-based severity prediction for occupational safety
  • Demonstration of BERT-based classification with SHAP explainability in safety-critical domains
  • Educational exploration of model bias in imbalanced safety datasets

Not appropriate for:

  • Standalone severity determination without human review
  • UK RIDDOR reportability decisions β€” the model was not trained on RIDDOR data
  • Industries outside health and social care β€” trained on H&C sector data only
  • Definitive clinical or legal severity assessments

Training Data

Dataset: US OSHA Incident Reports (2024 release) β€” Health and Care sector subset

Size: 181,586 records after preprocessing

  • Full OSHA dataset: 751,770 records
  • H&C sector subset: 193,512 records
  • COVID-19 records removed: 10,016 (to prevent rare global event bias)
  • Fatality records removed: 5 (statistically insignificant, severe class imbalance)

Features used:

  • new_nar_what_happened β€” what happened during the incident
  • new_nar_before_incident β€” what the employee was doing beforehand
  • new_nar_object_substance β€” the object or substance causing harm
  • size β€” organisational size (1=<20, 2=20-249, 3=250+ employees)

Class distribution: Heavily imbalanced β€” dominated by None class, reflecting real-world incident distribution (Heinrich's Triangle). Weighted loss function applied during training (up to 4.05x penalty for Minor class).


Performance

Overall Metrics

Metric Random Forest (baseline) DistilBERT ModernBERT
Weighted Avg F1 0.55 0.58 0.58
Macro ROC AUC 0.77 0.77 0.78
Overall Accuracy 0.52 0.55 0.54
Log Loss 1.29 1.07 1.10

DistilBERT selected as the deployment model: equal F1 to ModernBERT, highest accuracy, lowest log loss, and smaller model size.

Per-Class Metrics β€” DistilBERT

Class Precision Recall F1
None 0.94 0.71 0.81
Minor 0.16 0.14 0.15
Moderate 0.19 0.25 0.22
Severe 0.21 0.20 0.20
Major 0.37 0.60 0.46

Key finding: 0.60 recall on Major (highest-severity) incidents makes the model useful for prioritisation β€” catching 60% of the most serious incidents for urgent human review β€” despite poor performance on the middle classes.


Critical Findings from SHAP Analysis

SHAP token attribution was applied to DistilBERT to inspect the model's learned logic. This revealed two significant issues that any user of this model must understand:

1. Needlestick Blind Spot

99.25% of needlestick injuries in the training dataset are labelled None.

The model learned this spurious correlation and systematically classifies needlestick injuries as None severity β€” regardless of their actual consequences. In clinical practice needlestick injuries can result in bloodborne pathogen exposure with serious long-term health consequences.

Mitigation: Any incident narrative containing needle-related terms (needle, needlestick, lancet, sharps, puncture) should be flagged for mandatory human review and not relied upon for model output.

2. Organisational Size Bias

The size feature introduced a spurious correlation: small organisations (size=1, <20 employees) are associated with None class at 50.8%.

Mitigation: Model predictions for incidents from small organisations should be treated with additional caution. Future implementations should consider removing the size feature unless a more balanced dataset is available.

Positive Finding β€” No Significant US Linguistic Bias

SHAP global feature importance analysis found no significant US-centric linguistic bias in the top predictive features. The model relied on universal medical and safety terminology rather than US-specific language, supporting its tentative use as a UK proxy tool. However, regulatory and operational differences between US and UK health and social care systems mean direct comparison with UK standards still requires caution.


Limitations

  • Jurisdiction: Trained on US OSHA data. UK RIDDOR performance not characterised. Time-loss metric provides a compatibility bridge but does not eliminate cross-jurisdictional differences.
  • Industry: Health and social care only. Performance on other sectors is unknown and likely poor.
  • Minority classes: Minor, Moderate, and Severe classes perform poorly (F1 0.15-0.22). Narratives for these classes are textually indistinct β€” a fundamental data limitation not resolvable by model architecture alone.
  • Correlation not causation: SHAP values reflect learned correlations in the training data. They provide interpretations, not causal explanations.
  • Static training data: Model accuracy will degrade over time as incident patterns and safety language evolve. Continuous retraining would be required for production deployment.
  • Fatalities excluded: Insufficient fatality cases in the dataset (5 records) for reliable modelling.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "stuSterfc/ohs-severity-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

labels = ["None", "Minor", "Moderate", "Severe", "Major"]

def predict_severity(narrative: str) -> dict:
    inputs = tokenizer(
        narrative,
        return_tensors="pt",
        truncation=True,
        max_length=512,
        padding=True
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probs = torch.softmax(outputs.logits, dim=1).squeeze().tolist()
    
    return {
        "probabilities": {label: round(prob, 4) 
                         for label, prob in zip(labels, probs)},
        "predicted_class": labels[probs.index(max(probs))],
        "confidence": round(max(probs), 4)
    }

# Example
result = predict_severity(
    "Employee was assisting patient transfer from bed to wheelchair. "
    "Felt sharp pain in lower back. Unable to complete shift."
)
print(result)

# IMPORTANT: Check for needlestick blind spot
needle_terms = ["needle", "needlestick", "lancet", "sharps", "puncture"]
if any(term in narrative.lower() for term in needle_terms):
    print("WARNING: Needle-related incident detected. "
          "Model has known blind spot β€” refer for human review.")
---

## Input Format

The model was trained on concatenated narratives using `[SEP]` tokens 
as separators between the three OSHA narrative fields, with an 
organisational size prefix:

Organisation size: {size}. [SEP] {what_happened} [SEP] {before_incident} [SEP] {object_substance}


However, testing found that neither the `[SEP]` separators nor the 
organisational size prefix had a significant impact on output 
probabilities. A plain concatenated narrative performs equivalently:

{what_happened} {before_incident} {object_substance}


For practical use, pass the incident narrative as a single string 
without special formatting. Include as much detail as available β€” 
the model performs better with richer narratives (average word count 
for well-classified Major incidents was 58.6 words vs 17.8 for 
poorly-classified incidents).-

## Training Details

- **Base model:** `distilbert-base-cased`
- **Framework:** HuggingFace Transformers
- **Fine-tuning:** HuggingFace Trainer API
- **Hyperparameter optimisation:** Optuna
- **Class imbalance handling:** Weighted loss function (inverse class frequency)
- **Training data split:** Standard train/validation/test split
- **Hardware:** [GPU details if you have them]

---

## Citation

*Citation will be added when arXiv preprint is published.*

---

## About the Author

Stuart Clark β€” MSc Computer Science (Distinction), Northumbria University 2024.
NEBOSH National Diploma. 14 years occupational health and safety consulting.

GitHub: github.com/stuartclark-ml

[More Information Needed]
Downloads last month
5
Safetensors
Model size
65.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support