Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
library_name: pytorch
|
| 5 |
+
tags:
|
| 6 |
+
- distilbert
|
| 7 |
+
- text-classification
|
| 8 |
+
- multi-task
|
| 9 |
+
- incident-response
|
| 10 |
+
- sre
|
| 11 |
+
- sentinelops
|
| 12 |
+
pipeline_tag: text-classification
|
| 13 |
+
datasets:
|
| 14 |
+
- ayushgupta07xx/sentinelops-corpus
|
| 15 |
+
base_model: distilbert-base-uncased
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# SentinelOps Incident Classifier
|
| 19 |
+
|
| 20 |
+
Multi-task DistilBERT classifier that predicts **severity** (P0/P1/P2/P3) and **category** (networking, database, deploy, capacity, auth, other) for SRE incident postmortems. Non-generative baseline for the [SentinelOps](https://github.com/ayushgupta07xx/sentinelops) flagship project, benchmarked against a fine-tuned Mistral-7B generation model.
|
| 21 |
+
|
| 22 |
+
## Intended use
|
| 23 |
+
|
| 24 |
+
Given an incident summary, returns `{severity, category}` predictions for routing, prioritization, or retrieval filtering inside the SentinelOps agent.
|
| 25 |
+
|
| 26 |
+
**Not intended for**: standalone production incident triage, compliance decisions, or any use case where a miscategorization has safety or financial impact.
|
| 27 |
+
|
| 28 |
+
## Training data
|
| 29 |
+
|
| 30 |
+
- **Corpus**: 2,000+ public postmortems scraped from `danluu/post-mortems`, Cloudflare blog, GitHub status (Atom feed), and AWS post-event summaries.
|
| 31 |
+
- **Labeled subset**: ~270 manually labeled for severity (4 classes) and category (6 classes). Remaining examples weakly labeled via regex + keyword rules.
|
| 32 |
+
- **Preprocessing**: HTML→text; MinHash dedup (threshold 0.85, 5-word shingles, 128 permutations); PII scrub (regex for emails/IPs/tokens + Presidio for names/phones).
|
| 33 |
+
|
| 34 |
+
Dataset: https://huggingface.co/datasets/ayushgupta07xx/sentinelops-corpus
|
| 35 |
+
|
| 36 |
+
## Training
|
| 37 |
+
|
| 38 |
+
- **Base**: `distilbert-base-uncased`
|
| 39 |
+
- **Architecture**: Shared DistilBERT encoder with two classification heads (severity + category), joint loss.
|
| 40 |
+
- **Hardware**: Kaggle T4 GPU (PyTorch).
|
| 41 |
+
- **Optimizer**: AdamW with cosine LR schedule.
|
| 42 |
+
- **Tracking**: Weights & Biases — project `sentinelops`, job_type `classifier_pt`.
|
| 43 |
+
|
| 44 |
+
Training code: https://github.com/ayushgupta07xx/sentinelops/blob/main/training/classifier/train_pt.py
|
| 45 |
+
|
| 46 |
+
## Evaluation
|
| 47 |
+
|
| 48 |
+
### ⚠️ Test set is 27 examples
|
| 49 |
+
|
| 50 |
+
This is a **small held-out set** — per-class F1 numbers have high variance and a single misclassification moves a 9-example class F1 by ~0.1. Treat these numbers as directional indicators, not population estimates. The limit reflects the labeled-data budget (~270 manually labeled examples, standard 80/10/10 split) which is the realistic ceiling for a solo 5-week project.
|
| 51 |
+
|
| 52 |
+
### Headline metrics
|
| 53 |
+
|
| 54 |
+
| Head | Accuracy | Macro-F1 | Weighted-F1 |
|
| 55 |
+
|----------|----------|----------|-------------|
|
| 56 |
+
| Severity | 0.48 | 0.36 | 0.52 |
|
| 57 |
+
| Category | 0.30 | 0.36 | 0.24 |
|
| 58 |
+
|
| 59 |
+
Full per-class precision/recall/F1 is in `eval_report.json`. Confusion matrices: `assets/confusion_matrix_severity.png`, `assets/confusion_matrix_category.png`.
|
| 60 |
+
|
| 61 |
+
## Limitations
|
| 62 |
+
|
| 63 |
+
1. **Tiny test set (27)** — confidence intervals on per-class F1 are wide. Do not use these numbers to claim SOTA anything.
|
| 64 |
+
2. **Class imbalance** — some classes (e.g., `auth`) had fewer than 10 training examples; the model underperforms there.
|
| 65 |
+
3. **Weak labels** — the majority of training data uses regex/keyword rules, so the model partially learns those rules rather than semantic features. Performance on incidents whose vocabulary doesn't match the rules is likely worse.
|
| 66 |
+
4. **Domain drift** — trained on public postmortems from hyperscalers and open-source infra. Generalization to internal enterprise incidents without further fine-tuning is unverified.
|
| 67 |
+
5. **English only** — all training data is English.
|
| 68 |
+
|
| 69 |
+
## Bias considerations
|
| 70 |
+
|
| 71 |
+
- Overrepresentation of hyperscaler incidents (AWS, GitHub, Cloudflare) vs. small-org or on-prem incidents.
|
| 72 |
+
- "Severity" labels reflect the reporting org's conventions, which differ across companies — the model learns an average that may not match your org's definitions.
|
| 73 |
+
- Weak labels encode my prior about what keywords indicate each category, which may bias category boundaries.
|
| 74 |
+
|
| 75 |
+
## Files
|
| 76 |
+
|
| 77 |
+
- `model.pt` — PyTorch state dict (~253 MB)
|
| 78 |
+
- `tokenizer/` — HF tokenizer files
|
| 79 |
+
- `config.json` — model hyperparameters
|
| 80 |
+
- `label_mappings.json` — `severity` and `category` label lists
|
| 81 |
+
- `eval_report.json` — full classification reports
|
| 82 |
+
- `assets/confusion_matrix_{severity,category}.png` — confusion matrices
|
| 83 |
+
|
| 84 |
+
## Loading
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
import json, torch
|
| 88 |
+
from transformers import DistilBertTokenizerFast
|
| 89 |
+
from huggingface_hub import snapshot_download
|
| 90 |
+
|
| 91 |
+
local = snapshot_download("ayushgupta07xx/sentinelops-classifier")
|
| 92 |
+
|
| 93 |
+
# Requires the DistilBertMultiTask class from:
|
| 94 |
+
# https://github.com/ayushgupta07xx/sentinelops/blob/main/training/classifier/model_pt.py
|
| 95 |
+
from model_pt import DistilBertMultiTask, Config
|
| 96 |
+
|
| 97 |
+
with open(f"{local}/label_mappings.json") as f:
|
| 98 |
+
lm = json.load(f)
|
| 99 |
+
cfg = Config(num_severity=len(lm["severity"]), num_category=len(lm["category"]))
|
| 100 |
+
model = DistilBertMultiTask(cfg)
|
| 101 |
+
model.load_state_dict(torch.load(f"{local}/model.pt", map_location="cpu"))
|
| 102 |
+
model.eval()
|
| 103 |
+
|
| 104 |
+
tokenizer = DistilBertTokenizerFast.from_pretrained(f"{local}/tokenizer")
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Citation
|
| 108 |
+
|
| 109 |
+
Built as part of [SentinelOps](https://github.com/ayushgupta07xx/sentinelops). No paper — this is engineering, not research.
|
| 110 |
+
|
| 111 |
+
## Changelog
|
| 112 |
+
|
| 113 |
+
- **2026-04 v0.1** — Initial release; 27-example test set.
|