File size: 3,905 Bytes
e9f624f 5e5c515 611e26a 5e5c515 611e26a fd4c2cf 611e26a 5e5c515 611e26a 5e5c515 9e26b78 fd4c2cf 1ff172a 611e26a 1ff172a 611e26a 1ff172a 611e26a 1ff172a 611e26a 1ff172a 611e26a e9f624f 611e26a e9f624f 611e26a e9f624f 611e26a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ---
tags:
- ml-intern
---
# HR Conversations Multi-Label Classifier
SETFit-style classifier for **20 HR topic labels** on employee–agent conversations, trained with **5,000 synthetic + 100 real samples** and evaluated via **5-fold stratified cross-validation** (no data leakage).
## Results
| Metric | Score |
|--------|-------|
| **F1-micro (5-fold CV)** | **0.7962 ± 0.0098** |
| **F1-macro (5-fold CV)** | **0.7721** |
| Fold 1 | 0.7851 |
| Fold 2 | 0.7989 |
| Fold 3 | 0.8031 |
| Fold 4 | 0.7846 |
| Fold 5 | **0.8091** |
## Model Details
| Attribute | Value |
|-----------|-------|
| Encoder | `sentence-transformers/all-MiniLM-L6-v2` (384-dim) |
| Classifier | Multi-output Logistic Regression (scikit-learn) |
| Training samples | 5,100 (5,000 synthetic + 100 real) |
| Labels | 20 HR topics |
| Validation | 5-fold stratified cross-validation |
| Framework | Sentence-Transformers + scikit-learn |
## 20 HR Topic Labels
1. Benefits
2. Career Development
3. Compliance & Legal
4. Contracts
5. Diversity, Equity & Inclusion
6. Expense Management
7. Harassment
8. Health
9. IT & Equipment
10. Leave & Absence
11. Mobility
12. Offboarding
13. Onboarding
14. Payroll
15. Performance Management
16. Recruitment
17. Safety
18. Timetracking
19. Training
20. Work Arrangements
## Usage
```python
from sentence_transformers import SentenceTransformer
import pickle, json
from huggingface_hub import hf_hub_download
# Download artifacts
classifier_path = hf_hub_download("AurelPx/hr-conversations-classifier", "setfit_classifier.pkl")
label_path = hf_hub_download("AurelPx/hr-conversations-classifier", "setfit_label_config.json")
# Load
encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
with open(classifier_path, 'rb') as f:
classifier = pickle.load(f)
with open(label_path) as f:
config = json.load(f)
LABELS = config['label_names']
# Classify
sample = (
"USER: I haven't received my payslip for March yet. Could you please check what's going on?\n"
"AGENT: Good morning. I've checked the payroll system and it appears your March payslip "
"was generated on the 28th but there was a distribution delay. I've resent it to your "
"registered email. You should receive it within the next hour."
)
emb = encoder.encode([sample])
proba = classifier.predict_proba(emb)
preds = [LABELS[i] for i, p in enumerate(proba) if p[0][1] >= 0.5]
print(preds) # ['Payroll']
```
## Interactive Demo
Try it live: [**AurelPx/hr-classifier-demo**](https://huggingface.co/spaces/AurelPx/hr-classifier-demo)
Paste any HR conversation, adjust the threshold, and see predicted labels with probabilities instantly.
## Training Approach
1. **Data augmentation** — 5,000 synthetic HR conversations generated from real conversation templates (no LLM, no external API, no data leakage).
2. **Stratified 5-fold CV** — splits by primary label, preserving label distribution in each fold.
3. **SETFit-style pipeline** — MiniLM embeddings + Logistic Regression, fast and accurate on small data.
## Files in this Repo
| File | Description |
|------|-------------|
| `setfit_classifier.pkl` | Trained Logistic Regression classifier |
| `setfit_encoder.pkl` | SentenceTransformer MiniLM encoder (optional, for offline use) |
| `setfit_cv_results.json` | Cross-validation scores per fold |
| `setfit_label_config.json` | Label names and classification threshold |
| `training_script.py` | Full training pipeline (augmentation + CV + inference) |
| `inference.py` | Standalone inference script (DistilBERT legacy — not recommended) |
| `model.safetensors` | Legacy DistilBERT checkpoint (kept for compatibility) |
## Dataset
- [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)
- 100 English HR conversations with multi-label annotations
## License
Apache 2.0
|