File size: 3,905 Bytes
e9f624f
 
 
 
5e5c515
 
611e26a
 
 
 
 
 
 
 
 
 
 
 
 
5e5c515
 
 
 
 
611e26a
 
 
fd4c2cf
611e26a
 
5e5c515
 
 
611e26a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e5c515
9e26b78
fd4c2cf
1ff172a
 
611e26a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ff172a
 
611e26a
 
 
 
 
 
 
 
 
 
 
1ff172a
611e26a
1ff172a
611e26a
 
 
 
 
 
 
 
 
1ff172a
611e26a
e9f624f
611e26a
 
e9f624f
611e26a
e9f624f
611e26a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
tags:
- ml-intern
---
# HR Conversations Multi-Label Classifier

SETFit-style classifier for **20 HR topic labels** on employee–agent conversations, trained with **5,000 synthetic + 100 real samples** and evaluated via **5-fold stratified cross-validation** (no data leakage).

## Results

| Metric | Score |
|--------|-------|
| **F1-micro (5-fold CV)** | **0.7962 ± 0.0098** |
| **F1-macro (5-fold CV)** | **0.7721** |
| Fold 1 | 0.7851 |
| Fold 2 | 0.7989 |
| Fold 3 | 0.8031 |
| Fold 4 | 0.7846 |
| Fold 5 | **0.8091** |

## Model Details

| Attribute | Value |
|-----------|-------|
| Encoder | `sentence-transformers/all-MiniLM-L6-v2` (384-dim) |
| Classifier | Multi-output Logistic Regression (scikit-learn) |
| Training samples | 5,100 (5,000 synthetic + 100 real) |
| Labels | 20 HR topics |
| Validation | 5-fold stratified cross-validation |
| Framework | Sentence-Transformers + scikit-learn |

## 20 HR Topic Labels

1. Benefits  
2. Career Development  
3. Compliance & Legal  
4. Contracts  
5. Diversity, Equity & Inclusion  
6. Expense Management  
7. Harassment  
8. Health  
9. IT & Equipment  
10. Leave & Absence  
11. Mobility  
12. Offboarding  
13. Onboarding  
14. Payroll  
15. Performance Management  
16. Recruitment  
17. Safety  
18. Timetracking  
19. Training  
20. Work Arrangements

## Usage

```python
from sentence_transformers import SentenceTransformer
import pickle, json
from huggingface_hub import hf_hub_download

# Download artifacts
classifier_path = hf_hub_download("AurelPx/hr-conversations-classifier", "setfit_classifier.pkl")
label_path      = hf_hub_download("AurelPx/hr-conversations-classifier", "setfit_label_config.json")

# Load
encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
with open(classifier_path, 'rb') as f:
    classifier = pickle.load(f)
with open(label_path) as f:
    config = json.load(f)

LABELS = config['label_names']

# Classify
sample = (
    "USER: I haven't received my payslip for March yet. Could you please check what's going on?\n"
    "AGENT: Good morning. I've checked the payroll system and it appears your March payslip "
    "was generated on the 28th but there was a distribution delay. I've resent it to your "
    "registered email. You should receive it within the next hour."
)

emb   = encoder.encode([sample])
proba = classifier.predict_proba(emb)
preds = [LABELS[i] for i, p in enumerate(proba) if p[0][1] >= 0.5]
print(preds)  # ['Payroll']
```

## Interactive Demo

Try it live: [**AurelPx/hr-classifier-demo**](https://huggingface.co/spaces/AurelPx/hr-classifier-demo)

Paste any HR conversation, adjust the threshold, and see predicted labels with probabilities instantly.

## Training Approach

1. **Data augmentation** — 5,000 synthetic HR conversations generated from real conversation templates (no LLM, no external API, no data leakage).
2. **Stratified 5-fold CV** — splits by primary label, preserving label distribution in each fold.
3. **SETFit-style pipeline** — MiniLM embeddings + Logistic Regression, fast and accurate on small data.

## Files in this Repo

| File | Description |
|------|-------------|
| `setfit_classifier.pkl` | Trained Logistic Regression classifier |
| `setfit_encoder.pkl` | SentenceTransformer MiniLM encoder (optional, for offline use) |
| `setfit_cv_results.json` | Cross-validation scores per fold |
| `setfit_label_config.json` | Label names and classification threshold |
| `training_script.py` | Full training pipeline (augmentation + CV + inference) |
| `inference.py` | Standalone inference script (DistilBERT legacy — not recommended) |
| `model.safetensors` | Legacy DistilBERT checkpoint (kept for compatibility) |

## Dataset

- [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)
- 100 English HR conversations with multi-label annotations

## License

Apache 2.0