AurelPx's picture
Upload README.md
5e5c515 verified
# HR Conversations Multi-Label Classifier
A fine-tuned **DistilBERT-base-uncased** (66M parameters) for multi-label classification of HR support conversations.
## Model Details
| Attribute | Value |
|-----------|-------|
| Base Model | `distilbert/distilbert-base-uncased` |
| Task | Multi-label text classification |
| Labels | 20 HR topics |
| Training Data | 100 synthetic HR conversations |
| Framework | Hugging Face Transformers |
## 20 HR Topic Labels
1. Benefits
2. Career Development
3. Compliance & Legal
4. Contracts
5. Diversity, Equity & Inclusion
6. Expense Management
7. Harassment
8. Health
9. IT & Equipment
10. Leave & Absence
11. Mobility
12. Offboarding
13. Onboarding
14. Payroll
15. Performance Management
16. Recruitment
17. Safety
18. Timetracking
19. Training
20. Work Arrangements
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "AurelPx/hr-conversations-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
LABELS = [
"Benefits", "Career Development", "Compliance & Legal", "Contracts",
"Diversity, Equity & Inclusion", "Expense Management", "Harassment", "Health",
"IT & Equipment", "Leave & Absence", "Mobility", "Offboarding",
"Onboarding", "Payroll", "Performance Management", "Recruitment",
"Safety", "Timetracking", "Training", "Work Arrangements"
]
def classify(text, threshold=0.3):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).numpy()[0]
return [LABELS[i] for i, p in enumerate(probs) if p >= threshold]
# Example
conversation = "USER: I haven't received my payslip for March yet..."
print(classify(conversation)) # ['Payroll']
```
## Training Notes
- **Dataset size**: 100 conversations (small dataset)
- **Split**: 80 train / 20 validation
- **Epochs**: 4-8 with early stopping
- **Limitations**: With only 100 samples across 20 classes, the model is in a very low-data regime. For production use, collect >500 samples per label or apply data augmentation.
## Links
- Dataset: [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)