# HR Conversations Multi-Label Classifier A fine-tuned **DistilBERT-base-uncased** (66M parameters) for multi-label classification of HR support conversations. ## Model Details | Attribute | Value | |-----------|-------| | Base Model | `distilbert/distilbert-base-uncased` | | Task | Multi-label text classification | | Labels | 20 HR topics | | Training Data | 100 synthetic HR conversations | | Framework | Hugging Face Transformers | ## 20 HR Topic Labels 1. Benefits 2. Career Development 3. Compliance & Legal 4. Contracts 5. Diversity, Equity & Inclusion 6. Expense Management 7. Harassment 8. Health 9. IT & Equipment 10. Leave & Absence 11. Mobility 12. Offboarding 13. Onboarding 14. Payroll 15. Performance Management 16. Recruitment 17. Safety 18. Timetracking 19. Training 20. Work Arrangements ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "AurelPx/hr-conversations-classifier" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) LABELS = [ "Benefits", "Career Development", "Compliance & Legal", "Contracts", "Diversity, Equity & Inclusion", "Expense Management", "Harassment", "Health", "IT & Equipment", "Leave & Absence", "Mobility", "Offboarding", "Onboarding", "Payroll", "Performance Management", "Recruitment", "Safety", "Timetracking", "Training", "Work Arrangements" ] def classify(text, threshold=0.3): inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = torch.sigmoid(logits).numpy()[0] return [LABELS[i] for i, p in enumerate(probs) if p >= threshold] # Example conversation = "USER: I haven't received my payslip for March yet..." print(classify(conversation)) # ['Payroll'] ``` ## Training Notes - **Dataset size**: 100 conversations (small dataset) - **Split**: 80 train / 20 validation - **Epochs**: 4-8 with early stopping - **Limitations**: With only 100 samples across 20 classes, the model is in a very low-data regime. For production use, collect >500 samples per label or apply data augmentation. ## Links - Dataset: [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)