AurelPx
/

hr-conversations-classifier

Safetensors

distilbert

Model card Files Files and versions

xet

Community

AurelPx commited on 2 days ago

Commit

5e5c515

verified ·

1 Parent(s): 9e26b78

Upload README.md

Browse files

Files changed (1) hide show

README.md +69 -85

README.md CHANGED Viewed

@@ -1,93 +1,77 @@
----
-library_name: transformers
-license: apache-2.0
-base_model: distilbert/distilbert-base-uncased
-tags:
-- generated_from_trainer
-- ml-intern
-metrics:
-- precision
-- recall
-- accuracy
-model-index:
-- name: hr-conversations-classifier
-  results: []
----
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# hr-conversations-classifier
-This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.6809
-- F1 Micro: 0.1111
-- F1 Macro: 0.0470
-- Precision: 0.0714
-- Recall: 0.25
-- Accuracy: 0.0
-- Hamming: 0.32
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1.0000000000000002e-06
-- train_batch_size: 8
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 16
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 5
-- num_epochs: 2
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | F1 Micro | F1 Macro | Precision | Recall | Accuracy | Hamming |
-|:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:---------:|:------:|:--------:|:-------:|
-| 1.3633        | 1.0   | 5    | 0.6809          | 0.1111   | 0.0470   | 0.0714    | 0.25   | 0.0      | 0.32    |
-| 1.3502        | 2.0   | 10   | 0.6771          | 0.1      | 0.0450   | 0.0648    | 0.2188 | 0.0      | 0.315   |
-### Framework versions
-- Transformers 5.8.0
-- Pytorch 2.11.0+cu130
-- Datasets 4.8.5
-- Tokenizers 0.22.2
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
 ## Usage
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = 'AurelPx/hr-conversations-classifier'
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
 ```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

+# HR Conversations Multi-Label Classifier
+A fine-tuned **DistilBERT-base-uncased** (66M parameters) for multi-label classification of HR support conversations.
+## Model Details
+| Attribute | Value |
+|-----------|-------|
+| Base Model | `distilbert/distilbert-base-uncased` |
+| Task | Multi-label text classification |
+| Labels | 20 HR topics |
+| Training Data | 100 synthetic HR conversations |
+| Framework | Hugging Face Transformers |
+## 20 HR Topic Labels
+1. Benefits
+2. Career Development
+3. Compliance & Legal
+4. Contracts
+5. Diversity, Equity & Inclusion
+6. Expense Management
+7. Harassment
+8. Health
+9. IT & Equipment
+10. Leave & Absence
+11. Mobility
+12. Offboarding
+13. Onboarding
+14. Payroll
+15. Performance Management
+16. Recruitment
+17. Safety
+18. Timetracking
+19. Training
+20. Work Arrangements
 ## Usage
 ```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_id = "AurelPx/hr-conversations-classifier"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+LABELS = [
+    "Benefits", "Career Development", "Compliance & Legal", "Contracts",
+    "Diversity, Equity & Inclusion", "Expense Management", "Harassment", "Health",
+    "IT & Equipment", "Leave & Absence", "Mobility", "Offboarding",
+    "Onboarding", "Payroll", "Performance Management", "Recruitment",
+    "Safety", "Timetracking", "Training", "Work Arrangements"
+]
+def classify(text, threshold=0.3):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+    with torch.no_grad():
+        logits = model(**inputs).logits
+    probs = torch.sigmoid(logits).numpy()[0]
+    return [LABELS[i] for i, p in enumerate(probs) if p >= threshold]
+# Example
+conversation = "USER: I haven't received my payslip for March yet..."
+print(classify(conversation))  # ['Payroll']
 ```
+## Training Notes
+- **Dataset size**: 100 conversations (small dataset)
+- **Split**: 80 train / 20 validation
+- **Epochs**: 4-8 with early stopping
+- **Limitations**: With only 100 samples across 20 classes, the model is in a very low-data regime. For production use, collect >500 samples per label or apply data augmentation.
+## Links
+- Dataset: [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)