AurelPx commited on
Commit
5e5c515
·
verified ·
1 Parent(s): 9e26b78

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -85
README.md CHANGED
@@ -1,93 +1,77 @@
1
- ---
2
- library_name: transformers
3
- license: apache-2.0
4
- base_model: distilbert/distilbert-base-uncased
5
- tags:
6
- - generated_from_trainer
7
- - ml-intern
8
- metrics:
9
- - precision
10
- - recall
11
- - accuracy
12
- model-index:
13
- - name: hr-conversations-classifier
14
- results: []
15
- ---
16
-
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
-
20
- # hr-conversations-classifier
21
-
22
- This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the None dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.6809
25
- - F1 Micro: 0.1111
26
- - F1 Macro: 0.0470
27
- - Precision: 0.0714
28
- - Recall: 0.25
29
- - Accuracy: 0.0
30
- - Hamming: 0.32
31
-
32
- ## Model description
33
-
34
- More information needed
35
-
36
- ## Intended uses & limitations
37
-
38
- More information needed
39
-
40
- ## Training and evaluation data
41
-
42
- More information needed
43
-
44
- ## Training procedure
45
-
46
- ### Training hyperparameters
47
-
48
- The following hyperparameters were used during training:
49
- - learning_rate: 1.0000000000000002e-06
50
- - train_batch_size: 8
51
- - eval_batch_size: 8
52
- - seed: 42
53
- - gradient_accumulation_steps: 2
54
- - total_train_batch_size: 16
55
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
56
- - lr_scheduler_type: linear
57
- - lr_scheduler_warmup_steps: 5
58
- - num_epochs: 2
59
-
60
- ### Training results
61
-
62
- | Training Loss | Epoch | Step | Validation Loss | F1 Micro | F1 Macro | Precision | Recall | Accuracy | Hamming |
63
- |:-------------:|:-----:|:----:|:---------------:|:--------:|:--------:|:---------:|:------:|:--------:|:-------:|
64
- | 1.3633 | 1.0 | 5 | 0.6809 | 0.1111 | 0.0470 | 0.0714 | 0.25 | 0.0 | 0.32 |
65
- | 1.3502 | 2.0 | 10 | 0.6771 | 0.1 | 0.0450 | 0.0648 | 0.2188 | 0.0 | 0.315 |
66
-
67
-
68
- ### Framework versions
69
-
70
- - Transformers 5.8.0
71
- - Pytorch 2.11.0+cu130
72
- - Datasets 4.8.5
73
- - Tokenizers 0.22.2
74
-
75
- <!-- ml-intern-provenance -->
76
- ## Generated by ML Intern
77
-
78
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
79
-
80
- - Try ML Intern: https://smolagents-ml-intern.hf.space
81
- - Source code: https://github.com/huggingface/ml-intern
82
 
83
  ## Usage
84
 
85
  ```python
86
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
87
 
88
- model_id = 'AurelPx/hr-conversations-classifier'
89
  tokenizer = AutoTokenizer.from_pretrained(model_id)
90
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
93
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
 
 
 
 
 
 
 
 
 
1
+ # HR Conversations Multi-Label Classifier
2
+
3
+ A fine-tuned **DistilBERT-base-uncased** (66M parameters) for multi-label classification of HR support conversations.
4
+
5
+ ## Model Details
6
+
7
+ | Attribute | Value |
8
+ |-----------|-------|
9
+ | Base Model | `distilbert/distilbert-base-uncased` |
10
+ | Task | Multi-label text classification |
11
+ | Labels | 20 HR topics |
12
+ | Training Data | 100 synthetic HR conversations |
13
+ | Framework | Hugging Face Transformers |
14
+
15
+ ## 20 HR Topic Labels
16
+
17
+ 1. Benefits
18
+ 2. Career Development
19
+ 3. Compliance & Legal
20
+ 4. Contracts
21
+ 5. Diversity, Equity & Inclusion
22
+ 6. Expense Management
23
+ 7. Harassment
24
+ 8. Health
25
+ 9. IT & Equipment
26
+ 10. Leave & Absence
27
+ 11. Mobility
28
+ 12. Offboarding
29
+ 13. Onboarding
30
+ 14. Payroll
31
+ 15. Performance Management
32
+ 16. Recruitment
33
+ 17. Safety
34
+ 18. Timetracking
35
+ 19. Training
36
+ 20. Work Arrangements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Usage
39
 
40
  ```python
41
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
42
+ import torch
43
 
44
+ model_id = "AurelPx/hr-conversations-classifier"
45
  tokenizer = AutoTokenizer.from_pretrained(model_id)
46
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
47
+
48
+ LABELS = [
49
+ "Benefits", "Career Development", "Compliance & Legal", "Contracts",
50
+ "Diversity, Equity & Inclusion", "Expense Management", "Harassment", "Health",
51
+ "IT & Equipment", "Leave & Absence", "Mobility", "Offboarding",
52
+ "Onboarding", "Payroll", "Performance Management", "Recruitment",
53
+ "Safety", "Timetracking", "Training", "Work Arrangements"
54
+ ]
55
+
56
+ def classify(text, threshold=0.3):
57
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
58
+ with torch.no_grad():
59
+ logits = model(**inputs).logits
60
+ probs = torch.sigmoid(logits).numpy()[0]
61
+ return [LABELS[i] for i, p in enumerate(probs) if p >= threshold]
62
+
63
+ # Example
64
+ conversation = "USER: I haven't received my payslip for March yet..."
65
+ print(classify(conversation)) # ['Payroll']
66
  ```
67
 
68
+ ## Training Notes
69
+
70
+ - **Dataset size**: 100 conversations (small dataset)
71
+ - **Split**: 80 train / 20 validation
72
+ - **Epochs**: 4-8 with early stopping
73
+ - **Limitations**: With only 100 samples across 20 classes, the model is in a very low-data regime. For production use, collect >500 samples per label or apply data augmentation.
74
+
75
+ ## Links
76
+
77
+ - Dataset: [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)