AurelPx
/

hr-conversations-classifier

Safetensors

distilbert

ml-intern

Model card Files Files and versions

xet

Community

AurelPx commited on 8 days ago

Commit

fd4c2cf

verified ·

1 Parent(s): f903d42

Upload README.md

Browse files

Files changed (1) hide show

README.md +80 -112

README.md CHANGED Viewed

@@ -1,142 +1,110 @@
----
-tags:
-- ml-intern
----
 # HR Conversations Multi-Label Classifier
-Fine-tuned **DistilBERT-base-uncased** (66M parameters) for multi-label classification of HR support conversations.
 ## Model Details
 | Attribute | Value |
 |-----------|-------|
-| Base Model | `distilbert/distilbert-base-uncased` |
-| Task | Multi-label text classification |
-| Labels | 20 HR topics (see below) |
-| Training Data | 100 synthetic HR conversations (first version) |
-| Framework | Hugging Face Transformers |
 ## 20 HR Topic Labels
-1. Benefits
-2. Career Development
-3. Compliance & Legal
-4. Contracts
-5. Diversity, Equity & Inclusion
-6. Expense Management
-7. Harassment
-8. Health
-9. IT & Equipment
-10. Leave & Absence
-11. Mobility
-12. Offboarding
-13. Onboarding
-14. Payroll
-15. Performance Management
-16. Recruitment
-17. Safety
-18. Timetracking
-19. Training
 20. Work Arrangements
-## Quick Usage
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-model_id = "AurelPx/hr-conversations-classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForSequenceClassification.from_pretrained(model_id)
-LABELS = [
-    "Benefits", "Career Development", "Compliance & Legal", "Contracts",
-    "Diversity, Equity & Inclusion", "Expense Management", "Harassment", "Health",
-    "IT & Equipment", "Leave & Absence", "Mobility", "Offboarding",
-    "Onboarding", "Payroll", "Performance Management", "Recruitment",
-    "Safety", "Timetracking", "Training", "Work Arrangements",
-]
-def classify(text: str, threshold: float = 0.5):
-    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
-    with torch.no_grad():
-        logits = model(**inputs).logits
-    probs = torch.sigmoid(logits).numpy()[0]
-    return [LABELS[i] for i, p in enumerate(probs) if p >= threshold]
-# Example
-conversation = "USER: I haven't received my payslip for March yet. Could you please check what's going on?"
-print(classify(conversation))  # ['Payroll']
-```
-## Improve the Model (Colab / Local)
-The current model was trained on only **100 real conversations** — this is too small for reliable classification. We provide a complete training script that:
-1. **Generates 5,000 synthetic HR conversations** from templates (no LLM needed)
-2. **Runs 5-fold stratified cross-validation** (no data leakage)
-3. **Fine-tunes DistilBERT** and pushes the best model to Hub
-### Run in Google Colab (Free T4 GPU)
 ```python
-# Step 1: Install dependencies
-!pip install -q transformers datasets accelerate pandas scikit-learn huggingface_hub
-# Step 2: Login to Hugging Face
-from huggingface_hub import notebook_login
-notebook_login()
-# Step 3: Download and run the training script
-!curl -L -o train.py https://huggingface.co/AurelPx/hr-conversations-classifier/raw/main/training_script.py
-!python train.py
 ```
-The script will:
-- Load your dataset from `AurelPx/ml-intern-a2d69eee-datasets`
-- Generate 5,000 synthetic conversations preserving the real distribution
-- Run 5-fold cross-validation, reporting mean ± std F1-micro
-- Push the best model to your Hub
-### Expected Runtime
-- CPU (8 cores): ~6-8 hours
-- Google Colab T4 GPU: ~30-45 minutes
 ## Files in this Repo
 | File | Description |
 |------|-------------|
-| `model.safetensors` | Fine-tuned model weights |
-| `training_script.py` | Full training pipeline (augmentation + 5-fold CV + fine-tuning) |
-| `inference.py` | Standalone inference script |
-| `label_config.json` | Label names and classification threshold |
 ## Dataset
 - [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)
 - 100 English HR conversations with multi-label annotations
-- Conversations include Payroll, Benefits, Leave, Contracts, Training, etc.
 ## License
-Apache 2.0 (same as DistilBERT)
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = 'AurelPx/hr-conversations-classifier'
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

 # HR Conversations Multi-Label Classifier
+SETFit-style classifier for **20 HR topic labels** on employee–agent conversations, trained with **5,000 synthetic + 100 real samples** and evaluated via **5-fold stratified cross-validation** (no data leakage).
+## Results
+| Metric | Score |
+|--------|-------|
+| **F1-micro (5-fold CV)** | **0.7962 ± 0.0098** |
+| **F1-macro (5-fold CV)** | **0.7721** |
+| Fold 1 | 0.7851 |
+| Fold 2 | 0.7989 |
+| Fold 3 | 0.8031 |
+| Fold 4 | 0.7846 |
+| Fold 5 | **0.8091** |
 ## Model Details
 | Attribute | Value |
 |-----------|-------|
+| Encoder | `sentence-transformers/all-MiniLM-L6-v2` (384-dim) |
+| Classifier | Multi-output Logistic Regression (scikit-learn) |
+| Training samples | 5,100 (5,000 synthetic + 100 real) |
+| Labels | 20 HR topics |
+| Validation | 5-fold stratified cross-validation |
+| Framework | Sentence-Transformers + scikit-learn |
 ## 20 HR Topic Labels
+1. Benefits
+2. Career Development
+3. Compliance & Legal
+4. Contracts
+5. Diversity, Equity & Inclusion
+6. Expense Management
+7. Harassment
+8. Health
+9. IT & Equipment
+10. Leave & Absence
+11. Mobility
+12. Offboarding
+13. Onboarding
+14. Payroll
+15. Performance Management
+16. Recruitment
+17. Safety
+18. Timetracking
+19. Training
 20. Work Arrangements
+## Usage
 ```python
+from sentence_transformers import SentenceTransformer
+import pickle, json
+from huggingface_hub import hf_hub_download
+# Download artifacts
+classifier_path = hf_hub_download("AurelPx/hr-conversations-classifier", "setfit_classifier.pkl")
+label_path      = hf_hub_download("AurelPx/hr-conversations-classifier", "setfit_label_config.json")
+# Load
+encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
+with open(classifier_path, 'rb') as f:
+    classifier = pickle.load(f)
+with open(label_path) as f:
+    config = json.load(f)
+LABELS = config['label_names']
+# Classify
+sample = (
+    "USER: I haven't received my payslip for March yet. Could you please check what's going on?\n"
+    "AGENT: Good morning. I've checked the payroll system and it appears your March payslip "
+    "was generated on the 28th but there was a distribution delay. I've resent it to your "
+    "registered email. You should receive it within the next hour."
+)
+emb   = encoder.encode([sample])
+proba = classifier.predict_proba(emb)
+preds = [LABELS[i] for i, p in enumerate(proba) if p[0][1] >= 0.5]
+print(preds)  # ['Payroll']
 ```
+## Training Approach
+1. **Data augmentation** — 5,000 synthetic HR conversations generated from real conversation templates (no LLM, no external API).
+2. **Stratified 5-fold CV** — splits by primary label, ensuring each fold preserves label distribution.
+3. **SETFit-style pipeline** — MiniLM embeddings + Logistic Regression, fast and accurate on small data.
 ## Files in this Repo
 | File | Description |
 |------|-------------|
+| `setfit_classifier.pkl` | Trained Logistic Regression classifier |
+| `setfit_encoder.pkl` | SentenceTransformer MiniLM encoder (optional, for offline use) |
+| `setfit_cv_results.json` | Cross-validation scores per fold |
+| `setfit_label_config.json` | Label names and threshold |
+| `training_script.py` | Full training pipeline (augmentation + CV + inference) |
+| `inference.py` | Standalone inference script (DistilBERT legacy — not recommended) |
+| `model.safetensors` | Legacy DistilBERT checkpoint (kept for compatibility) |
 ## Dataset
 - [AurelPx/ml-intern-a2d69eee-datasets](https://huggingface.co/datasets/AurelPx/ml-intern-a2d69eee-datasets)
 - 100 English HR conversations with multi-label annotations
 ## License
+Apache 2.0