File size: 8,999 Bytes
f3d6425 c236fd6 f3d6425 c236fd6 15d636a c236fd6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
---
library_name: transformers
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
model-index:
- name: schedulebot-nlu-engine
results: []
datasets:
- andreaceto/hasd
language:
- en
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Schedulebot-nlu-engine
## Model Description
This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a **`distilbert-base-uncased`** backbone and is capable of performing two tasks simultaneously:
- **Intent Classification**: Identifying the user's primary goal (e.g., `schedule`, `cancel`).
- **Named Entity Recognition (NER)**: Extracting custom, domain-specific entities (e.g., `appointment_type`).
This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.
## Model Architecture
The model uses a standard `distilbert-base-uncased` model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.
- **Base Model**: `distilbert-base-uncased`
- **Classifier Heads**: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
2. A GELU activation function.
3. A Dropout layer with a rate of 0.3 for regularization.
4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).
## Intended Use
This model is intended to be the core NLU component of a conversational AI system for managing appointments.
For instructions on how to use the model check the [dedicated file](./how_to_use.md).
## Training Data
The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.
- **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each.
- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.
The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).
### Intents
The model is trained to recognize the following intents:
`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).
### Entities
The model is trained to recognize the following custom named entities:
`practitioner_name`, `appointment_type`, `appointment_id`.
## Training Procedure
The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.
### Stage 1: Training the Classifier Heads
- The `distilbert-base-uncased` base model was entirely **frozen**.
- Only the randomly initialized MLP heads for intent and NER classification were trained.
**Setup**:
```python
# Define a data collator to handle padding for token classification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Define Training Arguments
training_args = TrainingArguments(
output_dir="path/to/output_dir",
overwrite_output_dir=True,
num_train_epochs=200, # Training epochs
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
learning_rate=1e-4, # Learning Rate
weight_decay=1e-5, # AdamW weight decay
logging_dir="path/to/logging_dir",
logging_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Focus on validation loss as the key metric
# --- Hub Arguments ---
push_to_hub=True,
hub_model_id=hub_model_id,
hub_strategy="end",
hub_token=hf_token,
report_to="tensorboard" # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_datasets["train"],
eval_dataset=processed_datasets["validation"],
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics, # Custom function (check how_to_use.md)
callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
)
```
### Stage 2: Fine-Tuning
- The DistilBERT backbone was entirely **unfrozen**.
- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.
**Setup**:
```python
# Define Training Arguments
training_args = TrainingArguments(
output_dir="path/to/output_dir",
overwrite_output_dir=True,
num_train_epochs=50, # Fine-tuning epochs
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
learning_rate=1e-6, # Learning Rate
weight_decay=1e-3, # AdamW weight decay
logging_dir="path/to/logging_dir",
logging_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Focus on NER F1 as the key metric
# --- Hub Arguments ---
push_to_hub=True,
hub_model_id=hub_model_id,
hub_strategy="end",
hub_token=hf_token,
report_to="tensorboard" # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_datasets["train"],
eval_dataset=processed_datasets["validation"],
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics, # Custom function (check how_to_use.md)
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)
```
## Evaluation
The model was evaluated on a held-out test set, and its performance was measured for both tasks.
### Intent Classification Performance
| Intent | Precision | Recall | F1-Score | Support |
| --- | --- | --- | --- | --- |
| bye | 0.9500 | 0.8261 | 0.8837 | 23 |
| cancel | 0.9211 | 0.8434 | 0.8805 | 83 |
| greeting | 0.9545 | 0.9545 | 0.9545 | 22 |
|negative_reply | 0.9091 | 0.9091 | 0.9091 | 22 |
| oos | 1.0000 | 0.8696 | 0.9302 | 23 |
|positive_reply | 0.7407 | 0.9091 | 0.8163 | 22 |
| query_avail | 0.9620 | 0.9383 | 0.9500 | 81 |
| reschedule | 0.8506 | 0.8916 | 0.8706 | 83 |
| schedule | 0.8488 | 0.9125 | 0.8795 | 80 |
| --- | --- | --- | --- | ---- |
| **Accuracy** | | | **0.8952** | 439 |
| **Macro Avg** | **0.9041** | **0.8949** | **0.8972** | 439 |
| **Weighted Avg** | **0.8998** | **0.8952** | **0.8960** | 439 |
### NER (Token Classification) Performance
| Entity | Precision | Recall | F1-Score | Support |
| --- | --- | --- | --- | --- |
| B-appointment_id | 1.0000 | 1.0000 | 1.0000 | 61 |
| B-appointment_type | 0.8646 | 0.7477 | 0.8019 | 111 |
| B-practitioner_name | 0.9161 | 0.9467 | 0.9311 | 150 |
| I-appointment_id | 0.9667 | 0.9667 | 0.9667 | 210 |
| I-appointment_type | 0.8182 | 0.7368 | 0.7754 | 171 |
| I-practitioner_name | 0.9540 | 0.8941 | 0.9231 | 255 |
| O | 0.9782 | 0.9892 | 0.9837 | 3813 |
| --- | --- | --- | --- | ---- |
| **Accuracy** | | | 0.9673 | 4771 |
| **Macro Avg** | 0.9283 | 0.8973 | 0.9117 | 4771 |
| **Weighted Avg** | 0.9664 | 0.9673 | 0.9666 | 4771 |
The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
## Limitations and Bias
- The model's performance is highly dependent on the quality and scope of the **HASD dataset**. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
- The model inherits any biases present in the `distilbert-base-uncased` model and the `clinc/clinc_oos` dataset. |