File size: 8,999 Bytes
f3d6425
 
 
 
 
 
 
 
 
c236fd6
 
 
 
f3d6425
 
 
 
c236fd6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15d636a
c236fd6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
library_name: transformers
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
model-index:
- name: schedulebot-nlu-engine
  results: []
datasets:
- andreaceto/hasd
language:
- en
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Schedulebot-nlu-engine

## Model Description

This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a **`distilbert-base-uncased`** backbone and is capable of performing two tasks simultaneously:

- **Intent Classification**: Identifying the user's primary goal (e.g., `schedule`, `cancel`).
- **Named Entity Recognition (NER)**: Extracting custom, domain-specific entities (e.g., `appointment_type`).

This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.

## Model Architecture

The model uses a standard `distilbert-base-uncased` model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.

- **Base Model**: `distilbert-base-uncased`
- **Classifier Heads**: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
    1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
    2. A GELU activation function.
    3. A Dropout layer with a rate of 0.3 for regularization.
    4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).

## Intended Use

This model is intended to be the core NLU component of a conversational AI system for managing appointments. 

For instructions on how to use the model check the [dedicated file](./how_to_use.md).

## Training Data

The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.

- **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each.
- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.

The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).

### Intents

The model is trained to recognize the following intents:
`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).

### Entities

The model is trained to recognize the following custom named entities:
`practitioner_name`, `appointment_type`, `appointment_id`.

## Training Procedure

The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.

### Stage 1: Training the Classifier Heads

- The `distilbert-base-uncased` base model was entirely **frozen**.
- Only the randomly initialized MLP heads for intent and NER classification were trained.

**Setup**:

```python
# Define a data collator to handle padding for token classification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="path/to/output_dir",
    overwrite_output_dir=True,
    num_train_epochs=200,               # Training epochs
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-4,                 # Learning Rate
    weight_decay=1e-5,                  # AdamW weight decay
    logging_dir="path/to/logging_dir",
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",     # Focus on validation loss as the key metric
    # --- Hub Arguments ---
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
    hub_token=hf_token,
    report_to="tensorboard"             # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
)
```

### Stage 2: Fine-Tuning

- The DistilBERT backbone was entirely **unfrozen**.
- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.

**Setup**:

```python
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="path/to/output_dir",
    overwrite_output_dir=True,
    num_train_epochs=50,               # Fine-tuning epochs
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-6,                 # Learning Rate
    weight_decay=1e-3,                  # AdamW weight decay
    logging_dir="path/to/logging_dir",
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",     # Focus on NER F1 as the key metric
    # --- Hub Arguments ---
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_strategy="end",
    hub_token=hf_token,
    report_to="tensorboard"             # Tensorboard to monitor training
)
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)
```
## Evaluation

The model was evaluated on a held-out test set, and its performance was measured for both tasks.

### Intent Classification Performance

| Intent        | Precision | Recall | F1-Score |  Support |
| ---           | ---       | ---    | ---      | ---      |
|           bye | 0.9500    | 0.8261 | 0.8837   | 23       |
|        cancel | 0.9211    | 0.8434 | 0.8805   | 83       |
|      greeting | 0.9545    | 0.9545 | 0.9545   | 22       |
|negative_reply | 0.9091    | 0.9091 | 0.9091   | 22       |
|           oos | 1.0000    | 0.8696 | 0.9302   | 23       |
|positive_reply | 0.7407    | 0.9091 | 0.8163   | 22       |
|   query_avail | 0.9620    | 0.9383 | 0.9500   | 81       |
|    reschedule | 0.8506    | 0.8916 | 0.8706   | 83       |
|      schedule | 0.8488    | 0.9125 | 0.8795   | 80       |
| ---           | ---       | ---    | ---      | ----     |
| **Accuracy**     |               |            | **0.8952**   | 439      |
| **Macro Avg**    |    **0.9041** | **0.8949** | **0.8972**   | 439      |
| **Weighted Avg** |    **0.8998** | **0.8952** | **0.8960**   | 439      |

### NER (Token Classification) Performance

| Entity              | Precision | Recall | F1-Score |  Support |
| ---                 | ---       | ---    | ---      | ---      |
| B-appointment_id    | 1.0000    | 1.0000 | 1.0000   | 61       |
| B-appointment_type  | 0.8646    | 0.7477 | 0.8019   | 111      |
| B-practitioner_name | 0.9161    | 0.9467 | 0.9311   | 150      |
| I-appointment_id    | 0.9667    | 0.9667 | 0.9667   | 210      |
| I-appointment_type  | 0.8182    | 0.7368 | 0.7754   | 171      |
| I-practitioner_name | 0.9540    | 0.8941 | 0.9231   | 255      |
| O                   | 0.9782    | 0.9892 | 0.9837   | 3813     |
| ---                 | ---       | ---    | ---      | ----     |
| **Accuracy**        |           |        | 0.9673   | 4771     |
| **Macro Avg**       | 0.9283    | 0.8973 | 0.9117   | 4771     |
| **Weighted Avg**    | 0.9664    | 0.9673 | 0.9666   | 4771     |

The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.

## Limitations and Bias

- The model's performance is highly dependent on the quality and scope of the **HASD dataset**. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
- The model inherits any biases present in the `distilbert-base-uncased` model and the `clinc/clinc_oos` dataset.