andreaceto
/

schedulebot-nlu-engine

@@ -7,104 +7,192 @@ tags:
 model-index:
 - name: schedulebot-nlu-engine
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# schedulebot-nlu-engine
-This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3390
-- Intent Accuracy: 0.9178
-- Intent F1: 0.9178
-- Ner F1: 0.9240
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-06
-- train_batch_size: 32
-- eval_batch_size: 32
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 50
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Intent Accuracy | Intent F1 | Ner F1 |
-|:-------------:|:-----:|:----:|:---------------:|:---------------:|:---------:|:------:|
-| No log        | 1.0   | 64   | 0.7274          | 0.7785          | 0.7785    | 0.9136 |
-| No log        | 2.0   | 128  | 0.6946          | 0.7991          | 0.8005    | 0.9162 |
-| No log        | 3.0   | 192  | 0.6461          | 0.8196          | 0.8178    | 0.9158 |
-| No log        | 4.0   | 256  | 0.6226          | 0.8265          | 0.8261    | 0.9152 |
-| No log        | 5.0   | 320  | 0.5986          | 0.8516          | 0.8518    | 0.9141 |
-| No log        | 6.0   | 384  | 0.5705          | 0.8356          | 0.8359    | 0.9153 |
-| No log        | 7.0   | 448  | 0.5506          | 0.8584          | 0.8568    | 0.9153 |
-| 0.901         | 8.0   | 512  | 0.5459          | 0.8379          | 0.8378    | 0.9147 |
-| 0.901         | 9.0   | 576  | 0.5220          | 0.8539          | 0.8546    | 0.9158 |
-| 0.901         | 10.0  | 640  | 0.5129          | 0.8676          | 0.8667    | 0.9157 |
-| 0.901         | 11.0  | 704  | 0.4974          | 0.8653          | 0.8648    | 0.9146 |
-| 0.901         | 12.0  | 768  | 0.4870          | 0.8744          | 0.8739    | 0.9180 |
-| 0.901         | 13.0  | 832  | 0.4892          | 0.8676          | 0.8682    | 0.9180 |
-| 0.901         | 14.0  | 896  | 0.4652          | 0.8767          | 0.8770    | 0.9174 |
-| 0.901         | 15.0  | 960  | 0.4523          | 0.8790          | 0.8789    | 0.9174 |
-| 0.6791        | 16.0  | 1024 | 0.4412          | 0.8881          | 0.8884    | 0.9197 |
-| 0.6791        | 17.0  | 1088 | 0.4441          | 0.8790          | 0.8785    | 0.9208 |
-| 0.6791        | 18.0  | 1152 | 0.4231          | 0.8950          | 0.8948    | 0.9190 |
-| 0.6791        | 19.0  | 1216 | 0.4202          | 0.8858          | 0.8855    | 0.9202 |
-| 0.6791        | 20.0  | 1280 | 0.4099          | 0.8950          | 0.8951    | 0.9208 |
-| 0.6791        | 21.0  | 1344 | 0.4054          | 0.8973          | 0.8970    | 0.9219 |
-| 0.6791        | 22.0  | 1408 | 0.4018          | 0.8950          | 0.8954    | 0.9212 |
-| 0.6791        | 23.0  | 1472 | 0.3953          | 0.8973          | 0.8974    | 0.9201 |
-| 0.5609        | 24.0  | 1536 | 0.3883          | 0.9041          | 0.9037    | 0.9220 |
-| 0.5609        | 25.0  | 1600 | 0.3874          | 0.8995          | 0.8994    | 0.9224 |
-| 0.5609        | 26.0  | 1664 | 0.3827          | 0.9041          | 0.9039    | 0.9224 |
-| 0.5609        | 27.0  | 1728 | 0.3796          | 0.9041          | 0.9045    | 0.9230 |
-| 0.5609        | 28.0  | 1792 | 0.3793          | 0.9018          | 0.9018    | 0.9230 |
-| 0.5609        | 29.0  | 1856 | 0.3703          | 0.9110          | 0.9111    | 0.9219 |
-| 0.5609        | 30.0  | 1920 | 0.3732          | 0.9018          | 0.9018    | 0.9207 |
-| 0.5609        | 31.0  | 1984 | 0.3639          | 0.9132          | 0.9134    | 0.9219 |
-| 0.4928        | 32.0  | 2048 | 0.3623          | 0.9064          | 0.9066    | 0.9225 |
-| 0.4928        | 33.0  | 2112 | 0.3599          | 0.9132          | 0.9133    | 0.9230 |
-| 0.4928        | 34.0  | 2176 | 0.3546          | 0.9110          | 0.9110    | 0.9219 |
-| 0.4928        | 35.0  | 2240 | 0.3515          | 0.9178          | 0.9178    | 0.9230 |
-| 0.4928        | 36.0  | 2304 | 0.3504          | 0.9155          | 0.9156    | 0.9235 |
-| 0.4928        | 37.0  | 2368 | 0.3501          | 0.9178          | 0.9179    | 0.9235 |
-| 0.4928        | 38.0  | 2432 | 0.3495          | 0.9132          | 0.9132    | 0.9230 |
-| 0.4928        | 39.0  | 2496 | 0.3452          | 0.9132          | 0.9132    | 0.9235 |
-| 0.447         | 40.0  | 2560 | 0.3430          | 0.9224          | 0.9224    | 0.9230 |
-| 0.447         | 41.0  | 2624 | 0.3441          | 0.9132          | 0.9134    | 0.9240 |
-| 0.447         | 42.0  | 2688 | 0.3408          | 0.9178          | 0.9178    | 0.9235 |
-| 0.447         | 43.0  | 2752 | 0.3427          | 0.9155          | 0.9156    | 0.9236 |
-| 0.447         | 44.0  | 2816 | 0.3420          | 0.9155          | 0.9157    | 0.9235 |
-| 0.447         | 45.0  | 2880 | 0.3407          | 0.9201          | 0.9201    | 0.9235 |
-| 0.447         | 46.0  | 2944 | 0.3396          | 0.9178          | 0.9178    | 0.9235 |
-| 0.4209        | 47.0  | 3008 | 0.3401          | 0.9178          | 0.9178    | 0.9235 |
-| 0.4209        | 48.0  | 3072 | 0.3389          | 0.9178          | 0.9178    | 0.9240 |
-| 0.4209        | 49.0  | 3136 | 0.3392          | 0.9178          | 0.9178    | 0.9240 |
-| 0.4209        | 50.0  | 3200 | 0.3390          | 0.9178          | 0.9178    | 0.9240 |
-### Framework versions
-- Transformers 4.53.2
-- Pytorch 2.6.0+cu124
-- Datasets 4.0.0
-- Tokenizers 0.21.2

 model-index:
 - name: schedulebot-nlu-engine
   results: []
+datasets:
+- andreaceto/hasd
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# Schedulebot-nlu-engine
+## Model Description
+This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a **`distilbert-base-uncased`** backbone and is capable of performing two tasks simultaneously:
+- **Intent Classification**: Identifying the user's primary goal (e.g., `schedule`, `cancel`).
+- **Named Entity Recognition (NER)**: Extracting custom, domain-specific entities (e.g., `appointment_type`).
+This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.
+## Model Architecture
+The model uses a standard `distilbert-base-uncased` model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.
+- **Base Model**: `distilbert-base-uncased`
+- **Classifier Heads**: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
+    1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
+    2. A GELU activation function.
+    3. A Dropout layer with a rate of 0.3 for regularization.
+    4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).
+## Intended Use
+This model is intended to be the core NLU component of a conversational AI system for managing appointments.
+For instructions on how to use the model check the [dedicated file](./how_to_use.md).
+## Training Data
+The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.
+- **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
+- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each.
+- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.
+The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).
+### Intents
+The model is trained to recognize the following intents:
+`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).
+### Entities
+The model is trained to recognize the following custom named entities:
+`practitioner_name`, `appointment_type`, `appointment_id`.
+## Training Procedure
+The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.
+### Stage 1: Training the Classifier Heads
+- The `distilbert-base-uncased` base model was entirely **frozen**.
+- Only the randomly initialized MLP heads for intent and NER classification were trained.
+**Setup**:
+```python
+# Define a data collator to handle padding for token classification
+data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    overwrite_output_dir=True,
+    num_train_epochs=200,               # Training epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-4,                 # Learning Rate
+    weight_decay=1e-5,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",     # Focus on validation loss as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
+)
+```
+### Stage 2: Selective Fine-Tuning
+- The DistilBERT backbone was entirely **unfrozen**.
+- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.
+**Setup**:
+```python
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    overwrite_output_dir=True,
+    num_train_epochs=50,               # Fine-tuning epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-6,                 # Learning Rate
+    weight_decay=1e-3,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",     # Focus on NER F1 as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
+)
+```
+## Evaluation
+The model was evaluated on a held-out test set, and its performance was measured for both tasks.
+### Intent Classification Performance
+| Intent        | Precision | Recall | F1-Score |  Support |
+| ---           | ---       | ---    | ---      | ---      |
+|           bye | 0.8636    | 0.8261 | 0.8444   | 23       |
+|        cancel | 0.8902    | 0.8795 | 0.8848   | 83       |
+|      greeting | 0.8636    | 0.8636 | 0.8636   | 22       |
+|negative_reply | 0.9048    | 0.8636 | 0.8837   | 22       |
+|           oos | 0.9524    | 0.8696 | 0.9091   | 23       |
+|positive_reply | 0.7308    | 0.8636 | 0.7917   | 22       |
+|   query_avail | 0.9268    | 0.9383 | 0.9325   | 81       |
+|    reschedule | 0.8974    | 0.8434 | 0.8696   | 83       |
+|      schedule | 0.8824    | 0.9375 | 0.9091   | 80       |
+| ---           | ---       | ---    | ---      | ----     |
+| **Accuracy**     |               |            | **0.8884**   | 439      |
+| **Macro Avg**    |    **0.8791** | **0.8761** | **0.8765**   | 439      |
+| **Weighted Avg** |    **0.8902** | **0.8884** | **0.8885**   | 439      |
+### NER (Token Classification) Performance
+| Entity              | Precision | Recall | F1-Score |  Support |
+| ---                 | ---       | ---    | ---      | ---      |
+| B-appointment_id    | 0.9813    | 0.9705 | 0.9759   | 271      |
+| B-appointment_type  | 0.8517    | 0.7943 | 0.8220   | 282      |
+| B-practitioner_name | 0.9540    | 0.9210 | 0.9372   | 405      |
+| O                   | 0.9782    | 0.9874 | 0.9828   | 3813     |
+| ---                 | ---       | ---    | ---      | ----     |
+| **Accuracy**        |           |        | 0.9694   | 4771     |
+| **Macro Avg**       | 0.9413    | 0.9183 | 0.9295   | 4771     |
+| **Weighted Avg**    | 0.9688    | 0.9694 | 0.9690   | 4771     |
+The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
+## Limitations and Bias
+- The model's performance is highly dependent on the quality and scope of the **HASD dataset**. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
+- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
+- The model inherits any biases present in the `distilbert-base-uncased` model and the `clinc/clinc_oos` dataset.