andreaceto
/

schedulebot-nlu-engine

@@ -7,104 +7,192 @@ tags:
 model-index:
 - name: schedulebot-nlu-engine
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# schedulebot-nlu-engine
-This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3194
-- Intent Accuracy: 0.9224
-- Intent F1: 0.9216
-- Ner F1: 0.9320
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-06
-- train_batch_size: 32
-- eval_batch_size: 32
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 50
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Intent Accuracy | Intent F1 | Ner F1 |
-|:-------------:|:-----:|:----:|:---------------:|:---------------:|:---------:|:------:|
-| No log        | 1.0   | 64   | 0.6763          | 0.8196          | 0.8178    | 0.9239 |
-| No log        | 2.0   | 128  | 0.6300          | 0.8470          | 0.8460    | 0.9227 |
-| No log        | 3.0   | 192  | 0.6008          | 0.8356          | 0.8347    | 0.9239 |
-| No log        | 4.0   | 256  | 0.5762          | 0.8539          | 0.8541    | 0.9240 |
-| No log        | 5.0   | 320  | 0.5599          | 0.8470          | 0.8468    | 0.9246 |
-| No log        | 6.0   | 384  | 0.5391          | 0.8493          | 0.8483    | 0.9263 |
-| No log        | 7.0   | 448  | 0.5222          | 0.8676          | 0.8670    | 0.9256 |
-| 0.8885        | 8.0   | 512  | 0.5053          | 0.8607          | 0.8603    | 0.9269 |
-| 0.8885        | 9.0   | 576  | 0.4875          | 0.8607          | 0.8597    | 0.9279 |
-| 0.8885        | 10.0  | 640  | 0.4723          | 0.8721          | 0.8708    | 0.9274 |
-| 0.8885        | 11.0  | 704  | 0.4599          | 0.8858          | 0.8854    | 0.9297 |
-| 0.8885        | 12.0  | 768  | 0.4536          | 0.8973          | 0.8966    | 0.9291 |
-| 0.8885        | 13.0  | 832  | 0.4432          | 0.8790          | 0.8783    | 0.9279 |
-| 0.8885        | 14.0  | 896  | 0.4334          | 0.8881          | 0.8873    | 0.9290 |
-| 0.8885        | 15.0  | 960  | 0.4268          | 0.8813          | 0.8806    | 0.9295 |
-| 0.6688        | 16.0  | 1024 | 0.4180          | 0.8881          | 0.8872    | 0.9295 |
-| 0.6688        | 17.0  | 1088 | 0.4119          | 0.8995          | 0.8991    | 0.9296 |
-| 0.6688        | 18.0  | 1152 | 0.4061          | 0.8973          | 0.8964    | 0.9290 |
-| 0.6688        | 19.0  | 1216 | 0.3949          | 0.8950          | 0.8940    | 0.9285 |
-| 0.6688        | 20.0  | 1280 | 0.3899          | 0.9018          | 0.9012    | 0.9296 |
-| 0.6688        | 21.0  | 1344 | 0.3855          | 0.9087          | 0.9083    | 0.9302 |
-| 0.6688        | 22.0  | 1408 | 0.3768          | 0.8950          | 0.8942    | 0.9296 |
-| 0.6688        | 23.0  | 1472 | 0.3756          | 0.8950          | 0.8948    | 0.9308 |
-| 0.5511        | 24.0  | 1536 | 0.3693          | 0.9110          | 0.9100    | 0.9308 |
-| 0.5511        | 25.0  | 1600 | 0.3658          | 0.9064          | 0.9057    | 0.9308 |
-| 0.5511        | 26.0  | 1664 | 0.3598          | 0.9110          | 0.9101    | 0.9320 |
-| 0.5511        | 27.0  | 1728 | 0.3647          | 0.9041          | 0.9035    | 0.9309 |
-| 0.5511        | 28.0  | 1792 | 0.3500          | 0.9201          | 0.9190    | 0.9314 |
-| 0.5511        | 29.0  | 1856 | 0.3466          | 0.9155          | 0.9145    | 0.9314 |
-| 0.5511        | 30.0  | 1920 | 0.3481          | 0.9155          | 0.9149    | 0.9314 |
-| 0.5511        | 31.0  | 1984 | 0.3431          | 0.9155          | 0.9150    | 0.9314 |
-| 0.4859        | 32.0  | 2048 | 0.3409          | 0.9110          | 0.9104    | 0.9314 |
-| 0.4859        | 33.0  | 2112 | 0.3404          | 0.9201          | 0.9195    | 0.9308 |
-| 0.4859        | 34.0  | 2176 | 0.3346          | 0.9132          | 0.9127    | 0.9309 |
-| 0.4859        | 35.0  | 2240 | 0.3324          | 0.9201          | 0.9192    | 0.9309 |
-| 0.4859        | 36.0  | 2304 | 0.3306          | 0.9178          | 0.9170    | 0.9309 |
-| 0.4859        | 37.0  | 2368 | 0.3309          | 0.9178          | 0.9173    | 0.9314 |
-| 0.4859        | 38.0  | 2432 | 0.3289          | 0.9178          | 0.9173    | 0.9314 |
-| 0.4859        | 39.0  | 2496 | 0.3272          | 0.9201          | 0.9195    | 0.9314 |
-| 0.4434        | 40.0  | 2560 | 0.3259          | 0.9178          | 0.9173    | 0.9314 |
-| 0.4434        | 41.0  | 2624 | 0.3240          | 0.9201          | 0.9193    | 0.9314 |
-| 0.4434        | 42.0  | 2688 | 0.3228          | 0.9224          | 0.9216    | 0.9326 |
-| 0.4434        | 43.0  | 2752 | 0.3243          | 0.9178          | 0.9173    | 0.9320 |
-| 0.4434        | 44.0  | 2816 | 0.3248          | 0.9201          | 0.9195    | 0.9314 |
-| 0.4434        | 45.0  | 2880 | 0.3218          | 0.9224          | 0.9216    | 0.9320 |
-| 0.4434        | 46.0  | 2944 | 0.3213          | 0.9224          | 0.9216    | 0.9320 |
-| 0.4221        | 47.0  | 3008 | 0.3205          | 0.9224          | 0.9216    | 0.9320 |
-| 0.4221        | 48.0  | 3072 | 0.3195          | 0.9224          | 0.9216    | 0.9320 |
-| 0.4221        | 49.0  | 3136 | 0.3196          | 0.9224          | 0.9216    | 0.9320 |
-| 0.4221        | 50.0  | 3200 | 0.3194          | 0.9224          | 0.9216    | 0.9320 |
-### Framework versions
-- Transformers 4.53.2
-- Pytorch 2.6.0+cu124
-- Datasets 4.0.0
-- Tokenizers 0.21.2

 model-index:
 - name: schedulebot-nlu-engine
   results: []
+datasets:
+- andreaceto/hasd
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# Schedulebot-nlu-engine
+## Model Description
+This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a **`distilbert-base-uncased`** backbone and is capable of performing two tasks simultaneously:
+- **Intent Classification**: Identifying the user's primary goal (e.g., `schedule`, `cancel`).
+- **Named Entity Recognition (NER)**: Extracting custom, domain-specific entities (e.g., `appointment_type`).
+This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.
+## Model Architecture
+The model uses a standard `distilbert-base-uncased` model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.
+- **Base Model**: `distilbert-base-uncased`
+- **Classifier Heads**: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
+    1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
+    2. A GELU activation function.
+    3. A Dropout layer with a rate of 0.3 for regularization.
+    4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).
+## Intended Use
+This model is intended to be the core NLU component of a conversational AI system for managing appointments.
+For instructions on how to use the model check the [dedicated file](./how_to_use.md).
+## Training Data
+The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.
+- **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
+- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each.
+- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.
+The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).
+### Intents
+The model is trained to recognize the following intents:
+`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).
+### Entities
+The model is trained to recognize the following custom named entities:
+`practitioner_name`, `appointment_type`, `appointment_id`.
+## Training Procedure
+The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.
+### Stage 1: Training the Classifier Heads
+- The `distilbert-base-uncased` base model was entirely **frozen**.
+- Only the randomly initialized MLP heads for intent and NER classification were trained.
+**Setup**:
+```python
+# Define a data collator to handle padding for token classification
+data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    overwrite_output_dir=True,
+    num_train_epochs=200,               # Training epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-4,                 # Learning Rate
+    weight_decay=1e-5,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",     # Focus on validation loss as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
+)
+```
+### Stage 2: Selective Fine-Tuning
+- The DistilBERT backbone was entirely **unfrozen**.
+- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.
+**Setup**:
+```python
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    overwrite_output_dir=True,
+    num_train_epochs=50,               # Fine-tuning epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-6,                 # Learning Rate
+    weight_decay=1e-3,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",     # Focus on NER F1 as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
+)
+```
+## Evaluation
+The model was evaluated on a held-out test set, and its performance was measured for both tasks.
+### Intent Classification Performance
+| Intent        | Precision | Recall | F1-Score |  Support |
+| ---           | ---       | ---    | ---      | ---      |
+|           bye | 0.9048    | 0.8261 | 0.8636   | 23       |
+|        cancel | 0.9103    | 0.8554 | 0.8820   | 83       |
+|      greeting | 1.0000    | 0.8636 | 0.9268   | 22       |
+|negative_reply | 0.8750    | 0.9545 | 0.9130   | 22       |
+|           oos | 1.0000    | 0.8261 | 0.9048   | 23       |
+|positive_reply | 0.7692    | 0.9091 | 0.8333   | 22       |
+|   query_avail | 0.9259    | 0.9259 | 0.9259   | 81       |
+|    reschedule | 0.8571    | 0.8675 | 0.8623   | 83       |
+|      schedule | 0.8506    | 0.9250 | 0.8862   | 80       |
+| ---           | ---       | ---    | ---      | ----     |
+| **Accuracy**     |               |            | **0.8884**   | 439      |
+| **Macro Avg**    |    **0.8992** | **0.8837** | **0.8887**   | 439      |
+| **Weighted Avg** |    **0.8923** | **0.8884** | **0.8887**   | 439      |
+### NER (Token Classification) Performance
+| Entity              | Precision | Recall | F1-Score |  Support |
+| ---                 | ---       | ---    | ---      | ---      |
+| B-appointment_id    | 0.9925    | 0.9705 | 0.9813   | 271      |
+| B-appointment_type  | 0.8760    | 0.7766 | 0.8233   | 282      |
+| B-practitioner_name | 0.9540    | 0.9210 | 0.9372   | 405      |
+| O                   | 0.9775    | 0.9908 | 0.9841   | 3813     |
+| ---                 | ---       | ---    | ---      | ----     |
+| **Accuracy**        |           |        | 0.9711   | 4771     |
+| **Macro Avg**       | 0.9500    | 0.9147 | 0.9315   | 4771     |
+| **Weighted Avg**    | 0.9703    | 0.9711 | 0.9705   | 4771     |
+The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
+## Limitations and Bias
+- The model's performance is highly dependent on the quality and scope of the **HASD dataset**. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
+- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
+- The model inherits any biases present in the `distilbert-base-uncased` model and the `clinc/clinc_oos` dataset.