andreaceto
/

schedulebot-nlu-engine

@@ -7,104 +7,195 @@ tags:
 model-index:
 - name: schedulebot-nlu-engine
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# schedulebot-nlu-engine
-This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3576
-- Intent Accuracy: 0.9110
-- Intent F1: 0.9109
-- Ner F1: 0.6897
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-06
-- train_batch_size: 32
-- eval_batch_size: 32
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 50
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Intent Accuracy | Intent F1 | Ner F1 |
-|:-------------:|:-----:|:----:|:---------------:|:---------------:|:---------:|:------:|
-| 1.115         | 1.0   | 64   | 0.7159          | 0.8059          | 0.8059    | 0.6490 |
-| 1.0456        | 2.0   | 128  | 0.6909          | 0.8105          | 0.8104    | 0.6545 |
-| 0.966         | 3.0   | 192  | 0.6603          | 0.8151          | 0.8141    | 0.6656 |
-| 0.924         | 4.0   | 256  | 0.6358          | 0.8265          | 0.8260    | 0.6711 |
-| 0.9042        | 5.0   | 320  | 0.6084          | 0.8242          | 0.8249    | 0.6765 |
-| 0.8602        | 6.0   | 384  | 0.5821          | 0.8379          | 0.8383    | 0.6743 |
-| 0.811         | 7.0   | 448  | 0.5633          | 0.8516          | 0.8508    | 0.6765 |
-| 0.8068        | 8.0   | 512  | 0.5532          | 0.8447          | 0.8457    | 0.6711 |
-| 0.772         | 9.0   | 576  | 0.5326          | 0.8516          | 0.8523    | 0.6765 |
-| 0.7349        | 10.0  | 640  | 0.5132          | 0.8676          | 0.8671    | 0.6765 |
-| 0.7047        | 11.0  | 704  | 0.5053          | 0.8630          | 0.8622    | 0.6798 |
-| 0.6994        | 12.0  | 768  | 0.4976          | 0.8630          | 0.8625    | 0.6809 |
-| 0.6831        | 13.0  | 832  | 0.4860          | 0.8630          | 0.8635    | 0.6820 |
-| 0.6509        | 14.0  | 896  | 0.4741          | 0.8858          | 0.8858    | 0.6809 |
-| 0.643         | 15.0  | 960  | 0.4616          | 0.8836          | 0.8833    | 0.6820 |
-| 0.6392        | 16.0  | 1024 | 0.4562          | 0.8858          | 0.8860    | 0.6832 |
-| 0.5852        | 17.0  | 1088 | 0.4452          | 0.8836          | 0.8833    | 0.6874 |
-| 0.5854        | 18.0  | 1152 | 0.4374          | 0.8813          | 0.8812    | 0.6885 |
-| 0.5917        | 19.0  | 1216 | 0.4358          | 0.8858          | 0.8851    | 0.6897 |
-| 0.5678        | 20.0  | 1280 | 0.4298          | 0.8927          | 0.8922    | 0.6875 |
-| 0.5684        | 21.0  | 1344 | 0.4229          | 0.9064          | 0.9061    | 0.6865 |
-| 0.5594        | 22.0  | 1408 | 0.4111          | 0.9041          | 0.9042    | 0.6853 |
-| 0.5556        | 23.0  | 1472 | 0.4058          | 0.8995          | 0.8995    | 0.6853 |
-| 0.521         | 24.0  | 1536 | 0.4058          | 0.8904          | 0.8902    | 0.6853 |
-| 0.5224        | 25.0  | 1600 | 0.3999          | 0.9041          | 0.9041    | 0.6853 |
-| 0.5269        | 26.0  | 1664 | 0.3923          | 0.9110          | 0.9109    | 0.6875 |
-| 0.4832        | 27.0  | 1728 | 0.3916          | 0.9064          | 0.9066    | 0.6853 |
-| 0.4938        | 28.0  | 1792 | 0.3908          | 0.9041          | 0.9041    | 0.6853 |
-| 0.4819        | 29.0  | 1856 | 0.3851          | 0.9064          | 0.9061    | 0.6853 |
-| 0.4781        | 30.0  | 1920 | 0.3916          | 0.9041          | 0.9041    | 0.6853 |
-| 0.4671        | 31.0  | 1984 | 0.3799          | 0.9041          | 0.9042    | 0.6875 |
-| 0.4689        | 32.0  | 2048 | 0.3810          | 0.9041          | 0.9042    | 0.6875 |
-| 0.4799        | 33.0  | 2112 | 0.3759          | 0.9064          | 0.9066    | 0.6897 |
-| 0.4691        | 34.0  | 2176 | 0.3699          | 0.9110          | 0.9107    | 0.6897 |
-| 0.4752        | 35.0  | 2240 | 0.3699          | 0.9087          | 0.9082    | 0.6918 |
-| 0.4574        | 36.0  | 2304 | 0.3676          | 0.9132          | 0.9132    | 0.6897 |
-| 0.4654        | 37.0  | 2368 | 0.3668          | 0.9132          | 0.9134    | 0.6918 |
-| 0.4416        | 38.0  | 2432 | 0.3663          | 0.9110          | 0.9112    | 0.6918 |
-| 0.4433        | 39.0  | 2496 | 0.3685          | 0.9110          | 0.9111    | 0.6897 |
-| 0.4202        | 40.0  | 2560 | 0.3625          | 0.9155          | 0.9156    | 0.6897 |
-| 0.4523        | 41.0  | 2624 | 0.3595          | 0.9178          | 0.9176    | 0.6897 |
-| 0.4325        | 42.0  | 2688 | 0.3589          | 0.9178          | 0.9176    | 0.6897 |
-| 0.4406        | 43.0  | 2752 | 0.3600          | 0.9087          | 0.9085    | 0.6897 |
-| 0.4348        | 44.0  | 2816 | 0.3594          | 0.9132          | 0.9130    | 0.6875 |
-| 0.4339        | 45.0  | 2880 | 0.3573          | 0.9178          | 0.9176    | 0.6897 |
-| 0.4259        | 46.0  | 2944 | 0.3576          | 0.9132          | 0.9131    | 0.6897 |
-| 0.4272        | 47.0  | 3008 | 0.3568          | 0.9155          | 0.9153    | 0.6897 |
-| 0.4276        | 48.0  | 3072 | 0.3568          | 0.9155          | 0.9153    | 0.6897 |
-| 0.4297        | 49.0  | 3136 | 0.3575          | 0.9132          | 0.9131    | 0.6897 |
-| 0.4223        | 50.0  | 3200 | 0.3576          | 0.9110          | 0.9109    | 0.6897 |
-### Framework versions
-- Transformers 4.53.2
-- Pytorch 2.6.0+cu124
-- Datasets 4.0.0
-- Tokenizers 0.21.2

 model-index:
 - name: schedulebot-nlu-engine
   results: []
+datasets:
+- andreaceto/hasd
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# Schedulebot-nlu-engine
+## Model Description
+This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a **`distilbert-base-uncased`** backbone and is capable of performing two tasks simultaneously:
+- **Intent Classification**: Identifying the user's primary goal (e.g., `schedule`, `cancel`).
+- **Named Entity Recognition (NER)**: Extracting custom, domain-specific entities (e.g., `appointment_type`).
+This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.
+## Model Architecture
+The model uses a standard `distilbert-base-uncased` model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.
+- **Base Model**: `distilbert-base-uncased`
+- **Classifier Heads**: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
+    1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
+    2. A GELU activation function.
+    3. A Dropout layer with a rate of 0.3 for regularization.
+    4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).
+## Intended Use
+This model is intended to be the core NLU component of a conversational AI system for managing appointments.
+For instructions on how to use the model check the [dedicated file](./how_to_use.md).
+## Training Data
+The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.
+- **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
+- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each.
+- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.
+The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).
+### Intents
+The model is trained to recognize the following intents:
+`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).
+### Entities
+The model is trained to recognize the following custom named entities:
+`practitioner_name`, `appointment_type`, `appointment_id`.
+## Training Procedure
+The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.
+### Stage 1: Training the Classifier Heads
+- The `distilbert-base-uncased` base model was entirely **frozen**.
+- Only the randomly initialized MLP heads for intent and NER classification were trained.
+**Setup**:
+```python
+# Define a data collator to handle padding for token classification
+data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    overwrite_output_dir=True,
+    num_train_epochs=200,               # Training epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-4,                 # Learning Rate
+    weight_decay=1e-5,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",     # Focus on validation loss as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
+)
+```
+### Stage 2: Fine-Tuning
+- The DistilBERT backbone was entirely **unfrozen**.
+- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.
+**Setup**:
+```python
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    overwrite_output_dir=True,
+    num_train_epochs=50,               # Fine-tuning epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-6,                 # Learning Rate
+    weight_decay=1e-3,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",     # Focus on NER F1 as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
+)
+```
+## Evaluation
+The model was evaluated on a held-out test set, and its performance was measured for both tasks.
+### Intent Classification Performance
+| Intent        | Precision | Recall | F1-Score |  Support |
+| ---           | ---       | ---    | ---      | ---      |
+|           bye | 0.9500    | 0.8261 | 0.8837   | 23       |
+|        cancel | 0.9211    | 0.8434 | 0.8805   | 83       |
+|      greeting | 0.9545    | 0.9545 | 0.9545   | 22       |
+|negative_reply | 0.9091    | 0.9091 | 0.9091   | 22       |
+|           oos | 1.0000    | 0.8696 | 0.9302   | 23       |
+|positive_reply | 0.7407    | 0.9091 | 0.8163   | 22       |
+|   query_avail | 0.9620    | 0.9383 | 0.9500   | 81       |
+|    reschedule | 0.8506    | 0.8916 | 0.8706   | 83       |
+|      schedule | 0.8488    | 0.9125 | 0.8795   | 80       |
+| ---           | ---       | ---    | ---      | ----     |
+| **Accuracy**     |               |            | **0.8952**   | 439      |
+| **Macro Avg**    |    **0.9041** | **0.8949** | **0.8972**   | 439      |
+| **Weighted Avg** |    **0.8998** | **0.8952** | **0.8960**   | 439      |
+### NER (Token Classification) Performance
+| Entity              | Precision | Recall | F1-Score |  Support |
+| ---                 | ---       | ---    | ---      | ---      |
+| B-appointment_id    | 1.0000    | 1.0000 | 1.0000   | 61       |
+| B-appointment_type  | 0.8646    | 0.7477 | 0.8019   | 111      |
+| B-practitioner_name | 0.9161    | 0.9467 | 0.9311   | 150      |
+| B-appointment_id    | 0.9667    | 0.9667 | 0.9667   | 210      |
+| I-appointment_type  | 0.8182    | 0.7368 | 0.7754   | 171      |
+| I-practitioner_name | 0.9540    | 0.8941 | 0.9231   | 255      |
+| O                   | 0.9782    | 0.9892 | 0.9837   | 3813     |
+| ---                 | ---       | ---    | ---      | ----     |
+| **Accuracy**        |           |        | 0.9673   | 4771     |
+| **Macro Avg**       | 0.9283    | 0.8973 | 0.9117   | 4771     |
+| **Weighted Avg**    | 0.9664    | 0.9673 | 0.9666   | 4771     |
+The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
+## Limitations and Bias
+- The model's performance is highly dependent on the quality and scope of the **HASD dataset**. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
+- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
+- The model inherits any biases present in the `distilbert-base-uncased` model and the `clinc/clinc_oos` dataset.