andreaceto
/

schedulebot-nlu-engine

@@ -40,31 +40,29 @@ The model uses a standard `distilbert-base-uncased` model as its core feature ex
 ## Intended Use
-This model is intended to be the core NLU component of a conversational AI system for managing appointments. It takes raw user text as input and outputs a structured JSON object containing the predicted intent and a list of extracted entities.
-```python
-from transformers import AutoTokenizer
-```
 ## Training Data
 The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.
 - **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
-- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each. [cite: multitask_model.ipynb]
-- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms. [cite: multitask_model.ipynb]
 ### Intents
 The model is trained to recognize the following intents:
-`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope). [cite: multitask_model.ipynb]
 ### Entities
 The model is trained to recognize the following custom named entities:
-`practitioner_name`, `appointment_type`, `appointment_id`. [cite: multitask_model.ipynb]
 ## Training Procedure
@@ -72,16 +70,94 @@ The model was trained using a two-stage fine-tuning strategy to ensure stability
 ### Stage 1: Training the Classifier Heads
-- The `distilbert-base-uncased` base model was **frozen**. [cite: multitask_model.ipynb]
 - Only the randomly initialized MLP heads for intent and NER classification were trained.
-- This was done for **5 epochs** with a higher learning rate (`5e-4`), allowing the new layers to learn the task basics without disrupting the pre-trained backbone. [cite: multitask_model.ipynb]
 ### Stage 2: Selective Fine-Tuning
-- The classification heads were kept trainable, and the **top two layers** of the DistilBERT backbone were **unfrozen**. [cite: multitask_model.ipynb]
-- The entire model was then fine-tuned for **3 epochs** with a much lower learning rate (`2e-5`). [cite: multitask_model.ipynb]
-- This gradual unfreezing approach allows the model to adapt its most task-specific layers to the new data while preserving the powerful, general-purpose knowledge in the lower layers.
 ## Evaluation
 The model was evaluated on a held-out test set, and its performance was measured for both tasks.
@@ -103,8 +179,6 @@ The model was evaluated on a held-out test set, and its performance was measured
 | **Macro Avg** | **0.98** | **0.98** | **0.98** | **197** |
 | **Weighted Avg** | **0.98** | **0.98** | **0.98** | **197** |
-*[Based on the classification report in the provided `multitask_model.ipynb` notebook.]* [cite: multitask_model.ipynb]
 ### NER (Token Classification) Performance
 | Entity | Precision | Recall | F1-Score | Support |
@@ -117,8 +191,6 @@ The model was evaluated on a held-out test set, and its performance was measured
 | **Macro Avg** | **1.00** | **1.00** | **1.00** | 1444 |
 | **Weighted Avg** | **1.00** | **1.00** | **1.00** | 1444 |
-*[Based on the classification report in the provided `multitask_model.ipynb` notebook.]* [cite: multitask_model.ipynb]
 The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
 ## Limitations and Bias

 ## Intended Use
+This model is intended to be the core NLU component of a conversational AI system for managing appointments.
+For instructions on how to use the model check the [dedicated file](./how_to_use.md).
 ## Training Data
 The model was trained on the **HASD (Hybrid Appointment Scheduling Dataset)**, a custom dataset built specifically for this task.
 - **Source**: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
+- **Balancing**: To combat class imbalance, intents sourced from `clinc/clinc_oos` were **down-sampled** to a maximum of **150 examples** each.
+- **Augmentation**: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), **Contextual Word Replacement** was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.
+The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).
 ### Intents
 The model is trained to recognize the following intents:
+`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).
 ### Entities
 The model is trained to recognize the following custom named entities:
+`practitioner_name`, `appointment_type`, `appointment_id`.
 ## Training Procedure
 ### Stage 1: Training the Classifier Heads
+- The `distilbert-base-uncased` base model was entirely **frozen**.
 - Only the randomly initialized MLP heads for intent and NER classification were trained.
+**Setup**:
+```python
+# Define a data collator to handle padding for token classification
+data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    num_train_epochs=200,               # Training epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-4,                 # Learning Rate
+    weight_decay=1e-5,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="steps",
+    logging_steps=10,
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="ner_f1",     # Focus on NER F1 as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=20)]
+)
+```
 ### Stage 2: Selective Fine-Tuning
+- The DistilBERT backbone was entirely **unfrozen**.
+- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.
+**Setup**:
+```python
+# Define Training Arguments
+training_args = TrainingArguments(
+    output_dir="path/to/output_dir",
+    num_train_epochs=50,               # Fine.tuning epochs
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    learning_rate=1e-6,                 # Learning Rate
+    weight_decay=1e-3,                  # AdamW weight decay
+    logging_dir="path/to/logging_dir",
+    logging_strategy="steps",
+    logging_steps=10,
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="ner_f1",     # Focus on NER F1 as the key metric
+    # --- Hub Arguments ---
+    push_to_hub=True,
+    hub_model_id=hub_model_id,
+    hub_strategy="end",
+    hub_token=hf_token,
+    report_to="tensorboard"             # Tensorboard to monitor training
+)
+# Create the Trainer
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,  # Custom function (check how_to_use.md)
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
+)
+```
 ## Evaluation
 The model was evaluated on a held-out test set, and its performance was measured for both tasks.
 | **Macro Avg** | **0.98** | **0.98** | **0.98** | **197** |
 | **Weighted Avg** | **0.98** | **0.98** | **0.98** | **197** |
 ### NER (Token Classification) Performance
 | Entity | Precision | Recall | F1-Score | Support |
 | **Macro Avg** | **1.00** | **1.00** | **1.00** | 1444 |
 | **Weighted Avg** | **1.00** | **1.00** | **1.00** | 1444 |
 The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.
 ## Limitations and Bias