Update README.md

15d636a verified 6 months ago

9 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: distilbert-base-uncased
	tags:
	- generated_from_trainer
	model-index:
	- name: schedulebot-nlu-engine
	results: []
	datasets:
	- andreaceto/hasd
	language:
	- en
	---
	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Schedulebot-nlu-engine

	## Model Description

	This model is a multi-task Natural Language Understanding (NLU) engine designed specifically for an appointment scheduling chatbot. It is fine-tuned from a `distilbert-base-uncased` backbone and is capable of performing two tasks simultaneously:

	- Intent Classification: Identifying the user's primary goal (e.g., `schedule`, `cancel`).
	- Named Entity Recognition (NER): Extracting custom, domain-specific entities (e.g., `appointment_type`).

	This model stands out due to its custom classification heads, which use a more complex architecture to improve performance on nuanced tasks.

	## Model Architecture

	The model uses a standard `distilbert-base-uncased` model as its core feature extractor. Two custom classification "heads" are placed on top of this base to perform the downstream tasks.

	- Base Model: `distilbert-base-uncased`
	- Classifier Heads: each head is a Multi-Layer Perceptron (MLP) with the following structure to allow for more complex feature interpretation:
	1. A Linear layer projecting the transformer's output dimension (768) to an intermediate size (384).
	2. A GELU activation function.
	3. A Dropout layer with a rate of 0.3 for regularization.
	4. A final Linear layer projecting the intermediate size to the number of output labels for the specific task (intent or NER).

	## Intended Use

	This model is intended to be the core NLU component of a conversational AI system for managing appointments.

	For instructions on how to use the model check the [dedicated file](./how_to_use.md).

	## Training Data

	The model was trained on the HASD (Hybrid Appointment Scheduling Dataset), a custom dataset built specifically for this task.

	- Source: The dataset is a hybrid of real-world conversational examples from `clinc/clinc_oos` (for simple intents) and synthetically generated, template-based examples for complex scheduling intents.
	- Balancing: To combat class imbalance, intents sourced from `clinc/clinc_oos` were down-sampled to a maximum of 150 examples each.
	- Augmentation: To increase data diversity for complex intents (`schedule`, `reschedule`, etc.), Contextual Word Replacement was used. A `distilbert-base-uncased` model augmented the templates by replacing non-placeholder words with contextually relevant synonyms.

	The dataset is available [here](https://huggingface.co/datasets/andreaceto/hasd).

	### Intents

	The model is trained to recognize the following intents:
	`schedule`, `reschedule`, `cancel`, `query_avail`, `greeting`, `positive_reply`, `negative_reply`, `bye`, `oos` (out-of-scope).

	### Entities

	The model is trained to recognize the following custom named entities:
	`practitioner_name`, `appointment_type`, `appointment_id`.

	## Training Procedure

	The model was trained using a two-stage fine-tuning strategy to ensure stability and performance.

	### Stage 1: Training the Classifier Heads

	- The `distilbert-base-uncased` base model was entirely frozen.
	- Only the randomly initialized MLP heads for intent and NER classification were trained.

	Setup:

	```python
	# Define a data collator to handle padding for token classification
	data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
	# Define Training Arguments
	training_args = TrainingArguments(
	output_dir="path/to/output_dir",
	overwrite_output_dir=True,
	num_train_epochs=200, # Training epochs
	per_device_train_batch_size=32,
	per_device_eval_batch_size=32,
	learning_rate=1e-4, # Learning Rate
	weight_decay=1e-5, # AdamW weight decay
	logging_dir="path/to/logging_dir",
	logging_strategy="epoch",
	eval_strategy="epoch",
	save_strategy="epoch",
	load_best_model_at_end=True,
	metric_for_best_model="eval_loss", # Focus on validation loss as the key metric
	# --- Hub Arguments ---
	push_to_hub=True,
	hub_model_id=hub_model_id,
	hub_strategy="end",
	hub_token=hf_token,
	report_to="tensorboard" # Tensorboard to monitor training
	)
	# Create the Trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=processed_datasets["train"],
	eval_dataset=processed_datasets["validation"],
	processing_class=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics, # Custom function (check how_to_use.md)
	callbacks=[EarlyStoppingCallback(early_stopping_patience=10)]
	)
	```

	### Stage 2: Fine-Tuning

	- The DistilBERT backbone was entirely unfrozen.
	- Using a very low LR allows the model to adapt even better to the new data while preserving the powerful, general-purpose knowledge.

	Setup:

	```python
	# Define Training Arguments
	training_args = TrainingArguments(
	output_dir="path/to/output_dir",
	overwrite_output_dir=True,
	num_train_epochs=50, # Fine-tuning epochs
	per_device_train_batch_size=32,
	per_device_eval_batch_size=32,
	learning_rate=1e-6, # Learning Rate
	weight_decay=1e-3, # AdamW weight decay
	logging_dir="path/to/logging_dir",
	logging_strategy="epoch",
	eval_strategy="epoch",
	save_strategy="epoch",
	load_best_model_at_end=True,
	metric_for_best_model="eval_loss", # Focus on NER F1 as the key metric
	# --- Hub Arguments ---
	push_to_hub=True,
	hub_model_id=hub_model_id,
	hub_strategy="end",
	hub_token=hf_token,
	report_to="tensorboard" # Tensorboard to monitor training
	)
	# Create the Trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=processed_datasets["train"],
	eval_dataset=processed_datasets["validation"],
	processing_class=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics, # Custom function (check how_to_use.md)
	callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
	)
	```
	## Evaluation

	The model was evaluated on a held-out test set, and its performance was measured for both tasks.

	### Intent Classification Performance

	\| Intent \| Precision \| Recall \| F1-Score \| Support \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| bye \| 0.9500 \| 0.8261 \| 0.8837 \| 23 \|
	\| cancel \| 0.9211 \| 0.8434 \| 0.8805 \| 83 \|
	\| greeting \| 0.9545 \| 0.9545 \| 0.9545 \| 22 \|
	\|negative_reply \| 0.9091 \| 0.9091 \| 0.9091 \| 22 \|
	\| oos \| 1.0000 \| 0.8696 \| 0.9302 \| 23 \|
	\|positive_reply \| 0.7407 \| 0.9091 \| 0.8163 \| 22 \|
	\| query_avail \| 0.9620 \| 0.9383 \| 0.9500 \| 81 \|
	\| reschedule \| 0.8506 \| 0.8916 \| 0.8706 \| 83 \|
	\| schedule \| 0.8488 \| 0.9125 \| 0.8795 \| 80 \|
	\| --- \| --- \| --- \| --- \| ---- \|
	\| Accuracy \| \| \| 0.8952 \| 439 \|
	\| Macro Avg \| 0.9041 \| 0.8949 \| 0.8972 \| 439 \|
	\| Weighted Avg \| 0.8998 \| 0.8952 \| 0.8960 \| 439 \|

	### NER (Token Classification) Performance

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| B-appointment_id \| 1.0000 \| 1.0000 \| 1.0000 \| 61 \|
	\| B-appointment_type \| 0.8646 \| 0.7477 \| 0.8019 \| 111 \|
	\| B-practitioner_name \| 0.9161 \| 0.9467 \| 0.9311 \| 150 \|
	\| I-appointment_id \| 0.9667 \| 0.9667 \| 0.9667 \| 210 \|
	\| I-appointment_type \| 0.8182 \| 0.7368 \| 0.7754 \| 171 \|
	\| I-practitioner_name \| 0.9540 \| 0.8941 \| 0.9231 \| 255 \|
	\| O \| 0.9782 \| 0.9892 \| 0.9837 \| 3813 \|
	\| --- \| --- \| --- \| --- \| ---- \|
	\| Accuracy \| \| \| 0.9673 \| 4771 \|
	\| Macro Avg \| 0.9283 \| 0.8973 \| 0.9117 \| 4771 \|
	\| Weighted Avg \| 0.9664 \| 0.9673 \| 0.9666 \| 4771 \|

	The model achieves near-perfect results on the NER task and excellent results on the intent classification task for this specific dataset.

	## Limitations and Bias

	- The model's performance is highly dependent on the quality and scope of the HASD dataset. It may not generalize well to phrasing or appointment types significantly different from what it was trained on.
	- The dataset was primarily generated from templates, which may not capture the full diversity of real human language.
	- The model inherits any biases present in the `distilbert-base-uncased` model and the `clinc/clinc_oos` dataset.