Distil_Bert_V3 / README.md

Updated

633c137 verified 11 months ago

3.73 kB

	---
	language: en
	tags:
	- text-classification
	- distilbert
	- ticket-classification
	- pytorch
	license: mit # Adjust as needed (e.g., apache-2.0, unspecified)
	datasets:
	- Defect_ticket_v2 # Custom/private dataset name
	model_name: DistilBERT Ticket Classifier
	metrics:
	- accuracy
	---

	# DistilBERT Ticket Classifier (Distil_Bert_V3)

	## Model Overview
	This is a fine-tuned DistilBERT model (`distilbert-base-cased`) designed to classify defect tickets and assign them to the appropriate team based on their text content. It cleaned the ticket data from Defect_ticket_V2.csv by fixing missing values input of ticket Description, Comment, and Summary, and predicts one of 5 team labels, each linked to a team email for automated routing.

	- Model Type: DistilBERT for Sequence Classification
	- Framework: PyTorch
	- Repository: [ZAM-ITI-110/Distil_Bert_V3](https://huggingface.co/ZAM-ITI-110/Distil_Bert_V3)
	- License: MIT (see YAML metadata above)
	- Created: February 2025
	- Creator: AUNGHLAINGTUN/Student ID6319250G NYP

	## Intended Use
	This model is intended for:
	- Automating ticket assignment in IT support or defect tracking systems.
	- Reducing manual triage time by predicting the responsible team based on ticket text.

	### Use Case
	- Input: A ticket with fields `Description`, `Comment`, and `Summary` (e.g., "Urgent server crash reported in production").
	- Output: A team label (0-4) mapped to a team email (e.g., `team1@example.com`).

	### Out of Scope
	- Not designed for multi-label classification or sentiment analysis.
	- May not generalize well to tickets outside the training domain (e.g., non-technical support tickets).

	## Training Data
	- Dataset: `Defect_ticket_v2.csv` (private dataset)
	- Size: Approximately 5,000 samples (70% train: ~3,504, 15% validation: ~750, 15% test: ~750).
	- Features: Combined text from `Description`, `Comment`, and `Summary` columns.
	- Labels: 5 unique team labels (encoded as 0-4), derived from the `Assigned Team` column.
	- Preprocessing: Missing values filled with empty strings; text truncated/padded to 512 tokens.

	Note: The dataset is not publicly available due to privacy constraints.

	## Training Procedure
	- Base Model: `distilbert-base-cased`
	- Fine-Tuning:
	- Epochs: 5
	- Batch Size: 8
	- Optimizer: AdamW (learning rate: 3e-5, weight decay: 0.01)
	- Scheduler: Linear with 10% warmup steps
	- Hardware: Trained on Google Colab with a T4 GPU (~31 seconds/epoch).
	- Mixed Precision: Enabled via PyTorch AMP for efficiency.
	- Loss Function: CrossEntropyLoss

	### Training Metrics
	\| Epoch \| Train Loss \| Validation Loss \| Validation Accuracy \|
	\|-------\|------------\|-----------------\|---------------------\|
	\| 1 \| 0.4021 \| 0.0038 \| 100% \|
	\| 2 \| 0.0031 \| 0.0011 \| 100% \|
	\| 3 \| 0.0013 \| 0.0006 \| 100% \|
	\| 4 \| 0.0008 \| 0.0004 \| 100% \|
	\| 5 \| 0.0007 \| 0.0004 \| 100% \|

	- Test Accuracy: 100% (on ~750 test samples).

	## Evaluation
	- Performance: Achieved 100% accuracy on both validation and test sets, indicating excellent fit to the provided data.
	- Caveats:
	- Perfect accuracy may suggest an easy classification task, limited dataset diversity, or potential data leakage (e.g., duplicates across splits).
	- Real-world performance on new, unseen tickets should be validated.

	## How to Use
	- Predicts the appropriate team and email for up to 6 ticket descriptions.
	- Click 'Predict' for each ticket or then 'Send Tickets' to process for all .
	### Installation
	```bash
	pip install transformers torch