Update README.md

7bebfbe verified about 1 month ago

3.52 kB

	---
	license: mit
	datasets:
	- thesofakillers/jigsaw-toxic-comment-classification-challenge
	language:
	- en
	metrics:
	- accuracy
	- f1
	base_model:
	- distilbert/distilbert-base-uncased
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- social
	---
	# Toxic Comment Classification with Transformer Optimization
	This project demonstrates a high-performance pipeline for classifying toxic comments using a binary classification approach. The models were trained and evaluated using the Jigsaw Toxic Comment Classification dataset, specifically leveraging the domain-specific Toxic-BERT model as a primary architecture.
	## Project Overview
	* Objective: To build an efficient binary toxicity classifier using state-of-the-art NLP models.
	* Model Type: Binary classification (Toxic vs. Non-Toxic).
	* Dataset: Jigsaw Toxic Comment Classification Challenge.
	* Scope: Includes data visualization, model benchmarking, and size reduction for deployment.
	## Technical Workflow
	### 1. Data Preprocessing & EDA
	* Labeling: Multi-label categories (toxic, severe_toxic, obscene, threat, insult, identity_hate) were condensed into a single binary 'is_toxic' label.
	* Balancing: The dataset was sampled to include 16,000 toxic and 16,000 non-toxic comments to ensure a balanced 32,000-sample training set.
	* Cleaning: Newline characters were removed to standardize the text input for transformer tokenizers.
	* Visualization: Word clouds were generated for both classes to identify the most frequent terms associated with toxic and non-toxic speech.
	### 2. Embedding Benchmarking
	The project evaluated 15 different embedding sets across two categories:
	* Light Models: Includes DistilBERT, MiniLM, ALBERT, and ELECTRA-Small.
	* Heavy Models: Includes BERT, RoBERTa, DeBERTa, XLNet, and domain-specific models like Toxic-BERT and HateBERT.
	### 3. Model Performance Results
	Models were evaluated using Logistic Regression (LR), Support Vector Machines (SVM), and Random Forest (RF)
	\| Embedding \| LR_AUC \| LinearSVM_ACC \| RBF_AUC \| RF_ACC \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Toxic-BERT_transformer_emb \| 0.997022 \| 0.979531 \| 0.991532 \| 0.979375 \|
	\| HateBERT_transformer_emb \| 0.967701 \| 0.901875 \| 0.965530 \| 0.852344 \|
	\| DistilBERT_transformer_emb \| 0.967614 \| 0.898906 \| 0.967362 \| 0.878125 \|
	## Optimization Techniques
	### 4. Dynamic Quantization
	To optimize the teacher model (Toxic-BERT) for CPU inference, dynamic quantization was applied to convert weights from FP32 to INT8.
	* Size Reduction: The model size decreased from 438.01 MB to 181.49 MB.
	* Accuracy Retention: The quantized model maintained a high Test AUC of 0.9966, showing negligible performance loss despite the 58% reduction in size.
	### 5. Knowledge Distillation
	A smaller student model (DistilBERT) was trained to mimic the behavior of the Toxic-BERT teacher.
	* Loss Function: A custom Binary Knowledge Distillation loss was used, combining Kullback-Leibler (KL) divergence for soft teacher probabilities and Cross-Entropy for hard labels.
	* Student Performance: Reached a Validation AUC of 0.9866 after 5 training epochs.
	* Final Footprint: The student model is 267.86 MB, significantly more portable than the original 438.03 MB teacher model.
	## Requirements
	* `torch`
	* `transformers`
	* `sentence-transformers`
	* `pandas`, `numpy`
	* `matplotlib`, `wordcloud`
	* `scikit-learn`

	---
	license: mit
	datasets:
	- thesofakillers/jigsaw-toxic-comment-classification-challenge
	language:
	- en
	metrics:
	- accuracy
	- f1
	base_model:
	- distilbert/distilbert-base-uncased
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- social
	---
	# Toxic Comment Classification with Transformer Optimization
	This project demonstrates a high-performance pipeline for classifying toxic comments using a binary classification approach. The models were trained and evaluated using the Jigsaw Toxic Comment Classification dataset, specifically leveraging the domain-specific Toxic-BERT model as a primary architecture.
	## Project Overview
	* Objective: To build an efficient binary toxicity classifier using state-of-the-art NLP models.
	* Model Type: Binary classification (Toxic vs. Non-Toxic).
	* Dataset: Jigsaw Toxic Comment Classification Challenge.
	* Scope: Includes data visualization, model benchmarking, and size reduction for deployment.
	## Technical Workflow
	### 1. Data Preprocessing & EDA
	* Labeling: Multi-label categories (toxic, severe_toxic, obscene, threat, insult, identity_hate) were condensed into a single binary 'is_toxic' label.
	* Balancing: The dataset was sampled to include 16,000 toxic and 16,000 non-toxic comments to ensure a balanced 32,000-sample training set.
	* Cleaning: Newline characters were removed to standardize the text input for transformer tokenizers.
	* Visualization: Word clouds were generated for both classes to identify the most frequent terms associated with toxic and non-toxic speech.
	### 2. Embedding Benchmarking
	The project evaluated 15 different embedding sets across two categories:
	* Light Models: Includes DistilBERT, MiniLM, ALBERT, and ELECTRA-Small.
	* Heavy Models: Includes BERT, RoBERTa, DeBERTa, XLNet, and domain-specific models like Toxic-BERT and HateBERT.
	### 3. Model Performance Results
	Models were evaluated using Logistic Regression (LR), Support Vector Machines (SVM), and Random Forest (RF)
	\| Embedding \| LR_AUC \| LinearSVM_ACC \| RBF_AUC \| RF_ACC \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Toxic-BERT_transformer_emb \| 0.997022 \| 0.979531 \| 0.991532 \| 0.979375 \|
	\| HateBERT_transformer_emb \| 0.967701 \| 0.901875 \| 0.965530 \| 0.852344 \|
	\| DistilBERT_transformer_emb \| 0.967614 \| 0.898906 \| 0.967362 \| 0.878125 \|
	## Optimization Techniques
	### 4. Dynamic Quantization
	To optimize the teacher model (Toxic-BERT) for CPU inference, dynamic quantization was applied to convert weights from FP32 to INT8.
	* Size Reduction: The model size decreased from 438.01 MB to 181.49 MB.
	* Accuracy Retention: The quantized model maintained a high Test AUC of 0.9966, showing negligible performance loss despite the 58% reduction in size.
	### 5. Knowledge Distillation
	A smaller student model (DistilBERT) was trained to mimic the behavior of the Toxic-BERT teacher.
	* Loss Function: A custom Binary Knowledge Distillation loss was used, combining Kullback-Leibler (KL) divergence for soft teacher probabilities and Cross-Entropy for hard labels.
	* Student Performance: Reached a Validation AUC of 0.9866 after 5 training epochs.
	* Final Footprint: The student model is 267.86 MB, significantly more portable than the original 438.03 MB teacher model.
	## Requirements
	* `torch`
	* `transformers`
	* `sentence-transformers`
	* `pandas`, `numpy`
	* `matplotlib`, `wordcloud`
	* `scikit-learn`