kohils
/

Twitter-Cyberbullying-Classification

Text Classification

ensemble-learning

Model card Files Files and versions

Twitter-Cyberbullying-Classification / README.md

kohils's picture

Update README.md

b7ac775 verified 19 days ago

|

history blame contribute delete

3.44 kB

	---
	library_name: sklearn
	tags:
	- text-classification
	- cyberbullying
	- nlp
	- social-impact
	- ensemble-learning
	dataset_info:
	name: Cyberbullying Classification
	source: Kaggle (andrewmvd/cyberbullying-classification)
	metrics:
	- accuracy
	model_file: voting_classifier_model.pkl
	---

	# Cyberbullying Classification Model (Scikit-Learn)

	This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble Voting Classifier combining Logistic Regression and Random Forest, achieving approximately 91% accuracy.

	## Model Details

	- Developed by: Kohil Sharma
	- Model Type: Voting Classifier (Logistic Regression + Random Forest)
	- Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency)
	- Library: Scikit-Learn
	- Language: English

	## Intended Use
	This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.

	### Classification Labels
	The model classifies text into 5 categories (mapped as follows):
	- `0`: Not Cyberbullying
	- `1`: Gender (Sexist)
	- `2`: Religion
	- `3`: Age
	- `4`: Ethnicity (Racist)

	(Note: The 'Other' category was removed during preprocessing to improve accuracy.)

	## Training Data
	- Dataset: [Cyberbullying Classification Tweets](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification)
	- Original Size: ~47,000 tweets
	- Processed Size: ~38,000 tweets (after cleaning duplicates and removing the 'Other' class)

	## Training Procedure

	### 1. Preprocessing
	The text underwent rigorous cleaning using the `tweet-preprocessor` library and custom functions:
	- Removal of Usernames (@), Hashtags (#), and Links (http).
	- Removal of punctuation and special characters.
	- Conversion to lowercase.
	- Lemmatization using NLTK's `WordNetLemmatizer`.
	- Stopword removal (including Twitter-specific stopwords like "rt", "mkr").

	### 2. Feature Engineering
	- TF-IDF Vectorizer was used to convert text into numerical vectors.

	### 3. Model Architecture
	- Base Models:
	1. `LogisticRegression` (C=100, penalty='l2')
	2. `RandomForestClassifier` (n_estimators=100)
	- Ensemble: `VotingClassifier` (Hard Voting) combining the above two.

	## Evaluation Results
	- Accuracy: ~91% on the test set.
	- Strengths: High precision in distinguishing Ethnicity, Religion, and Age-based bullying.

	## How to Use

	To use this model in Python, you need to load both the vectorizer and the model using `joblib`.

	```python
	import joblib
	import preprocessor as p # pip install tweet-preprocessor
	import string

	# 1. Load the saved files
	model = joblib.load('model.pickle')
	vectorizer = joblib.load('tfidf.pickle')

	# 2. Define the cleaning function (Must match training!)
	def clean_text(text):
	text = p.clean(text)
	text = text.lower()
	text = "".join([char for char in text if char not in string.punctuation])
	return text

	# 3. Make a prediction
	text = "You are dumb and you should go back to school."
	clean_input = clean_text(text)

	# Vectorize the text
	vectorized_input = vectorizer.transform([clean_input])

	# Predict
	prediction = model.predict(vectorized_input)
	classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}

	print(f"Prediction: {classes[prediction[0]]}")