Cyberbullying Classification Model (Scikit-Learn)

This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble Voting Classifier combining Logistic Regression and Random Forest, achieving approximately 91% accuracy.

Model Details

Developed by: Kohil Sharma
Model Type: Voting Classifier (Logistic Regression + Random Forest)
Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency)
Library: Scikit-Learn
Language: English

Intended Use

This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.

Classification Labels

The model classifies text into 5 categories (mapped as follows):

0: Not Cyberbullying
1: Gender (Sexist)
2: Religion
3: Age
4: Ethnicity (Racist)

(Note: The 'Other' category was removed during preprocessing to improve accuracy.)

Training Data

Dataset: Cyberbullying Classification Tweets
Original Size: ~47,000 tweets
Processed Size: ~38,000 tweets (after cleaning duplicates and removing the 'Other' class)

Training Procedure

1. Preprocessing

The text underwent rigorous cleaning using the tweet-preprocessor library and custom functions:

Removal of Usernames (@), Hashtags (#), and Links (http).
Removal of punctuation and special characters.
Conversion to lowercase.
Lemmatization using NLTK's WordNetLemmatizer.
Stopword removal (including Twitter-specific stopwords like "rt", "mkr").

2. Feature Engineering

TF-IDF Vectorizer was used to convert text into numerical vectors.

3. Model Architecture

Base Models:
1. LogisticRegression (C=100, penalty='l2')
2. RandomForestClassifier (n_estimators=100)
Ensemble: VotingClassifier (Hard Voting) combining the above two.

Evaluation Results

Accuracy: ~91% on the test set.
Strengths: High precision in distinguishing Ethnicity, Religion, and Age-based bullying.

How to Use

To use this model in Python, you need to load both the vectorizer and the model using joblib.

import joblib
import preprocessor as p # pip install tweet-preprocessor
import string

# 1. Load the saved files
model = joblib.load('model.pickle')
vectorizer = joblib.load('tfidf.pickle')

# 2. Define the cleaning function (Must match training!)
def clean_text(text):
    text = p.clean(text)
    text = text.lower()
    text = "".join([char for char in text if char not in string.punctuation])
    return text

# 3. Make a prediction
text = "You are dumb and you should go back to school."
clean_input = clean_text(text)

# Vectorize the text
vectorized_input = vectorizer.transform([clean_input])

# Predict
prediction = model.predict(vectorized_input)
classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}

print(f"Prediction: {classes[prediction[0]]}")

Downloads last month: -