Cyberbullying Classification Model (Scikit-Learn)

This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble Voting Classifier combining Logistic Regression and Random Forest, achieving approximately 91% accuracy.

Model Details

  • Developed by: Kohil Sharma
  • Model Type: Voting Classifier (Logistic Regression + Random Forest)
  • Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency)
  • Library: Scikit-Learn
  • Language: English

Intended Use

This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.

Classification Labels

The model classifies text into 5 categories (mapped as follows):

  • 0: Not Cyberbullying
  • 1: Gender (Sexist)
  • 2: Religion
  • 3: Age
  • 4: Ethnicity (Racist)

(Note: The 'Other' category was removed during preprocessing to improve accuracy.)

Training Data

Training Procedure

1. Preprocessing

The text underwent rigorous cleaning using the tweet-preprocessor library and custom functions:

  • Removal of Usernames (@), Hashtags (#), and Links (http).
  • Removal of punctuation and special characters.
  • Conversion to lowercase.
  • Lemmatization using NLTK's WordNetLemmatizer.
  • Stopword removal (including Twitter-specific stopwords like "rt", "mkr").

2. Feature Engineering

  • TF-IDF Vectorizer was used to convert text into numerical vectors.

3. Model Architecture

  • Base Models:
    1. LogisticRegression (C=100, penalty='l2')
    2. RandomForestClassifier (n_estimators=100)
  • Ensemble: VotingClassifier (Hard Voting) combining the above two.

Evaluation Results

  • Accuracy: ~91% on the test set.
  • Strengths: High precision in distinguishing Ethnicity, Religion, and Age-based bullying.

How to Use

To use this model in Python, you need to load both the vectorizer and the model using joblib.

import joblib
import preprocessor as p # pip install tweet-preprocessor
import string

# 1. Load the saved files
model = joblib.load('model.pickle')
vectorizer = joblib.load('tfidf.pickle')

# 2. Define the cleaning function (Must match training!)
def clean_text(text):
    text = p.clean(text)
    text = text.lower()
    text = "".join([char for char in text if char not in string.punctuation])
    return text

# 3. Make a prediction
text = "You are dumb and you should go back to school."
clean_input = clean_text(text)

# Vectorize the text
vectorized_input = vectorizer.transform([clean_input])

# Predict
prediction = model.predict(vectorized_input)
classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}

print(f"Prediction: {classes[prediction[0]]}")
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support