Cyberbullying Classification Model (Scikit-Learn)
This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble Voting Classifier combining Logistic Regression and Random Forest, achieving approximately 91% accuracy.
Model Details
- Developed by: Kohil Sharma
- Model Type: Voting Classifier (Logistic Regression + Random Forest)
- Feature Extraction: TF-IDF (Term Frequency-Inverse Document Frequency)
- Library: Scikit-Learn
- Language: English
Intended Use
This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.
Classification Labels
The model classifies text into 5 categories (mapped as follows):
0: Not Cyberbullying1: Gender (Sexist)2: Religion3: Age4: Ethnicity (Racist)
(Note: The 'Other' category was removed during preprocessing to improve accuracy.)
Training Data
- Dataset: Cyberbullying Classification Tweets
- Original Size: ~47,000 tweets
- Processed Size: ~38,000 tweets (after cleaning duplicates and removing the 'Other' class)
Training Procedure
1. Preprocessing
The text underwent rigorous cleaning using the tweet-preprocessor library and custom functions:
- Removal of Usernames (@), Hashtags (#), and Links (http).
- Removal of punctuation and special characters.
- Conversion to lowercase.
- Lemmatization using NLTK's
WordNetLemmatizer. - Stopword removal (including Twitter-specific stopwords like "rt", "mkr").
2. Feature Engineering
- TF-IDF Vectorizer was used to convert text into numerical vectors.
3. Model Architecture
- Base Models:
LogisticRegression(C=100, penalty='l2')RandomForestClassifier(n_estimators=100)
- Ensemble:
VotingClassifier(Hard Voting) combining the above two.
Evaluation Results
- Accuracy: ~91% on the test set.
- Strengths: High precision in distinguishing Ethnicity, Religion, and Age-based bullying.
How to Use
To use this model in Python, you need to load both the vectorizer and the model using joblib.
import joblib
import preprocessor as p # pip install tweet-preprocessor
import string
# 1. Load the saved files
model = joblib.load('model.pickle')
vectorizer = joblib.load('tfidf.pickle')
# 2. Define the cleaning function (Must match training!)
def clean_text(text):
text = p.clean(text)
text = text.lower()
text = "".join([char for char in text if char not in string.punctuation])
return text
# 3. Make a prediction
text = "You are dumb and you should go back to school."
clean_input = clean_text(text)
# Vectorize the text
vectorized_input = vectorizer.transform([clean_input])
# Predict
prediction = model.predict(vectorized_input)
classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}
print(f"Prediction: {classes[prediction[0]]}")
- Downloads last month
- -