---
library_name: sklearn
tags:
- text-classification
- cyberbullying
- nlp
- social-impact
- ensemble-learning
dataset_info:
  name: Cyberbullying Classification
  source: Kaggle (andrewmvd/cyberbullying-classification)
metrics:
- accuracy
model_file: voting_classifier_model.pkl
---

# Cyberbullying Classification Model (Scikit-Learn)

This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble **Voting Classifier** combining **Logistic Regression** and **Random Forest**, achieving approximately **91% accuracy**.

## Model Details

- **Developed by:** Kohil Sharma
- **Model Type:** Voting Classifier (Logistic Regression + Random Forest)
- **Feature Extraction:** TF-IDF (Term Frequency-Inverse Document Frequency)
- **Library:** Scikit-Learn
- **Language:** English

## Intended Use
This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments.

### Classification Labels
The model classifies text into **5 categories** (mapped as follows):
- `0`: **Not Cyberbullying**
- `1`: **Gender** (Sexist)
- `2`: **Religion**
- `3`: **Age**
- `4`: **Ethnicity** (Racist)

*(Note: The 'Other' category was removed during preprocessing to improve accuracy.)*

## Training Data
- **Dataset:** [Cyberbullying Classification Tweets](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification)
- **Original Size:** ~47,000 tweets
- **Processed Size:** ~38,000 tweets (after cleaning duplicates and removing the 'Other' class)

## Training Procedure

### 1. Preprocessing
The text underwent rigorous cleaning using the `tweet-preprocessor` library and custom functions:
- Removal of Usernames (@), Hashtags (#), and Links (http).
- Removal of punctuation and special characters.
- Conversion to lowercase.
- **Lemmatization** using NLTK's `WordNetLemmatizer`.
- Stopword removal (including Twitter-specific stopwords like "rt", "mkr").

### 2. Feature Engineering
- **TF-IDF Vectorizer** was used to convert text into numerical vectors.

### 3. Model Architecture
- **Base Models:**
    1. `LogisticRegression` (C=100, penalty='l2')
    2. `RandomForestClassifier` (n_estimators=100)
- **Ensemble:** `VotingClassifier` (Hard Voting) combining the above two.

## Evaluation Results
- **Accuracy:** ~91% on the test set.
- **Strengths:** High precision in distinguishing Ethnicity, Religion, and Age-based bullying.

## How to Use

To use this model in Python, you need to load both the vectorizer and the model using `joblib`.

```python
import joblib
import preprocessor as p # pip install tweet-preprocessor
import string

# 1. Load the saved files
model = joblib.load('model.pickle')
vectorizer = joblib.load('tfidf.pickle')

# 2. Define the cleaning function (Must match training!)
def clean_text(text):
    text = p.clean(text)
    text = text.lower()
    text = "".join([char for char in text if char not in string.punctuation])
    return text

# 3. Make a prediction
text = "You are dumb and you should go back to school."
clean_input = clean_text(text)

# Vectorize the text
vectorized_input = vectorizer.transform([clean_input])

# Predict
prediction = model.predict(vectorized_input)
classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'}

print(f"Prediction: {classes[prediction[0]]}")