--- library_name: sklearn tags: - text-classification - cyberbullying - nlp - social-impact - ensemble-learning dataset_info: name: Cyberbullying Classification source: Kaggle (andrewmvd/cyberbullying-classification) metrics: - accuracy model_file: voting_classifier_model.pkl --- # Cyberbullying Classification Model (Scikit-Learn) This is a traditional Machine Learning model that classifies tweets into different categories of cyberbullying. It is an ensemble **Voting Classifier** combining **Logistic Regression** and **Random Forest**, achieving approximately **91% accuracy**. ## Model Details - **Developed by:** Kohil Sharma - **Model Type:** Voting Classifier (Logistic Regression + Random Forest) - **Feature Extraction:** TF-IDF (Term Frequency-Inverse Document Frequency) - **Library:** Scikit-Learn - **Language:** English ## Intended Use This model is designed to detect specific types of cyberbullying in text. It is lightweight and faster than transformer models, making it suitable for low-resource environments. ### Classification Labels The model classifies text into **5 categories** (mapped as follows): - `0`: **Not Cyberbullying** - `1`: **Gender** (Sexist) - `2`: **Religion** - `3`: **Age** - `4`: **Ethnicity** (Racist) *(Note: The 'Other' category was removed during preprocessing to improve accuracy.)* ## Training Data - **Dataset:** [Cyberbullying Classification Tweets](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification) - **Original Size:** ~47,000 tweets - **Processed Size:** ~38,000 tweets (after cleaning duplicates and removing the 'Other' class) ## Training Procedure ### 1. Preprocessing The text underwent rigorous cleaning using the `tweet-preprocessor` library and custom functions: - Removal of Usernames (@), Hashtags (#), and Links (http). - Removal of punctuation and special characters. - Conversion to lowercase. - **Lemmatization** using NLTK's `WordNetLemmatizer`. - Stopword removal (including Twitter-specific stopwords like "rt", "mkr"). ### 2. Feature Engineering - **TF-IDF Vectorizer** was used to convert text into numerical vectors. ### 3. Model Architecture - **Base Models:** 1. `LogisticRegression` (C=100, penalty='l2') 2. `RandomForestClassifier` (n_estimators=100) - **Ensemble:** `VotingClassifier` (Hard Voting) combining the above two. ## Evaluation Results - **Accuracy:** ~91% on the test set. - **Strengths:** High precision in distinguishing Ethnicity, Religion, and Age-based bullying. ## How to Use To use this model in Python, you need to load both the vectorizer and the model using `joblib`. ```python import joblib import preprocessor as p # pip install tweet-preprocessor import string # 1. Load the saved files model = joblib.load('model.pickle') vectorizer = joblib.load('tfidf.pickle') # 2. Define the cleaning function (Must match training!) def clean_text(text): text = p.clean(text) text = text.lower() text = "".join([char for char in text if char not in string.punctuation]) return text # 3. Make a prediction text = "You are dumb and you should go back to school." clean_input = clean_text(text) # Vectorize the text vectorized_input = vectorizer.transform([clean_input]) # Predict prediction = model.predict(vectorized_input) classes = {0: 'Not Cyberbullying', 1: 'Gender', 2: 'Religion', 3: 'Age', 4: 'Ethnicity'} print(f"Prediction: {classes[prediction[0]]}")