Toxic-Predict
Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.
π© Project Context
This project is part of the Cellula Internship proposal:
"Safe and Responsible Multi-Modal Toxic Content Moderation"
The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.
Project Structure
.
βββ app.py
βββ run.py
βββ test.py
βββ requirements.txt
βββ README.md
βββ data/
β βββ cellula-toxic.csv
β βββ cleaned.csv
β βββ eval.csv
β βββ test.csv
β βββ tokenizer.pkl
β βββ train.csv
βββ models/
β βββ model.py
β βββ toxic_classifier.h5
β βββ toxic_classifier.keras
βββ notebooks/
β βββ Preprocessing.ipynb
β βββ tokenization.ipynb
βββ src/
βββ preprocess.py
βββ tokenize_and_split.py
Features
- Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
- Data cleaning, preprocessing, and label encoding
- Tokenization and sequence padding for text data
- Deep learning and transformer-based models for multi-class toxicity classification
- Evaluation metrics: classification report and confusion matrix
- Jupyter notebooks for data exploration and model development
- Streamlit web app for demo and deployment
Setup
Clone the repository
git clone https://github.com/yourusername/toxic-classification.git cd toxic-predictInstall dependencies
pip install -r requirements.txtPrepare data
- Place your data files in the
data/directory if not already present.
- Place your data files in the
Train the model
- Use the scripts in
src/or the Jupyter notebooks innotebooks/to preprocess data and train the model.
- Use the scripts in
Run predictions
- Use
app.pyorrun.pyto run inference on new data.
- Use
Usage
- Preprocessing and Tokenization:
Seenotebooks/Preprocessing.ipynbandnotebooks/tokenization.ipynbfor step-by-step data cleaning, splitting, and tokenization. - Model Training:
Model architecture and training code are inmodels/model.py. - Inference:
Load the trained model (models/toxic_classifier.h5or.keras) and tokenizer (data/tokenizer.pkl) to predict toxicity categories for new samples.
Data
- CSV files with columns:
query,image descriptions,Toxic Category, andToxic Category Encoded. - Data splits:
train.csv,eval.csv,test.csv, andcleaned.csvfor processed data. - 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.
Model
- Deep learning model built with Keras (TensorFlow backend).
- Multi-class classification with label encoding for toxicity categories.
- Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.
Evaluation
- Classification report and confusion matrix are generated for model evaluation.
- See the evaluation steps in
notebooks/Preprocessing.ipynb.
language: en
π€ Hugging Face Inference
This model is available on the Hugging Face Hub: NightPrince/Toxic_Classification
Inference API Usage
You can use the Hugging Face Inference API or widget with two fields:
text: The main query or post textimage_desc: The image description (if any)
Example (Python):
from huggingface_hub import InferenceClient
client = InferenceClient("NightPrince/Toxic_Classification")
result = client.text_classification({
"text": "This is a dangerous post",
"image_desc": "Knife shown in the image"
})
print(result) # {'label': 'toxic', 'score': 0.98}
Custom Pipeline Details
- The model uses a custom
pipeline.pyfor multi-input inference. - The output is a dictionary with the predicted
label(class name) andscore(confidence). - Class names are mapped using
label_map.json.
Files in the repo:
pipeline.py(custom inference logic)tokenizer.json(Keras tokenizer)label_map.json(class code to name mapping)- TensorFlow SavedModel files (
saved_model.pb,variables/)
Requirements:
tensorflow
keras
numpy
π Resources
- Cellula Internship Project Proposal
- BLIP: Bootstrapped Language-Image Pre-training
- Llama Guard
- DistilBERT
- Streamlit
License
MIT License
Author: Yahya Muhammad Alnwsany
Contact: yahyaalnwsany39@gmail.com
Portfolio: https://nightprincey.github.io/Portfolio/