File size: 4,535 Bytes
3951155 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
---
language: en
tags:
- toxic-content
- text-classification
- keras
- tensorflow
- deep-learning
- safety
- multiclass
license: mit
datasets:
- custom
metrics:
- accuracy
- f1
pipeline_tag: text-classification
model-index:
- name: Toxic_Classification
results: []
---
# Toxic-Predict
Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.
---
## 🚩 Project Context
This project is part of the **Cellula Internship** proposal:
**"Safe and Responsible Multi-Modal Toxic Content Moderation"**
The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.
---
## Features
- Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
- Data cleaning, preprocessing, and label encoding
- Tokenization and sequence padding for text data
- Deep learning and transformer-based models for multi-class toxicity classification
- Evaluation metrics: classification report and confusion matrix
- Jupyter notebooks for data exploration and model development
- Streamlit web app for demo and deployment
---
---
## Usage
- **Preprocessing and Tokenization:**
See `notebooks/Preprocessing.ipynb` and `notebooks/tokenization.ipynb` for step-by-step data cleaning, splitting, and tokenization.
- **Model Training:**
Model architecture and training code are in `models/model.py`.
- **Inference:**
Load the trained model (`models/toxic_classifier.h5` or `.keras`) and tokenizer (`data/tokenizer.pkl`) to predict toxicity categories for new samples.
---
## Data
- CSV files with columns: `query`, `image descriptions`, `Toxic Category`, and `Toxic Category Encoded`.
- Data splits: `train.csv`, `eval.csv`, `test.csv`, and `cleaned.csv` for processed data.
- 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.
---
## Model
- Deep learning model built with Keras (TensorFlow backend).
- Multi-class classification with label encoding for toxicity categories.
- Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.
---
## Evaluation
- Classification report and confusion matrix are generated for model evaluation.
- See the evaluation steps in `notebooks/Preprocessing.ipynb`.
---
language: en
## 🤗 Hugging Face Inference
This model is available on the Hugging Face Hub: [NightPrince/Toxic_Classification](https://huggingface.co/NightPrince/Toxic_Classification)
### Inference API Usage
You can use the Hugging Face Inference API or widget with two fields:
- `text`: The main query or post text
- `image_desc`: The image description (if any)
**Example (Python):**
```python
from huggingface_hub import InferenceClient
client = InferenceClient("NightPrince/Toxic_Classification")
result = client.text_classification({
"text": "This is a dangerous post",
"image_desc": "Knife shown in the image"
})
print(result) # {'label': 'toxic', 'score': 0.98}
```
### Custom Pipeline Details
- The model uses a custom `pipeline.py` for multi-input inference.
- The output is a dictionary with the predicted `label` (class name) and `score` (confidence).
- Class names are mapped using `label_map.json`.
**Files in the repo:**
- `pipeline.py` (custom inference logic)
- `tokenizer.json` (Keras tokenizer)
- `label_map.json` (class code to name mapping)
- TensorFlow SavedModel files (`saved_model.pb`, `variables/`)
**Requirements:**
```
tensorflow
keras
numpy
```
---
---
## 📚 Resources
- [Cellula Internship Project Proposal](#)
- [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
- [Llama Guard](https://llama.meta.com/llama-guard/)
- [DistilBERT](https://huggingface.co/distilbert-base-uncased)
- [Streamlit](https://streamlit.io/)
---
## License
MIT License
---
**Author:** Yahya Muhammad Alnwsany
**Contact:** yahyaalnwsany39@gmail.com
**Portfolio:** https://nightprincey.github.io/Portfolio/
|