NightPrince
/

Toxic_Classification

@@ -1,119 +1,181 @@
----
-language: en
-tags:
-- toxic-content
-- text-classification
-- keras
-- tensorflow
-- deep-learning
-- safety
-- multiclass
-license: mit
-datasets:
-- custom
-metrics:
-- accuracy
-- f1
-pipeline_tag: text-classification
-model-index:
-- name: Toxic_Classification
-  results: []
----
-# Toxic_Classification (Keras / TensorFlow Model)
-This is a **multi-class text classification model** for toxic content detection.
-It was trained as part of the **Cellula Internship - Safe and Responsible Multi-Modal Toxic Content Moderation** project.
----
-## 🚩 Task: Multi-class Toxic Content Detection
-The model classifies text (query + image description) into **9 categories:**
-| Label ID | Category                     |
-|--------- |------------------------------|
-| 0        | Child Sexual Exploitation    |
-| 1        | Elections                    |
-| 2        | Non-Violent Crimes           |
-| 3        | Safe                         |
-| 4        | Sex-Related Crimes           |
-| 5        | Suicide & Self-Harm          |
-| 6        | Unknown S-Type               |
-| 7        | Violent Crimes               |
-| 8        | Unsafe                       |
----
-## ✅ Model Details
-- **Framework:** TensorFlow 2.19.0 + Keras 3.7.0
-- **Input:** Text + Image description (concatenated string)
-- **Tokenizer:** JSON tokenizer (`tokenizer.json`) with OOV handling and vocab size of 10,000
-- **Max Sequence Length:** 150 tokens
-- **Output:** Softmax probabilities over 9 classes
----
-## ✅ Files Included in this Repository:
-| File                   | Description                         |
-|----------------------- |------------------------------------ |
-| `toxic_classifier.keras` | Saved Keras v3 model file |
-| `tokenizer.json`       | Keras tokenizer for preprocessing |
-| `config.json`          | Model configuration (architecture, vocab size, labels etc) |
-| `requirements.txt`     | Python dependencies |
-| `README.md`            | This model card |
----
-## ✅ Example Usage (Python):
-```python
-from keras.saving import load_model
-from tensorflow.keras.preprocessing.text import tokenizer_from_json
-from tensorflow.keras.preprocessing.sequence import pad_sequences
-import numpy as np
-import json
-# Load tokenizer
-with open("tokenizer.json", "r", encoding="utf-8") as f:
-    tokenizer = tokenizer_from_json(f.read())
-# Load model
-model = load_model("toxic_classifier.keras")
-# Example inference
-query = "Example user query"
-image_desc = "Image describes a dangerous situation"
-text = query + " " + image_desc
-sequence = tokenizer.texts_to_sequences([text])
-padded = pad_sequences(sequence, maxlen=150, padding='post', truncating='post')
-prediction = model.predict(padded)
-predicted_label = np.argmax(prediction, axis=1)[0]
-print(f"Predicted Label ID: {predicted_label}")
-## 📚 Resources
-- [Cellula Internship Project Proposal](#)
-- [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
-- [Llama Guard](https://llama.meta.com/llama-guard/)
-- [DistilBERT](https://huggingface.co/distilbert-base-uncased)
-- [Streamlit](https://streamlit.io/)
----
-## License
-MIT License
----
-**Author:** Yahya Muhammad Alnwsany
-**Contact:** yahyaalnwsany39@gmail.com
-**Portfolio:** https://nightprincey.github.io/Portfolio/

+# Toxic-Predict
+Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.
+---
+## 🚩 Project Context
+This project is part of the **Cellula Internship** proposal:
+**"Safe and Responsible Multi-Modal Toxic Content Moderation"**
+The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.
+---
+## Project Structure
+```
+.
+├── app.py
+├── run.py
+├── test.py
+├── requirements.txt
+├── README.md
+├── data/
+│   ├── cellula-toxic.csv
+│   ├── cleaned.csv
+│   ├── eval.csv
+│   ├── test.csv
+│   ├── tokenizer.pkl
+│   └── train.csv
+├── models/
+│   ├── model.py
+│   ├── toxic_classifier.h5
+│   └── toxic_classifier.keras
+├── notebooks/
+│   ├── Preprocessing.ipynb
+│   └── tokenization.ipynb
+└── src/
+    ├── preprocess.py
+    └── tokenize_and_split.py
+```
+---
+## Features
+- Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
+- Data cleaning, preprocessing, and label encoding
+- Tokenization and sequence padding for text data
+- Deep learning and transformer-based models for multi-class toxicity classification
+- Evaluation metrics: classification report and confusion matrix
+- Jupyter notebooks for data exploration and model development
+- Streamlit web app for demo and deployment
+---
+## Setup
+1. **Clone the repository**
+   ```sh
+   git clone https://github.com/yourusername/toxic-classification.git
+   cd toxic-predict
+   ```
+2. **Install dependencies**
+   ```sh
+   pip install -r requirements.txt
+   ```
+3. **Prepare data**
+   - Place your data files in the `data/` directory if not already present.
+4. **Train the model**
+   - Use the scripts in `src/` or the Jupyter notebooks in `notebooks/` to preprocess data and train the model.
+5. **Run predictions**
+   - Use `app.py` or `run.py` to run inference on new data.
+---
+## Usage
+- **Preprocessing and Tokenization:**
+  See `notebooks/Preprocessing.ipynb` and `notebooks/tokenization.ipynb` for step-by-step data cleaning, splitting, and tokenization.
+- **Model Training:**
+  Model architecture and training code are in `models/model.py`.
+- **Inference:**
+  Load the trained model (`models/toxic_classifier.h5` or `.keras`) and tokenizer (`data/tokenizer.pkl`) to predict toxicity categories for new samples.
+---
+## Data
+- CSV files with columns: `query`, `image descriptions`, `Toxic Category`, and `Toxic Category Encoded`.
+- Data splits: `train.csv`, `eval.csv`, `test.csv`, and `cleaned.csv` for processed data.
+- 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.
+---
+## Model
+- Deep learning model built with Keras (TensorFlow backend).
+- Multi-class classification with label encoding for toxicity categories.
+- Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.
+---
+## Evaluation
+- Classification report and confusion matrix are generated for model evaluation.
+- See the evaluation steps in `notebooks/Preprocessing.ipynb`.
+---
+language: en
+## 🤗 Hugging Face Inference
+This model is available on the Hugging Face Hub: [NightPrince/Toxic_Classification](https://huggingface.co/NightPrince/Toxic_Classification)
+### Inference API Usage
+You can use the Hugging Face Inference API or widget with two fields:
+- `text`: The main query or post text
+- `image_desc`: The image description (if any)
+**Example (Python):**
+```python
+from huggingface_hub import InferenceClient
+client = InferenceClient("NightPrince/Toxic_Classification")
+result = client.text_classification({
+    "text": "This is a dangerous post",
+    "image_desc": "Knife shown in the image"
+})
+print(result)  # {'label': 'toxic', 'score': 0.98}
+```
+### Custom Pipeline Details
+- The model uses a custom `pipeline.py` for multi-input inference.
+- The output is a dictionary with the predicted `label` (class name) and `score` (confidence).
+- Class names are mapped using `label_map.json`.
+**Files in the repo:**
+- `pipeline.py` (custom inference logic)
+- `tokenizer.json` (Keras tokenizer)
+- `label_map.json` (class code to name mapping)
+- TensorFlow SavedModel files (`saved_model.pb`, `variables/`)
+**Requirements:**
+```
+tensorflow
+keras
+numpy
+```
+---
+---
+## 📚 Resources
+- [Cellula Internship Project Proposal](#)
+- [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
+- [Llama Guard](https://llama.meta.com/llama-guard/)
+- [DistilBERT](https://huggingface.co/distilbert-base-uncased)
+- [Streamlit](https://streamlit.io/)
+---
+## License
+MIT License
+---
+**Author:** Yahya Muhammad Alnwsany
+**Contact:** yahyaalnwsany39@gmail.com
+**Portfolio:** https://nightprincey.github.io/Portfolio/