NightPrince commited on
Commit
9b57c88
Β·
verified Β·
1 Parent(s): 3a6bc73

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -119
README.md CHANGED
@@ -1,119 +1,181 @@
1
- ---
2
- language: en
3
- tags:
4
- - toxic-content
5
- - text-classification
6
- - keras
7
- - tensorflow
8
- - deep-learning
9
- - safety
10
- - multiclass
11
- license: mit
12
- datasets:
13
- - custom
14
- metrics:
15
- - accuracy
16
- - f1
17
- pipeline_tag: text-classification
18
- model-index:
19
- - name: Toxic_Classification
20
- results: []
21
- ---
22
-
23
- # Toxic_Classification (Keras / TensorFlow Model)
24
-
25
- This is a **multi-class text classification model** for toxic content detection.
26
- It was trained as part of the **Cellula Internship - Safe and Responsible Multi-Modal Toxic Content Moderation** project.
27
-
28
- ---
29
-
30
- ## 🚩 Task: Multi-class Toxic Content Detection
31
-
32
- The model classifies text (query + image description) into **9 categories:**
33
-
34
- | Label ID | Category |
35
- |--------- |------------------------------|
36
- | 0 | Child Sexual Exploitation |
37
- | 1 | Elections |
38
- | 2 | Non-Violent Crimes |
39
- | 3 | Safe |
40
- | 4 | Sex-Related Crimes |
41
- | 5 | Suicide & Self-Harm |
42
- | 6 | Unknown S-Type |
43
- | 7 | Violent Crimes |
44
- | 8 | Unsafe |
45
-
46
- ---
47
-
48
- ## βœ… Model Details
49
-
50
- - **Framework:** TensorFlow 2.19.0 + Keras 3.7.0
51
- - **Input:** Text + Image description (concatenated string)
52
- - **Tokenizer:** JSON tokenizer (`tokenizer.json`) with OOV handling and vocab size of 10,000
53
- - **Max Sequence Length:** 150 tokens
54
- - **Output:** Softmax probabilities over 9 classes
55
-
56
- ---
57
-
58
- ## βœ… Files Included in this Repository:
59
-
60
- | File | Description |
61
- |----------------------- |------------------------------------ |
62
- | `toxic_classifier.keras` | Saved Keras v3 model file |
63
- | `tokenizer.json` | Keras tokenizer for preprocessing |
64
- | `config.json` | Model configuration (architecture, vocab size, labels etc) |
65
- | `requirements.txt` | Python dependencies |
66
- | `README.md` | This model card |
67
-
68
- ---
69
-
70
- ## βœ… Example Usage (Python):
71
-
72
- ```python
73
- from keras.saving import load_model
74
- from tensorflow.keras.preprocessing.text import tokenizer_from_json
75
- from tensorflow.keras.preprocessing.sequence import pad_sequences
76
- import numpy as np
77
- import json
78
-
79
- # Load tokenizer
80
- with open("tokenizer.json", "r", encoding="utf-8") as f:
81
- tokenizer = tokenizer_from_json(f.read())
82
-
83
- # Load model
84
- model = load_model("toxic_classifier.keras")
85
-
86
- # Example inference
87
- query = "Example user query"
88
- image_desc = "Image describes a dangerous situation"
89
- text = query + " " + image_desc
90
-
91
- sequence = tokenizer.texts_to_sequences([text])
92
- padded = pad_sequences(sequence, maxlen=150, padding='post', truncating='post')
93
-
94
- prediction = model.predict(padded)
95
- predicted_label = np.argmax(prediction, axis=1)[0]
96
- print(f"Predicted Label ID: {predicted_label}")
97
-
98
-
99
-
100
- ## πŸ“š Resources
101
-
102
- - [Cellula Internship Project Proposal](#)
103
- - [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
104
- - [Llama Guard](https://llama.meta.com/llama-guard/)
105
- - [DistilBERT](https://huggingface.co/distilbert-base-uncased)
106
- - [Streamlit](https://streamlit.io/)
107
-
108
- ---
109
-
110
- ## License
111
-
112
- MIT License
113
-
114
- ---
115
-
116
- **Author:** Yahya Muhammad Alnwsany
117
- **Contact:** yahyaalnwsany39@gmail.com
118
- **Portfolio:** https://nightprincey.github.io/Portfolio/
119
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Toxic-Predict
2
+
3
+ Toxic-Predict is a machine learning project developed as part of the Cellula Internship, focused on safe and responsible multi-modal toxic content moderation. It classifies text queries and image descriptions into nine toxicity categories such as "Safe", "Violent Crimes", "Non-Violent Crimes", "Unsafe", and others. The project leverages deep learning (Keras/TensorFlow), NLP preprocessing, and benchmarking with modern transformer models to build and evaluate a robust multi-class toxic content classifier.
4
+
5
+ ---
6
+
7
+ ## 🚩 Project Context
8
+
9
+ This project is part of the **Cellula Internship** proposal:
10
+ **"Safe and Responsible Multi-Modal Toxic Content Moderation"**
11
+ The goal is to build a dual-stage moderation pipeline for both text and images, combining hard guardrails (Llama Guard) and soft classification (DistilBERT/Deep Learning) for nuanced, policy-compliant moderation.
12
+
13
+ ---
14
+
15
+ ## Project Structure
16
+
17
+ ```
18
+ .
19
+ β”œβ”€β”€ app.py
20
+ β”œβ”€β”€ run.py
21
+ β”œβ”€β”€ test.py
22
+ β”œβ”€β”€ requirements.txt
23
+ β”œβ”€β”€ README.md
24
+ β”œβ”€β”€ data/
25
+ β”‚ β”œβ”€β”€ cellula-toxic.csv
26
+ β”‚ β”œβ”€β”€ cleaned.csv
27
+ β”‚ β”œβ”€β”€ eval.csv
28
+ β”‚ β”œβ”€β”€ test.csv
29
+ β”‚ β”œβ”€β”€ tokenizer.pkl
30
+ β”‚ └── train.csv
31
+ β”œβ”€β”€ models/
32
+ β”‚ β”œβ”€β”€ model.py
33
+ β”‚ β”œβ”€β”€ toxic_classifier.h5
34
+ β”‚ └── toxic_classifier.keras
35
+ β”œβ”€β”€ notebooks/
36
+ β”‚ β”œβ”€β”€ Preprocessing.ipynb
37
+ β”‚ └── tokenization.ipynb
38
+ └── src/
39
+ β”œβ”€β”€ preprocess.py
40
+ └── tokenize_and_split.py
41
+ ```
42
+
43
+ ---
44
+
45
+ ## Features
46
+
47
+ - Dual-stage moderation: hard filter (Llama Guard) + soft classifier (DistilBERT/CNN/LSTM)
48
+ - Data cleaning, preprocessing, and label encoding
49
+ - Tokenization and sequence padding for text data
50
+ - Deep learning and transformer-based models for multi-class toxicity classification
51
+ - Evaluation metrics: classification report and confusion matrix
52
+ - Jupyter notebooks for data exploration and model development
53
+ - Streamlit web app for demo and deployment
54
+
55
+ ---
56
+
57
+ ## Setup
58
+
59
+ 1. **Clone the repository**
60
+ ```sh
61
+ git clone https://github.com/yourusername/toxic-classification.git
62
+ cd toxic-predict
63
+ ```
64
+
65
+ 2. **Install dependencies**
66
+ ```sh
67
+ pip install -r requirements.txt
68
+ ```
69
+
70
+ 3. **Prepare data**
71
+ - Place your data files in the `data/` directory if not already present.
72
+
73
+ 4. **Train the model**
74
+ - Use the scripts in `src/` or the Jupyter notebooks in `notebooks/` to preprocess data and train the model.
75
+
76
+ 5. **Run predictions**
77
+ - Use `app.py` or `run.py` to run inference on new data.
78
+
79
+ ---
80
+
81
+ ## Usage
82
+
83
+ - **Preprocessing and Tokenization:**
84
+ See `notebooks/Preprocessing.ipynb` and `notebooks/tokenization.ipynb` for step-by-step data cleaning, splitting, and tokenization.
85
+ - **Model Training:**
86
+ Model architecture and training code are in `models/model.py`.
87
+ - **Inference:**
88
+ Load the trained model (`models/toxic_classifier.h5` or `.keras`) and tokenizer (`data/tokenizer.pkl`) to predict toxicity categories for new samples.
89
+
90
+ ---
91
+
92
+ ## Data
93
+
94
+ - CSV files with columns: `query`, `image descriptions`, `Toxic Category`, and `Toxic Category Encoded`.
95
+ - Data splits: `train.csv`, `eval.csv`, `test.csv`, and `cleaned.csv` for processed data.
96
+ - 9 categories: Safe, Violent Crimes, Elections, Sex-Related Crimes, Unsafe, Non-Violent Crimes, Child Sexual Exploitation, Unknown S-Type, Suicide & Self-Harm.
97
+
98
+ ---
99
+
100
+ ## Model
101
+
102
+ - Deep learning model built with Keras (TensorFlow backend).
103
+ - Multi-class classification with label encoding for toxicity categories.
104
+ - Benchmarking with PEFT-LoRA DistilBERT and baseline CNN/LSTM.
105
+
106
+ ---
107
+
108
+ ## Evaluation
109
+
110
+ - Classification report and confusion matrix are generated for model evaluation.
111
+ - See the evaluation steps in `notebooks/Preprocessing.ipynb`.
112
+
113
+ ---
114
+
115
+ language: en
116
+
117
+ ## πŸ€— Hugging Face Inference
118
+
119
+ This model is available on the Hugging Face Hub: [NightPrince/Toxic_Classification](https://huggingface.co/NightPrince/Toxic_Classification)
120
+
121
+ ### Inference API Usage
122
+
123
+ You can use the Hugging Face Inference API or widget with two fields:
124
+
125
+ - `text`: The main query or post text
126
+ - `image_desc`: The image description (if any)
127
+
128
+ **Example (Python):**
129
+
130
+ ```python
131
+ from huggingface_hub import InferenceClient
132
+ client = InferenceClient("NightPrince/Toxic_Classification")
133
+ result = client.text_classification({
134
+ "text": "This is a dangerous post",
135
+ "image_desc": "Knife shown in the image"
136
+ })
137
+ print(result) # {'label': 'toxic', 'score': 0.98}
138
+ ```
139
+
140
+ ### Custom Pipeline Details
141
+
142
+ - The model uses a custom `pipeline.py` for multi-input inference.
143
+ - The output is a dictionary with the predicted `label` (class name) and `score` (confidence).
144
+ - Class names are mapped using `label_map.json`.
145
+
146
+ **Files in the repo:**
147
+ - `pipeline.py` (custom inference logic)
148
+ - `tokenizer.json` (Keras tokenizer)
149
+ - `label_map.json` (class code to name mapping)
150
+ - TensorFlow SavedModel files (`saved_model.pb`, `variables/`)
151
+
152
+ **Requirements:**
153
+ ```
154
+ tensorflow
155
+ keras
156
+ numpy
157
+ ```
158
+
159
+ ---
160
+
161
+ ---
162
+
163
+ ## πŸ“š Resources
164
+
165
+ - [Cellula Internship Project Proposal](#)
166
+ - [BLIP: Bootstrapped Language-Image Pre-training](https://github.com/salesforce/BLIP)
167
+ - [Llama Guard](https://llama.meta.com/llama-guard/)
168
+ - [DistilBERT](https://huggingface.co/distilbert-base-uncased)
169
+ - [Streamlit](https://streamlit.io/)
170
+
171
+ ---
172
+
173
+ ## License
174
+
175
+ MIT License
176
+
177
+ ---
178
+
179
+ **Author:** Yahya Muhammad Alnwsany
180
+ **Contact:** yahyaalnwsany39@gmail.com
181
+ **Portfolio:** https://nightprincey.github.io/Portfolio/