Prudhvinath07 commited on
Commit
dec266f
·
1 Parent(s): 145a122

added all files

Browse files
.DS_Store ADDED
Binary file (6.15 kB). View file
 
.dockerignore ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Git
2
+ .git
3
+ .gitignore
4
+
5
+ # Python
6
+ __pycache__/
7
+ *.py[cod]
8
+ *$py.class
9
+ *.so
10
+ .Python
11
+ env/
12
+ build/
13
+ develop-eggs/
14
+ dist/
15
+ downloads/
16
+ eggs/
17
+ .eggs/
18
+ lib/
19
+ lib64/
20
+ parts/
21
+ sdist/
22
+ var/
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+
27
+ # Virtual Environment
28
+ venv/
29
+ ENV/
30
+
31
+ # IDE specific files
32
+ .idea/
33
+ .vscode/
34
+ *.swp
35
+ *.swo
36
+
37
+ # OS specific files
38
+ .DS_Store
39
+ .DS_Store?
40
+ ._*
41
+ .Spotlight-V100
42
+ .Trashes
43
+ ehthumbs.db
44
+ Thumbs.db
45
+
46
+ # Docker and deployment files
47
+ Dockerfile
48
+ .dockerignore
49
+ build_docker.sh
50
+ DEPLOY_TO_HUGGINGFACE.md
51
+ .space
52
+ deploy_to_huggingface.sh
53
+
54
+ # Test files that aren't needed for deployment
55
+ test_*.py
56
+ CLI_interactive_test.py
57
+
58
+ # Training scripts not needed for inference
59
+ train.py
60
+ src/train.py
.space ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ title: Toxic Comment Classifier
2
+ emoji: 🔍
3
+ colorFrom: blue
4
+ colorTo: indigo
5
+ sdk: docker
6
+ pinned: false
7
+ license: mit
Dockerfile ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Copy requirements first for better caching
6
+ COPY requirements.txt .
7
+
8
+ # Install dependencies
9
+ RUN pip install --no-cache-dir -r requirements.txt
10
+
11
+ # Copy the rest of the application
12
+ COPY . .
13
+
14
+ # Download NLTK data
15
+ RUN python -c "import nltk; nltk.download('punkt')"
16
+
17
+ # Make port 7860 available for Hugging Face Spaces
18
+ EXPOSE 7860
19
+
20
+ # Set environment variables for Streamlit
21
+ ENV PYTHONUNBUFFERED=1 \
22
+ PYTHONDONTWRITEBYTECODE=1 \
23
+ STREAMLIT_SERVER_PORT=7860 \
24
+ STREAMLIT_SERVER_HEADLESS=true \
25
+ STREAMLIT_SERVER_ENABLE_CORS=false
26
+
27
+ # Command to run the application
28
+ CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -1,10 +1,159 @@
1
- ---
2
- title: Toxic Comment Classification Using Bert
3
- emoji: 🏃
4
- colorFrom: pink
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Toxic Comment Classification using BERT
2
+
3
+ A sophisticated machine learning project that uses BERT (Bidirectional Encoder Representations from Transformers) to classify toxic comments. This project provides both a web interface and CLI tools for detecting various types of toxic comments.
4
+
5
+ ## 🌟 Features
6
+
7
+ - Real-time toxic comment classification
8
+ - Interactive web interface using Streamlit
9
+ - Command-line interface for batch processing
10
+ - Support for multiple toxicity categories
11
+ - Visualization of toxicity scores using Plotly
12
+ - GPU acceleration support (when available)
13
+
14
+ ## 🛠️ Prerequisites
15
+
16
+ - Python 3.7+
17
+ - CUDA-compatible GPU (optional, for faster processing)
18
+ - Git
19
+
20
+ ## 📦 Installation
21
+
22
+ 1. Clone the repository:
23
+ ```bash
24
+ git clone https://github.com/yourusername/commentclassification_using_bert_model.git
25
+ cd commentclassification_using_bert_model
26
+ ```
27
+
28
+ 2. Create and activate a virtual environment:
29
+ ```bash
30
+ python -m venv venv
31
+ source venv/bin/activate # On Windows, use: venv\Scripts\activate
32
+ ```
33
+
34
+ 3. Install required packages:
35
+ ```bash
36
+ pip install -r requirements.txt
37
+ ```
38
+
39
+ ## 🚀 Usage
40
+
41
+ ### Web Interface
42
+
43
+ 1. Start the Streamlit application:
44
+ ```bash
45
+ streamlit run app.py
46
+ ```
47
+ 2. Open your browser and navigate to the displayed URL (typically http://localhost:8501)
48
+ 3. Enter text in the input field to get toxicity predictions
49
+ 4. View the visualization of toxicity scores through an interactive chart
50
+
51
+ ### Docker Container
52
+
53
+ 1. Build the Docker image:
54
+ ```bash
55
+ docker build -t toxic-comment-classifier .
56
+ ```
57
+ 2. Run the Docker container:
58
+ ```bash
59
+ docker run -p 7860:7860 toxic-comment-classifier
60
+ ```
61
+ 3. Open your browser and navigate to http://localhost:7860
62
+
63
+ ### Hugging Face Spaces Deployment
64
+
65
+ This project can be deployed to Hugging Face Spaces using Docker:
66
+
67
+ 1. Create a new Space on Hugging Face with Docker SDK
68
+ 2. Push this repository to the Space
69
+ 3. Hugging Face will automatically build and deploy the Docker container
70
+
71
+ For detailed deployment instructions, see [DEPLOY_TO_HUGGINGFACE.md](DEPLOY_TO_HUGGINGFACE.md)
72
+
73
+ ### Command Line Interface
74
+
75
+ For interactive testing:
76
+ ```bash
77
+ python CLI_interactive_test.py
78
+ ```
79
+
80
+ For model training:
81
+ ```bash
82
+ python train.py
83
+ ```
84
+
85
+ For running tests:
86
+ ```bash
87
+ python test_model.py
88
+ ```
89
+
90
+ ## 🏗️ Project Structure
91
+
92
+ ```
93
+ ├── app.py # Streamlit web application
94
+ ├── CLI_interactive_test.py # Command line interface
95
+ ├── train.py # Model training script
96
+ ├── test_model.py # Model testing utilities
97
+ ├── cuda.py # CUDA availability check
98
+ ├── requirements.txt # Project dependencies
99
+ ├── setup.py # Package setup configuration
100
+ ├── Dockerfile # Docker configuration for containerization
101
+ ├── .dockerignore # Files to exclude from Docker image
102
+ ├── .space # Hugging Face Spaces configuration
103
+ ├── DEPLOY_TO_HUGGINGFACE.md # Deployment instructions for Hugging Face
104
+ ├── deploy_to_huggingface.sh # Script to help with Hugging Face deployment
105
+ ├── src/ # Source code directory
106
+ ├── models/ # Saved model checkpoints
107
+ └── data/ # Training and test datasets
108
+ ```
109
+
110
+ ## 🔧 Model Architecture
111
+
112
+ The project uses a fine-tuned BERT model (bert-base-uncased) with additional classification layers to detect different types of toxicity in text. The model is implemented using PyTorch and the Transformers library.
113
+
114
+ Key components:
115
+ - BERT base model for text encoding
116
+ - Custom classification head for toxicity detection
117
+ - Multi-label classification support
118
+ - Real-time inference capabilities
119
+
120
+ ## 📊 Performance
121
+
122
+ The model is trained to classify text into multiple toxicity categories with high accuracy. It can process text in real-time and provides confidence scores for each category of toxicity:
123
+ - Toxic
124
+ - Severe Toxic
125
+ - Obscene
126
+ - Threat
127
+ - Insult
128
+ - Identity Hate
129
+
130
+ ## 💻 Dependencies
131
+
132
+ Key dependencies include:
133
+ - transformers >= 4.35.0
134
+ - torch >= 1.9.0
135
+ - streamlit >= 1.24.0
136
+ - fastapi >= 0.68.0
137
+ - plotly >= 5.13.0
138
+ - pandas >= 1.3.0
139
+ - numpy >= 1.19.0
140
+
141
+ ## 🤝 Contributing
142
+
143
+ Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:
144
+ 1. Fork the repository
145
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
146
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
147
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
148
+ 5. Open a Pull Request
149
+
150
+ ## 📝 License
151
+
152
+ This project is licensed under the MIT License - see the LICENSE file for details.
153
+
154
+ ## 🙏 Acknowledgments
155
+
156
+ - Hugging Face for the Transformers library
157
+ - The BERT team at Google Research
158
+ - The Streamlit team for the excellent web framework
159
+ - The PyTorch team for the deep learning framework
__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file to make src a package
__pycache__/__init__.cpython-312.pyc ADDED
Binary file (173 Bytes). View file
 
api/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file to make api a package
api/main.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, HTTPException
2
+ from pydantic import BaseModel
3
+ from typing import List, Dict
4
+ import torch
5
+ from src.preprocessing.text_processor import TextPreprocessor
6
+ from src.models.toxic_classifier import ToxicClassifier
7
+
8
+ app = FastAPI()
9
+
10
+ class CommentRequest(BaseModel):
11
+ text: str
12
+
13
+ class ToxicityResponse(BaseModel):
14
+ toxic: float
15
+ severe_toxic: float
16
+ obscene: float
17
+ threat: float
18
+ insult: float
19
+ identity_hate: float
20
+ confidence: float
21
+
22
+ @app.post("/predict", response_model=ToxicityResponse)
23
+ async def predict_toxicity(comment: CommentRequest):
24
+ try:
25
+ # Preprocess text
26
+ preprocessor = TextPreprocessor()
27
+ processed_text = preprocessor.process(comment.text)
28
+
29
+ # Tokenize for BERT
30
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
31
+ encoded = tokenizer(
32
+ processed_text,
33
+ padding=True,
34
+ truncation=True,
35
+ max_length=128,
36
+ return_tensors='pt'
37
+ )
38
+
39
+ # Get model prediction
40
+ model.eval()
41
+ with torch.no_grad():
42
+ outputs = model(
43
+ encoded['input_ids'].to(device),
44
+ encoded['attention_mask'].to(device)
45
+ )
46
+
47
+ predictions = outputs[0].cpu().numpy()
48
+ confidence = float(outputs.max())
49
+
50
+ return ToxicityResponse(
51
+ toxic=float(predictions[0]),
52
+ severe_toxic=float(predictions[1]),
53
+ obscene=float(predictions[2]),
54
+ threat=float(predictions[3]),
55
+ insult=float(predictions[4]),
56
+ identity_hate=float(predictions[5]),
57
+ confidence=confidence
58
+ )
59
+
60
+ except Exception as e:
61
+ raise HTTPException(status_code=500, detail=str(e))
app.py ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import torch
3
+ from transformers import AutoTokenizer
4
+ from src.models.toxic_classifier import ToxicClassifier
5
+ import os
6
+ import numpy as np
7
+ import plotly.graph_objects as go
8
+ from typing import Dict
9
+
10
+ class ToxicPredictor:
11
+ def __init__(self, model_path: str):
12
+ self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
13
+
14
+ # Load tokenizer and model
15
+ self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
16
+ self.model = ToxicClassifier().to(self.device)
17
+
18
+ try:
19
+ # Load trained weights with weights_only=True for security
20
+ checkpoint = torch.load(model_path, map_location=self.device, weights_only=True)
21
+
22
+ # Handle both old and new model state dict formats
23
+ if 'model_state_dict' in checkpoint:
24
+ state_dict = checkpoint['model_state_dict']
25
+ else:
26
+ state_dict = checkpoint
27
+
28
+ # Load state dict and handle any missing/unexpected keys
29
+ missing_keys, unexpected_keys = self.model.load_state_dict(state_dict, strict=False)
30
+ if missing_keys:
31
+ st.warning(f"Missing keys in state dict: {missing_keys}")
32
+ if unexpected_keys:
33
+ st.warning(f"Unexpected keys in state dict: {unexpected_keys}")
34
+
35
+ self.model.eval()
36
+
37
+ except Exception as e:
38
+ st.error(f"Error loading model: {str(e)}")
39
+ raise
40
+
41
+ # Category names
42
+ self.categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
43
+
44
+ def predict(self, text: str) -> Dict[str, float]:
45
+ """Predict toxicity scores for a single text"""
46
+ try:
47
+ # Tokenize
48
+ encoding = self.tokenizer(
49
+ text,
50
+ add_special_tokens=True,
51
+ max_length=128,
52
+ padding='max_length',
53
+ truncation=True,
54
+ return_tensors='pt'
55
+ )
56
+
57
+ # Move to device
58
+ input_ids = encoding['input_ids'].to(self.device)
59
+ attention_mask = encoding['attention_mask'].to(self.device)
60
+
61
+ # Get predictions
62
+ with torch.no_grad():
63
+ outputs = self.model(input_ids, attention_mask)
64
+ probabilities = torch.sigmoid(outputs).cpu().numpy()[0]
65
+
66
+ # Create results dictionary
67
+ results = {
68
+ category: float(prob)
69
+ for category, prob in zip(self.categories, probabilities)
70
+ }
71
+
72
+ return results
73
+ except Exception as e:
74
+ st.error(f"Error during prediction: {str(e)}")
75
+ raise
76
+
77
+ def create_gauge_chart(value: float, title: str) -> go.Figure:
78
+ """Create a gauge chart for toxicity scores"""
79
+ fig = go.Figure(go.Indicator(
80
+ mode="gauge+number",
81
+ value=value * 100, # Convert to percentage
82
+ domain={'x': [0, 1], 'y': [0, 1]},
83
+ title={'text': title},
84
+ gauge={
85
+ 'axis': {'range': [0, 100]},
86
+ 'bar': {'color': "darkblue"},
87
+ 'steps': [
88
+ {'range': [0, 33], 'color': "lightgreen"},
89
+ {'range': [33, 66], 'color': "yellow"},
90
+ {'range': [66, 100], 'color': "red"}
91
+ ],
92
+ 'threshold': {
93
+ 'line': {'color': "red", 'width': 4},
94
+ 'thickness': 0.75,
95
+ 'value': 50
96
+ }
97
+ }
98
+ ))
99
+
100
+ fig.update_layout(height=200)
101
+ return fig
102
+
103
+ def main():
104
+ st.set_page_config(
105
+ page_title="Toxic Comment Classifier",
106
+ page_icon="🔍",
107
+ layout="wide"
108
+ )
109
+
110
+ # Title and description
111
+ st.title("💬 Toxic Comment Classifier")
112
+ st.markdown("""
113
+ This app uses a BERT-based model to detect toxic comments.
114
+ Enter your text below to analyze it for different types of toxicity.
115
+ """)
116
+
117
+ # Load model
118
+ model_path = os.path.join("models", "saved", "best_model.pt")
119
+
120
+ if not os.path.exists(model_path):
121
+ st.error("Model file not found! Please train the model first.")
122
+ return
123
+
124
+ try:
125
+ # Initialize predictor
126
+ @st.cache_resource(show_spinner=False)
127
+ def load_predictor():
128
+ with st.spinner("Loading model..."):
129
+ return ToxicPredictor(model_path)
130
+
131
+ predictor = load_predictor()
132
+
133
+ # Text input
134
+ text = st.text_area(
135
+ "Enter text to analyze:",
136
+ height=100,
137
+ placeholder="Type or paste your text here..."
138
+ )
139
+
140
+ if st.button("Analyze", type="primary"):
141
+ if not text:
142
+ st.warning("Please enter some text to analyze.")
143
+ return
144
+
145
+ with st.spinner("Analyzing text..."):
146
+ try:
147
+ # Get predictions
148
+ predictions = predictor.predict(text)
149
+
150
+ # Display results
151
+ st.markdown("### Analysis Results")
152
+
153
+ # Create columns for the gauge charts
154
+ col1, col2, col3 = st.columns(3)
155
+
156
+ # Display gauge charts in columns
157
+ with col1:
158
+ st.plotly_chart(create_gauge_chart(predictions['toxic'], "Toxic"), use_container_width=True)
159
+ st.plotly_chart(create_gauge_chart(predictions['obscene'], "Obscene"), use_container_width=True)
160
+
161
+ with col2:
162
+ st.plotly_chart(create_gauge_chart(predictions['severe_toxic'], "Severe Toxic"), use_container_width=True)
163
+ st.plotly_chart(create_gauge_chart(predictions['threat'], "Threat"), use_container_width=True)
164
+
165
+ with col3:
166
+ st.plotly_chart(create_gauge_chart(predictions['insult'], "Insult"), use_container_width=True)
167
+ st.plotly_chart(create_gauge_chart(predictions['identity_hate'], "Identity Hate"), use_container_width=True)
168
+
169
+ # Overall assessment
170
+ st.markdown("### Overall Assessment")
171
+ max_toxicity = max(predictions.values())
172
+ max_category = max(predictions.items(), key=lambda x: x[1])[0]
173
+
174
+ if max_toxicity > 0.5:
175
+ st.error(f"⚠️ This text may be toxic (highest score: {max_toxicity:.2%} for {max_category})")
176
+ else:
177
+ st.success(f"✅ This text appears to be non-toxic (highest score: {max_toxicity:.2%})")
178
+
179
+ except Exception as e:
180
+ st.error(f"Error analyzing text: {str(e)}")
181
+
182
+ # Add information about the categories
183
+ with st.expander("ℹ️ About the Toxicity Categories"):
184
+ st.markdown("""
185
+ The model analyzes text for six types of toxicity:
186
+
187
+ * **Toxic**: General category for unpleasant content
188
+ * **Severe Toxic**: Extreme cases of toxicity
189
+ * **Obscene**: Explicit or vulgar content
190
+ * **Threat**: Expressions of intent to harm
191
+ * **Insult**: Disrespectful or demeaning language
192
+ * **Identity Hate**: Prejudiced language against protected characteristics
193
+
194
+ Scores range from 0% to 100%, where higher scores indicate stronger presence of that category.
195
+ """)
196
+
197
+ # Footer
198
+ st.markdown("---")
199
+ st.markdown(
200
+ "Built with ❤️ using Streamlit and BERT. "
201
+ "Model trained on the Toxic Comment Classification Dataset."
202
+ )
203
+
204
+ except Exception as e:
205
+ st.error(f"Application error: {str(e)}")
206
+
207
+ if __name__ == "__main__":
208
+ main()
data/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file to make data a package
data/data_loader.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import torch
3
+ from torch.utils.data import Dataset, DataLoader
4
+ from transformers import BertTokenizer
5
+ from typing import Dict, List, Tuple
6
+ import numpy as np
7
+ import os
8
+
9
+ class ToxicCommentDataset(Dataset):
10
+ def __init__(self, texts: List[str], labels: np.ndarray, tokenizer: BertTokenizer, max_length: int = 128):
11
+ # Convert texts to list if it's a pandas Series
12
+ self.texts = texts.tolist() if isinstance(texts, pd.Series) else texts
13
+ self.labels = labels
14
+ self.tokenizer = tokenizer
15
+ self.max_length = max_length
16
+
17
+ def __len__(self):
18
+ return len(self.texts)
19
+
20
+ def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
21
+ text = str(self.texts[idx])
22
+
23
+ # Handle unusual line terminators
24
+ text = text.replace('\u2028', ' ').replace('\u2029', ' ') # Remove line/paragraph separators
25
+ text = ' '.join(text.splitlines()) # Normalize all newlines
26
+
27
+ label = self.labels[idx]
28
+
29
+ encoding = self.tokenizer(
30
+ text,
31
+ add_special_tokens=True,
32
+ max_length=self.max_length,
33
+ padding='max_length',
34
+ truncation=True,
35
+ return_tensors='pt'
36
+ )
37
+
38
+ return {
39
+ 'input_ids': encoding['input_ids'].flatten(),
40
+ 'attention_mask': encoding['attention_mask'].flatten(),
41
+ 'labels': torch.FloatTensor(label)
42
+ }
43
+
44
+ def load_toxic_data(data_path: str) -> Tuple[List[str], np.ndarray]:
45
+ """Load and prepare the toxic comment dataset"""
46
+ try:
47
+ # Use encoding='utf-8-sig' to handle BOM if present
48
+ df = pd.read_csv(data_path, encoding='utf-8-sig', on_bad_lines='skip')
49
+
50
+ # List of toxicity categories
51
+ toxic_categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
52
+
53
+ # Convert text column to list and labels to numpy array
54
+ texts = df['comment_text'].tolist()
55
+ labels = df[toxic_categories].values
56
+
57
+ return texts, labels
58
+ except Exception as e:
59
+ raise RuntimeError(f"Error loading data from {data_path}: {str(e)}")
60
+
61
+ def create_data_loaders(
62
+ texts: List[str],
63
+ labels: np.ndarray,
64
+ tokenizer: BertTokenizer,
65
+ train_ratio: float = 0.8,
66
+ batch_size: int = 32,
67
+ num_workers: int = 4 # Adjusted for Windows
68
+ ) -> Tuple[DataLoader, DataLoader]:
69
+ """Create train and validation data loaders"""
70
+ try:
71
+ # Calculate split index
72
+ dataset_size = len(texts)
73
+ train_size = int(dataset_size * train_ratio)
74
+
75
+ # Split data
76
+ train_texts = texts[:train_size]
77
+ train_labels = labels[:train_size]
78
+ val_texts = texts[train_size:]
79
+ val_labels = labels[train_size:]
80
+
81
+ # Create datasets
82
+ train_dataset = ToxicCommentDataset(train_texts, train_labels, tokenizer)
83
+ val_dataset = ToxicCommentDataset(val_texts, val_labels, tokenizer)
84
+
85
+ # Create data loaders with Windows-optimized settings
86
+ train_loader = DataLoader(
87
+ train_dataset,
88
+ batch_size=batch_size,
89
+ shuffle=True,
90
+ num_workers=num_workers,
91
+ pin_memory=True, # Helps with CUDA performance
92
+ persistent_workers=True # Keeps workers alive between epochs
93
+ )
94
+
95
+ val_loader = DataLoader(
96
+ val_dataset,
97
+ batch_size=batch_size,
98
+ shuffle=False,
99
+ num_workers=num_workers,
100
+ pin_memory=True,
101
+ persistent_workers=True
102
+ )
103
+
104
+ return train_loader, val_loader
105
+ except Exception as e:
106
+ raise RuntimeError(f"Error creating data loaders: {str(e)}")
models/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file to make models a package
models/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (180 Bytes). View file
 
models/__pycache__/toxic_classifier.cpython-312.pyc ADDED
Binary file (2.27 kB). View file
 
models/toxic_classifier.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from transformers import AutoModel
4
+ from typing import Dict, Tuple
5
+
6
+ class ToxicClassifier(nn.Module):
7
+ def __init__(self, num_classes: int = 6, dropout: float = 0.3):
8
+ super(ToxicClassifier, self).__init__()
9
+
10
+ # BERT base model - freeze some layers to prevent overfitting
11
+ self.bert = AutoModel.from_pretrained('bert-base-uncased')
12
+
13
+ # Freeze the first 8 layers of BERT
14
+ for param in list(self.bert.parameters())[:-8]:
15
+ param.requires_grad = False
16
+
17
+ # Simplified architecture focusing on BERT's power
18
+ self.dropout = nn.Dropout(dropout)
19
+ self.classifier = nn.Linear(768, num_classes) # 768 is BERT's hidden size
20
+
21
+ # Initialize the classifier weights properly
22
+ torch.nn.init.xavier_uniform_(self.classifier.weight)
23
+ self.classifier.bias.data.fill_(0.0)
24
+
25
+ def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
26
+ # Get BERT embeddings
27
+ outputs = self.bert(input_ids, attention_mask=attention_mask)
28
+ pooled_output = outputs.pooler_output # [batch_size, 768]
29
+
30
+ # Apply dropout and classification
31
+ pooled_output = self.dropout(pooled_output)
32
+ logits = self.classifier(pooled_output)
33
+
34
+ return logits # Return logits directly, BCEWithLogitsLoss will handle the sigmoid
models/trainer.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch.utils.data import DataLoader
3
+ from typing import Dict, List
4
+ from tqdm import tqdm
5
+ from torch.amp import autocast, GradScaler
6
+
7
+ class ModelTrainer:
8
+ def __init__(self, model, optimizer, criterion, device, scaler: GradScaler = None, scheduler=None):
9
+ self.model = model
10
+ self.optimizer = optimizer
11
+ self.criterion = criterion
12
+ self.device = device
13
+ self.scaler = scaler or GradScaler('cuda')
14
+ self.use_amp = device.type == 'cuda'
15
+ self.scheduler = scheduler
16
+
17
+ def train_epoch(self, dataloader: DataLoader) -> Dict[str, float]:
18
+ self.model.train()
19
+ total_loss = 0
20
+
21
+ for batch in tqdm(dataloader, desc="Training"):
22
+ input_ids = batch['input_ids'].to(self.device)
23
+ attention_mask = batch['attention_mask'].to(self.device)
24
+ labels = batch['labels'].to(self.device)
25
+
26
+ self.optimizer.zero_grad()
27
+
28
+ if self.use_amp:
29
+ with autocast('cuda'):
30
+ outputs = self.model(input_ids, attention_mask)
31
+ loss = self.criterion(outputs, labels)
32
+
33
+ self.scaler.scale(loss).backward()
34
+
35
+ # Clip gradients
36
+ self.scaler.unscale_(self.optimizer)
37
+ torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
38
+
39
+ self.scaler.step(self.optimizer)
40
+ self.scaler.update()
41
+ else:
42
+ outputs = self.model(input_ids, attention_mask)
43
+ loss = self.criterion(outputs, labels)
44
+ loss.backward()
45
+ torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
46
+ self.optimizer.step()
47
+
48
+ if self.scheduler is not None:
49
+ self.scheduler.step()
50
+
51
+ total_loss += loss.item()
52
+
53
+ return {'loss': total_loss / len(dataloader)}
54
+
55
+ def evaluate(self, dataloader: DataLoader) -> Dict[str, float]:
56
+ self.model.eval()
57
+ total_loss = 0
58
+ predictions = []
59
+ true_labels = []
60
+
61
+ with torch.no_grad():
62
+ for batch in tqdm(dataloader, desc="Evaluating"):
63
+ input_ids = batch['input_ids'].to(self.device)
64
+ attention_mask = batch['attention_mask'].to(self.device)
65
+ labels = batch['labels'].to(self.device)
66
+
67
+ if self.use_amp:
68
+ with autocast('cuda'):
69
+ outputs = self.model(input_ids, attention_mask)
70
+ loss = self.criterion(outputs, labels)
71
+ else:
72
+ outputs = self.model(input_ids, attention_mask)
73
+ loss = self.criterion(outputs, labels)
74
+
75
+ # Apply sigmoid to get probabilities for predictions
76
+ probs = torch.sigmoid(outputs)
77
+
78
+ total_loss += loss.item()
79
+ predictions.extend(probs.cpu().numpy())
80
+ true_labels.extend(labels.cpu().numpy())
81
+
82
+ return {
83
+ 'loss': total_loss / len(dataloader),
84
+ 'predictions': predictions,
85
+ 'true_labels': true_labels
86
+ }
preprocessing/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file to make preprocessing a package
preprocessing/text_processor.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import nltk
3
+ from nltk.tokenize import word_tokenize
4
+ from nltk.corpus import stopwords
5
+ from nltk.stem import WordNetLemmatizer
6
+ from typing import List, Optional
7
+
8
+ class TextPreprocessor:
9
+ def __init__(self):
10
+ nltk.download('punkt')
11
+ nltk.download('stopwords')
12
+ nltk.download('wordnet')
13
+ self.stop_words = set(stopwords.words('english'))
14
+ self.lemmatizer = WordNetLemmatizer()
15
+
16
+ def clean_text(self, text: str) -> str:
17
+ """Clean and normalize text"""
18
+ # Convert to lowercase
19
+ text = text.lower()
20
+
21
+ # Remove special characters and numbers
22
+ text = re.sub(r'[^a-zA-Z\s]', '', text)
23
+
24
+ # Remove extra whitespace
25
+ text = re.sub(r'\s+', ' ', text).strip()
26
+
27
+ return text
28
+
29
+ def tokenize(self, text: str) -> List[str]:
30
+ """Tokenize text into words"""
31
+ return word_tokenize(text)
32
+
33
+ def remove_stopwords(self, tokens: List[str]) -> List[str]:
34
+ """Remove stop words from token list"""
35
+ return [token for token in tokens if token not in self.stop_words]
36
+
37
+ def lemmatize(self, tokens: List[str]) -> List[str]:
38
+ """Lemmatize tokens"""
39
+ return [self.lemmatizer.lemmatize(token) for token in tokens]
40
+
41
+ def process(self, text: str) -> List[str]:
42
+ """Complete preprocessing pipeline"""
43
+ cleaned_text = self.clean_text(text)
44
+ tokens = self.tokenize(cleaned_text)
45
+ tokens = self.remove_stopwords(tokens)
46
+ tokens = self.lemmatize(tokens)
47
+ return tokens
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ transformers>=4.5.0
3
+ nltk>=3.6.0
4
+ fastapi>=0.68.0
5
+ uvicorn>=0.15.0
6
+ scikit-learn>=0.24.0
7
+ tqdm>=4.62.0
8
+ pydantic>=1.8.0
9
+ streamlit>=1.24.0
10
+ plotly>=5.13.0
11
+ torch>=1.9.0
12
+ transformers>=4.35.0
13
+ numpy>=1.19.0
14
+ pandas>=1.3.0
saved/best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ac08d9bdca185a464f8a71e88cd2e15ce2fb6b18ebb51dc3d459e00e0f9c159
3
+ size 480592037
train.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import BertTokenizer, AdamW
3
+ from src.models.toxic_classifier import ToxicClassifier
4
+ from src.models.trainer import ModelTrainer
5
+ from src.data.data_loader import load_toxic_data, create_data_loaders
6
+ import logging
7
+ import os
8
+ from torch.cuda.amp import GradScaler, autocast # For mixed precision training
9
+
10
+ # Setup logging
11
+ logging.basicConfig(level=logging.INFO)
12
+ logger = logging.getLogger(__name__)
13
+
14
+ def train_model(
15
+ data_path: str,
16
+ model_save_path: str,
17
+ num_epochs: int = 5,
18
+ batch_size: int = 64, # Increased for RTX 3060
19
+ learning_rate: float = 2e-5,
20
+ max_grad_norm: float = 1.0
21
+ ):
22
+ # Set device and enable CUDA optimizations
23
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
24
+ if device.type == 'cuda':
25
+ torch.backends.cudnn.benchmark = True
26
+ logger.info(f"Using device: {device}")
27
+
28
+ # Load tokenizer
29
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
30
+
31
+ # Load data
32
+ logger.info("Loading dataset...")
33
+ texts, labels = load_toxic_data(data_path)
34
+ train_loader, val_loader = create_data_loaders(
35
+ texts,
36
+ labels,
37
+ tokenizer,
38
+ batch_size=batch_size
39
+ )
40
+
41
+ # Initialize model
42
+ logger.info("Initializing model...")
43
+ model = ToxicClassifier().to(device)
44
+
45
+ # Initialize optimizer with weight decay
46
+ optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
47
+
48
+ # Initialize gradient scaler for mixed precision training
49
+ scaler = GradScaler()
50
+
51
+ # Initialize trainer with mixed precision support
52
+ trainer = ModelTrainer(model, optimizer, criterion=torch.nn.BCELoss(), device=device, scaler=scaler)
53
+
54
+ # Training loop
55
+ logger.info("Starting training...")
56
+ best_val_loss = float('inf')
57
+
58
+ for epoch in range(num_epochs):
59
+ # Train
60
+ train_metrics = trainer.train_epoch(train_loader)
61
+ logger.info(f"Epoch {epoch+1}/{num_epochs}")
62
+ logger.info(f"Training Loss: {train_metrics['loss']:.4f}")
63
+
64
+ # Evaluate
65
+ val_metrics = trainer.evaluate(val_loader)
66
+ val_loss = val_metrics['loss']
67
+ logger.info(f"Validation Loss: {val_loss:.4f}")
68
+
69
+ # Save best model
70
+ if val_loss < best_val_loss:
71
+ best_val_loss = val_loss
72
+ torch.save({
73
+ 'epoch': epoch,
74
+ 'model_state_dict': model.state_dict(),
75
+ 'optimizer_state_dict': optimizer.state_dict(),
76
+ 'loss': best_val_loss,
77
+ }, os.path.join(model_save_path, 'best_model.pt'))
78
+ logger.info("Saved best model checkpoint")
79
+
80
+ logger.info("Training completed!")
81
+
82
+ if __name__ == "__main__":
83
+ DATA_PATH = os.path.join("data", "raw", "train.csv")
84
+ MODEL_SAVE_PATH = os.path.join("models", "saved")
85
+
86
+ # Create model save directory if it doesn't exist
87
+ os.makedirs(MODEL_SAVE_PATH, exist_ok=True)
88
+
89
+ train_model(DATA_PATH, MODEL_SAVE_PATH)