Spaces:

binuser007
/

Toxic_comment_classification_using_Bert

Build error

App Files Files Community

Prudhvinath07 commited on Apr 8, 2025

Commit

dec266f

1 Parent(s): 145a122

added all files

Browse files

Files changed (22) hide show

.DS_Store +0 -0
.dockerignore +60 -0
.space +7 -0
Dockerfile +28 -0
README.md +159 -10
__init__.py +1 -0
__pycache__/__init__.cpython-312.pyc +0 -0
api/__init__.py +1 -0
api/main.py +61 -0
app.py +208 -0
data/__init__.py +1 -0
data/data_loader.py +106 -0
models/__init__.py +1 -0
models/__pycache__/__init__.cpython-312.pyc +0 -0
models/__pycache__/toxic_classifier.cpython-312.pyc +0 -0
models/toxic_classifier.py +34 -0
models/trainer.py +86 -0
preprocessing/__init__.py +1 -0
preprocessing/text_processor.py +47 -0
requirements.txt +14 -0
saved/best_model.pt +3 -0
train.py +89 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.dockerignore ADDED Viewed

	@@ -0,0 +1,60 @@

+# Git
+.git
+.gitignore
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual Environment
+venv/
+ENV/
+# IDE specific files
+.idea/
+.vscode/
+*.swp
+*.swo
+# OS specific files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Docker and deployment files
+Dockerfile
+.dockerignore
+build_docker.sh
+DEPLOY_TO_HUGGINGFACE.md
+.space
+deploy_to_huggingface.sh
+# Test files that aren't needed for deployment
+test_*.py
+CLI_interactive_test.py
+# Training scripts not needed for inference
+train.py
+src/train.py

.space ADDED Viewed

	@@ -0,0 +1,7 @@

+title: Toxic Comment Classifier
+emoji: 🔍
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+pinned: false
+license: mit

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+FROM python:3.9-slim
+WORKDIR /app
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application
+COPY . .
+# Download NLTK data
+RUN python -c "import nltk; nltk.download('punkt')"
+# Make port 7860 available for Hugging Face Spaces
+EXPOSE 7860
+# Set environment variables for Streamlit
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    STREAMLIT_SERVER_PORT=7860 \
+    STREAMLIT_SERVER_HEADLESS=true \
+    STREAMLIT_SERVER_ENABLE_CORS=false
+# Command to run the application
+CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -1,10 +1,159 @@
----
-title: Toxic Comment Classification Using Bert
-emoji: 🏃
-colorFrom: pink
-colorTo: purple
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Toxic Comment Classification using BERT
+A sophisticated machine learning project that uses BERT (Bidirectional Encoder Representations from Transformers) to classify toxic comments. This project provides both a web interface and CLI tools for detecting various types of toxic comments.
+## 🌟 Features
+- Real-time toxic comment classification
+- Interactive web interface using Streamlit
+- Command-line interface for batch processing
+- Support for multiple toxicity categories
+- Visualization of toxicity scores using Plotly
+- GPU acceleration support (when available)
+## 🛠️ Prerequisites
+- Python 3.7+
+- CUDA-compatible GPU (optional, for faster processing)
+- Git
+## 📦 Installation
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/yourusername/commentclassification_using_bert_model.git
+   cd commentclassification_using_bert_model
+   ```
+2. Create and activate a virtual environment:
+   ```bash
+   python -m venv venv
+   source venv/bin/activate  # On Windows, use: venv\Scripts\activate
+   ```
+3. Install required packages:
+   ```bash
+   pip install -r requirements.txt
+   ```
+## 🚀 Usage
+### Web Interface
+1. Start the Streamlit application:
+   ```bash
+   streamlit run app.py
+   ```
+2. Open your browser and navigate to the displayed URL (typically http://localhost:8501)
+3. Enter text in the input field to get toxicity predictions
+4. View the visualization of toxicity scores through an interactive chart
+### Docker Container
+1. Build the Docker image:
+   ```bash
+   docker build -t toxic-comment-classifier .
+   ```
+2. Run the Docker container:
+   ```bash
+   docker run -p 7860:7860 toxic-comment-classifier
+   ```
+3. Open your browser and navigate to http://localhost:7860
+### Hugging Face Spaces Deployment
+This project can be deployed to Hugging Face Spaces using Docker:
+1. Create a new Space on Hugging Face with Docker SDK
+2. Push this repository to the Space
+3. Hugging Face will automatically build and deploy the Docker container
+For detailed deployment instructions, see [DEPLOY_TO_HUGGINGFACE.md](DEPLOY_TO_HUGGINGFACE.md)
+### Command Line Interface
+For interactive testing:
+```bash
+python CLI_interactive_test.py
+```
+For model training:
+```bash
+python train.py
+```
+For running tests:
+```bash
+python test_model.py
+```
+## 🏗️ Project Structure
+```
+├── app.py                  # Streamlit web application
+├── CLI_interactive_test.py # Command line interface
+├── train.py               # Model training script
+├── test_model.py          # Model testing utilities
+├── cuda.py               # CUDA availability check
+├── requirements.txt       # Project dependencies
+├── setup.py              # Package setup configuration
+├── Dockerfile            # Docker configuration for containerization
+├── .dockerignore         # Files to exclude from Docker image
+├── .space                # Hugging Face Spaces configuration
+├── DEPLOY_TO_HUGGINGFACE.md # Deployment instructions for Hugging Face
+├── deploy_to_huggingface.sh # Script to help with Hugging Face deployment
+├── src/                  # Source code directory
+├── models/               # Saved model checkpoints
+└── data/                 # Training and test datasets
+```
+## 🔧 Model Architecture
+The project uses a fine-tuned BERT model (bert-base-uncased) with additional classification layers to detect different types of toxicity in text. The model is implemented using PyTorch and the Transformers library.
+Key components:
+- BERT base model for text encoding
+- Custom classification head for toxicity detection
+- Multi-label classification support
+- Real-time inference capabilities
+## 📊 Performance
+The model is trained to classify text into multiple toxicity categories with high accuracy. It can process text in real-time and provides confidence scores for each category of toxicity:
+- Toxic
+- Severe Toxic
+- Obscene
+- Threat
+- Insult
+- Identity Hate
+## 💻 Dependencies
+Key dependencies include:
+- transformers >= 4.35.0
+- torch >= 1.9.0
+- streamlit >= 1.24.0
+- fastapi >= 0.68.0
+- plotly >= 5.13.0
+- pandas >= 1.3.0
+- numpy >= 1.19.0
+## 🤝 Contributing
+Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request
+## 📝 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## 🙏 Acknowledgments
+- Hugging Face for the Transformers library
+- The BERT team at Google Research
+- The Streamlit team for the excellent web framework
+- The PyTorch team for the deep learning framework

__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file to make src a package

__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (173 Bytes). View file

api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file to make api a package

api/main.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import List, Dict
+import torch
+from src.preprocessing.text_processor import TextPreprocessor
+from src.models.toxic_classifier import ToxicClassifier
+app = FastAPI()
+class CommentRequest(BaseModel):
+    text: str
+class ToxicityResponse(BaseModel):
+    toxic: float
+    severe_toxic: float
+    obscene: float
+    threat: float
+    insult: float
+    identity_hate: float
+    confidence: float
+@app.post("/predict", response_model=ToxicityResponse)
+async def predict_toxicity(comment: CommentRequest):
+    try:
+        # Preprocess text
+        preprocessor = TextPreprocessor()
+        processed_text = preprocessor.process(comment.text)
+        # Tokenize for BERT
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        encoded = tokenizer(
+            processed_text,
+            padding=True,
+            truncation=True,
+            max_length=128,
+            return_tensors='pt'
+        )
+        # Get model prediction
+        model.eval()
+        with torch.no_grad():
+            outputs = model(
+                encoded['input_ids'].to(device),
+                encoded['attention_mask'].to(device)
+            )
+        predictions = outputs[0].cpu().numpy()
+        confidence = float(outputs.max())
+        return ToxicityResponse(
+            toxic=float(predictions[0]),
+            severe_toxic=float(predictions[1]),
+            obscene=float(predictions[2]),
+            threat=float(predictions[3]),
+            insult=float(predictions[4]),
+            identity_hate=float(predictions[5]),
+            confidence=confidence
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))

app.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import streamlit as st
+import torch
+from transformers import AutoTokenizer
+from src.models.toxic_classifier import ToxicClassifier
+import os
+import numpy as np
+import plotly.graph_objects as go
+from typing import Dict
+class ToxicPredictor:
+    def __init__(self, model_path: str):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        # Load tokenizer and model
+        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+        self.model = ToxicClassifier().to(self.device)
+        try:
+            # Load trained weights with weights_only=True for security
+            checkpoint = torch.load(model_path, map_location=self.device, weights_only=True)
+            # Handle both old and new model state dict formats
+            if 'model_state_dict' in checkpoint:
+                state_dict = checkpoint['model_state_dict']
+            else:
+                state_dict = checkpoint
+            # Load state dict and handle any missing/unexpected keys
+            missing_keys, unexpected_keys = self.model.load_state_dict(state_dict, strict=False)
+            if missing_keys:
+                st.warning(f"Missing keys in state dict: {missing_keys}")
+            if unexpected_keys:
+                st.warning(f"Unexpected keys in state dict: {unexpected_keys}")
+            self.model.eval()
+        except Exception as e:
+            st.error(f"Error loading model: {str(e)}")
+            raise
+        # Category names
+        self.categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
+    def predict(self, text: str) -> Dict[str, float]:
+        """Predict toxicity scores for a single text"""
+        try:
+            # Tokenize
+            encoding = self.tokenizer(
+                text,
+                add_special_tokens=True,
+                max_length=128,
+                padding='max_length',
+                truncation=True,
+                return_tensors='pt'
+            )
+            # Move to device
+            input_ids = encoding['input_ids'].to(self.device)
+            attention_mask = encoding['attention_mask'].to(self.device)
+            # Get predictions
+            with torch.no_grad():
+                outputs = self.model(input_ids, attention_mask)
+                probabilities = torch.sigmoid(outputs).cpu().numpy()[0]
+            # Create results dictionary
+            results = {
+                category: float(prob)
+                for category, prob in zip(self.categories, probabilities)
+            }
+            return results
+        except Exception as e:
+            st.error(f"Error during prediction: {str(e)}")
+            raise
+def create_gauge_chart(value: float, title: str) -> go.Figure:
+    """Create a gauge chart for toxicity scores"""
+    fig = go.Figure(go.Indicator(
+        mode="gauge+number",
+        value=value * 100,  # Convert to percentage
+        domain={'x': [0, 1], 'y': [0, 1]},
+        title={'text': title},
+        gauge={
+            'axis': {'range': [0, 100]},
+            'bar': {'color': "darkblue"},
+            'steps': [
+                {'range': [0, 33], 'color': "lightgreen"},
+                {'range': [33, 66], 'color': "yellow"},
+                {'range': [66, 100], 'color': "red"}
+            ],
+            'threshold': {
+                'line': {'color': "red", 'width': 4},
+                'thickness': 0.75,
+                'value': 50
+            }
+        }
+    ))
+    fig.update_layout(height=200)
+    return fig
+def main():
+    st.set_page_config(
+        page_title="Toxic Comment Classifier",
+        page_icon="🔍",
+        layout="wide"
+    )
+    # Title and description
+    st.title("💬 Toxic Comment Classifier")
+    st.markdown("""
+    This app uses a BERT-based model to detect toxic comments.
+    Enter your text below to analyze it for different types of toxicity.
+    """)
+    # Load model
+    model_path = os.path.join("models", "saved", "best_model.pt")
+    if not os.path.exists(model_path):
+        st.error("Model file not found! Please train the model first.")
+        return
+    try:
+        # Initialize predictor
+        @st.cache_resource(show_spinner=False)
+        def load_predictor():
+            with st.spinner("Loading model..."):
+                return ToxicPredictor(model_path)
+        predictor = load_predictor()
+        # Text input
+        text = st.text_area(
+            "Enter text to analyze:",
+            height=100,
+            placeholder="Type or paste your text here..."
+        )
+        if st.button("Analyze", type="primary"):
+            if not text:
+                st.warning("Please enter some text to analyze.")
+                return
+            with st.spinner("Analyzing text..."):
+                try:
+                    # Get predictions
+                    predictions = predictor.predict(text)
+                    # Display results
+                    st.markdown("### Analysis Results")
+                    # Create columns for the gauge charts
+                    col1, col2, col3 = st.columns(3)
+                    # Display gauge charts in columns
+                    with col1:
+                        st.plotly_chart(create_gauge_chart(predictions['toxic'], "Toxic"), use_container_width=True)
+                        st.plotly_chart(create_gauge_chart(predictions['obscene'], "Obscene"), use_container_width=True)
+                    with col2:
+                        st.plotly_chart(create_gauge_chart(predictions['severe_toxic'], "Severe Toxic"), use_container_width=True)
+                        st.plotly_chart(create_gauge_chart(predictions['threat'], "Threat"), use_container_width=True)
+                    with col3:
+                        st.plotly_chart(create_gauge_chart(predictions['insult'], "Insult"), use_container_width=True)
+                        st.plotly_chart(create_gauge_chart(predictions['identity_hate'], "Identity Hate"), use_container_width=True)
+                    # Overall assessment
+                    st.markdown("### Overall Assessment")
+                    max_toxicity = max(predictions.values())
+                    max_category = max(predictions.items(), key=lambda x: x[1])[0]
+                    if max_toxicity > 0.5:
+                        st.error(f"⚠️ This text may be toxic (highest score: {max_toxicity:.2%} for {max_category})")
+                    else:
+                        st.success(f"✅ This text appears to be non-toxic (highest score: {max_toxicity:.2%})")
+                except Exception as e:
+                    st.error(f"Error analyzing text: {str(e)}")
+        # Add information about the categories
+        with st.expander("ℹ️ About the Toxicity Categories"):
+            st.markdown("""
+            The model analyzes text for six types of toxicity:
+            * **Toxic**: General category for unpleasant content
+            * **Severe Toxic**: Extreme cases of toxicity
+            * **Obscene**: Explicit or vulgar content
+            * **Threat**: Expressions of intent to harm
+            * **Insult**: Disrespectful or demeaning language
+            * **Identity Hate**: Prejudiced language against protected characteristics
+            Scores range from 0% to 100%, where higher scores indicate stronger presence of that category.
+            """)
+        # Footer
+        st.markdown("---")
+        st.markdown(
+            "Built with ❤️ using Streamlit and BERT. "
+            "Model trained on the Toxic Comment Classification Dataset."
+        )
+    except Exception as e:
+        st.error(f"Application error: {str(e)}")
+if __name__ == "__main__":
+    main()

data/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file to make data a package

data/data_loader.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import pandas as pd
+import torch
+from torch.utils.data import Dataset, DataLoader
+from transformers import BertTokenizer
+from typing import Dict, List, Tuple
+import numpy as np
+import os
+class ToxicCommentDataset(Dataset):
+    def __init__(self, texts: List[str], labels: np.ndarray, tokenizer: BertTokenizer, max_length: int = 128):
+        # Convert texts to list if it's a pandas Series
+        self.texts = texts.tolist() if isinstance(texts, pd.Series) else texts
+        self.labels = labels
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+    def __len__(self):
+        return len(self.texts)
+    def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
+        text = str(self.texts[idx])
+        # Handle unusual line terminators
+        text = text.replace('\u2028', ' ').replace('\u2029', ' ')  # Remove line/paragraph separators
+        text = ' '.join(text.splitlines())  # Normalize all newlines
+        label = self.labels[idx]
+        encoding = self.tokenizer(
+            text,
+            add_special_tokens=True,
+            max_length=self.max_length,
+            padding='max_length',
+            truncation=True,
+            return_tensors='pt'
+        )
+        return {
+            'input_ids': encoding['input_ids'].flatten(),
+            'attention_mask': encoding['attention_mask'].flatten(),
+            'labels': torch.FloatTensor(label)
+        }
+def load_toxic_data(data_path: str) -> Tuple[List[str], np.ndarray]:
+    """Load and prepare the toxic comment dataset"""
+    try:
+        # Use encoding='utf-8-sig' to handle BOM if present
+        df = pd.read_csv(data_path, encoding='utf-8-sig', on_bad_lines='skip')
+        # List of toxicity categories
+        toxic_categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
+        # Convert text column to list and labels to numpy array
+        texts = df['comment_text'].tolist()
+        labels = df[toxic_categories].values
+        return texts, labels
+    except Exception as e:
+        raise RuntimeError(f"Error loading data from {data_path}: {str(e)}")
+def create_data_loaders(
+    texts: List[str],
+    labels: np.ndarray,
+    tokenizer: BertTokenizer,
+    train_ratio: float = 0.8,
+    batch_size: int = 32,
+    num_workers: int = 4  # Adjusted for Windows
+) -> Tuple[DataLoader, DataLoader]:
+    """Create train and validation data loaders"""
+    try:
+        # Calculate split index
+        dataset_size = len(texts)
+        train_size = int(dataset_size * train_ratio)
+        # Split data
+        train_texts = texts[:train_size]
+        train_labels = labels[:train_size]
+        val_texts = texts[train_size:]
+        val_labels = labels[train_size:]
+        # Create datasets
+        train_dataset = ToxicCommentDataset(train_texts, train_labels, tokenizer)
+        val_dataset = ToxicCommentDataset(val_texts, val_labels, tokenizer)
+        # Create data loaders with Windows-optimized settings
+        train_loader = DataLoader(
+            train_dataset,
+            batch_size=batch_size,
+            shuffle=True,
+            num_workers=num_workers,
+            pin_memory=True,  # Helps with CUDA performance
+            persistent_workers=True  # Keeps workers alive between epochs
+        )
+        val_loader = DataLoader(
+            val_dataset,
+            batch_size=batch_size,
+            shuffle=False,
+            num_workers=num_workers,
+            pin_memory=True,
+            persistent_workers=True
+        )
+        return train_loader, val_loader
+    except Exception as e:
+        raise RuntimeError(f"Error creating data loaders: {str(e)}")

models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file to make models a package

models/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (180 Bytes). View file

models/__pycache__/toxic_classifier.cpython-312.pyc ADDED Viewed

Binary file (2.27 kB). View file

models/toxic_classifier.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import torch
+import torch.nn as nn
+from transformers import AutoModel
+from typing import Dict, Tuple
+class ToxicClassifier(nn.Module):
+    def __init__(self, num_classes: int = 6, dropout: float = 0.3):
+        super(ToxicClassifier, self).__init__()
+        # BERT base model - freeze some layers to prevent overfitting
+        self.bert = AutoModel.from_pretrained('bert-base-uncased')
+        # Freeze the first 8 layers of BERT
+        for param in list(self.bert.parameters())[:-8]:
+            param.requires_grad = False
+        # Simplified architecture focusing on BERT's power
+        self.dropout = nn.Dropout(dropout)
+        self.classifier = nn.Linear(768, num_classes)  # 768 is BERT's hidden size
+        # Initialize the classifier weights properly
+        torch.nn.init.xavier_uniform_(self.classifier.weight)
+        self.classifier.bias.data.fill_(0.0)
+    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+        # Get BERT embeddings
+        outputs = self.bert(input_ids, attention_mask=attention_mask)
+        pooled_output = outputs.pooler_output  # [batch_size, 768]
+        # Apply dropout and classification
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits  # Return logits directly, BCEWithLogitsLoss will handle the sigmoid

models/trainer.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import torch
+from torch.utils.data import DataLoader
+from typing import Dict, List
+from tqdm import tqdm
+from torch.amp import autocast, GradScaler
+class ModelTrainer:
+    def __init__(self, model, optimizer, criterion, device, scaler: GradScaler = None, scheduler=None):
+        self.model = model
+        self.optimizer = optimizer
+        self.criterion = criterion
+        self.device = device
+        self.scaler = scaler or GradScaler('cuda')
+        self.use_amp = device.type == 'cuda'
+        self.scheduler = scheduler
+    def train_epoch(self, dataloader: DataLoader) -> Dict[str, float]:
+        self.model.train()
+        total_loss = 0
+        for batch in tqdm(dataloader, desc="Training"):
+            input_ids = batch['input_ids'].to(self.device)
+            attention_mask = batch['attention_mask'].to(self.device)
+            labels = batch['labels'].to(self.device)
+            self.optimizer.zero_grad()
+            if self.use_amp:
+                with autocast('cuda'):
+                    outputs = self.model(input_ids, attention_mask)
+                    loss = self.criterion(outputs, labels)
+                self.scaler.scale(loss).backward()
+                # Clip gradients
+                self.scaler.unscale_(self.optimizer)
+                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
+                self.scaler.step(self.optimizer)
+                self.scaler.update()
+            else:
+                outputs = self.model(input_ids, attention_mask)
+                loss = self.criterion(outputs, labels)
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
+                self.optimizer.step()
+            if self.scheduler is not None:
+                self.scheduler.step()
+            total_loss += loss.item()
+        return {'loss': total_loss / len(dataloader)}
+    def evaluate(self, dataloader: DataLoader) -> Dict[str, float]:
+        self.model.eval()
+        total_loss = 0
+        predictions = []
+        true_labels = []
+        with torch.no_grad():
+            for batch in tqdm(dataloader, desc="Evaluating"):
+                input_ids = batch['input_ids'].to(self.device)
+                attention_mask = batch['attention_mask'].to(self.device)
+                labels = batch['labels'].to(self.device)
+                if self.use_amp:
+                    with autocast('cuda'):
+                        outputs = self.model(input_ids, attention_mask)
+                        loss = self.criterion(outputs, labels)
+                else:
+                    outputs = self.model(input_ids, attention_mask)
+                    loss = self.criterion(outputs, labels)
+                # Apply sigmoid to get probabilities for predictions
+                probs = torch.sigmoid(outputs)
+                total_loss += loss.item()
+                predictions.extend(probs.cpu().numpy())
+                true_labels.extend(labels.cpu().numpy())
+        return {
+            'loss': total_loss / len(dataloader),
+            'predictions': predictions,
+            'true_labels': true_labels
+        }

preprocessing/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file to make preprocessing a package

preprocessing/text_processor.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import re
+import nltk
+from nltk.tokenize import word_tokenize
+from nltk.corpus import stopwords
+from nltk.stem import WordNetLemmatizer
+from typing import List, Optional
+class TextPreprocessor:
+    def __init__(self):
+        nltk.download('punkt')
+        nltk.download('stopwords')
+        nltk.download('wordnet')
+        self.stop_words = set(stopwords.words('english'))
+        self.lemmatizer = WordNetLemmatizer()
+    def clean_text(self, text: str) -> str:
+        """Clean and normalize text"""
+        # Convert to lowercase
+        text = text.lower()
+        # Remove special characters and numbers
+        text = re.sub(r'[^a-zA-Z\s]', '', text)
+        # Remove extra whitespace
+        text = re.sub(r'\s+', ' ', text).strip()
+        return text
+    def tokenize(self, text: str) -> List[str]:
+        """Tokenize text into words"""
+        return word_tokenize(text)
+    def remove_stopwords(self, tokens: List[str]) -> List[str]:
+        """Remove stop words from token list"""
+        return [token for token in tokens if token not in self.stop_words]
+    def lemmatize(self, tokens: List[str]) -> List[str]:
+        """Lemmatize tokens"""
+        return [self.lemmatizer.lemmatize(token) for token in tokens]
+    def process(self, text: str) -> List[str]:
+        """Complete preprocessing pipeline"""
+        cleaned_text = self.clean_text(text)
+        tokens = self.tokenize(cleaned_text)
+        tokens = self.remove_stopwords(tokens)
+        tokens = self.lemmatize(tokens)
+        return tokens

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+# Core dependencies
+transformers>=4.5.0
+nltk>=3.6.0
+fastapi>=0.68.0
+uvicorn>=0.15.0
+scikit-learn>=0.24.0
+tqdm>=4.62.0
+pydantic>=1.8.0
+streamlit>=1.24.0
+plotly>=5.13.0
+torch>=1.9.0
+transformers>=4.35.0
+numpy>=1.19.0
+pandas>=1.3.0

saved/best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9ac08d9bdca185a464f8a71e88cd2e15ce2fb6b18ebb51dc3d459e00e0f9c159
+size 480592037

train.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import torch
+from transformers import BertTokenizer, AdamW
+from src.models.toxic_classifier import ToxicClassifier
+from src.models.trainer import ModelTrainer
+from src.data.data_loader import load_toxic_data, create_data_loaders
+import logging
+import os
+from torch.cuda.amp import GradScaler, autocast  # For mixed precision training
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def train_model(
+    data_path: str,
+    model_save_path: str,
+    num_epochs: int = 5,
+    batch_size: int = 64,  # Increased for RTX 3060
+    learning_rate: float = 2e-5,
+    max_grad_norm: float = 1.0
+):
+    # Set device and enable CUDA optimizations
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    if device.type == 'cuda':
+        torch.backends.cudnn.benchmark = True
+    logger.info(f"Using device: {device}")
+    # Load tokenizer
+    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    # Load data
+    logger.info("Loading dataset...")
+    texts, labels = load_toxic_data(data_path)
+    train_loader, val_loader = create_data_loaders(
+        texts,
+        labels,
+        tokenizer,
+        batch_size=batch_size
+    )
+    # Initialize model
+    logger.info("Initializing model...")
+    model = ToxicClassifier().to(device)
+    # Initialize optimizer with weight decay
+    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
+    # Initialize gradient scaler for mixed precision training
+    scaler = GradScaler()
+    # Initialize trainer with mixed precision support
+    trainer = ModelTrainer(model, optimizer, criterion=torch.nn.BCELoss(), device=device, scaler=scaler)
+    # Training loop
+    logger.info("Starting training...")
+    best_val_loss = float('inf')
+    for epoch in range(num_epochs):
+        # Train
+        train_metrics = trainer.train_epoch(train_loader)
+        logger.info(f"Epoch {epoch+1}/{num_epochs}")
+        logger.info(f"Training Loss: {train_metrics['loss']:.4f}")
+        # Evaluate
+        val_metrics = trainer.evaluate(val_loader)
+        val_loss = val_metrics['loss']
+        logger.info(f"Validation Loss: {val_loss:.4f}")
+        # Save best model
+        if val_loss < best_val_loss:
+            best_val_loss = val_loss
+            torch.save({
+                'epoch': epoch,
+                'model_state_dict': model.state_dict(),
+                'optimizer_state_dict': optimizer.state_dict(),
+                'loss': best_val_loss,
+            }, os.path.join(model_save_path, 'best_model.pt'))
+            logger.info("Saved best model checkpoint")
+    logger.info("Training completed!")
+if __name__ == "__main__":
+    DATA_PATH = os.path.join("data", "raw", "train.csv")
+    MODEL_SAVE_PATH = os.path.join("models", "saved")
+    # Create model save directory if it doesn't exist
+    os.makedirs(MODEL_SAVE_PATH, exist_ok=True)
+    train_model(DATA_PATH, MODEL_SAVE_PATH)