Spaces:

An24
/

toxic-language-detectorv4

Sleeping

App Files Files Community

anpha@DESKTOP-IT4F327 commited on Mar 15, 2025

Commit

2989a5c

1 Parent(s): e965645

20250315

Browse files

Files changed (6) hide show

README.md +134 -117
app.py +251 -22
backend/services/ml_model.py +37 -11
backend/services/social_media.py +1 -1
backend/utils/vector_utils.py +14 -5
requirements.txt +5 -2

README.md CHANGED Viewed

@@ -1,157 +1,174 @@
----
-title: My Hugging Face Space
-emoji: 🚀
-colorFrom: blue
-colorTo: purple
-sdk: streamlit
-sdk_version: "1.25.0"
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at [Hugging Face Spaces Config](https://huggingface.co/docs/hub/spaces-config-reference).
-# Social Media Toxicity Detector
-A browser extension that detects toxic, offensive, hate speech, and spam content on social media platforms using a machine learning model.
-## Features
-- Detection of toxic content on Facebook, Twitter, and YouTube
-- Classification into 4 categories: Clean (0), Offensive (1), Hate Speech (2), and Spam (3)
-- Real-time content scanning on social media platforms
-- Manual text analysis
-- Admin dashboard for content monitoring and analytics
-- User role-based access control
-- Comment log and history tracking
-## Project Structure
-The project is organized into two main components:
-1. **Backend API**: FastAPI-based REST API for model inference, user management, and data storage
-2. **Browser Extension**: Chrome extension for content detection and user interface
-## Backend Setup
-### Prerequisites
-- Python 3.9+
-- PostgreSQL with pgvector extension
-- Virtual environment (recommended)
-### Installation
-1. Clone the repository:
-```bash
-git clone https://github.com/yourusername/social-media-toxicity-detector.git
-cd social-media-toxicity-detector
-```
-2. Create and activate a virtual environment:
-```bash
-python -m venv venv
-source venv/bin/activate  # On Windows: venv\Scripts\activate
-```
-3. Install dependencies:
-```bash
-pip install -r requirements.txt
-```
-4. Set up environment variables by creating a `.env` file:
-```
-# API Configuration
-SECRET_KEY=your-secret-key-here
-ACCESS_TOKEN_EXPIRE_MINUTES=30
-# Database Configuration
-POSTGRES_SERVER=localhost
-POSTGRES_USER=postgres
-POSTGRES_PASSWORD=postgres
-POSTGRES_DB=toxicity_detector
-POSTGRES_PORT=5432
-# ML Model Configuration
-MODEL_PATH=model/toxicity_detector.h5
-HUGGINGFACE_API_URL=https://api-inference.huggingface.co/models/your-model-endpoint
-HUGGINGFACE_API_TOKEN=your-huggingface-token
-# Social Media APIs
-FACEBOOK_API_KEY=your-facebook-api-key
-TWITTER_API_KEY=your-twitter-api-key
-YOUTUBE_API_KEY=your-youtube-api-key
-```
-5. Initialize the database:
-```bash
-alembic revision --autogenerate -m "Initial migration"
-alembic upgrade head
-```
-6. Start the API server:
-```bash
-uvicorn backend.main:app --reload
-```
-### API Documentation
-Once the server is running, you can access the API documentation at:
-- Swagger UI: http://localhost:8000/docs
-- ReDoc: http://localhost:8000/redoc
-## Extension Setup
-1. Navigate to the extension directory:
-```bash
-cd extension
-```
-2. Configure the API endpoint in `background.js`:
-```javascript
-const API_BASE_URL = 'http://localhost:8000/api'; // Change to your actual API endpoint
 ```
-3. Install the extension in Chrome:
-   - Open Chrome and navigate to `chrome://extensions/`
-   - Enable "Developer mode"
-   - Click "Load unpacked" and select the `extension` directory
-## Usage
-1. After installing the extension, click on the extension icon in the toolbar
-2. Log in with your credentials
-3. Visit Facebook, Twitter, or YouTube to activate content scanning
-4. Use the extension popup to scan pages manually or analyze specific text
-5. Access the admin dashboard at `http://localhost:8000/admin` (requires admin login)
 ## Model Training
-The toxicity detection model was trained using a dataset with 4 labels:
-- 0: Clean content
-- 1: Offensive content
-- 2: Hate speech
-- 3: Spam
-The model file (.h5) should be placed in the `model` directory or served via Hugging Face API.
-## Database Schema
-The system uses PostgreSQL with pgvector extension for vector similarity search:
-- **Users**: User accounts with role-based permissions
-- **Roles**: User roles (admin, moderator, user)
-- **Comments**: Detected comments with classification results and vector embeddings
-- **Logs**: System activity logs
-## Security Features
-- JWT authentication
-- Role-based access control
 - Password hashing with bcrypt
-- Request logging
-- Input validation and sanitization
 ## License
-[MIT License](LICENSE)

+# Toxic Language Detector
+A comprehensive system for detecting toxic language on social media platforms (Facebook, YouTube, Twitter), implemented as a browser extension with a FastAPI backend.
+## Project Overview
+This project aims to detect and analyze toxic language in social media comments using a machine learning model trained on a large dataset. The system classifies comments into four categories:
+- 0: Clean (non-toxic)
+- 1: Offensive
+- 2: Hate speech
+- 3: Spam
+The project consists of two main components:
+1. **Backend API**: A FastAPI application that handles ML model inference, data storage, and provides endpoints for both the extension and admin users.
+2. **Browser Extension**: A Chrome extension that scans comments on supported social media platforms and highlights toxic content.
+## Backend Architecture
+### Core Components
+- **FastAPI Application**: The main web framework that serves the API endpoints
+- **Machine Learning Model**: LSTM-based model for toxic language classification
+- **Database**: SQLAlchemy ORM with SQLite/PostgreSQL for data storage
+- **Authentication**: JWT-based token authentication for API access
+### Directory Structure
+```
+TOXIC-LANGUAGE-DETECTORV1/
+│── backend/
+│   ├── api/
+│   │   ├── models/        # Pydantic models for API requests/responses
+│   │   ├── routes/        # API endpoints
+│   ├── config/            # Configuration settings
+│   ├── core/              # Core functionality (auth, dependencies)
+│   ├── db/                # Database models and connection
+│   │   ├── models/        # SQLAlchemy models
+│   ├── services/          # Service layer (ML model, social media APIs)
+│   ├── utils/             # Utility functions
+│── model/                 # ML model files
+│── app.py                 # Main entry point
+│── requirements.txt       # Dependencies
+│── Dockerfile             # Container configuration
+```
+### Database Schema
+The database consists of the following main tables:
+1. **User**: Stores user information and authentication data
+2. **Role**: Defines user roles (admin, user)
+3. **Comment**: Stores analyzed comments with their predictions and vector representations
+4. **Log**: Records API access and system events
+### API Endpoints
+The backend provides two main sets of endpoints:
+1. **Extension Endpoints**:
+   - `/extension/detect`: Analyzes comment text from the browser extension
+2. **API Endpoints**:
+   - Authentication: `/auth/register`, `/auth/token`
+   - Admin: `/admin/users`, `/admin/comments`, `/admin/logs`
+   - Prediction: `/predict/single`, `/predict/batch`
+   - Analysis: `/detect/similar`, `/detect/statistics`
+## Browser Extension
+### Features
+- Real-time comment analysis on Facebook, YouTube, and Twitter
+- Visual indicators for toxic comments with different colors based on toxicity type
+- Option to blur highly toxic content with a reveal button
+- Configurable settings through a popup interface
+- Statistics tracking for scanned comments
+### Components
+- **Background Script**: Handles API communication and manages extension state
+- **Content Script**: Analyzes comments on supported websites
+- **Popup Interface**: User-friendly settings panel
+### Directory Structure
+```
+EXTENSION/
+│── icons/              # Extension icons
+│── popup/              # Popup interface files
+│   ├── popup.css
+│   ├── popup.html
+│   ├── popup.js
+│── background.js       # Background script
+│── content.js          # Content script for analyzing comments
+│── manifest.json       # Extension configuration
+│── styles.css          # CSS for content modifications
 ```
+## Setup and Deployment
+### Backend Setup
+1. Clone the repository
+2. Install dependencies: `pip install -r requirements.txt`
+3. Set up environment variables:
+   ```
+   export SECRET_KEY="your-secret-key"
+   export DATABASE_URL="sqlite:///./toxic_detector.db"
+   export EXTENSION_API_KEY="your-extension-api-key"
+   ```
+4. Run the application: `uvicorn app:app --reload`
+### Hugging Face Space Deployment
+1. Create a new Space on Hugging Face
+2. Upload the project files
+3. Configure the environment variables
+4. Set the Space to use FastAPI template
+### Extension Setup
+1. Open Chrome and navigate to `chrome://extensions/`
+2. Enable Developer Mode
+3. Click "Load unpacked" and select the EXTENSION directory
+4. Configure the extension API endpoint in the popup settings
 ## Model Training
+The toxic language detection model was trained on a large dataset with four classification labels. The model architecture is based on LSTM (Long Short-Term Memory) networks, which are effective for sequence classification tasks like text analysis.
+### Model Architecture
+- Embedding layer
+- LSTM layer
+- Dense output layer with softmax activation
+- Trained with categorical cross-entropy loss
+## Data Flow
+1. User visits a social media platform
+2. Extension scans comments on the page
+3. Comments are sent to the backend API
+4. API processes comments using the ML model
+5. Results are returned to the extension
+6. Extension highlights toxic comments
+7. Comment data is stored in the database for analysis
+## Security Considerations
+- JWT token authentication for API endpoints
+- API key authentication for extension
 - Password hashing with bcrypt
+- CORS protection
+- Request logging for monitoring
+## Future Improvements
+- Add more social media platforms
+- Implement user feedback mechanism to improve model
+- Add multi-language support
+- Develop a dashboard for analytics
+- Implement more advanced NLP techniques
 ## License
+This project is for research purposes only.
+## Acknowledgements
+- TensorFlow team for ML framework
+- FastAPI for backend framework
+- Chrome Extensions API

app.py CHANGED Viewed

@@ -1,40 +1,269 @@
-# app.py - Main entry point for the FastAPI application
-from fastapi import FastAPI, Depends
 from fastapi.middleware.cors import CORSMiddleware
-from api.routes import admin, auth, extension, prediction, toxic_detection
-from core.middleware import LogMiddleware
-from db.models.base import Base
-from db.models.user import engine
-import uvicorn
 app = FastAPI(
     title="Toxic Language Detector API",
     description="API for detecting toxic language in social media comments",
-    version="1.0.0"
 )
-# Configure CORS
 app.add_middleware(
     CORSMiddleware,
-    allow_origins=["*"],  # Update this with specific origins in production
     allow_credentials=True,
     allow_methods=["*"],
     allow_headers=["*"],
 )
-# Add custom middleware
-app.add_middleware(LogMiddleware)
-# Include routers
-app.include_router(auth.router, prefix="/auth", tags=["Authentication"])
-app.include_router(admin.router, prefix="/admin", tags=["Admin"])
-app.include_router(extension.router, prefix="/extension", tags=["Extension"])
-app.include_router(prediction.router, prefix="/predict", tags=["Prediction"])
-app.include_router(toxic_detection.router, prefix="/detect", tags=["Toxic Detection"])
-# Create database tables
-Base.metadata.create_all(bind=engine)
 if __name__ == "__main__":
-    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)

+# app.py - Hugging Face Space Entry Point
+import os
+import sys
+import gradio as gr
+from fastapi import FastAPI, HTTPException, Depends, status, Request
 from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse, JSONResponse
+from fastapi.staticfiles import StaticFiles
+from pydantic import BaseModel
+from typing import List, Dict, Any, Optional
+import tensorflow as tf
+import numpy as np
+from sklearn.feature_extraction.text import TfidfVectorizer
+import re
+# Define our FastAPI application
 app = FastAPI(
     title="Toxic Language Detector API",
     description="API for detecting toxic language in social media comments",
+    version="1.0.0",
 )
+# CORS configuration
 app.add_middleware(
     CORSMiddleware,
+    allow_origins=["*"],
     allow_credentials=True,
     allow_methods=["*"],
     allow_headers=["*"],
 )
+# API models
+class PredictionRequest(BaseModel):
+    text: str
+    platform: Optional[str] = "unknown"
+    platform_id: Optional[str] = None
+    metadata: Optional[Dict[str, Any]] = None
+class PredictionResponse(BaseModel):
+    text: str
+    prediction: int
+    confidence: float
+    prediction_text: str
+# Load ML model
+class ToxicDetectionModel:
+    def __init__(self):
+        # Load or create model trained on Vietnamese social media data
+        try:
+            self.model = tf.keras.models.load_model("model/best_model_LSTM.h5")
+            print("Vietnamese toxicity model loaded successfully")
+        except Exception as e:
+            print(f"Error loading model: {e}")
+            print("Creating a dummy model for demonstration")
+            self.model = self._create_dummy_model()
+        # Initialize vectorizer for Vietnamese text
+        # Vietnamese doesn't use the same stop words as English
+        self.vectorizer = TfidfVectorizer(
+            max_features=10000,
+            stop_words=None,  # Don't use English stop words
+            ngram_range=(1, 3)  # Use 1-3 grams for better Vietnamese phrase capture
+        )
+        # Map predictions to text labels (in Vietnamese)
+        self.label_mapping = {
+            0: "bình thường",  # clean
+            1: "xúc phạm",     # offensive
+            2: "thù ghét",     # hate
+            3: "spam"          # spam
+        }
+        # Load Vietnamese tokenizer if available
+        try:
+            # Try to load underthesea for Vietnamese NLP
+            import importlib.util
+            if importlib.util.find_spec("underthesea"):
+                from underthesea import word_tokenize
+                self.has_vietnamese_nlp = True
+                print("Vietnamese NLP library loaded successfully")
+            else:
+                self.has_vietnamese_nlp = False
+                print("Vietnamese NLP library not found, using basic tokenization")
+        except Exception:
+            self.has_vietnamese_nlp = False
+    def _create_dummy_model(self):
+        # Create a simple model for demonstration
+        inputs = tf.keras.Input(shape=(10000,))
+        x = tf.keras.layers.Dense(128, activation='relu')(inputs)
+        x = tf.keras.layers.Dropout(0.3)(x)
+        outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
+        model = tf.keras.Model(inputs=inputs, outputs=outputs)
+        model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
+        return model
+    def preprocess_text(self, text):
+        # Clean text while preserving Vietnamese diacritical marks
+        text = text.lower()
+        text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
+        text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
+        # For Vietnamese, preserve diacritical marks and only remove punctuation
+        text = re.sub(r'[.,;:!?()"\'\[\]/\\]', ' ', text)
+        text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
+        # Use Vietnamese tokenization if available
+        if self.has_vietnamese_nlp:
+            try:
+                from underthesea import word_tokenize
+                text = word_tokenize(text, format="text")
+            except Exception as e:
+                print(f"Error in Vietnamese tokenization: {e}")
+        # Vectorize
+        if not hasattr(self.vectorizer, 'vocabulary_'):
+            self.vectorizer.fit([text])
+        features = self.vectorizer.transform([text]).toarray()
+        return features
+    def predict(self, text):
+        # Preprocess text
+        features = self.preprocess_text(text)
+        # Make prediction
+        predictions = self.model.predict(features)[0]
+        # Get most likely class and confidence
+        predicted_class = np.argmax(predictions)
+        confidence = float(predictions[predicted_class])
+        return int(predicted_class), confidence, self.label_mapping[int(predicted_class)]
+# Initialize model
+model = ToxicDetectionModel()
+# API Key validation
+API_KEY = os.environ.get("API_KEY", "test-api-key")
+def verify_api_key(request: Request):
+    api_key = request.headers.get("X-API-Key")
+    if not api_key or api_key != API_KEY:
+        raise HTTPException(
+            status_code=status.HTTP_401_UNAUTHORIZED,
+            detail="Invalid API Key",
+        )
+    return api_key
+# API routes
+@app.post("/extension/detect", response_model=PredictionResponse)
+async def detect_toxic_language(
+    request: PredictionRequest,
+    api_key: str = Depends(verify_api_key)
+):
+    try:
+        # Make prediction
+        prediction_class, confidence, prediction_text = model.predict(request.text)
+        # Return response
+        return {
+            "text": request.text,
+            "prediction": prediction_class,
+            "confidence": confidence,
+            "prediction_text": prediction_text
+        }
+    except Exception as e:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Error processing request: {str(e)}"
+        )
+@app.get("/", response_class=HTMLResponse)
+async def root():
+    return """
+    <html>
+        <head>
+            <title>Toxic Language Detector API</title>
+            <style>
+                body {
+                    font-family: Arial, sans-serif;
+                    max-width: 800px;
+                    margin: 0 auto;
+                    padding: 20px;
+                }
+                h1 {
+                    color: #333;
+                }
+                .endpoint {
+                    margin-bottom: 20px;
+                    padding: 10px;
+                    border: 1px solid #ddd;
+                    border-radius: 5px;
+                }
+                .method {
+                    display: inline-block;
+                    padding: 3px 6px;
+                    background-color: #2196F3;
+                    color: white;
+                    border-radius: 3px;
+                    font-size: 14px;
+                }
+                pre {
+                    background-color: #f5f5f5;
+                    padding: 10px;
+                    border-radius: 5px;
+                    overflow-x: auto;
+                }
+            </style>
+        </head>
+        <body>
+            <h1>Toxic Language Detector API</h1>
+            <p>This API provides endpoints for detecting toxic language in text.</p>
+            <div class="endpoint">
+                <span class="method">POST</span> <strong>/extension/detect</strong>
+                <p>Analyzes text for toxic language and returns the prediction.</p>
+                <h4>Request</h4>
+                <pre>
+{
+  "text": "Your text to analyze",
+  "platform": "facebook",
+  "platform_id": "optional-id",
+  "metadata": {}
+}
+                </pre>
+                <h4>Response</h4>
+                <pre>
+{
+  "text": "Your text to analyze",
+  "prediction": 0,
+  "confidence": 0.95,
+  "prediction_text": "clean"
+}
+                </pre>
+                <p>Prediction values: 0 (clean), 1 (offensive), 2 (hate), 3 (spam)</p>
+            </div>
+            <p>For more information, check the <a href="/docs">API documentation</a>.</p>
+        </body>
+    </html>
+    """
+# Gradio interface
+def predict_toxic(text):
+    prediction_class, confidence, prediction_text = model.predict(text)
+    # Format response
+    result = f"Prediction: {prediction_text.capitalize()} (Class {prediction_class})\n"
+    result += f"Confidence: {confidence:.2f}"
+    return result
+# Create Gradio interface
+interface = gr.Interface(
+    fn=predict_toxic,
+    inputs=gr.Textbox(lines=5, placeholder="Enter text to analyze for toxic language..."),
+    outputs="text",
+    title="Toxic Language Detector",
+    description="Detects whether text contains toxic language. Classes: 0 (clean), 1 (offensive), 2 (hate), 3 (spam)."
+)
+# Mount Gradio app
+app = gr.mount_gradio_app(app, interface, path="/gradio")
+# For direct Hugging Face Space usage
 if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

backend/services/ml_model.py CHANGED Viewed

@@ -7,7 +7,7 @@ import re
 import os
 class MLModel:
-    def __init__(self, model_path="model/best_model_LSTM.h5", max_length=100, max_words=10000):
         self.model_path = model_path
         self.max_length = max_length
         self.max_words = max_words
@@ -16,32 +16,56 @@ class MLModel:
         self.load_model()
     def load_model(self):
-        """Load the pretrained model"""
         if os.path.exists(self.model_path):
             self.model = tf.keras.models.load_model(self.model_path)
-            print(f"Model loaded from {self.model_path}")
         else:
             print(f"Model not found at {self.model_path}. Using dummy model.")
             # Create a dummy model for testing
             self.model = self._create_dummy_model()
-        # Initialize tokenizer - in production, this should be loaded from a saved tokenizer
-        self.tokenizer = Tokenizer(num_words=self.max_words)
     def _create_dummy_model(self):
         """Create a dummy model for testing purposes"""
         inputs = tf.keras.Input(shape=(self.max_length,))
-        x = tf.keras.layers.Embedding(self.max_words, 64, input_length=self.max_length)(inputs)
-        x = tf.keras.layers.LSTM(64)(x)
         outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
         model = tf.keras.Model(inputs=inputs, outputs=outputs)
         model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
         return model
     def preprocess_text(self, text):
-        """Preprocess text for prediction"""
-        # Clean text
-        text = re.sub(r'[^\w\s]', '', text.lower())
         # Tokenize and pad
         sequences = self.tokenizer.texts_to_sequences([text])
@@ -50,7 +74,7 @@ class MLModel:
         return padded_sequences
     def predict(self, text):
-        """Predict the class of the text"""
         # Preprocess text
         preprocessed_text = self.preprocess_text(text)
@@ -61,4 +85,6 @@ class MLModel:
         predicted_class = np.argmax(prediction)
         confidence = float(prediction[predicted_class])
         return int(predicted_class), confidence

 import os
 class MLModel:
+    def __init__(self, model_path="model/best_model_LSTM.h5", max_length=100, max_words=20000):
         self.model_path = model_path
         self.max_length = max_length
         self.max_words = max_words
         self.load_model()
     def load_model(self):
+        """Load the pretrained model trained on Vietnamese social media data"""
         if os.path.exists(self.model_path):
             self.model = tf.keras.models.load_model(self.model_path)
+            print(f"Vietnamese toxicity model loaded from {self.model_path}")
         else:
             print(f"Model not found at {self.model_path}. Using dummy model.")
             # Create a dummy model for testing
             self.model = self._create_dummy_model()
+        # In production, this should be loaded from a saved tokenizer trained on Vietnamese data
+        # For Vietnamese text, we need a specialized tokenizer or use a pre-tokenized approach
+        try:
+            tokenizer_path = "model/vietnamese_tokenizer.pkl"
+            if os.path.exists(tokenizer_path):
+                import pickle
+                with open(tokenizer_path, 'rb') as handle:
+                    self.tokenizer = pickle.load(handle)
+                print(f"Vietnamese tokenizer loaded from {tokenizer_path}")
+            else:
+                print("Tokenizer not found, initializing new one (for development only)")
+                self.tokenizer = Tokenizer(num_words=self.max_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
+        except Exception as e:
+            print(f"Error loading tokenizer: {e}")
+            self.tokenizer = Tokenizer(num_words=self.max_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
     def _create_dummy_model(self):
         """Create a dummy model for testing purposes"""
         inputs = tf.keras.Input(shape=(self.max_length,))
+        x = tf.keras.layers.Embedding(self.max_words, 128, input_length=self.max_length)(inputs)
+        x = tf.keras.layers.LSTM(128)(x)
         outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
         model = tf.keras.Model(inputs=inputs, outputs=outputs)
         model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
         return model
     def preprocess_text(self, text):
+        """Preprocess Vietnamese text for prediction"""
+        # For Vietnamese, we need to maintain special characters and diacritical marks
+        # Only remove punctuation and normalize whitespace
+        text = re.sub(r'[.,;:!?()"\'\[\]/\\]', ' ', text.lower())
+        text = re.sub(r'\s+', ' ', text).strip()
+        # Use underthesea for Vietnamese tokenization if available
+        try:
+            from underthesea import word_tokenize
+            tokenized_text = word_tokenize(text, format="text")
+            text = tokenized_text
+        except ImportError:
+            # Fallback if underthesea is not available
+            pass
         # Tokenize and pad
         sequences = self.tokenizer.texts_to_sequences([text])
         return padded_sequences
     def predict(self, text):
+        """Predict the class of the Vietnamese text"""
         # Preprocess text
         preprocessed_text = self.preprocess_text(text)
         predicted_class = np.argmax(prediction)
         confidence = float(prediction[predicted_class])
+        # Map prediction to labels appropriate for Vietnamese content
+        # 0: clean, 1: offensive, 2: hate, 3: spam
         return int(predicted_class), confidence

backend/services/social_media.py CHANGED Viewed

@@ -231,4 +231,4 @@ class YouTubeAPI:
                 videos.append(video)
             return videos
-        return []

                 videos.append(video)
             return videos
+        return []

backend/utils/vector_utils.py CHANGED Viewed

@@ -22,15 +22,15 @@ def _get_vectorizer():
 def preprocess_text(text):
     """
-    Preprocess text for vectorization
     Args:
-        text (str): Raw text
     Returns:
         str: Preprocessed text
     """
-    # Convert to lowercase
     text = text.lower()
     # Remove URLs
@@ -39,13 +39,22 @@ def preprocess_text(text):
     # Remove HTML tags
     text = re.sub(r'<.*?>', '', text)
-    # Remove special characters and numbers
-    text = re.sub(r'[^\w\s]', '', text)
     text = re.sub(r'\d+', '', text)
     # Remove extra whitespace
     text = re.sub(r'\s+', ' ', text).strip()
     return text
 def extract_features(text):

 def preprocess_text(text):
     """
+    Preprocess Vietnamese text for vectorization
     Args:
+        text (str): Raw Vietnamese text
     Returns:
         str: Preprocessed text
     """
+    # Convert to lowercase (preserving Vietnamese diacritical marks)
     text = text.lower()
     # Remove URLs
     # Remove HTML tags
     text = re.sub(r'<.*?>', '', text)
+    # For Vietnamese text, we need to preserve diacritical marks
+    # Only remove punctuation that doesn't affect meaning
+    text = re.sub(r'[.,;:!?()"\'\[\]/\\]', ' ', text)
     text = re.sub(r'\d+', '', text)
     # Remove extra whitespace
     text = re.sub(r'\s+', ' ', text).strip()
+    # Use Vietnamese-specific tokenization if available
+    try:
+        from underthesea import word_tokenize
+        text = word_tokenize(text, format="text")
+    except ImportError:
+        # Fallback if underthesea is not available
+        pass
     return text
 def extract_features(text):

requirements.txt CHANGED Viewed

@@ -1,4 +1,3 @@
-# requirements.txt
 fastapi==0.104.0
 uvicorn==0.23.2
 sqlalchemy==2.0.22
@@ -16,4 +15,8 @@ tensorflow==2.14.0
 python-dotenv==1.0.0
 httpx==0.25.0
 gunicorn==21.2.0
-pytest==7.4.2

 fastapi==0.104.0
 uvicorn==0.23.2
 sqlalchemy==2.0.22
 python-dotenv==1.0.0
 httpx==0.25.0
 gunicorn==21.2.0
+pytest==7.4.2
+underthesea==6.7.0  # For Vietnamese word tokenization
+langdetect==1.0.9   # For language detection
+transformers==4.35.0  # For multilingual models (optional)
+pyvi==0.1.1  # Vietnamese language processing