anpha@DESKTOP-IT4F327 commited on
Commit
2989a5c
·
1 Parent(s): e965645
README.md CHANGED
@@ -1,157 +1,174 @@
1
- ---
2
- title: My Hugging Face Space
3
- emoji: 🚀
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: streamlit
7
- sdk_version: "1.25.0"
8
- app_file: app.py
9
- pinned: false
10
- ---
11
 
12
- Check out the configuration reference at [Hugging Face Spaces Config](https://huggingface.co/docs/hub/spaces-config-reference).
13
 
14
- # Social Media Toxicity Detector
15
 
16
- A browser extension that detects toxic, offensive, hate speech, and spam content on social media platforms using a machine learning model.
17
 
18
- ## Features
 
 
 
19
 
20
- - Detection of toxic content on Facebook, Twitter, and YouTube
21
- - Classification into 4 categories: Clean (0), Offensive (1), Hate Speech (2), and Spam (3)
22
- - Real-time content scanning on social media platforms
23
- - Manual text analysis
24
- - Admin dashboard for content monitoring and analytics
25
- - User role-based access control
26
- - Comment log and history tracking
27
 
28
- ## Project Structure
 
29
 
30
- The project is organized into two main components:
31
 
32
- 1. **Backend API**: FastAPI-based REST API for model inference, user management, and data storage
33
- 2. **Browser Extension**: Chrome extension for content detection and user interface
34
 
35
- ## Backend Setup
 
 
 
36
 
37
- ### Prerequisites
38
 
39
- - Python 3.9+
40
- - PostgreSQL with pgvector extension
41
- - Virtual environment (recommended)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- ### Installation
44
 
45
- 1. Clone the repository:
46
- ```bash
47
- git clone https://github.com/yourusername/social-media-toxicity-detector.git
48
- cd social-media-toxicity-detector
49
- ```
50
 
51
- 2. Create and activate a virtual environment:
52
- ```bash
53
- python -m venv venv
54
- source venv/bin/activate # On Windows: venv\Scripts\activate
55
- ```
56
 
57
- 3. Install dependencies:
58
- ```bash
59
- pip install -r requirements.txt
60
- ```
61
 
62
- 4. Set up environment variables by creating a `.env` file:
63
- ```
64
- # API Configuration
65
- SECRET_KEY=your-secret-key-here
66
- ACCESS_TOKEN_EXPIRE_MINUTES=30
67
-
68
- # Database Configuration
69
- POSTGRES_SERVER=localhost
70
- POSTGRES_USER=postgres
71
- POSTGRES_PASSWORD=postgres
72
- POSTGRES_DB=toxicity_detector
73
- POSTGRES_PORT=5432
74
-
75
- # ML Model Configuration
76
- MODEL_PATH=model/toxicity_detector.h5
77
- HUGGINGFACE_API_URL=https://api-inference.huggingface.co/models/your-model-endpoint
78
- HUGGINGFACE_API_TOKEN=your-huggingface-token
79
-
80
- # Social Media APIs
81
- FACEBOOK_API_KEY=your-facebook-api-key
82
- TWITTER_API_KEY=your-twitter-api-key
83
- YOUTUBE_API_KEY=your-youtube-api-key
84
- ```
85
 
86
- 5. Initialize the database:
87
- ```bash
88
- alembic revision --autogenerate -m "Initial migration"
89
- alembic upgrade head
90
- ```
91
 
92
- 6. Start the API server:
93
- ```bash
94
- uvicorn backend.main:app --reload
95
- ```
 
96
 
97
- ### API Documentation
98
 
99
- Once the server is running, you can access the API documentation at:
100
- - Swagger UI: http://localhost:8000/docs
101
- - ReDoc: http://localhost:8000/redoc
102
 
103
- ## Extension Setup
 
 
 
 
104
 
105
- 1. Navigate to the extension directory:
106
- ```bash
107
- cd extension
108
- ```
 
109
 
110
- 2. Configure the API endpoint in `background.js`:
111
- ```javascript
112
- const API_BASE_URL = 'http://localhost:8000/api'; // Change to your actual API endpoint
 
 
 
 
 
 
 
 
 
 
113
  ```
114
 
115
- 3. Install the extension in Chrome:
116
- - Open Chrome and navigate to `chrome://extensions/`
117
- - Enable "Developer mode"
118
- - Click "Load unpacked" and select the `extension` directory
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
- ## Usage
121
 
122
- 1. After installing the extension, click on the extension icon in the toolbar
123
- 2. Log in with your credentials
124
- 3. Visit Facebook, Twitter, or YouTube to activate content scanning
125
- 4. Use the extension popup to scan pages manually or analyze specific text
126
- 5. Access the admin dashboard at `http://localhost:8000/admin` (requires admin login)
127
 
128
  ## Model Training
129
 
130
- The toxicity detection model was trained using a dataset with 4 labels:
131
- - 0: Clean content
132
- - 1: Offensive content
133
- - 2: Hate speech
134
- - 3: Spam
135
 
136
- The model file (.h5) should be placed in the `model` directory or served via Hugging Face API.
137
 
138
- ## Database Schema
 
 
 
139
 
140
- The system uses PostgreSQL with pgvector extension for vector similarity search:
141
 
142
- - **Users**: User accounts with role-based permissions
143
- - **Roles**: User roles (admin, moderator, user)
144
- - **Comments**: Detected comments with classification results and vector embeddings
145
- - **Logs**: System activity logs
 
 
 
146
 
147
- ## Security Features
148
 
149
- - JWT authentication
150
- - Role-based access control
151
  - Password hashing with bcrypt
152
- - Request logging
153
- - Input validation and sanitization
 
 
 
 
 
 
 
 
154
 
155
  ## License
156
 
157
- [MIT License](LICENSE)
 
 
 
 
 
 
 
1
+ # Toxic Language Detector
 
 
 
 
 
 
 
 
 
2
 
3
+ A comprehensive system for detecting toxic language on social media platforms (Facebook, YouTube, Twitter), implemented as a browser extension with a FastAPI backend.
4
 
5
+ ## Project Overview
6
 
7
+ This project aims to detect and analyze toxic language in social media comments using a machine learning model trained on a large dataset. The system classifies comments into four categories:
8
 
9
+ - 0: Clean (non-toxic)
10
+ - 1: Offensive
11
+ - 2: Hate speech
12
+ - 3: Spam
13
 
14
+ The project consists of two main components:
 
 
 
 
 
 
15
 
16
+ 1. **Backend API**: A FastAPI application that handles ML model inference, data storage, and provides endpoints for both the extension and admin users.
17
+ 2. **Browser Extension**: A Chrome extension that scans comments on supported social media platforms and highlights toxic content.
18
 
19
+ ## Backend Architecture
20
 
21
+ ### Core Components
 
22
 
23
+ - **FastAPI Application**: The main web framework that serves the API endpoints
24
+ - **Machine Learning Model**: LSTM-based model for toxic language classification
25
+ - **Database**: SQLAlchemy ORM with SQLite/PostgreSQL for data storage
26
+ - **Authentication**: JWT-based token authentication for API access
27
 
28
+ ### Directory Structure
29
 
30
+ ```
31
+ TOXIC-LANGUAGE-DETECTORV1/
32
+ │── backend/
33
+ │ ├── api/
34
+ │ │ ├── models/ # Pydantic models for API requests/responses
35
+ │ │ ├── routes/ # API endpoints
36
+ │ ├── config/ # Configuration settings
37
+ │ ├── core/ # Core functionality (auth, dependencies)
38
+ │ ├── db/ # Database models and connection
39
+ │ │ ├── models/ # SQLAlchemy models
40
+ │ ├── services/ # Service layer (ML model, social media APIs)
41
+ │ ├── utils/ # Utility functions
42
+ │── model/ # ML model files
43
+ │── app.py # Main entry point
44
+ │── requirements.txt # Dependencies
45
+ │── Dockerfile # Container configuration
46
+ ```
47
 
48
+ ### Database Schema
49
 
50
+ The database consists of the following main tables:
 
 
 
 
51
 
52
+ 1. **User**: Stores user information and authentication data
53
+ 2. **Role**: Defines user roles (admin, user)
54
+ 3. **Comment**: Stores analyzed comments with their predictions and vector representations
55
+ 4. **Log**: Records API access and system events
 
56
 
57
+ ### API Endpoints
 
 
 
58
 
59
+ The backend provides two main sets of endpoints:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ 1. **Extension Endpoints**:
62
+ - `/extension/detect`: Analyzes comment text from the browser extension
 
 
 
63
 
64
+ 2. **API Endpoints**:
65
+ - Authentication: `/auth/register`, `/auth/token`
66
+ - Admin: `/admin/users`, `/admin/comments`, `/admin/logs`
67
+ - Prediction: `/predict/single`, `/predict/batch`
68
+ - Analysis: `/detect/similar`, `/detect/statistics`
69
 
70
+ ## Browser Extension
71
 
72
+ ### Features
 
 
73
 
74
+ - Real-time comment analysis on Facebook, YouTube, and Twitter
75
+ - Visual indicators for toxic comments with different colors based on toxicity type
76
+ - Option to blur highly toxic content with a reveal button
77
+ - Configurable settings through a popup interface
78
+ - Statistics tracking for scanned comments
79
 
80
+ ### Components
81
+
82
+ - **Background Script**: Handles API communication and manages extension state
83
+ - **Content Script**: Analyzes comments on supported websites
84
+ - **Popup Interface**: User-friendly settings panel
85
 
86
+ ### Directory Structure
87
+
88
+ ```
89
+ EXTENSION/
90
+ │── icons/ # Extension icons
91
+ │── popup/ # Popup interface files
92
+ │ ├── popup.css
93
+ │ ├── popup.html
94
+ │ ├── popup.js
95
+ │── background.js # Background script
96
+ │── content.js # Content script for analyzing comments
97
+ │── manifest.json # Extension configuration
98
+ │── styles.css # CSS for content modifications
99
  ```
100
 
101
+ ## Setup and Deployment
102
+
103
+ ### Backend Setup
104
+
105
+ 1. Clone the repository
106
+ 2. Install dependencies: `pip install -r requirements.txt`
107
+ 3. Set up environment variables:
108
+ ```
109
+ export SECRET_KEY="your-secret-key"
110
+ export DATABASE_URL="sqlite:///./toxic_detector.db"
111
+ export EXTENSION_API_KEY="your-extension-api-key"
112
+ ```
113
+ 4. Run the application: `uvicorn app:app --reload`
114
+
115
+ ### Hugging Face Space Deployment
116
+
117
+ 1. Create a new Space on Hugging Face
118
+ 2. Upload the project files
119
+ 3. Configure the environment variables
120
+ 4. Set the Space to use FastAPI template
121
 
122
+ ### Extension Setup
123
 
124
+ 1. Open Chrome and navigate to `chrome://extensions/`
125
+ 2. Enable Developer Mode
126
+ 3. Click "Load unpacked" and select the EXTENSION directory
127
+ 4. Configure the extension API endpoint in the popup settings
 
128
 
129
  ## Model Training
130
 
131
+ The toxic language detection model was trained on a large dataset with four classification labels. The model architecture is based on LSTM (Long Short-Term Memory) networks, which are effective for sequence classification tasks like text analysis.
 
 
 
 
132
 
133
+ ### Model Architecture
134
 
135
+ - Embedding layer
136
+ - LSTM layer
137
+ - Dense output layer with softmax activation
138
+ - Trained with categorical cross-entropy loss
139
 
140
+ ## Data Flow
141
 
142
+ 1. User visits a social media platform
143
+ 2. Extension scans comments on the page
144
+ 3. Comments are sent to the backend API
145
+ 4. API processes comments using the ML model
146
+ 5. Results are returned to the extension
147
+ 6. Extension highlights toxic comments
148
+ 7. Comment data is stored in the database for analysis
149
 
150
+ ## Security Considerations
151
 
152
+ - JWT token authentication for API endpoints
153
+ - API key authentication for extension
154
  - Password hashing with bcrypt
155
+ - CORS protection
156
+ - Request logging for monitoring
157
+
158
+ ## Future Improvements
159
+
160
+ - Add more social media platforms
161
+ - Implement user feedback mechanism to improve model
162
+ - Add multi-language support
163
+ - Develop a dashboard for analytics
164
+ - Implement more advanced NLP techniques
165
 
166
  ## License
167
 
168
+ This project is for research purposes only.
169
+
170
+ ## Acknowledgements
171
+
172
+ - TensorFlow team for ML framework
173
+ - FastAPI for backend framework
174
+ - Chrome Extensions API
app.py CHANGED
@@ -1,40 +1,269 @@
1
- # app.py - Main entry point for the FastAPI application
2
-
3
- from fastapi import FastAPI, Depends
 
 
4
  from fastapi.middleware.cors import CORSMiddleware
5
- from api.routes import admin, auth, extension, prediction, toxic_detection
6
- from core.middleware import LogMiddleware
7
- from db.models.base import Base
8
- from db.models.user import engine
9
- import uvicorn
 
 
 
10
 
 
11
  app = FastAPI(
12
  title="Toxic Language Detector API",
13
  description="API for detecting toxic language in social media comments",
14
- version="1.0.0"
15
  )
16
 
17
- # Configure CORS
18
  app.add_middleware(
19
  CORSMiddleware,
20
- allow_origins=["*"], # Update this with specific origins in production
21
  allow_credentials=True,
22
  allow_methods=["*"],
23
  allow_headers=["*"],
24
  )
25
 
26
- # Add custom middleware
27
- app.add_middleware(LogMiddleware)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- # Include routers
30
- app.include_router(auth.router, prefix="/auth", tags=["Authentication"])
31
- app.include_router(admin.router, prefix="/admin", tags=["Admin"])
32
- app.include_router(extension.router, prefix="/extension", tags=["Extension"])
33
- app.include_router(prediction.router, prefix="/predict", tags=["Prediction"])
34
- app.include_router(toxic_detection.router, prefix="/detect", tags=["Toxic Detection"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- # Create database tables
37
- Base.metadata.create_all(bind=engine)
38
 
 
39
  if __name__ == "__main__":
40
- uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)
 
 
1
+ # app.py - Hugging Face Space Entry Point
2
+ import os
3
+ import sys
4
+ import gradio as gr
5
+ from fastapi import FastAPI, HTTPException, Depends, status, Request
6
  from fastapi.middleware.cors import CORSMiddleware
7
+ from fastapi.responses import HTMLResponse, JSONResponse
8
+ from fastapi.staticfiles import StaticFiles
9
+ from pydantic import BaseModel
10
+ from typing import List, Dict, Any, Optional
11
+ import tensorflow as tf
12
+ import numpy as np
13
+ from sklearn.feature_extraction.text import TfidfVectorizer
14
+ import re
15
 
16
+ # Define our FastAPI application
17
  app = FastAPI(
18
  title="Toxic Language Detector API",
19
  description="API for detecting toxic language in social media comments",
20
+ version="1.0.0",
21
  )
22
 
23
+ # CORS configuration
24
  app.add_middleware(
25
  CORSMiddleware,
26
+ allow_origins=["*"],
27
  allow_credentials=True,
28
  allow_methods=["*"],
29
  allow_headers=["*"],
30
  )
31
 
32
+ # API models
33
+ class PredictionRequest(BaseModel):
34
+ text: str
35
+ platform: Optional[str] = "unknown"
36
+ platform_id: Optional[str] = None
37
+ metadata: Optional[Dict[str, Any]] = None
38
+
39
+ class PredictionResponse(BaseModel):
40
+ text: str
41
+ prediction: int
42
+ confidence: float
43
+ prediction_text: str
44
+
45
+ # Load ML model
46
+ class ToxicDetectionModel:
47
+ def __init__(self):
48
+ # Load or create model trained on Vietnamese social media data
49
+ try:
50
+ self.model = tf.keras.models.load_model("model/best_model_LSTM.h5")
51
+ print("Vietnamese toxicity model loaded successfully")
52
+ except Exception as e:
53
+ print(f"Error loading model: {e}")
54
+ print("Creating a dummy model for demonstration")
55
+ self.model = self._create_dummy_model()
56
+
57
+ # Initialize vectorizer for Vietnamese text
58
+ # Vietnamese doesn't use the same stop words as English
59
+ self.vectorizer = TfidfVectorizer(
60
+ max_features=10000,
61
+ stop_words=None, # Don't use English stop words
62
+ ngram_range=(1, 3) # Use 1-3 grams for better Vietnamese phrase capture
63
+ )
64
+
65
+ # Map predictions to text labels (in Vietnamese)
66
+ self.label_mapping = {
67
+ 0: "bình thường", # clean
68
+ 1: "xúc phạm", # offensive
69
+ 2: "thù ghét", # hate
70
+ 3: "spam" # spam
71
+ }
72
+
73
+ # Load Vietnamese tokenizer if available
74
+ try:
75
+ # Try to load underthesea for Vietnamese NLP
76
+ import importlib.util
77
+ if importlib.util.find_spec("underthesea"):
78
+ from underthesea import word_tokenize
79
+ self.has_vietnamese_nlp = True
80
+ print("Vietnamese NLP library loaded successfully")
81
+ else:
82
+ self.has_vietnamese_nlp = False
83
+ print("Vietnamese NLP library not found, using basic tokenization")
84
+ except Exception:
85
+ self.has_vietnamese_nlp = False
86
+
87
+ def _create_dummy_model(self):
88
+ # Create a simple model for demonstration
89
+ inputs = tf.keras.Input(shape=(10000,))
90
+ x = tf.keras.layers.Dense(128, activation='relu')(inputs)
91
+ x = tf.keras.layers.Dropout(0.3)(x)
92
+ outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
93
+ model = tf.keras.Model(inputs=inputs, outputs=outputs)
94
+ model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
95
+ return model
96
+
97
+ def preprocess_text(self, text):
98
+ # Clean text while preserving Vietnamese diacritical marks
99
+ text = text.lower()
100
+ text = re.sub(r'https?://\S+|www\.\S+', '', text) # Remove URLs
101
+ text = re.sub(r'<.*?>', '', text) # Remove HTML tags
102
+
103
+ # For Vietnamese, preserve diacritical marks and only remove punctuation
104
+ text = re.sub(r'[.,;:!?()"\'\[\]/\\]', ' ', text)
105
+ text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
106
+
107
+ # Use Vietnamese tokenization if available
108
+ if self.has_vietnamese_nlp:
109
+ try:
110
+ from underthesea import word_tokenize
111
+ text = word_tokenize(text, format="text")
112
+ except Exception as e:
113
+ print(f"Error in Vietnamese tokenization: {e}")
114
+
115
+ # Vectorize
116
+ if not hasattr(self.vectorizer, 'vocabulary_'):
117
+ self.vectorizer.fit([text])
118
+
119
+ features = self.vectorizer.transform([text]).toarray()
120
+ return features
121
+
122
+ def predict(self, text):
123
+ # Preprocess text
124
+ features = self.preprocess_text(text)
125
+
126
+ # Make prediction
127
+ predictions = self.model.predict(features)[0]
128
+
129
+ # Get most likely class and confidence
130
+ predicted_class = np.argmax(predictions)
131
+ confidence = float(predictions[predicted_class])
132
+
133
+ return int(predicted_class), confidence, self.label_mapping[int(predicted_class)]
134
+
135
+ # Initialize model
136
+ model = ToxicDetectionModel()
137
+
138
+ # API Key validation
139
+ API_KEY = os.environ.get("API_KEY", "test-api-key")
140
 
141
+ def verify_api_key(request: Request):
142
+ api_key = request.headers.get("X-API-Key")
143
+ if not api_key or api_key != API_KEY:
144
+ raise HTTPException(
145
+ status_code=status.HTTP_401_UNAUTHORIZED,
146
+ detail="Invalid API Key",
147
+ )
148
+ return api_key
149
+
150
+ # API routes
151
+ @app.post("/extension/detect", response_model=PredictionResponse)
152
+ async def detect_toxic_language(
153
+ request: PredictionRequest,
154
+ api_key: str = Depends(verify_api_key)
155
+ ):
156
+ try:
157
+ # Make prediction
158
+ prediction_class, confidence, prediction_text = model.predict(request.text)
159
+
160
+ # Return response
161
+ return {
162
+ "text": request.text,
163
+ "prediction": prediction_class,
164
+ "confidence": confidence,
165
+ "prediction_text": prediction_text
166
+ }
167
+ except Exception as e:
168
+ raise HTTPException(
169
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
170
+ detail=f"Error processing request: {str(e)}"
171
+ )
172
+
173
+ @app.get("/", response_class=HTMLResponse)
174
+ async def root():
175
+ return """
176
+ <html>
177
+ <head>
178
+ <title>Toxic Language Detector API</title>
179
+ <style>
180
+ body {
181
+ font-family: Arial, sans-serif;
182
+ max-width: 800px;
183
+ margin: 0 auto;
184
+ padding: 20px;
185
+ }
186
+ h1 {
187
+ color: #333;
188
+ }
189
+ .endpoint {
190
+ margin-bottom: 20px;
191
+ padding: 10px;
192
+ border: 1px solid #ddd;
193
+ border-radius: 5px;
194
+ }
195
+ .method {
196
+ display: inline-block;
197
+ padding: 3px 6px;
198
+ background-color: #2196F3;
199
+ color: white;
200
+ border-radius: 3px;
201
+ font-size: 14px;
202
+ }
203
+ pre {
204
+ background-color: #f5f5f5;
205
+ padding: 10px;
206
+ border-radius: 5px;
207
+ overflow-x: auto;
208
+ }
209
+ </style>
210
+ </head>
211
+ <body>
212
+ <h1>Toxic Language Detector API</h1>
213
+ <p>This API provides endpoints for detecting toxic language in text.</p>
214
+
215
+ <div class="endpoint">
216
+ <span class="method">POST</span> <strong>/extension/detect</strong>
217
+ <p>Analyzes text for toxic language and returns the prediction.</p>
218
+ <h4>Request</h4>
219
+ <pre>
220
+ {
221
+ "text": "Your text to analyze",
222
+ "platform": "facebook",
223
+ "platform_id": "optional-id",
224
+ "metadata": {}
225
+ }
226
+ </pre>
227
+ <h4>Response</h4>
228
+ <pre>
229
+ {
230
+ "text": "Your text to analyze",
231
+ "prediction": 0,
232
+ "confidence": 0.95,
233
+ "prediction_text": "clean"
234
+ }
235
+ </pre>
236
+ <p>Prediction values: 0 (clean), 1 (offensive), 2 (hate), 3 (spam)</p>
237
+ </div>
238
+
239
+ <p>For more information, check the <a href="/docs">API documentation</a>.</p>
240
+ </body>
241
+ </html>
242
+ """
243
+
244
+ # Gradio interface
245
+ def predict_toxic(text):
246
+ prediction_class, confidence, prediction_text = model.predict(text)
247
+
248
+ # Format response
249
+ result = f"Prediction: {prediction_text.capitalize()} (Class {prediction_class})\n"
250
+ result += f"Confidence: {confidence:.2f}"
251
+
252
+ return result
253
+
254
+ # Create Gradio interface
255
+ interface = gr.Interface(
256
+ fn=predict_toxic,
257
+ inputs=gr.Textbox(lines=5, placeholder="Enter text to analyze for toxic language..."),
258
+ outputs="text",
259
+ title="Toxic Language Detector",
260
+ description="Detects whether text contains toxic language. Classes: 0 (clean), 1 (offensive), 2 (hate), 3 (spam)."
261
+ )
262
 
263
+ # Mount Gradio app
264
+ app = gr.mount_gradio_app(app, interface, path="/gradio")
265
 
266
+ # For direct Hugging Face Space usage
267
  if __name__ == "__main__":
268
+ import uvicorn
269
+ uvicorn.run(app, host="0.0.0.0", port=7860)
backend/services/ml_model.py CHANGED
@@ -7,7 +7,7 @@ import re
7
  import os
8
 
9
  class MLModel:
10
- def __init__(self, model_path="model/best_model_LSTM.h5", max_length=100, max_words=10000):
11
  self.model_path = model_path
12
  self.max_length = max_length
13
  self.max_words = max_words
@@ -16,32 +16,56 @@ class MLModel:
16
  self.load_model()
17
 
18
  def load_model(self):
19
- """Load the pretrained model"""
20
  if os.path.exists(self.model_path):
21
  self.model = tf.keras.models.load_model(self.model_path)
22
- print(f"Model loaded from {self.model_path}")
23
  else:
24
  print(f"Model not found at {self.model_path}. Using dummy model.")
25
  # Create a dummy model for testing
26
  self.model = self._create_dummy_model()
27
 
28
- # Initialize tokenizer - in production, this should be loaded from a saved tokenizer
29
- self.tokenizer = Tokenizer(num_words=self.max_words)
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  def _create_dummy_model(self):
32
  """Create a dummy model for testing purposes"""
33
  inputs = tf.keras.Input(shape=(self.max_length,))
34
- x = tf.keras.layers.Embedding(self.max_words, 64, input_length=self.max_length)(inputs)
35
- x = tf.keras.layers.LSTM(64)(x)
36
  outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
37
  model = tf.keras.Model(inputs=inputs, outputs=outputs)
38
  model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
39
  return model
40
 
41
  def preprocess_text(self, text):
42
- """Preprocess text for prediction"""
43
- # Clean text
44
- text = re.sub(r'[^\w\s]', '', text.lower())
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  # Tokenize and pad
47
  sequences = self.tokenizer.texts_to_sequences([text])
@@ -50,7 +74,7 @@ class MLModel:
50
  return padded_sequences
51
 
52
  def predict(self, text):
53
- """Predict the class of the text"""
54
  # Preprocess text
55
  preprocessed_text = self.preprocess_text(text)
56
 
@@ -61,4 +85,6 @@ class MLModel:
61
  predicted_class = np.argmax(prediction)
62
  confidence = float(prediction[predicted_class])
63
 
 
 
64
  return int(predicted_class), confidence
 
7
  import os
8
 
9
  class MLModel:
10
+ def __init__(self, model_path="model/best_model_LSTM.h5", max_length=100, max_words=20000):
11
  self.model_path = model_path
12
  self.max_length = max_length
13
  self.max_words = max_words
 
16
  self.load_model()
17
 
18
  def load_model(self):
19
+ """Load the pretrained model trained on Vietnamese social media data"""
20
  if os.path.exists(self.model_path):
21
  self.model = tf.keras.models.load_model(self.model_path)
22
+ print(f"Vietnamese toxicity model loaded from {self.model_path}")
23
  else:
24
  print(f"Model not found at {self.model_path}. Using dummy model.")
25
  # Create a dummy model for testing
26
  self.model = self._create_dummy_model()
27
 
28
+ # In production, this should be loaded from a saved tokenizer trained on Vietnamese data
29
+ # For Vietnamese text, we need a specialized tokenizer or use a pre-tokenized approach
30
+ try:
31
+ tokenizer_path = "model/vietnamese_tokenizer.pkl"
32
+ if os.path.exists(tokenizer_path):
33
+ import pickle
34
+ with open(tokenizer_path, 'rb') as handle:
35
+ self.tokenizer = pickle.load(handle)
36
+ print(f"Vietnamese tokenizer loaded from {tokenizer_path}")
37
+ else:
38
+ print("Tokenizer not found, initializing new one (for development only)")
39
+ self.tokenizer = Tokenizer(num_words=self.max_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
40
+ except Exception as e:
41
+ print(f"Error loading tokenizer: {e}")
42
+ self.tokenizer = Tokenizer(num_words=self.max_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
43
 
44
  def _create_dummy_model(self):
45
  """Create a dummy model for testing purposes"""
46
  inputs = tf.keras.Input(shape=(self.max_length,))
47
+ x = tf.keras.layers.Embedding(self.max_words, 128, input_length=self.max_length)(inputs)
48
+ x = tf.keras.layers.LSTM(128)(x)
49
  outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
50
  model = tf.keras.Model(inputs=inputs, outputs=outputs)
51
  model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
52
  return model
53
 
54
  def preprocess_text(self, text):
55
+ """Preprocess Vietnamese text for prediction"""
56
+ # For Vietnamese, we need to maintain special characters and diacritical marks
57
+ # Only remove punctuation and normalize whitespace
58
+ text = re.sub(r'[.,;:!?()"\'\[\]/\\]', ' ', text.lower())
59
+ text = re.sub(r'\s+', ' ', text).strip()
60
+
61
+ # Use underthesea for Vietnamese tokenization if available
62
+ try:
63
+ from underthesea import word_tokenize
64
+ tokenized_text = word_tokenize(text, format="text")
65
+ text = tokenized_text
66
+ except ImportError:
67
+ # Fallback if underthesea is not available
68
+ pass
69
 
70
  # Tokenize and pad
71
  sequences = self.tokenizer.texts_to_sequences([text])
 
74
  return padded_sequences
75
 
76
  def predict(self, text):
77
+ """Predict the class of the Vietnamese text"""
78
  # Preprocess text
79
  preprocessed_text = self.preprocess_text(text)
80
 
 
85
  predicted_class = np.argmax(prediction)
86
  confidence = float(prediction[predicted_class])
87
 
88
+ # Map prediction to labels appropriate for Vietnamese content
89
+ # 0: clean, 1: offensive, 2: hate, 3: spam
90
  return int(predicted_class), confidence
backend/services/social_media.py CHANGED
@@ -231,4 +231,4 @@ class YouTubeAPI:
231
  videos.append(video)
232
  return videos
233
 
234
- return []
 
231
  videos.append(video)
232
  return videos
233
 
234
+ return []
backend/utils/vector_utils.py CHANGED
@@ -22,15 +22,15 @@ def _get_vectorizer():
22
 
23
  def preprocess_text(text):
24
  """
25
- Preprocess text for vectorization
26
 
27
  Args:
28
- text (str): Raw text
29
 
30
  Returns:
31
  str: Preprocessed text
32
  """
33
- # Convert to lowercase
34
  text = text.lower()
35
 
36
  # Remove URLs
@@ -39,13 +39,22 @@ def preprocess_text(text):
39
  # Remove HTML tags
40
  text = re.sub(r'<.*?>', '', text)
41
 
42
- # Remove special characters and numbers
43
- text = re.sub(r'[^\w\s]', '', text)
 
44
  text = re.sub(r'\d+', '', text)
45
 
46
  # Remove extra whitespace
47
  text = re.sub(r'\s+', ' ', text).strip()
48
 
 
 
 
 
 
 
 
 
49
  return text
50
 
51
  def extract_features(text):
 
22
 
23
  def preprocess_text(text):
24
  """
25
+ Preprocess Vietnamese text for vectorization
26
 
27
  Args:
28
+ text (str): Raw Vietnamese text
29
 
30
  Returns:
31
  str: Preprocessed text
32
  """
33
+ # Convert to lowercase (preserving Vietnamese diacritical marks)
34
  text = text.lower()
35
 
36
  # Remove URLs
 
39
  # Remove HTML tags
40
  text = re.sub(r'<.*?>', '', text)
41
 
42
+ # For Vietnamese text, we need to preserve diacritical marks
43
+ # Only remove punctuation that doesn't affect meaning
44
+ text = re.sub(r'[.,;:!?()"\'\[\]/\\]', ' ', text)
45
  text = re.sub(r'\d+', '', text)
46
 
47
  # Remove extra whitespace
48
  text = re.sub(r'\s+', ' ', text).strip()
49
 
50
+ # Use Vietnamese-specific tokenization if available
51
+ try:
52
+ from underthesea import word_tokenize
53
+ text = word_tokenize(text, format="text")
54
+ except ImportError:
55
+ # Fallback if underthesea is not available
56
+ pass
57
+
58
  return text
59
 
60
  def extract_features(text):
requirements.txt CHANGED
@@ -1,4 +1,3 @@
1
- # requirements.txt
2
  fastapi==0.104.0
3
  uvicorn==0.23.2
4
  sqlalchemy==2.0.22
@@ -16,4 +15,8 @@ tensorflow==2.14.0
16
  python-dotenv==1.0.0
17
  httpx==0.25.0
18
  gunicorn==21.2.0
19
- pytest==7.4.2
 
 
 
 
 
 
1
  fastapi==0.104.0
2
  uvicorn==0.23.2
3
  sqlalchemy==2.0.22
 
15
  python-dotenv==1.0.0
16
  httpx==0.25.0
17
  gunicorn==21.2.0
18
+ pytest==7.4.2
19
+ underthesea==6.7.0 # For Vietnamese word tokenization
20
+ langdetect==1.0.9 # For language detection
21
+ transformers==4.35.0 # For multilingual models (optional)
22
+ pyvi==0.1.1 # Vietnamese language processing