File size: 15,598 Bytes
ea0e222 ce08923 ea0e222 e317d56 ea0e222 df28c90 ea0e222 46cc63a ea0e222 df28c90 52b0ede ea0e222 52b0ede ea0e222 df28c90 ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 df28c90 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 52b0ede df28c90 ea0e222 46cc63a ea0e222 52b0ede ea0e222 52b0ede ea0e222 0f0ce9b ea0e222 46cc63a ea0e222 0f0ce9b ea0e222 0f0ce9b ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a 0f0ce9b ea0e222 52b0ede ea0e222 df28c90 ea0e222 df28c90 ea0e222 46cc63a ea0e222 46cc63a ea0e222 52b0ede ea0e222 df28c90 ea0e222 46cc63a ea0e222 46cc63a ea0e222 46cc63a 52b0ede ea0e222 52b0ede ea0e222 46cc63a ea0e222 52b0ede ea0e222 46cc63a ea0e222 52b0ede ea0e222 52b0ede 46cc63a ea0e222 52b0ede ea0e222 52b0ede ea0e222 52b0ede 46cc63a ea0e222 52b0ede df28c90 ea0e222 46cc63a 52b0ede ea0e222 e317d56 ea0e222 52b0ede ea0e222 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | <div align="center">
<img src="docs/assets/signalmod_logo.png" alt="SignalMod" width="520" />
### Intelligent moderation for YouTube comments
π **English** Β· [EspaΓ±ol](README.es.md)










</div>
---
## Project description
**SignalMod** is an intelligent moderation assistant for YouTube comments. It automatically classifies each comment as **Safe** or **Toxic**, returns a probability between 0 and 1, and tags toxicity categories (insult, threat, identity hate, obscene content).
It is built around the team's **hybrid meta-feature stacking** model β frozen Toxic-BERT embeddings combined with metadata features and a regularised logistic regression β reaching **F1 = 0.805** with a trainβtest gap of **2.54 pp** on the project's 200-sample test split.
The product ships as a FastAPI REST service plus a React SPA that mimics the YouTube Watch experience: pick a video, the API fetches the latest 50 comments via the YouTube Data API, scores them, and persists every prediction in Supabase so any visitor can see the full history.
---
## Tools and languages
### Languages
- **Python 3.12** β backend, ML pipelines, evaluation.
- **TypeScript + React 18** β frontend SPA.
- **SQL (PostgreSQL via Supabase)** β predictions persistence.
### Backend
- **FastAPI 0.136** β REST API, Pydantic schemas, lifespan model loading.
- **Uvicorn** β ASGI server with hot reload.
- **scikit-learn 1.8** β TF-IDF baseline + meta-learner Logistic Regression.
- **Optuna** β hyperparameter search for the TF-IDF baseline.
- **PyTorch 2.x + Transformers 5.9** β frozen `unitary/toxic-bert` for CLS embeddings.
- **spaCy + NLTK** β lemmatisation, stopwords, regex-based cleanup.
- **MLflow** β experiment tracking.
- **Supabase Python SDK** β predictions persistence with anonymous RLS policies.
- **google-api-python-client** β YouTube Data API v3 integration.
### Frontend
- **React 18 + Vite 5 + TypeScript** β SPA with hot module reload.
- **CSS modules** β YouTube-like dark theme.
### Tooling and ops
- **uv** β Python package and venv manager (`pyproject.toml` + `uv.lock`).
- **pnpm** β frontend package manager.
- **Docker + Docker Compose** β single-container deploy serving API + built SPA.
- **GNU Make** β `make dev`, `make install`, `make build`, `make docker`.
- **Render** β free-tier deploy via `render.yaml` blueprint.
- **Pytest** β unit tests for API contracts and preprocessing.
---
## Project architecture
```
Project_9_Equipo3/
βββ configs/ # YAML configs for pipelines and inference catalog
β βββ pipeline.yaml # Training data paths, target columns, CV folds
β βββ features.yaml # Preprocessing and TF-IDF settings
β βββ model_catalog.yaml # Inference catalog (3 swappable models)
β βββ best_params.yaml # Optuna winner for the LR baseline
β βββ suggested_videos.yaml # YouTube IDs shown in the Up-next rail
β βββ *_training.yaml # Training profiles (golden baseline, expert, hybrid, β¦)
βββ data/ # Raw and processed datasets (git-ignored)
βββ docs/ # API.md, PIPELINE.md, ARCHITECTURE.md, DEPLOY.md
β βββ assets/signalmod_logo.png # Brand assets
βββ frontend/ # React + Vite SPA
β βββ public/signalmod_logo.png # Logo served as static asset
β βββ src/
β βββ api/ # Typed HTTP client
β βββ components/ # Layout, CommentRow, SuggestedRail, ModelBanner
β βββ context/ # Global app state (active model, threshold)
β βββ hooks/ # useDebouncedPredict
β βββ pages/ # WatchPage, HubPage, SettingsPage
β βββ utils/ # toxicityColor, randomUsername, relativeTime
βββ models/
β βββ baseline/lr_tfidf.joblib # Optuna-tuned LR baseline
β βββ production_final/ # meta_stack_final.joblib β production artifact
βββ notebooks/
β βββ 01β04 # EDA, preprocessing, TF-IDF, baseline LR
β βββ 12 # Golden baseline (frozen Toxic-BERT)
β βββ 14 # Final meta-stacking β production artifact
β βββ archive_attempts/ # Earlier experiments preserved for reproducibility
βββ reports/ # Metrics, plots, EDA figures, summary.csv
βββ src/
β βββ api/ # FastAPI app
β β βββ main.py # Lifespan, CORS, static SPA mount
β β βββ routes/ # health, models, predict (+ /predictions), videos
β β βββ schemas.py # Pydantic request/response models
β β βββ services.py # predict_single, to_predict_response
β β βββ state.py # Shared app state
β β βββ youtube.py # YouTube Data API fetch + suggested metadata
β βββ data/ # Loader, dual loader for hybrid pipelines
β βββ db/ # Supabase client + save_prediction helpers
β βββ evaluation/ # Evaluator, threshold tuning, stable CV
β βββ experiments/ # Notebook 13 / 14 script versions
β βββ features/ # text_preprocessor, vectorizer, metadata, augmentation
β βββ models/ # baseline (LR/RF/XGBoost), hybrid_ensemble, metadata_lr
β βββ pipeline/ # run_pipeline + per-strategy variants
β βββ service/ # ModelService, meta_stack_predictor, model_catalog
β βββ utils/ # Logger
βββ supabase/predictions_setup.sql # SQL to create the predictions table + RLS policies
βββ tests/ # Pytest suite
βββ Dockerfile # Multi-stage build (frontend + uv backend)
βββ docker-compose.yml # One-container deploy serving API + SPA
βββ render.yaml # Render blueprint (web service + static site)
βββ Procfile # Render process declaration
βββ Makefile # make dev / install / build / docker / test
βββ pyproject.toml + uv.lock # Python dependencies pinned with uv
βββ README.md / README.es.md # English / Spanish documentation
```
### Data flow
```
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β React SPA (Vite) http://localhost:5173β
β Layout Β· Watch Β· Hub Β· Settings β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β HTTP JSON (Vite proxy β :8000)
ββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β FastAPI http://localhost:8000β
β /predict /predict-batch /predict-video β
β /predictions (GET β Supabase history) β
β /models /models/select /model-info β
β /videos/suggested /health β
ββββββββ¬ββββββββββββββββββββββββββββββ¬ββββββββββββ
β β
ββββββββββββββββΌββββββββββββββ βββββββββββββββΌβββββββββββββββ
β ModelService β β YouTube Data API v3 β
β Β· local joblib β β Β· video metadata β
β Β· hf_remote β β Β· 50 newest comments β
β Β· meta_stack (production) β β β
ββββββββ¬ββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ
β Supabase (PostgreSQL) β
β table: predictions(id, created_at, text, video_id, β
β probability, is_toxic, labels, β¦) β
β RLS: anon insert + anon select β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Model catalog (swappable from the UI)
| Model | Type | F1 (test) | Trainβtest gap | Threshold | Latency | Default |
| -------------------------------- | ----------- | --------- | -------------- | --------- | ------- | ------- |
| **Meta-Feature Stacking** | Hybrid | **0.805** | **2.54 pp** | **0.381** | ~400 ms | **Yes** |
| Frozen Toxic-BERT | Transformer | 0.790 | 0.16 pp | 0.120 | ~400 ms | No |
| LR + TF-IDF (Optuna) | sklearn | 0.758 | 4.76 pp | 0.500 | < 50 ms | No |
The production model concatenates the frozen `[CLS]` embedding from `unitary/toxic-bert` (768-d) with hand-crafted metadata features (length, uppercase ratio, emoji densityβ¦), scales them with `StandardScaler`, and feeds them into a `LogisticRegression(C=0.001)` meta-learner.
---
## Setup & run
### 1. Prerequisites
| Tool | macOS / Linux | Windows |
| ----------- | ----------------------------------- | --------------------------------------------------------- |
| **Python 3.12** | `brew install python@3.12` | [python.org/downloads](https://www.python.org/downloads/) (check *Add Python to PATH*) |
| **uv** | `curl -LsSf https://astral.sh/uv/install.sh \| sh` | `powershell -c "irm https://astral.sh/uv/install.ps1 \| iex"` |
| **Node.js 18+** | `brew install node` | [nodejs.org](https://nodejs.org/) (LTS) |
| **pnpm** | `npm i -g pnpm` | `npm i -g pnpm` |
| **Make** *(optional)* | already installed | `winget install GnuWin32.Make` (or use WSL) |
### 2. Clone & configure
```bash
git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
cd Project_9_Equipo3
cp .env.example .env
# Fill: YOUTUBE_API_KEY, SUPABASE_URL, SUPABASE_KEY
```
> **Windows PowerShell**: replace `cp` with `Copy-Item .env.example .env`.
Paste `supabase/predictions_setup.sql` into the Supabase SQL editor before the first run (creates the `predictions` table + RLS policies).
### 3. Run β three ways
#### Option A β With Makefile (recommended on macOS / Linux / WSL)
```bash
make install # uv sync + pnpm install
make dev # FastAPI :8000 + Vite :5173
```
| Command | What it does |
| ------------- | --------------------------------------------- |
| `make install`| Install Python + frontend deps |
| `make dev` | Start API and UI in parallel (Ctrl+C stops both) |
| `make api` | API only |
| `make ui` | UI only |
| `make build` | Build the SPA into `frontend/dist` |
| `make test` | Run Pytest |
| `make docker` | `docker compose up --build` |
| `make stop` | Kill anything on ports 8000 / 5173 |
| `make clean` | Remove `.venv`, `node_modules`, `dist` |
#### Option B β Manual (macOS / Linux)
Two terminals.
**Terminal 1 β API**
```bash
uv sync
uv run uvicorn src.api.main:app --reload --port 8000
```
**Terminal 2 β Frontend**
```bash
cd frontend
pnpm install
pnpm dev
```
#### Option C β Manual (Windows PowerShell)
Two terminals.
**Terminal 1 β API**
```powershell
uv sync
uv run uvicorn src.api.main:app --reload --port 8000
```
**Terminal 2 β Frontend**
```powershell
cd frontend
pnpm install
pnpm dev
```
> If `uv` is not recognised after install, close and reopen PowerShell so the new `PATH` is picked up.
### 4. Open the app
| URL | What you'll see |
| ------------------------------ | ---------------------------------------- |
| http://localhost:5173 | React SPA β Watch / Hub / Settings |
| http://localhost:8000/docs | FastAPI Swagger UI |
| http://localhost:8000/health | Health check |
### 5. Docker (one container β API + SPA built)
Same commands on **macOS / Linux / Windows**:
```bash
# Normal β keeps images and volumes for fast rebuilds
docker compose up --build
# β http://localhost:8000 Β· Ctrl+C to stop Β· docker compose down
# Ephemeral demo β Ctrl+C tears down container + image + volumes
make docker-demo
# Manual full cleanup
make docker-clean
# (equivalent to: docker compose down --rmi local --volumes --remove-orphans)
```
---
More: see [docs/PIPELINE.md](docs/PIPELINE.md) for training, [docs/API.md](docs/API.md) for endpoints, [docs/DEPLOY.md](docs/DEPLOY.md) for Render deployment.
---
## Contributors
<table>
<tr>
<td align="center" width="25%">
<b>AndrΓ©s Torrez</b><br/>
<sub>Backend Developer</sub>
</td>
<td align="center" width="25%">
<b>Mirae Kang</b><br/>
<sub>Scrum Master</sub>
</td>
<td align="center" width="25%">
<b>Jonathan Brasales</b><br/>
<sub>AI Developer</sub>
</td>
<td align="center" width="25%">
<b>Roberto Molero</b><br/>
<sub>Product Owner</sub>
</td>
</tr>
</table>
---
<div align="center">
**SignalMod** β Bootcamp IA P6 Β· Team 3 Β· 2026
</div>
|