Ruperth commited on
Commit
ea0e222
·
1 Parent(s): 5465983

docs: rewrite readmes with logo description architecture and language toggle

Browse files
Files changed (2) hide show
  1. README.es.md +237 -222
  2. README.md +238 -205
README.es.md CHANGED
@@ -1,297 +1,312 @@
1
- # Detector de comentarios tóxicos en YouTube (youtube_hate_detector)
2
 
3
- [Python](https://www.python.org/downloads/)
4
- [FastAPI](https://fastapi.tiangolo.com/)
5
- [React](https://react.dev/)
6
- [Docker](https://docs.docker.com/compose/)
7
 
8
- **English:** [README.md](README.md)
9
 
10
- Soporte de moderación **Seguro vs Tóxico** para comentarios estilo YouTube. La pila es **FastAPI** (inferencia REST) más una SPA **React** que imita una página de reproducción: escribe o carga comentarios, consulta puntuaciones de toxicidad y cambia de modelo en Ajustes.
11
 
12
- **Producción por defecto:** **Hybrid Meta-Feature Stacking** — `models/production_final/meta_stack_final.joblib` (F1 en test **0,805**, brecha train–test **2,54 %**, por debajo de la regla del equipo **< 5 %** de sobreajuste).
13
-
14
- ---
15
-
16
- ## Qué hace este proyecto
17
-
18
-
19
- | Aspecto | Detalle |
20
- | -------------------------- | ------------------------------------------------------------------------------------------------- |
21
- | **Tarea** | Clasificación binaria sobre `IsToxic` → **Seguro (0)** / **Tóxico (1)** |
22
- | **Datos** | `data/raw/youtoxic_english_1000.csv` (~1k comentarios en inglés; columnas multietiqueta para EDA) |
23
- | **Métrica principal** | F1 ponderado (clase tóxica desbalanceada) |
24
- | **Control de sobreajuste** | |F1 train − F1 test| < 5 puntos porcentuales |
25
- | **Texto en la UI** | **tóxico** |
26
 
27
-
28
- Los moderadores reciben una puntuación y etiqueta prácticas por comentario. La demo no sustituye la revisión humana; prioriza un rendimiento **útil** en un corpus pequeño y de dominio concreto.
29
 
30
  ---
31
 
32
- ## Modelos: baseline → producción
33
 
34
- Tres opciones de inferencia están en `[configs/model_catalog.yaml](configs/model_catalog.yaml)` y en la UI. Las métricas siguientes corresponden al split de test estratificado del proyecto, salvo que se indique lo contrario.
35
 
 
36
 
37
- | Modelo | Tipo | F1 test (ponderado) | Brecha train–test | Artefacto / pesos | Umbral en UI |
38
- | -------------------------------------- | ----------------------- | ------------------- | ----------------- | ------------------------------------------------------------------------------ | ------------ |
39
- | **LR + TF-IDF (Baseline)** | sklearn + TF-IDF | 0,758 | 4,76 pp | `models/baseline/lr_tfidf.joblib` | 0,50 |
40
- | **Frozen Toxic-BERT (Baseline)** | Transformer (congelado) | 0,790 | 0,16 pp | Hugging Face `[unitary/toxic-bert](https://huggingface.co/unitary/toxic-bert)` | 0,12 |
41
- | **Meta-Feature Stacking (Production)** | Stack híbrido | **0,805** | **2,54 pp** | `models/production_final/meta_stack_final.joblib` | **0,381** |
42
 
 
43
 
44
- Números canónicos de baselines: `[models/baseline/manifest.json](models/baseline/manifest.json)`. Ejecución de producción: `[reports/notebook_14/final_result.json](reports/notebook_14/final_result.json)`. Guion de presentación: `[reports/HANDOVER_REPORT.md](reports/HANDOVER_REPORT.md)`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- ### Aportación del equipo — Hybrid Meta-Feature Stacking
47
 
48
- Producción combina señales que sklearn no captura solo, sin afinar un transformer grande sobre ~1k filas:
49
 
50
- ```text
51
- Texto del comentario
52
- ├─ Frozen Toxic-BERT embedding [CLS] (768-d)
53
- Metadatos (longitud, ratio mayúsculas, densidad de emojis, …)
54
- concat StandardScaler LogisticRegression (C=0,001)
55
- P(tóxico) umbral 0,381
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
58
- - **BERT congelado** aporta señal semántica; los pesos no se entrenan (mismo checkpoint Hub que el baseline congelado).
59
- - **Metadatos** conservan estructura interpretable (puntuación, longitud, etc.).
60
- - **Regularización fuerte** y búsqueda de umbral en test mantienen la brecha por debajo del 5 % y cumplen el objetivo **F1 ≥ 0,80**.
61
-
62
- Implementación: [Notebook 14](notebooks/14_final_meta_stacking.ipynb) · `uv run python -m src.experiments.notebook_14_final_stack`
63
 
64
- ### Hilo de notebooks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
 
66
 
67
- | Notebooks | Rol |
68
- | ------------------- | ---------------------------------------------------------------------- |
69
- | `01`–`04` | EDA, preprocesado, TF-IDF baseline LR |
70
- | `12` | Estrategia golden baseline (métricas Toxic-BERT congelado) |
71
- | `14` | Meta-stacking final artefacto de producción |
72
- | `archive_attempts/` | Experimentos anteriores (05–11, 13); conservados para reproducibilidad |
73
 
 
74
 
75
  ---
76
 
77
- ## Requisitos previos
78
 
79
- - **Python 3.12** (ver `.python-version`)
80
- - **[uv](https://docs.astral.sh/uv/)** para instalación y comandos
81
- - **Node.js 18+** para desarrollo local del frontend
82
- - **Opcional:** `YOUTUBE_API_KEY` para comentarios en vivo y miniaturas de vídeos sugeridos ([Google Cloud Console](https://console.cloud.google.com/apis/credentials))
83
-
84
- Los baselines con transformer y producción necesitan dependencias de Hugging Face:
85
-
86
- ```bash
87
- uv sync --extra hf
88
- uv run python -c "import transformers; print('ok')"
89
- ```
90
 
91
- ---
 
 
 
 
 
 
92
 
93
- ## Instalación
94
 
95
  ```bash
96
- git clone <url-de-tu-repo>
97
- cd youtube_hate_detector
98
 
99
  cp .env.example .env
100
- # Edita .env: YOUTUBE_API_KEY, MODEL_NAME (opcional)
101
-
102
- uv sync --extra hf
103
  ```
104
 
105
- Coloca `youtoxic_english_1000.csv` en `data/raw/` si vas a reentrenar (el archivo está en `.gitignore`).
106
 
107
- ---
108
 
109
- ## Ejecución local (desarrollo)
110
 
111
- ### 1. API
112
 
113
  ```bash
114
- uv run uvicorn src.api.main:app --reload --port 8000
 
115
  ```
116
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
- | Recurso | URL |
119
- | ------- | ------------------------------------------------------------ |
120
- | Swagger | [http://localhost:8000/docs](http://localhost:8000/docs) |
121
- | Health | [http://localhost:8000/health](http://localhost:8000/health) |
122
- | OpenAPI | [http://localhost:8000/redoc](http://localhost:8000/redoc) |
123
-
124
 
125
- Al arrancar, `ModelService` carga el modelo de `MODEL_NAME` (por defecto: **Meta-Feature Stacking (Production)**). La primera carga de un transformer puede descargar pesos de Hugging Face (~1 minuto sin caché).
126
-
127
- ### 2. UI React
128
 
 
129
  ```bash
130
- cd frontend
131
- npm install
132
- npm run dev
133
  ```
134
 
135
- Abre [http://localhost:5173](http://localhost:5173)Vite hace proxy de las rutas API (`/predict`, `/models/status`, etc.) al puerto 8000.
136
-
137
- **Página Watch:** vídeos sugeridos, puntuación de comentarios, análisis en vivo del borrador.
138
- **Ajustes:** cambio entre los tres modelos del catálogo; slider de umbral (se actualiza al cambiar de modelo).
139
- **Moderator Hub:** historial de comentarios puntuados en la sesión.
140
-
141
- Banner de producción (desde `/model-info`): p. ej. *Meta-Feature Stacking Model (F1: 0.805, Gap: 2.54%)*.
142
-
143
- ---
144
-
145
- ## Docker (API + UI compilada)
146
-
147
  ```bash
148
- export YOUTUBE_API_KEY=tu_clave # opcional pero recomendado para comentarios reales
149
- docker compose up --build
 
150
  ```
151
 
 
152
 
153
- | URL | Servicio |
154
- | -------------------------------------------------------- | ---------------------------------------------- |
155
- | [http://localhost:8000](http://localhost:8000) | FastAPI + `frontend/dist` (un solo contenedor) |
156
- | [http://localhost:8000/docs](http://localhost:8000/docs) | Swagger |
157
-
158
 
159
- La imagen copia `models/baseline/` y `models/production_final/`. `INSTALL_HF=1` es el valor por defecto en `docker-compose.yml` para producción y el baseline BERT congelado. Para una imagen solo sklearn (baseline LR):
160
-
161
- ```bash
162
- INSTALL_HF=0 docker compose build --build-arg INSTALL_HF=0
163
  ```
164
 
165
- ---
166
-
167
- ## Resumen de la API
 
 
 
168
 
169
- Referencia completa: [docs/API.es.md](docs/API.es.md) · [docs/API.md](docs/API.md)
170
 
 
171
 
172
- | Método | Ruta | Descripción |
173
- | ------ | ------------------- | --------------------------------------------------------------------- |
174
- | `POST` | `/predict` | Puntúa un comentario `{ "text", "threshold" }` |
175
- | `POST` | `/predict-batch` | Hasta 100 textos |
176
- | `POST` | `/predict-video` | Obtiene comentarios de YouTube y los puntúa (API key o fallback demo) |
177
- | `GET` | `/videos/suggested` | Metadatos del carril derecho (`configs/suggested_videos.yaml`) |
178
- | `GET` | `/models/status` | Catálogo + disponibilidad (joblib / deps HF) |
179
- | `POST` | `/models/select` | Cambia de modelo `{ "model_name": "..." }` |
180
- | `GET` | `/model-info` | Metadatos del modelo activo (banner, umbral recomendado) |
181
 
 
182
 
183
- **Ejemplo**
184
 
185
  ```bash
186
- curl -s -X POST http://localhost:8000/predict \
187
- -H "Content-Type: application/json" \
188
- -d '{"text": "Thanks for the great tutorial!", "threshold": 0.381}'
189
- ```
190
 
191
- Cambiar al baseline LR:
 
192
 
193
- ```bash
194
- curl -s -X POST http://localhost:8000/models/select \
195
- -H "Content-Type: application/json" \
196
- -d '{"model_name": "LR + TF-IDF (Baseline)"}'
197
  ```
198
 
199
  ---
200
 
201
- ## Estructura del proyecto
202
-
203
- ```
204
- youtube_hate_detector/
205
- ├── configs/
206
- │ ├── model_catalog.yaml # Modelos de demo (baselines + producción)
207
- │ ├── pipeline.yaml # Rutas de entrenamiento
208
- │ ├── features.yaml
209
- │ └── suggested_videos.yaml
210
- ├── data/
211
- │ ├── raw/ # CSV fuente (git-ignored)
212
- │ └── processed/ # Exportaciones preprocesadas
213
- ├── frontend/ # React + Vite
214
- ├── models/
215
- │ ├── baseline/ # lr_tfidf.joblib, manifest.json
216
- │ ├── production_final/ # meta_stack_final.joblib
217
- │ └── README.md
218
- ├── notebooks/
219
- │ ├── 01–03, 12, 14 # Hilo principal
220
- │ └── archive_attempts/ # 04–11, 13
221
- ├── reports/
222
- │ ├── HANDOVER_REPORT.md
223
- │ ├── notebook_14/
224
- │ ├── golden_baseline/
225
- │ └── v2/ # Figuras EDA del equipo
226
- ├── src/
227
- │ ├── api/ # Rutas FastAPI
228
- │ ├── service/ # ModelService, predictor meta-stack
229
- │ ├── pipeline/ # Pipelines de entrenamiento
230
- │ ├── features/
231
- │ └── evaluation/
232
- ├── tests/
233
- ├── Dockerfile
234
- ├── docker-compose.yml
235
- ├── pyproject.toml
236
- └── uv.lock
237
- ```
238
 
239
  ---
240
 
241
- ## Entrenamiento y reproducción de métricas
242
-
243
-
244
- | Objetivo | Comando |
245
- | -------------------------------- | ------------------------------------------------------------ |
246
- | Baseline LR + TF-IDF | `uv run python -m src.pipeline.run_pipeline --model lr` |
247
- | Informes baseline BERT congelado | `uv run python -m src.pipeline.run_golden_baseline_pipeline` |
248
- | Meta-stack de producción | `uv run python -m src.experiments.notebook_14_final_stack` |
249
-
250
-
251
- Detalle del pipeline: [docs/PIPELINE.es.md](docs/PIPELINE.es.md) · Resultados agregados: [docs/RESULTS.es.md](docs/RESULTS.es.md) · Ejecuciones históricas: `[reports/summary.csv](reports/summary.csv)`
 
 
 
 
 
 
 
 
 
 
 
252
 
253
  ---
254
 
255
- ## Configuración
256
-
257
-
258
- | Archivo | Uso |
259
- | ------------------------------- | ----------------------------------------------------------------------- |
260
- | `.env` | `YOUTUBE_API_KEY`, `MODEL_NAME`, `ENV` |
261
- | `configs/model_catalog.yaml` | Catálogo de inferencia (editar y reiniciar la API para añadir entradas) |
262
- | `configs/suggested_videos.yaml` | IDs de vídeo del carril sugerido |
263
- | `configs/best_params.yaml` | Referencia Optuna LR para el baseline |
264
-
265
-
266
- No hagas commit de `.env`. Haz commit de `uv.lock` cuando cambien las dependencias.
267
-
268
- ---
269
-
270
- ## Tests
271
-
272
- ```bash
273
- uv sync --extra dev --extra hf
274
- uv run pytest
275
- ```
276
-
277
- Cubre contratos de la API, preprocesado y cableado del catálogo para los tres modelos de demo.
278
-
279
- ---
280
-
281
- ## Índice de documentación
282
-
283
-
284
- | English | Español |
285
- | -------------------------------------------------------- | -------------------------------------------------- |
286
- | [docs/API.md](docs/API.md) | [docs/API.es.md](docs/API.es.md) |
287
- | [docs/PIPELINE.md](docs/PIPELINE.md) | [docs/PIPELINE.es.md](docs/PIPELINE.es.md) |
288
- | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) |
289
- | [docs/RESULTS.md](docs/RESULTS.md) | [docs/RESULTS.es.md](docs/RESULTS.es.md) |
290
- | [reports/HANDOVER_REPORT.md](reports/HANDOVER_REPORT.md) | |
291
-
292
-
293
- ---
294
 
295
- ## Licencia y datos
296
 
297
- Usa el dataset del proyecto y las claves de API según las normas de tu curso u organización. El uso de YouTube Data API debe cumplir las [condiciones de Google](https://developers.google.com/youtube/terms/api-services-terms-of-service).
 
1
+ <div align="center">
2
 
3
+ <img src="docs/assets/signalmod_logo.png" alt="SignalMod" width="520" />
 
 
 
4
 
5
+ ### Moderación inteligente para comentarios de YouTube
6
 
7
+ 🌐 [English](README.md) · **Español**
8
 
9
+ ![Python](https://img.shields.io/badge/Python-3.12-3776AB?logo=python&logoColor=white)
10
+ ![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688?logo=fastapi&logoColor=white)
11
+ ![React](https://img.shields.io/badge/React-18-61DAFB?logo=react&logoColor=black)
12
+ ![Vite](https://img.shields.io/badge/Vite-5-646CFF?logo=vite&logoColor=white)
13
+ ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?logo=pytorch&logoColor=white)
14
+ ![Transformers](https://img.shields.io/badge/Transformers-5.9-FFD21E?logo=huggingface&logoColor=black)
15
+ ![scikit-learn](https://img.shields.io/badge/scikit--learn-1.8-F7931E?logo=scikitlearn&logoColor=white)
16
+ ![Supabase](https://img.shields.io/badge/Supabase-DB-3ECF8E?logo=supabase&logoColor=white)
17
+ ![Docker](https://img.shields.io/badge/Docker-compose-2496ED?logo=docker&logoColor=white)
18
+ ![Render](https://img.shields.io/badge/Deploy-Render-46E3B7?logo=render&logoColor=white)
 
 
 
 
19
 
20
+ </div>
 
21
 
22
  ---
23
 
24
+ ## Descripción del proyecto
25
 
26
+ **SignalMod** es un asistente de moderación inteligente para comentarios de YouTube. Clasifica automáticamente cada comentario como **Seguro** o **Tóxico**, devuelve una probabilidad entre 0 y 1 y etiqueta categorías de toxicidad (insulto, amenaza, odio identitario, contenido obsceno).
27
 
28
+ Está construido alrededor del modelo **hybrid meta-feature stacking** del equipo — embeddings de Toxic-BERT congelado combinados con metadatos y una regresión logística regularizada — que alcanza **F1 = 0,805** con una brecha train–test de **2,54 pp** sobre el split de 200 muestras del proyecto.
29
 
30
+ El producto se entrega como una API REST con FastAPI y una SPA React que imita la experiencia de YouTube Watch: eliges un vídeo, la API descarga los 50 comentarios más recientes vía la YouTube Data API, los puntúa y persiste cada predicción en Supabase para que cualquier visitante pueda ver el histórico completo.
 
 
 
 
31
 
32
+ ---
33
 
34
+ ## Herramientas y lenguajes
35
+
36
+ ### Lenguajes
37
+ - **Python 3.12** — backend, pipelines de ML, evaluación.
38
+ - **TypeScript + React 18** — SPA del frontend.
39
+ - **SQL (PostgreSQL vía Supabase)** — persistencia de predicciones.
40
+
41
+ ### Backend
42
+ - **FastAPI 0.136** — API REST, esquemas Pydantic, carga del modelo en lifespan.
43
+ - **Uvicorn** — servidor ASGI con hot reload.
44
+ - **scikit-learn 1.8** — baseline TF-IDF + meta-learner LogisticRegression.
45
+ - **Optuna** — búsqueda de hiperparámetros del baseline TF-IDF.
46
+ - **PyTorch 2.x + Transformers 5.9** — `unitary/toxic-bert` congelado para embeddings CLS.
47
+ - **spaCy + NLTK** — lematización, stopwords, limpieza basada en regex.
48
+ - **MLflow** — tracking de experimentos.
49
+ - **Supabase Python SDK** — persistencia de predicciones con políticas RLS anónimas.
50
+ - **google-api-python-client** — integración con YouTube Data API v3.
51
+
52
+ ### Frontend
53
+ - **React 18 + Vite 5 + TypeScript** — SPA con hot module reload.
54
+ - **CSS modules** — tema oscuro estilo YouTube.
55
+
56
+ ### Tooling y operaciones
57
+ - **uv** — gestor de paquetes y entorno virtual de Python (`pyproject.toml` + `uv.lock`).
58
+ - **pnpm** — gestor de paquetes del frontend.
59
+ - **Docker + Docker Compose** — despliegue en un único contenedor sirviendo API + SPA construida.
60
+ - **GNU Make** — `make dev`, `make install`, `make build`, `make docker`.
61
+ - **Render** — despliegue gratuito vía blueprint `render.yaml`.
62
+ - **Pytest** — tests unitarios de contratos de API y preprocesado.
63
 
64
+ ---
65
 
66
+ ## Arquitectura del proyecto
67
 
68
+ ```
69
+ Project_9_Equipo3/
70
+ ├─ configs/ # Configs YAML para pipelines y catálogo de inferencia
71
+ │ ├ pipeline.yaml # Rutas de datos, target, folds de CV
72
+ │ ├ features.yaml # Preprocesado y ajustes de TF-IDF
73
+ │ ├ model_catalog.yaml # Catálogo de inferencia (3 modelos intercambiables)
74
+ │ ├── best_params.yaml # Ganador de Optuna para el baseline LR
75
+ │ ├── suggested_videos.yaml # IDs de YouTube del rail "Up next"
76
+ │ └── *_training.yaml # Perfiles de entrenamiento (golden, expert, hybrid, …)
77
+ ├── data/ # Datasets crudos y procesados (git-ignored)
78
+ ├── docs/ # API.md, PIPELINE.md, ARCHITECTURE.md, DEPLOY.md
79
+ │ └── assets/signalmod_logo.png # Activos de marca
80
+ ├── frontend/ # SPA React + Vite
81
+ │ ├── public/signalmod_logo.png # Logo servido como activo estático
82
+ │ └── src/
83
+ │ ├── api/ # Cliente HTTP tipado
84
+ │ ├── components/ # Layout, CommentRow, SuggestedRail, ModelBanner
85
+ │ ├── context/ # Estado global (modelo activo, umbral)
86
+ │ ├── hooks/ # useDebouncedPredict
87
+ │ ├── pages/ # WatchPage, HubPage, SettingsPage
88
+ │ └── utils/ # toxicityColor, randomUsername, relativeTime
89
+ ├── models/
90
+ │ ├── baseline/lr_tfidf.joblib # Baseline LR ajustado con Optuna
91
+ │ └── production_final/ # meta_stack_final.joblib — artefacto de producción
92
+ ├── notebooks/
93
+ │ ├── 01–04 # EDA, preprocesado, TF-IDF, baseline LR
94
+ │ ├── 12 # Golden baseline (Toxic-BERT congelado)
95
+ │ ├── 14 # Meta-stacking final — artefacto de producción
96
+ │ └── archive_attempts/ # Experimentos anteriores conservados para reproducibilidad
97
+ ├── reports/ # Métricas, gráficos, figuras EDA, summary.csv
98
+ ├── src/
99
+ │ ├── api/ # App FastAPI
100
+ │ │ ├── main.py # Lifespan, CORS, montaje del SPA estático
101
+ │ │ ├── routes/ # health, models, predict (+ /predictions), videos
102
+ │ │ ├── schemas.py # Modelos Pydantic request/response
103
+ │ │ ├── services.py # predict_single, to_predict_response
104
+ │ │ ├── state.py # Estado compartido de la app
105
+ │ │ └── youtube.py # Fetch a YouTube Data API + metadatos sugeridos
106
+ │ ├── data/ # Loader, dual loader para pipelines híbridos
107
+ │ ├── db/ # Cliente Supabase + helpers save_prediction
108
+ │ ├── evaluation/ # Evaluator, threshold tuning, CV estable
109
+ │ ├── experiments/ # Versiones script de los notebooks 13 / 14
110
+ │ ├── features/ # text_preprocessor, vectorizer, metadata, augmentation
111
+ │ ├── models/ # baseline (LR/RF/XGBoost), hybrid_ensemble, metadata_lr
112
+ │ ├── pipeline/ # run_pipeline + variantes por estrategia
113
+ │ ├── service/ # ModelService, meta_stack_predictor, model_catalog
114
+ │ └── utils/ # Logger
115
+ ├── supabase/predictions_setup.sql # SQL para crear la tabla predictions + políticas RLS
116
+ ├── tests/ # Suite Pytest
117
+ ├── Dockerfile # Build multi-stage (frontend + backend con uv)
118
+ ├── docker-compose.yml # Despliegue de un contenedor (API + SPA)
119
+ ├── render.yaml # Blueprint de Render (web service + static site)
120
+ ├── Procfile # Declaración de proceso para Render
121
+ ├── Makefile # make dev / install / build / docker / test
122
+ ├── pyproject.toml + uv.lock # Dependencias Python fijadas con uv
123
+ └── README.md / README.es.md # Documentación en inglés / español
124
  ```
125
 
126
+ ### Flujo de datos
 
 
 
 
127
 
128
+ ```
129
+ ┌────────────────────────────────────────────────┐
130
+ │ SPA React (Vite) http://localhost:5173│
131
+ │ Layout · Watch · Hub · Settings │
132
+ └──────────────────┬─────────────────────────────┘
133
+ │ HTTP JSON (proxy Vite → :8000)
134
+ ┌──────────────────▼─────────────────────────────┐
135
+ │ FastAPI http://localhost:8000│
136
+ │ /predict /predict-batch /predict-video │
137
+ │ /predictions (GET — histórico de Supabase) │
138
+ │ /models /models/select /model-info │
139
+ │ /videos/suggested /health │
140
+ └──────┬─────────────────────────────┬───────────┘
141
+ │ │
142
+ ┌──────────────▼─────────────┐ ┌─────────────▼──────────────┐
143
+ │ ModelService │ │ YouTube Data API v3 │
144
+ │ · local joblib │ │ · metadatos de vídeo │
145
+ │ · hf_remote │ │ · 50 comentarios + nuevos │
146
+ │ · meta_stack (producción) │ │ │
147
+ └──────┬─────────────────────┘ └────────────────────────────┘
148
+
149
+ ┌──────▼──────────────────────────────────────────────────┐
150
+ │ Supabase (PostgreSQL) │
151
+ │ tabla: predictions(id, created_at, text, video_id, │
152
+ │ probability, is_toxic, labels, …) │
153
+ │ RLS: insert anónimo + select anónimo │
154
+ └─────────────────────────────────────────────────────────┘
155
+ ```
156
 
157
+ ### Catálogo de modelos (intercambiable desde la UI)
158
 
159
+ | Modelo | Tipo | F1 (test) | Brecha train–test | Umbral | Latencia | Default |
160
+ | -------------------------------- | ----------- | --------- | ----------------- | --------- | -------- | ------- |
161
+ | **Meta-Feature Stacking** | Híbrido | **0,805** | **2,54 pp** | **0,381** | ~400 ms | **Sí** |
162
+ | Frozen Toxic-BERT | Transformer | 0,790 | 0,16 pp | 0,120 | ~400 ms | No |
163
+ | LR + TF-IDF (Optuna) | sklearn | 0,758 | 4,76 pp | 0,500 | < 50 ms | No |
 
164
 
165
+ El modelo de producción concatena el embedding `[CLS]` congelado de `unitary/toxic-bert` (768-d) con metadatos hechos a mano (longitud, ratio de mayúsculas, densidad de emojis…), los escala con `StandardScaler` y los pasa por un meta-learner `LogisticRegression(C=0,001)`.
166
 
167
  ---
168
 
169
+ ## Instalación y ejecución
170
 
171
+ ### 1. Requisitos previos
 
 
 
 
 
 
 
 
 
 
172
 
173
+ | Herramienta | macOS / Linux | Windows |
174
+ | ------------ | ----------------------------------- | -------------------------------------------------------- |
175
+ | **Python 3.12** | `brew install python@3.12` | [python.org/downloads](https://www.python.org/downloads/) (marca *Add Python to PATH*) |
176
+ | **uv** | `curl -LsSf https://astral.sh/uv/install.sh \| sh` | `powershell -c "irm https://astral.sh/uv/install.ps1 \| iex"` |
177
+ | **Node.js 18+** | `brew install node` | [nodejs.org](https://nodejs.org/) (LTS) |
178
+ | **pnpm** | `npm i -g pnpm` | `npm i -g pnpm` |
179
+ | **Make** *(opcional)* | ya instalado | `winget install GnuWin32.Make` (o usa WSL) |
180
 
181
+ ### 2. Clonar y configurar
182
 
183
  ```bash
184
+ git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
185
+ cd Project_9_Equipo3
186
 
187
  cp .env.example .env
188
+ # Rellena: YOUTUBE_API_KEY, SUPABASE_URL, SUPABASE_KEY
 
 
189
  ```
190
 
191
+ > **PowerShell de Windows**: sustituye `cp` por `Copy-Item .env.example .env`.
192
 
193
+ Pega `supabase/predictions_setup.sql` en el editor SQL de Supabase antes del primer arranque (crea la tabla `predictions` + políticas RLS).
194
 
195
+ ### 3. Arranque — tres opciones
196
 
197
+ #### Opción A — Con Makefile (recomendada en macOS / Linux / WSL)
198
 
199
  ```bash
200
+ make install # uv sync + pnpm install
201
+ make dev # FastAPI :8000 + Vite :5173
202
  ```
203
 
204
+ | Comando | Qué hace |
205
+ | ------------- | ---------------------------------------------- |
206
+ | `make install`| Instala deps de Python + frontend |
207
+ | `make dev` | Arranca API y UI en paralelo (Ctrl+C los para) |
208
+ | `make api` | Solo la API |
209
+ | `make ui` | Solo la UI |
210
+ | `make build` | Compila el SPA a `frontend/dist` |
211
+ | `make test` | Ejecuta Pytest |
212
+ | `make docker` | `docker compose up --build` |
213
+ | `make stop` | Mata procesos en los puertos 8000 / 5173 |
214
+ | `make clean` | Borra `.venv`, `node_modules`, `dist` |
215
 
216
+ #### Opción B — Manual (macOS / Linux)
 
 
 
 
 
217
 
218
+ Dos terminales.
 
 
219
 
220
+ **Terminal 1 — API**
221
  ```bash
222
+ uv sync
223
+ uv run uvicorn src.api.main:app --reload --port 8000
 
224
  ```
225
 
226
+ **Terminal 2Frontend**
 
 
 
 
 
 
 
 
 
 
 
227
  ```bash
228
+ cd frontend
229
+ pnpm install
230
+ pnpm dev
231
  ```
232
 
233
+ #### Opción C — Manual (PowerShell de Windows)
234
 
235
+ Dos terminales.
 
 
 
 
236
 
237
+ **Terminal 1 API**
238
+ ```powershell
239
+ uv sync
240
+ uv run uvicorn src.api.main:app --reload --port 8000
241
  ```
242
 
243
+ **Terminal 2 — Frontend**
244
+ ```powershell
245
+ cd frontend
246
+ pnpm install
247
+ pnpm dev
248
+ ```
249
 
250
+ > Si `uv` no se reconoce tras instalarlo, cierra y vuelve a abrir PowerShell para que se recargue el `PATH`.
251
 
252
+ ### 4. Abrir la aplicación
253
 
254
+ | URL | Qué verás |
255
+ | ------------------------------ | ---------------------------------------- |
256
+ | http://localhost:5173 | SPA React Watch / Hub / Settings |
257
+ | http://localhost:8000/docs | Swagger de FastAPI |
258
+ | http://localhost:8000/health | Health check |
 
 
 
 
259
 
260
+ ### 5. Docker (un solo contenedor — API + SPA compilada)
261
 
262
+ Mismos comandos en **macOS / Linux / Windows**:
263
 
264
  ```bash
265
+ # Normal deja imágenes y volúmenes para builds rápidos
266
+ docker compose up --build
267
+ # → http://localhost:8000 · Ctrl+C para parar · docker compose down
 
268
 
269
+ # Demo efímera — Ctrl+C borra contenedor + imagen + volúmenes
270
+ make docker-demo
271
 
272
+ # Limpieza manual completa
273
+ make docker-clean
274
+ # (equivale a: docker compose down --rmi local --volumes --remove-orphans)
 
275
  ```
276
 
277
  ---
278
 
279
+ Más detalle: [docs/PIPELINE.es.md](docs/PIPELINE.es.md) para entrenamiento, [docs/API.es.md](docs/API.es.md) para endpoints, [docs/DEPLOY.md](docs/DEPLOY.md) para despliegue en Render.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
  ---
282
 
283
+ ## Colaboradores
284
+
285
+ <table>
286
+ <tr>
287
+ <td align="center" width="25%">
288
+ <b>Andrés Torrez</b><br/>
289
+ <sub>Backend Developer</sub>
290
+ </td>
291
+ <td align="center" width="25%">
292
+ <b>Mirae Kang</b><br/>
293
+ <sub>Scrum Master</sub>
294
+ </td>
295
+ <td align="center" width="25%">
296
+ <b>Jonathan Brasales</b><br/>
297
+ <sub>AI Developer</sub>
298
+ </td>
299
+ <td align="center" width="25%">
300
+ <b>Roberto Molero</b><br/>
301
+ <sub>Product Owner</sub>
302
+ </td>
303
+ </tr>
304
+ </table>
305
 
306
  ---
307
 
308
+ <div align="center">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
309
 
310
+ **SignalMod** Bootcamp IA P6 · Equipo 3 · 2026
311
 
312
+ </div>
README.md CHANGED
@@ -1,279 +1,312 @@
1
- # YouTube Toxic Comment Detector (youtube_hate_detector)
2
 
3
- [![Python](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
4
- [![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688.svg)](https://fastapi.tiangolo.com/)
5
- [![React](https://img.shields.io/badge/React-UI-61DAFB.svg)](https://react.dev/)
6
- [![Docker](https://img.shields.io/badge/docker-compose-2496ED.svg)](https://docs.docker.com/compose/)
7
 
8
- **Español:** [README.es.md](README.es.md)
9
 
10
- Automated **Safe vs Toxic** moderation support for YouTube-style comments. The stack is **FastAPI** (REST inference) plus a **React** SPA that mimics a Watch page: type or load comments, see toxicity scores, and switch models in Settings.
11
 
12
- **Production default:** **Hybrid Meta-Feature Stacking** — `models/production_final/meta_stack_final.joblib` (held-out test F1 **0.805**, train–test gap **2.54%**, under the team’s **&lt; 5%** overfitting rule).
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ---
15
 
16
- ## What this project does
17
 
18
- | Aspect | Detail |
19
- |--------|--------|
20
- | **Task** | Binary classification on `IsToxic` → **Safe (0)** / **Toxic (1)** |
21
- | **Data** | `data/raw/youtoxic_english_1000.csv` (~1k English comments; multilabel columns available for EDA) |
22
- | **Primary metric** | F1 weighted (imbalanced toxic class) |
23
- | **Overfitting guardrail** | \|F1 train − F1 test\| &lt; 5 percentage points |
24
- | **User-facing wording** | **toxic** |
25
 
26
- Moderators get a practical score and label per comment. The demo does not replace human review; it prioritizes **usable** performance on a small domain-specific corpus.
27
 
28
- ---
29
 
30
- ## Models: baseline → production
31
 
32
- Three inference options are registered in [`configs/model_catalog.yaml`](configs/model_catalog.yaml) and exposed in the UI. Metrics below are on the project’s stratified hold-out test split unless noted.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- | Model | Type | Test F1 (weighted) | Train–test gap | Artifact / weights | UI threshold |
35
- |-------|------|-------------------|----------------|---------------------|--------------|
36
- | **LR + TF-IDF (Baseline)** | sklearn + TF-IDF | 0.758 | 4.76 pp | `models/baseline/lr_tfidf.joblib` | 0.50 |
37
- | **Frozen Toxic-BERT (Baseline)** | Transformer (frozen) | 0.790 | 0.16 pp | Hugging Face [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert) | 0.12 |
38
- | **Meta-Feature Stacking (Production)** | Hybrid stack | **0.805** | **2.54 pp** | `models/production_final/meta_stack_final.joblib` | **0.381** |
39
 
40
- Canonical baseline numbers: [`models/baseline/manifest.json`](models/baseline/manifest.json). Production run: [`reports/notebook_14/final_result.json`](reports/notebook_14/final_result.json). Presentation script: [`reports/HANDOVER_REPORT.md`](reports/HANDOVER_REPORT.md).
41
 
42
- ### Team contribution — Hybrid Meta-Feature Stacking
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- Production combines signals that sklearn alone misses, without fine-tuning a large transformer on ~1k rows:
45
 
46
- ```text
47
- Comment text
48
- ├─► Frozen Toxic-BERT → [CLS] embedding (768-d)
49
- └─► Metadata features (length, caps ratio, emoji density, …)
50
- └─► concat → StandardScaler → LogisticRegression (C=0.001)
51
- └─► P(toxic)threshold 0.381
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
 
54
- - **Frozen BERT** supplies semantic signal; weights stay fixed (same Hub checkpoint as the frozen baseline path).
55
- - **Metadata** keeps interpretable structure (punctuation, length, etc.).
56
- - **Strong regularization** and test-set threshold search keep the train–test gap under 5% while passing the **F1 ≥ 0.80** target.
57
-
58
- Implementation: [Notebook 14](notebooks/14_final_meta_stacking.ipynb) · `uv run python -m src.experiments.notebook_14_final_stack`
59
 
60
- ### Notebook narrative
 
 
 
 
61
 
62
- | Notebooks | Role |
63
- |-----------|------|
64
- | `01`–`03` | EDA, preprocessing, TF-IDF → LR baseline |
65
- | `12` | Golden baseline strategy (frozen Toxic-BERT metrics) |
66
- | `14` | Final meta-stacking → production artifact |
67
- | `archive_attempts/` | Earlier experiments (04–11, 13); kept for reproducibility |
68
 
69
  ---
70
 
71
- ## Prerequisites
72
-
73
- - **Python 3.12** (see `.python-version`)
74
- - **[uv](https://docs.astral.sh/uv/)** for installs and commands
75
- - **Node.js 18+** for local frontend dev
76
- - **Optional:** `YOUTUBE_API_KEY` for live comments and suggested-video thumbnails ([Google Cloud Console](https://console.cloud.google.com/apis/credentials))
77
 
78
- Transformer baselines and production need Hugging Face dependencies:
79
-
80
- ```bash
81
- uv sync --extra hf
82
- uv run python -c "import transformers; print('ok')"
83
- ```
84
 
85
- ---
 
 
 
 
 
 
86
 
87
- ## Installation
88
 
89
  ```bash
90
- git clone <your-repo-url>
91
- cd youtube_hate_detector
92
 
93
  cp .env.example .env
94
- # Edit .env: YOUTUBE_API_KEY, MODEL_NAME (optional)
95
-
96
- uv sync --extra hf
97
  ```
98
 
99
- Place `youtoxic_english_1000.csv` in `data/raw/` if you plan to retrain (file is git-ignored).
100
 
101
- ---
102
 
103
- ## Run locally (development)
104
 
105
- ### 1. API
106
 
107
  ```bash
108
- uv run uvicorn src.api.main:app --reload --port 8000
 
109
  ```
110
 
111
- | Resource | URL |
112
- |----------|-----|
113
- | Swagger | http://localhost:8000/docs |
114
- | Health | http://localhost:8000/health |
115
- | OpenAPI | http://localhost:8000/redoc |
 
 
 
 
 
 
116
 
117
- On startup, `ModelService` loads the model from `MODEL_NAME` (default: **Meta-Feature Stacking (Production)**). First load of a transformer model may download weights from Hugging Face (~1 minute on a cold cache).
118
 
119
- ### 2. React UI
120
 
 
121
  ```bash
122
- cd frontend
123
- npm install
124
- npm run dev
125
  ```
126
 
127
- Open http://localhost:5173Vite proxies API routes (`/predict`, `/models/status`, etc.) to port 8000.
128
-
129
- **Watch page:** suggested videos, comment list scoring, live draft analysis.
130
- **Settings:** switch among the three catalog models; threshold slider (defaults update when you change model).
131
- **Moderator Hub:** session history of scored comments.
132
-
133
- Production banner (from `/model-info`): e.g. *Meta-Feature Stacking Model (F1: 0.805, Gap: 2.54%)*.
134
-
135
- ---
136
-
137
- ## Docker (API + built UI)
138
-
139
  ```bash
140
- export YOUTUBE_API_KEY=your_key # optional but recommended for real comments
141
- docker compose up --build
 
142
  ```
143
 
144
- | URL | Service |
145
- |-----|---------|
146
- | http://localhost:8000 | FastAPI + `frontend/dist` (single container) |
147
- | http://localhost:8000/docs | Swagger |
148
 
149
- The image copies `models/baseline/` and `models/production_final/`. `INSTALL_HF=1` is the default in `docker-compose.yml` so production and frozen BERT baselines work. For a sklearn-only image (LR baseline only):
150
 
151
- ```bash
152
- INSTALL_HF=0 docker compose build --build-arg INSTALL_HF=0
 
 
153
  ```
154
 
155
- ---
156
-
157
- ## API overview
 
 
 
158
 
159
- Full reference: [docs/API.md](docs/API.md)
160
 
161
- | Method | Path | Description |
162
- |--------|------|-------------|
163
- | `POST` | `/predict` | Score one comment `{ "text", "threshold" }` |
164
- | `POST` | `/predict-batch` | Up to 100 texts |
165
- | `POST` | `/predict-video` | Fetch YouTube comments and score (API key or demo fallback) |
166
- | `GET` | `/videos/suggested` | Right-rail video metadata (`configs/suggested_videos.yaml`) |
167
- | `GET` | `/models/status` | Catalog + availability (joblib / HF deps) |
168
- | `POST` | `/models/select` | Switch model `{ "model_name": "..." }` |
169
- | `GET` | `/model-info` | Active model metadata (banner text, recommended threshold) |
170
 
171
- **Example**
 
 
 
 
172
 
173
- ```bash
174
- curl -s -X POST http://localhost:8000/predict \
175
- -H "Content-Type: application/json" \
176
- -d '{"text": "Thanks for the great tutorial!", "threshold": 0.381}'
177
- ```
178
 
179
- Switch to the LR baseline:
180
 
181
  ```bash
182
- curl -s -X POST http://localhost:8000/models/select \
183
- -H "Content-Type: application/json" \
184
- -d '{"model_name": "LR + TF-IDF (Baseline)"}'
185
- ```
186
-
187
- ---
188
 
189
- ## Project structure
 
190
 
191
- ```
192
- youtube_hate_detector/
193
- ├── configs/
194
- │ ├── model_catalog.yaml # Demo models (baselines + production)
195
- │ ├── pipeline.yaml # Training paths
196
- │ ├── features.yaml
197
- │ └── suggested_videos.yaml
198
- ├── data/
199
- │ ├── raw/ # Source CSV (git-ignored)
200
- │ └── processed/ # Preprocessed exports
201
- ├── frontend/ # React + Vite
202
- ├── models/
203
- │ ├── baseline/ # lr_tfidf.joblib, manifest.json
204
- │ ├── production_final/ # meta_stack_final.joblib
205
- │ └── README.md
206
- ├── notebooks/
207
- │ ├── 01–03, 12, 14 # Main story
208
- │ └── archive_attempts/ # 04–11, 13
209
- ├── reports/
210
- │ ├── HANDOVER_REPORT.md
211
- │ ├── notebook_14/
212
- │ ├── golden_baseline/
213
- │ └── v2/ # Teammate EDA figures
214
- ├── src/
215
- │ ├── api/ # FastAPI routes
216
- │ ├── service/ # ModelService, meta-stack predictor
217
- │ ├── pipeline/ # Training pipelines
218
- │ ├── features/
219
- │ └── evaluation/
220
- ├── tests/
221
- ├── Dockerfile
222
- ├── docker-compose.yml
223
- ├── pyproject.toml
224
- └── uv.lock
225
  ```
226
 
227
  ---
228
 
229
- ## Training and reproducing metrics
230
-
231
- | Goal | Command |
232
- |------|---------|
233
- | LR + TF-IDF baseline | `uv run python -m src.pipeline.run_pipeline --model lr` |
234
- | Frozen BERT baseline reports | `uv run python -m src.pipeline.run_golden_baseline_pipeline` |
235
- | Production meta-stack | `uv run python -m src.experiments.notebook_14_final_stack` |
236
-
237
- Pipeline details: [docs/PIPELINE.md](docs/PIPELINE.md) · Aggregated results: [docs/RESULTS.md](docs/RESULTS.md) · Historical runs: [`reports/summary.csv`](reports/summary.csv)
238
-
239
- ---
240
-
241
- ## Configuration
242
-
243
- | File | Purpose |
244
- |------|---------|
245
- | `.env` | `YOUTUBE_API_KEY`, `MODEL_NAME`, `ENV` |
246
- | `configs/model_catalog.yaml` | Inference catalog (edit + restart API to add entries) |
247
- | `configs/suggested_videos.yaml` | Video IDs for the suggested rail |
248
- | `configs/best_params.yaml` | Optuna LR reference for baseline |
249
-
250
- Never commit `.env`. Commit `uv.lock` when dependencies change.
251
 
252
  ---
253
 
254
- ## Tests
255
-
256
- ```bash
257
- uv sync --extra dev --extra hf
258
- uv run pytest
259
- ```
260
-
261
- Covers API contracts, preprocessing, and catalog wiring for the three demo models.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262
 
263
  ---
264
 
265
- ## Documentation index
266
-
267
- | English | Español |
268
- |---------|---------|
269
- | [docs/API.md](docs/API.md) | [docs/API.es.md](docs/API.es.md) |
270
- | [docs/PIPELINE.md](docs/PIPELINE.md) | [docs/PIPELINE.es.md](docs/PIPELINE.es.md) |
271
- | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) |
272
- | [docs/RESULTS.md](docs/RESULTS.md) | [docs/RESULTS.es.md](docs/RESULTS.es.md) |
273
- | [reports/HANDOVER_REPORT.md](reports/HANDOVER_REPORT.md) | |
274
-
275
- ---
276
 
277
- ## License and data
278
 
279
- Use the project dataset and API keys according to your course or organization rules. YouTube Data API usage must comply with [Google’s terms](https://developers.google.com/youtube/terms/api-services-terms-of-service).
 
1
+ <div align="center">
2
 
3
+ <img src="docs/assets/signalmod_logo.png" alt="SignalMod" width="520" />
 
 
 
4
 
5
+ ### Intelligent moderation for YouTube comments
6
 
7
+ 🌐 **English** · [Español](README.es.md)
8
 
9
+ ![Python](https://img.shields.io/badge/Python-3.12-3776AB?logo=python&logoColor=white)
10
+ ![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688?logo=fastapi&logoColor=white)
11
+ ![React](https://img.shields.io/badge/React-18-61DAFB?logo=react&logoColor=black)
12
+ ![Vite](https://img.shields.io/badge/Vite-5-646CFF?logo=vite&logoColor=white)
13
+ ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?logo=pytorch&logoColor=white)
14
+ ![Transformers](https://img.shields.io/badge/Transformers-5.9-FFD21E?logo=huggingface&logoColor=black)
15
+ ![scikit-learn](https://img.shields.io/badge/scikit--learn-1.8-F7931E?logo=scikitlearn&logoColor=white)
16
+ ![Supabase](https://img.shields.io/badge/Supabase-DB-3ECF8E?logo=supabase&logoColor=white)
17
+ ![Docker](https://img.shields.io/badge/Docker-compose-2496ED?logo=docker&logoColor=white)
18
+ ![Render](https://img.shields.io/badge/Deploy-Render-46E3B7?logo=render&logoColor=white)
19
+
20
+ </div>
21
 
22
  ---
23
 
24
+ ## Project description
25
 
26
+ **SignalMod** is an intelligent moderation assistant for YouTube comments. It automatically classifies each comment as **Safe** or **Toxic**, returns a probability between 0 and 1, and tags toxicity categories (insult, threat, identity hate, obscene content).
 
 
 
 
 
 
27
 
28
+ It is built around the team's **hybrid meta-feature stacking** model frozen Toxic-BERT embeddings combined with metadata features and a regularised logistic regression — reaching **F1 = 0.805** with a train–test gap of **2.54 pp** on the project's 200-sample test split.
29
 
30
+ The product ships as a FastAPI REST service plus a React SPA that mimics the YouTube Watch experience: pick a video, the API fetches the latest 50 comments via the YouTube Data API, scores them, and persists every prediction in Supabase so any visitor can see the full history.
31
 
32
+ ---
33
 
34
+ ## Tools and languages
35
+
36
+ ### Languages
37
+ - **Python 3.12** — backend, ML pipelines, evaluation.
38
+ - **TypeScript + React 18** — frontend SPA.
39
+ - **SQL (PostgreSQL via Supabase)** — predictions persistence.
40
+
41
+ ### Backend
42
+ - **FastAPI 0.136** — REST API, Pydantic schemas, lifespan model loading.
43
+ - **Uvicorn** — ASGI server with hot reload.
44
+ - **scikit-learn 1.8** — TF-IDF baseline + meta-learner Logistic Regression.
45
+ - **Optuna** — hyperparameter search for the TF-IDF baseline.
46
+ - **PyTorch 2.x + Transformers 5.9** — frozen `unitary/toxic-bert` for CLS embeddings.
47
+ - **spaCy + NLTK** — lemmatisation, stopwords, regex-based cleanup.
48
+ - **MLflow** — experiment tracking.
49
+ - **Supabase Python SDK** — predictions persistence with anonymous RLS policies.
50
+ - **google-api-python-client** — YouTube Data API v3 integration.
51
+
52
+ ### Frontend
53
+ - **React 18 + Vite 5 + TypeScript** — SPA with hot module reload.
54
+ - **CSS modules** — YouTube-like dark theme.
55
+
56
+ ### Tooling and ops
57
+ - **uv** — Python package and venv manager (`pyproject.toml` + `uv.lock`).
58
+ - **pnpm** — frontend package manager.
59
+ - **Docker + Docker Compose** — single-container deploy serving API + built SPA.
60
+ - **GNU Make** — `make dev`, `make install`, `make build`, `make docker`.
61
+ - **Render** — free-tier deploy via `render.yaml` blueprint.
62
+ - **Pytest** — unit tests for API contracts and preprocessing.
63
 
64
+ ---
 
 
 
 
65
 
66
+ ## Project architecture
67
 
68
+ ```
69
+ Project_9_Equipo3/
70
+ ├── configs/ # YAML configs for pipelines and inference catalog
71
+ │ ├── pipeline.yaml # Training data paths, target columns, CV folds
72
+ │ ├── features.yaml # Preprocessing and TF-IDF settings
73
+ │ ├── model_catalog.yaml # Inference catalog (3 swappable models)
74
+ │ ├── best_params.yaml # Optuna winner for the LR baseline
75
+ │ ├── suggested_videos.yaml # YouTube IDs shown in the Up-next rail
76
+ │ └── *_training.yaml # Training profiles (golden baseline, expert, hybrid, …)
77
+ ├── data/ # Raw and processed datasets (git-ignored)
78
+ ├── docs/ # API.md, PIPELINE.md, ARCHITECTURE.md, DEPLOY.md
79
+ │ └── assets/signalmod_logo.png # Brand assets
80
+ ├── frontend/ # React + Vite SPA
81
+ │ ├── public/signalmod_logo.png # Logo served as static asset
82
+ │ └── src/
83
+ │ ├── api/ # Typed HTTP client
84
+ │ ├── components/ # Layout, CommentRow, SuggestedRail, ModelBanner
85
+ │ ├── context/ # Global app state (active model, threshold)
86
+ │ ├── hooks/ # useDebouncedPredict
87
+ │ ├── pages/ # WatchPage, HubPage, SettingsPage
88
+ │ └── utils/ # toxicityColor, randomUsername, relativeTime
89
+ ├── models/
90
+ │ ├── baseline/lr_tfidf.joblib # Optuna-tuned LR baseline
91
+ │ └── production_final/ # meta_stack_final.joblib — production artifact
92
+ ├── notebooks/
93
+ │ ├── 01–04 # EDA, preprocessing, TF-IDF, baseline LR
94
+ │ ├── 12 # Golden baseline (frozen Toxic-BERT)
95
+ │ ├── 14 # Final meta-stacking — production artifact
96
+ │ └── archive_attempts/ # Earlier experiments preserved for reproducibility
97
+ ├── reports/ # Metrics, plots, EDA figures, summary.csv
98
+ ├── src/
99
+ │ ├── api/ # FastAPI app
100
+ │ │ ├── main.py # Lifespan, CORS, static SPA mount
101
+ │ │ ├── routes/ # health, models, predict (+ /predictions), videos
102
+ │ │ ├── schemas.py # Pydantic request/response models
103
+ │ │ ├── services.py # predict_single, to_predict_response
104
+ │ │ ├── state.py # Shared app state
105
+ │ │ └── youtube.py # YouTube Data API fetch + suggested metadata
106
+ │ ├── data/ # Loader, dual loader for hybrid pipelines
107
+ │ ├── db/ # Supabase client + save_prediction helpers
108
+ │ ├── evaluation/ # Evaluator, threshold tuning, stable CV
109
+ │ ├── experiments/ # Notebook 13 / 14 script versions
110
+ │ ├── features/ # text_preprocessor, vectorizer, metadata, augmentation
111
+ │ ├── models/ # baseline (LR/RF/XGBoost), hybrid_ensemble, metadata_lr
112
+ │ ├── pipeline/ # run_pipeline + per-strategy variants
113
+ │ ├── service/ # ModelService, meta_stack_predictor, model_catalog
114
+ │ └── utils/ # Logger
115
+ ├── supabase/predictions_setup.sql # SQL to create the predictions table + RLS policies
116
+ ├── tests/ # Pytest suite
117
+ ├── Dockerfile # Multi-stage build (frontend + uv backend)
118
+ ├── docker-compose.yml # One-container deploy serving API + SPA
119
+ ├── render.yaml # Render blueprint (web service + static site)
120
+ ├── Procfile # Render process declaration
121
+ ├── Makefile # make dev / install / build / docker / test
122
+ ├── pyproject.toml + uv.lock # Python dependencies pinned with uv
123
+ └── README.md / README.es.md # English / Spanish documentation
124
+ ```
125
 
126
+ ### Data flow
127
 
128
+ ```
129
+ ┌────────────────────────────────────────────────┐
130
+ │ React SPA (Vite) http://localhost:5173│
131
+ │ Layout · Watch · Hub · Settings │
132
+ └──────────────────┬─────────────────────────────┘
133
+ HTTP JSON (Vite proxy :8000)
134
+ ┌──────────────────▼─────────────────────────────┐
135
+ │ FastAPI http://localhost:8000│
136
+ │ /predict /predict-batch /predict-video │
137
+ │ /predictions (GET — Supabase history) │
138
+ │ /models /models/select /model-info │
139
+ │ /videos/suggested /health │
140
+ └──────┬─────────────────────────────┬───────────┘
141
+ │ │
142
+ ┌──────────────▼─────────────┐ ┌─────────────▼──────────────┐
143
+ │ ModelService │ │ YouTube Data API v3 │
144
+ │ · local joblib │ │ · video metadata │
145
+ │ · hf_remote │ │ · 50 newest comments │
146
+ │ · meta_stack (production) │ │ │
147
+ └──────┬─────────────────────┘ └────────────────────────────┘
148
+
149
+ ┌──────▼──────────────────────────────────────────────────┐
150
+ │ Supabase (PostgreSQL) │
151
+ │ table: predictions(id, created_at, text, video_id, │
152
+ │ probability, is_toxic, labels, …) │
153
+ │ RLS: anon insert + anon select │
154
+ └─────────────────────────────────────────────────────────┘
155
  ```
156
 
157
+ ### Model catalog (swappable from the UI)
 
 
 
 
158
 
159
+ | Model | Type | F1 (test) | Train–test gap | Threshold | Latency | Default |
160
+ | -------------------------------- | ----------- | --------- | -------------- | --------- | ------- | ------- |
161
+ | **Meta-Feature Stacking** | Hybrid | **0.805** | **2.54 pp** | **0.381** | ~400 ms | **Yes** |
162
+ | Frozen Toxic-BERT | Transformer | 0.790 | 0.16 pp | 0.120 | ~400 ms | No |
163
+ | LR + TF-IDF (Optuna) | sklearn | 0.758 | 4.76 pp | 0.500 | < 50 ms | No |
164
 
165
+ The production model concatenates the frozen `[CLS]` embedding from `unitary/toxic-bert` (768-d) with hand-crafted metadata features (length, uppercase ratio, emoji density…), scales them with `StandardScaler`, and feeds them into a `LogisticRegression(C=0.001)` meta-learner.
 
 
 
 
 
166
 
167
  ---
168
 
169
+ ## Setup & run
 
 
 
 
 
170
 
171
+ ### 1. Prerequisites
 
 
 
 
 
172
 
173
+ | Tool | macOS / Linux | Windows |
174
+ | ----------- | ----------------------------------- | --------------------------------------------------------- |
175
+ | **Python 3.12** | `brew install python@3.12` | [python.org/downloads](https://www.python.org/downloads/) (check *Add Python to PATH*) |
176
+ | **uv** | `curl -LsSf https://astral.sh/uv/install.sh \| sh` | `powershell -c "irm https://astral.sh/uv/install.ps1 \| iex"` |
177
+ | **Node.js 18+** | `brew install node` | [nodejs.org](https://nodejs.org/) (LTS) |
178
+ | **pnpm** | `npm i -g pnpm` | `npm i -g pnpm` |
179
+ | **Make** *(optional)* | already installed | `winget install GnuWin32.Make` (or use WSL) |
180
 
181
+ ### 2. Clone & configure
182
 
183
  ```bash
184
+ git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
185
+ cd Project_9_Equipo3
186
 
187
  cp .env.example .env
188
+ # Fill: YOUTUBE_API_KEY, SUPABASE_URL, SUPABASE_KEY
 
 
189
  ```
190
 
191
+ > **Windows PowerShell**: replace `cp` with `Copy-Item .env.example .env`.
192
 
193
+ Paste `supabase/predictions_setup.sql` into the Supabase SQL editor before the first run (creates the `predictions` table + RLS policies).
194
 
195
+ ### 3. Run three ways
196
 
197
+ #### Option A — With Makefile (recommended on macOS / Linux / WSL)
198
 
199
  ```bash
200
+ make install # uv sync + pnpm install
201
+ make dev # FastAPI :8000 + Vite :5173
202
  ```
203
 
204
+ | Command | What it does |
205
+ | ------------- | --------------------------------------------- |
206
+ | `make install`| Install Python + frontend deps |
207
+ | `make dev` | Start API and UI in parallel (Ctrl+C stops both) |
208
+ | `make api` | API only |
209
+ | `make ui` | UI only |
210
+ | `make build` | Build the SPA into `frontend/dist` |
211
+ | `make test` | Run Pytest |
212
+ | `make docker` | `docker compose up --build` |
213
+ | `make stop` | Kill anything on ports 8000 / 5173 |
214
+ | `make clean` | Remove `.venv`, `node_modules`, `dist` |
215
 
216
+ #### Option B Manual (macOS / Linux)
217
 
218
+ Two terminals.
219
 
220
+ **Terminal 1 — API**
221
  ```bash
222
+ uv sync
223
+ uv run uvicorn src.api.main:app --reload --port 8000
 
224
  ```
225
 
226
+ **Terminal 2Frontend**
 
 
 
 
 
 
 
 
 
 
 
227
  ```bash
228
+ cd frontend
229
+ pnpm install
230
+ pnpm dev
231
  ```
232
 
233
+ #### Option C Manual (Windows PowerShell)
 
 
 
234
 
235
+ Two terminals.
236
 
237
+ **Terminal 1 — API**
238
+ ```powershell
239
+ uv sync
240
+ uv run uvicorn src.api.main:app --reload --port 8000
241
  ```
242
 
243
+ **Terminal 2 — Frontend**
244
+ ```powershell
245
+ cd frontend
246
+ pnpm install
247
+ pnpm dev
248
+ ```
249
 
250
+ > If `uv` is not recognised after install, close and reopen PowerShell so the new `PATH` is picked up.
251
 
252
+ ### 4. Open the app
 
 
 
 
 
 
 
 
253
 
254
+ | URL | What you'll see |
255
+ | ------------------------------ | ---------------------------------------- |
256
+ | http://localhost:5173 | React SPA — Watch / Hub / Settings |
257
+ | http://localhost:8000/docs | FastAPI Swagger UI |
258
+ | http://localhost:8000/health | Health check |
259
 
260
+ ### 5. Docker (one container — API + SPA built)
 
 
 
 
261
 
262
+ Same commands on **macOS / Linux / Windows**:
263
 
264
  ```bash
265
+ # Normal keeps images and volumes for fast rebuilds
266
+ docker compose up --build
267
+ # → http://localhost:8000 · Ctrl+C to stop · docker compose down
 
 
 
268
 
269
+ # Ephemeral demo — Ctrl+C tears down container + image + volumes
270
+ make docker-demo
271
 
272
+ # Manual full cleanup
273
+ make docker-clean
274
+ # (equivalent to: docker compose down --rmi local --volumes --remove-orphans)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
275
  ```
276
 
277
  ---
278
 
279
+ More: see [docs/PIPELINE.md](docs/PIPELINE.md) for training, [docs/API.md](docs/API.md) for endpoints, [docs/DEPLOY.md](docs/DEPLOY.md) for Render deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
  ---
282
 
283
+ ## Contributors
284
+
285
+ <table>
286
+ <tr>
287
+ <td align="center" width="25%">
288
+ <b>Andrés Torrez</b><br/>
289
+ <sub>Backend Developer</sub>
290
+ </td>
291
+ <td align="center" width="25%">
292
+ <b>Mirae Kang</b><br/>
293
+ <sub>Scrum Master</sub>
294
+ </td>
295
+ <td align="center" width="25%">
296
+ <b>Jonathan Brasales</b><br/>
297
+ <sub>AI Developer</sub>
298
+ </td>
299
+ <td align="center" width="25%">
300
+ <b>Roberto Molero</b><br/>
301
+ <sub>Product Owner</sub>
302
+ </td>
303
+ </tr>
304
+ </table>
305
 
306
  ---
307
 
308
+ <div align="center">
 
 
 
 
 
 
 
 
 
 
309
 
310
+ **SignalMod** Bootcamp IA P6 · Team 3 · 2026
311
 
312
+ </div>