Mirae Kang commited on
Commit
52b0ede
·
1 Parent(s): 975d796

docs: documentation, #15

Browse files
.env.example CHANGED
@@ -1,4 +1,4 @@
1
- # Copy to .env for local development: cp env.example .env
2
  # Docker Compose reads these via environment (optional).
3
 
4
  # YouTube Data API v3 (optional — /predict-video and scraping)
 
1
+ # Copy to .env for local development: cp .env.example .env
2
  # Docker Compose reads these via environment (optional).
3
 
4
  # YouTube Data API v3 (optional — /predict-video and scraping)
README.es.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Detector de comentarios tóxicos en YouTube (SignalMod)
2
+
3
+ [![Python](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
4
+ [![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688.svg)](https://fastapi.tiangolo.com/)
5
+ [![Streamlit](https://img.shields.io/badge/Streamlit-UI-FF4B4B.svg)](https://streamlit.io/)
6
+ [![Docker](https://img.shields.io/badge/docker-compose-2496ED.svg)](https://docs.docker.com/compose/)
7
+
8
+ **English:** [README.md](README.md)
9
+
10
+ Clasificación binaria **Seguro vs Tóxico** para comentarios estilo YouTube. Stack de producción: **FastAPI** (API REST) y **Streamlit** (interfaz tipo página de vídeo). Modelo por defecto: **Regresión logística + TF-IDF** (`models/final_model.joblib`).
11
+
12
+ ---
13
+
14
+ ## Descripción del proyecto
15
+
16
+ | Elemento | Detalle |
17
+ |----------|---------|
18
+ | **Objetivo** | Apoyar a moderadores detectando comentarios tóxicos |
19
+ | **Dataset** | `data/raw/youtoxic_english_1000.csv` (~1000 comentarios en inglés) |
20
+ | **Etiqueta** | `IsToxic` → **Seguro (0)** / **Tóxico (1)** |
21
+ | **Métrica principal** | F1 ponderado y ROC-AUC |
22
+ | **Control de sobreajuste** | \|F1 CV − F1 test\| < 5 puntos porcentuales |
23
+
24
+ ---
25
+
26
+ ## Arquitectura
27
+
28
+ ```
29
+ youtube_hate_detector/
30
+ ├── configs/ # YAML: pipeline, features, models, best_params
31
+ ├── data/raw/ # CSV fuente
32
+ ├── models/ # final_model.joblib, experimentos/
33
+ ├── reports/ # summary.csv, gráficos, artefactos del pipeline
34
+ ├── src/
35
+ │ ├── api/ # FastAPI
36
+ │ ├── app/ # Streamlit (src/app/app.py)
37
+ │ ├── evaluation/ # Evaluator
38
+ │ ├── features/ # Preprocesado y vectorización
39
+ │ ├── models/ # LR, RF, XGBoost
40
+ │ ├── pipeline/ # Entrenamiento end-to-end
41
+ │ └── service/ # ModelService
42
+ ├── tests/
43
+ ├── Dockerfile
44
+ └── docker-compose.yml
45
+ ```
46
+
47
+ **Flujo:** entrenamiento (`run_pipeline`) → inferencia API o Streamlit vía `ModelService`.
48
+
49
+ Más detalle: [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md)
50
+
51
+ ---
52
+
53
+ ## Instalación
54
+
55
+ ```bash
56
+ git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
57
+ cd Project_9_Equipo3
58
+
59
+ python -m venv .venv
60
+ source .venv/bin/activate
61
+
62
+ pip install -r requirements.txt
63
+ python -m spacy download en_core_web_sm
64
+ ```
65
+
66
+ Coloca `youtoxic_english_1000.csv` en `data/raw/`.
67
+
68
+ ```bash
69
+ cp .env.example .env
70
+ # Opcional: YOUTUBE_API_KEY, MODEL_NAME
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Pipeline de entrenamiento
76
+
77
+ ```bash
78
+ python -m src.pipeline.run_pipeline --model lr
79
+ # lr | rf | xgboost
80
+ ```
81
+
82
+ Actualiza [`reports/summary.csv`](reports/summary.csv) y guarda gráficos en `reports/pipeline/{model}/`.
83
+
84
+ Documentación: [docs/PIPELINE.es.md](docs/PIPELINE.es.md)
85
+
86
+ ---
87
+
88
+ ## Docker
89
+
90
+ ```bash
91
+ docker compose up --build
92
+ ```
93
+
94
+ | Servicio | URL |
95
+ |----------|-----|
96
+ | Streamlit | http://localhost:8501 |
97
+ | FastAPI | http://localhost:8000 |
98
+ | Swagger | http://localhost:8000/docs |
99
+
100
+ ```bash
101
+ docker compose down
102
+ ```
103
+
104
+ ---
105
+
106
+ ## Ejecución local
107
+
108
+ ```bash
109
+ uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
110
+ streamlit run src/app/app.py --server.port 8501
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Ejemplos de API
116
+
117
+ Ver [docs/API.es.md](docs/API.es.md)
118
+
119
+ ```bash
120
+ curl -s -X POST http://localhost:8000/predict \
121
+ -H "Content-Type: application/json" \
122
+ -d '{"text": "Great video!", "threshold": 0.5}'
123
+ ```
124
+
125
+ ---
126
+
127
+ ## Resultados
128
+
129
+ Mejor modelo **sklearn** en test (`configs/best_params.yaml`):
130
+
131
+ | Métrica | Valor |
132
+ |---------|-------|
133
+ | F1 (ponderado, test) | **0.7579** |
134
+ | ROC-AUC | **0.81** |
135
+ | Falsos positivos | 18 |
136
+ | Falsos negativos | 30 |
137
+ | Brecha CV–test | **4.76 pp** |
138
+
139
+ Gráficos EDA: `reports/v2/`.
140
+
141
+ ---
142
+
143
+ ## Comparativa de modelos
144
+
145
+ Tabla canónica: [`reports/summary.csv`](reports/summary.csv)
146
+ Resumen: [docs/RESULTS.es.md](docs/RESULTS.es.md)
147
+
148
+ | Modelo | Familia | F1 (test) | ROC-AUC | Por defecto |
149
+ |--------|---------|-----------|---------|-------------|
150
+ | LR + TF-IDF (ajustado) | sklearn | 0.7579 | 0.81 | Sí |
151
+ | RF / XGBoost | sklearn | — | — | Ejecutar pipeline |
152
+ | DistilBERT / toxic-bert / RoBERTa | Hugging Face | — | — | Opcional en API/UI |
153
+
154
+ ---
155
+
156
+ ## Tests
157
+
158
+ ```bash
159
+ pytest tests/ -v
160
+ ```
161
+
162
+ ---
163
+
164
+ ## Índice de documentación
165
+
166
+ | Español | English |
167
+ |---------|---------|
168
+ | [docs/API.es.md](docs/API.es.md) | [docs/API.md](docs/API.md) |
169
+ | [docs/PIPELINE.es.md](docs/PIPELINE.es.md) | [docs/PIPELINE.md](docs/PIPELINE.md) |
170
+ | [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) |
171
+ | [docs/RESULTS.es.md](docs/RESULTS.es.md) | [docs/RESULTS.md](docs/RESULTS.md) |
README.md CHANGED
@@ -1,66 +1,134 @@
1
  # YouTube Toxic Comment Detector (SignalMod)
2
 
3
- Binary **Safe vs Toxic** comment moderation assistant with a **FastAPI** backend and a **Streamlit** UI.
 
 
 
 
4
 
5
- ## Quick start (Docker)
6
 
7
- No manual setup beyond Docker. The image bundles the default model (`models/final_model.joblib`), configs, and NLP assets (spaCy + NLTK).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- ```bash
10
- docker compose up --build
11
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- | Service | URL |
14
- |-----------|-----|
15
- | Streamlit UI | http://localhost:8501 |
16
- | FastAPI | http://localhost:8000 |
17
- | API docs | http://localhost:8000/docs |
18
 
19
- Optional: set `YOUTUBE_API_KEY` for live comment scraping on `/predict-video`:
 
 
20
 
21
  ```bash
22
- export YOUTUBE_API_KEY=your_key_here
23
- docker compose up --build
 
 
 
 
 
 
24
  ```
25
 
26
- Stop containers:
 
 
27
 
28
  ```bash
29
- docker compose down
 
 
30
  ```
31
 
32
- Docker image and containers use the project name `youtube_hate_detector` (e.g. `youtube_hate_detector-api`). If you previously built `ai-nlp-app:latest`, remove it once: `docker rmi ai-nlp-app:latest`.
33
 
34
- ## Architecture
35
 
 
 
 
 
 
36
  ```
37
- youtube_hate_detector/
38
- ├── configs/ # YAML hyperparameters (non-secret)
39
- ├── data/
40
- │ ├── raw/ # Original dataset (gitignored)
41
- │ └── processed/
42
- ├── models/ # Serialized models (e.g. final_model.joblib)
43
- ├── src/
44
- │ ├── api/ # FastAPI (REST)
45
- │ ├── app/ # Streamlit UI
46
- │ ├── data/
47
- │ ├── features/
48
- │ ├── models/
49
- │ ├── pipeline/
50
- │ ├── service/ # ModelService (inference)
51
- │ └── utils/
52
- ├── tests/
53
- ├── Dockerfile
54
- └── docker-compose.yml
 
 
55
  ```
56
 
57
- ## Local development (without Docker)
 
 
 
 
58
 
59
  ```bash
60
- python -m venv .venv && source .venv/bin/activate
61
- pip install -r requirements.txt
62
- python -m spacy download en_core_web_sm
 
 
63
 
 
 
 
 
 
64
  # Terminal 1 — API
65
  uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
66
 
@@ -68,10 +136,104 @@ uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
68
  streamlit run src/app/app.py --server.port 8501
69
  ```
70
 
71
- Copy `env.example` to `.env` if you need a YouTube API key or custom `MODEL_NAME`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  ## Tests
74
 
75
  ```bash
76
  pytest tests/ -v
77
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # YouTube Toxic Comment Detector (SignalMod)
2
 
3
+ [![Python](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
4
+ [![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688.svg)](https://fastapi.tiangolo.com/)
5
+ [![Streamlit](https://img.shields.io/badge/Streamlit-UI-FF4B4B.svg)](https://streamlit.io/)
6
+ [![Docker](https://img.shields.io/badge/docker-compose-2496ED.svg)](https://docs.docker.com/compose/)
7
+ **Español:** [README.es.md](README.es.md)
8
 
9
+ Automated **Safe vs Toxic** classification for YouTube-style comments. The production stack is **FastAPI** (REST inference) plus **Streamlit** (watch-page style UI). The default model is **Logistic Regression + TF-IDF** (`models/final_model.joblib`).
10
 
11
+ ---
12
+
13
+ ## Project description
14
+
15
+ | Item | Detail |
16
+ |------|--------|
17
+ | **Goal** | Help moderation teams flag toxic comments quickly |
18
+ | **Dataset** | `data/raw/youtoxic_english_1000.csv` (~1k English comments) |
19
+ | **Target** | `IsToxic` → **Safe (0)** / **Toxic (1)** |
20
+ | **Primary metric** | Weighted F1 and ROC-AUC (imbalanced classes) |
21
+ | **Overfitting check** | \|CV F1 − test F1\| < 5 percentage points (project rubric) |
22
+
23
+ ---
24
+
25
+ ## Architecture
26
 
 
 
27
  ```
28
+ youtube_hate_detector/
29
+ ├── configs/ # YAML: pipeline, features, models, best_params
30
+ ├── data/raw/ # Source CSV (not committed if gitignored)
31
+ ├── models/ # final_model.joblib, experiments/
32
+ ├── reports/ # summary.csv, plots, pipeline artifacts
33
+ ├── src/
34
+ │ ├── api/ # FastAPI — /predict, /predict-batch, …
35
+ │ ├── app/ # Streamlit UI (src/app/app.py)
36
+ │ ├── data/ # load_raw_data, scraping helpers
37
+ │ ├── evaluation/ # Evaluator — metrics, ROC, confusion matrix
38
+ │ ├── features/ # TextPreprocessor, Vectorizer
39
+ │ ├── models/ # LR, RF, XGBoost baselines
40
+ │ ├── pipeline/ # run_pipeline.py — train end-to-end
41
+ │ └── service/ # ModelService — shared inference layer
42
+ ├── tests/
43
+ ├── Dockerfile
44
+ └── docker-compose.yml
45
+ ```
46
+
47
+ **Runtime flow**
48
+
49
+ 1. **Training:** `load_raw_data` → `TextPreprocessor` → `build_model().fit()` → `Evaluator` → `reports/summary.csv`
50
+ 2. **API:** `uvicorn` loads `ModelService` → `POST /predict`
51
+ 3. **Streamlit:** `ModelService.predict()` in-process (same models as API catalog)
52
+
53
+ See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for more detail.
54
 
55
+ ---
 
 
 
 
56
 
57
+ ## Installation
58
+
59
+ **Requirements:** Python 3.12+, ~2 GB disk for dependencies (optional PyTorch if using Hugging Face models in the UI).
60
 
61
  ```bash
62
+ git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
63
+ cd Project_9_Equipo3 # or your local folder name
64
+
65
+ python -m venv .venv
66
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
67
+
68
+ pip install -r requirements.txt
69
+ python -m spacy download en_core_web_sm
70
  ```
71
 
72
+ **Data:** place `youtoxic_english_1000.csv` under `data/raw/` (path in `configs/pipeline.yaml`).
73
+
74
+ **Environment:**
75
 
76
  ```bash
77
+ cp .env.example .env
78
+ # Optional: YOUTUBE_API_KEY for /predict-video
79
+ # MODEL_NAME must match a key in ModelService (default: LR + TF-IDF (local))
80
  ```
81
 
82
+ ---
83
 
84
+ ## Training pipeline
85
 
86
+ End-to-end training and evaluation:
87
+
88
+ ```bash
89
+ python -m src.pipeline.run_pipeline --model lr
90
+ # Options: lr | rf | xgboost
91
  ```
92
+
93
+ **Phases:** load data → stratified split → spaCy/NLTK preprocessing train → 5-fold CV → test metrics → save `models/experiments/{model}/` → MLflow → update [`reports/summary.csv`](reports/summary.csv) and plots under `reports/pipeline/{model}/`.
94
+
95
+ Config files:
96
+
97
+ | File | Purpose |
98
+ |------|---------|
99
+ | `configs/pipeline.yaml` | Paths, `IsToxic`, test_size, CV folds |
100
+ | `configs/features.yaml` | Preprocessing + TF-IDF |
101
+ | `configs/models.yaml` | Classifier hyperparameters |
102
+ | `configs/best_params.yaml` | Optuna winner (LR) |
103
+
104
+ Details: [docs/PIPELINE.md](docs/PIPELINE.md)
105
+
106
+ ---
107
+
108
+ ## Run with Docker
109
+
110
+ ```bash
111
+ docker compose up --build
112
  ```
113
 
114
+ | Service | URL |
115
+ |---------|-----|
116
+ | Streamlit | http://localhost:8501 |
117
+ | FastAPI | http://localhost:8000 |
118
+ | Swagger | http://localhost:8000/docs |
119
 
120
  ```bash
121
+ export YOUTUBE_API_KEY=your_key # optional
122
+ docker compose down # stop
123
+ ```
124
+
125
+ Containers: `youtube_hate_detector-api`, `youtube_hate_detector-streamlit`.
126
 
127
+ ---
128
+
129
+ ## Local run (without Docker)
130
+
131
+ ```bash
132
  # Terminal 1 — API
133
  uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
134
 
 
136
  streamlit run src/app/app.py --server.port 8501
137
  ```
138
 
139
+ ---
140
+
141
+ ## API examples
142
+
143
+ Full reference: [docs/API.md](docs/API.md)
144
+
145
+ **Health check**
146
+
147
+ ```bash
148
+ curl -s http://localhost:8000/ | python -m json.tool
149
+ ```
150
+
151
+ **Single prediction**
152
+
153
+ ```bash
154
+ curl -s -X POST http://localhost:8000/predict \
155
+ -H "Content-Type: application/json" \
156
+ -d '{"text": "This video is amazing, thanks for sharing!", "threshold": 0.5}'
157
+ ```
158
+
159
+ Example response:
160
+
161
+ ```json
162
+ {
163
+ "text": "This video is amazing, thanks for sharing!",
164
+ "is_toxic": false,
165
+ "probability": 0.08,
166
+ "labels": [],
167
+ "model_used": "LR + TF-IDF (local)",
168
+ "latency_ms": 12.5
169
+ }
170
+ ```
171
+
172
+ **Batch**
173
+
174
+ ```bash
175
+ curl -s -X POST http://localhost:8000/predict-batch \
176
+ -H "Content-Type: application/json" \
177
+ -d '{"texts": ["Great content!", "You are an idiot"], "threshold": 0.5}'
178
+ ```
179
+
180
+ **List / switch models**
181
+
182
+ ```bash
183
+ curl -s http://localhost:8000/models
184
+ curl -s -X PUT http://localhost:8000/model/DistilBERT%20Toxicity
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Results
190
+
191
+ Best **sklearn** model on the project test split (from `configs/best_params.yaml`):
192
+
193
+ | Metric | Value |
194
+ |--------|-------|
195
+ | F1 (weighted, test) | **0.7579** |
196
+ | ROC-AUC | **0.81** |
197
+ | False positives | 18 |
198
+ | False negatives | 30 |
199
+ | CV–test gap | **4.76 pp** (within 5 pp target) |
200
+ | Train–test gap | 14.07 pp |
201
+
202
+ Plots and EDA: `reports/v2/`. Per-run artifacts: `reports/pipeline/{lr,rf,xgboost}/`.
203
+
204
+ ---
205
+
206
+ ## Model comparison
207
+
208
+ Canonical table: [`reports/summary.csv`](reports/summary.csv)
209
+ Human-readable: [docs/RESULTS.md](docs/RESULTS.md)
210
+
211
+ | Model | Family | F1 (test) | ROC-AUC | FP | FN | Production default |
212
+ |-------|--------|-----------|---------|----|----|--------------------|
213
+ | LR + TF-IDF (tuned) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes |
214
+ | LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes (`final_model.joblib`) |
215
+ | RF / XGBoost | sklearn | — | — | — | — | Run pipeline to fill |
216
+ | DistilBERT / toxic-bert / RoBERTa | Hugging Face | — | — | — | — | Optional via API/UI |
217
+
218
+ Re-run `python -m src.pipeline.run_pipeline --model rf` to append RF metrics to `summary.csv`.
219
+
220
+ ---
221
 
222
  ## Tests
223
 
224
  ```bash
225
  pytest tests/ -v
226
  ```
227
+
228
+ Covers preprocessor, vectorizer, model binary output, and `/predict` response shape.
229
+
230
+ ---
231
+
232
+ ## Documentation index
233
+
234
+ | English | Español |
235
+ |---------|---------|
236
+ | [docs/API.md](docs/API.md) | [docs/API.es.md](docs/API.es.md) |
237
+ | [docs/PIPELINE.md](docs/PIPELINE.md) | [docs/PIPELINE.es.md](docs/PIPELINE.es.md) |
238
+ | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) |
239
+ | [docs/RESULTS.md](docs/RESULTS.md) | [docs/RESULTS.es.md](docs/RESULTS.es.md) |
docs/API.es.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Referencia API (FastAPI)
2
+
3
+ URL base (local): `http://localhost:8000`
4
+ Documentación interactiva: `/docs`, `/redoc`
5
+
6
+ Implementación: [`src/api/main.py`](../src/api/main.py)
7
+
8
+ ---
9
+
10
+ ## Endpoints
11
+
12
+ | Método | Ruta | Descripción |
13
+ |--------|------|-------------|
14
+ | `GET` | `/` | Estado del servicio y modelo activo |
15
+ | `GET` | `/model-info` | Metadatos del modelo cargado |
16
+ | `GET` | `/models` | Modelos disponibles y activo |
17
+ | `PUT` | `/model/{model_name}` | Cambiar modelo activo |
18
+ | `POST` | `/predict` | Clasificar un comentario |
19
+ | `POST` | `/predict-batch` | Hasta 100 comentarios |
20
+ | `POST` | `/predict-video` | Comentarios de un vídeo de YouTube |
21
+
22
+ ---
23
+
24
+ ## `POST /predict`
25
+
26
+ **Cuerpo**
27
+
28
+ ```json
29
+ {
30
+ "text": "Texto del comentario",
31
+ "threshold": 0.5
32
+ }
33
+ ```
34
+
35
+ **Respuesta**
36
+
37
+ ```json
38
+ {
39
+ "text": "Texto del comentario",
40
+ "is_toxic": false,
41
+ "probability": 0.08,
42
+ "labels": [],
43
+ "model_used": "LR + TF-IDF (local)",
44
+ "latency_ms": 15.2
45
+ }
46
+ ```
47
+
48
+ - `is_toxic`: `true` = **Tóxico**, `false` = **Seguro**
49
+ - `probability`: probabilidad de clase tóxica (0–1)
50
+
51
+ **curl**
52
+
53
+ ```bash
54
+ curl -s -X POST http://localhost:8000/predict \
55
+ -H "Content-Type: application/json" \
56
+ -d '{"text": "¡Gran vídeo, gracias!", "threshold": 0.5}'
57
+ ```
58
+
59
+ ---
60
+
61
+ ## `POST /predict-batch`
62
+
63
+ ```bash
64
+ curl -s -X POST http://localhost:8000/predict-batch \
65
+ -H "Content-Type: application/json" \
66
+ -d '{"texts": ["Comentario seguro", "Eres un idiota"], "threshold": 0.5}'
67
+ ```
68
+
69
+ ---
70
+
71
+ ## `POST /predict-video`
72
+
73
+ Requiere `YOUTUBE_API_KEY` en `.env` para comentarios reales.
74
+
75
+ ```json
76
+ {
77
+ "url": "https://www.youtube.com/watch?v=VIDEO_ID",
78
+ "max_comments": 50,
79
+ "threshold": 0.5
80
+ }
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Variables de entorno
86
+
87
+ | Variable | Descripción |
88
+ |----------|-------------|
89
+ | `MODEL_NAME` | Modelo al arrancar la API |
90
+ | `YOUTUBE_API_KEY` | API de YouTube para `/predict-video` |
91
+
92
+ Ver [`.env.example`](../.env.example).
docs/API.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # API reference (FastAPI)
2
+
3
+ Base URL (local): `http://localhost:8000`
4
+ Interactive docs: `/docs` (Swagger), `/redoc` (ReDoc)
5
+
6
+ Implementation: [`src/api/main.py`](../src/api/main.py)
7
+ Inference: [`src/service/model_service.py`](../src/service/model_service.py)
8
+
9
+ ---
10
+
11
+ ## Endpoints
12
+
13
+ | Method | Path | Description |
14
+ |--------|------|-------------|
15
+ | `GET` | `/` | Health check and active model name |
16
+ | `GET` | `/model-info` | Metadata for the loaded model |
17
+ | `GET` | `/models` | List available models and active one |
18
+ | `PUT` | `/model/{model_name}` | Switch active model (lazy load on next predict) |
19
+ | `POST` | `/predict` | Classify one comment |
20
+ | `POST` | `/predict-batch` | Classify up to 100 comments |
21
+ | `POST` | `/predict-video` | Fetch YouTube comments and classify (needs API key or demo fallback) |
22
+
23
+ ---
24
+
25
+ ## `POST /predict`
26
+
27
+ **Request body**
28
+
29
+ ```json
30
+ {
31
+ "text": "Comment text here",
32
+ "threshold": 0.5
33
+ }
34
+ ```
35
+
36
+ | Field | Type | Required | Description |
37
+ |-------|------|----------|-------------|
38
+ | `text` | string | yes | 1–5000 characters, non-empty after trim |
39
+ | `threshold` | float | no | Toxic if `probability >= threshold` (default `0.5`) |
40
+
41
+ **Response**
42
+
43
+ ```json
44
+ {
45
+ "text": "Comment text here",
46
+ "is_toxic": false,
47
+ "probability": 0.0821,
48
+ "labels": [],
49
+ "model_used": "LR + TF-IDF (local)",
50
+ "latency_ms": 15.2
51
+ }
52
+ ```
53
+
54
+ | Field | Description |
55
+ |-------|-------------|
56
+ | `is_toxic` | `true` = **Toxic**, `false` = **Safe** |
57
+ | `probability` | P(toxic), 0.0–1.0 |
58
+ | `labels` | Optional category hints when toxic (keyword/heuristic or HF labels) |
59
+ | `model_used` | Active model id from `ModelService` |
60
+
61
+ **curl**
62
+
63
+ ```bash
64
+ curl -s -X POST http://localhost:8000/predict \
65
+ -H "Content-Type: application/json" \
66
+ -d '{"text": "Thanks for the tutorial!", "threshold": 0.5}'
67
+ ```
68
+
69
+ **Toxic example**
70
+
71
+ ```bash
72
+ curl -s -X POST http://localhost:8000/predict \
73
+ -H "Content-Type: application/json" \
74
+ -d '{"text": "You are worthless garbage", "threshold": 0.5}'
75
+ ```
76
+
77
+ ---
78
+
79
+ ## `POST /predict-batch`
80
+
81
+ ```json
82
+ {
83
+ "texts": ["Safe comment", "Another line"],
84
+ "threshold": 0.5
85
+ }
86
+ ```
87
+
88
+ Response includes `results` (list of predict objects), `total`, `toxic_count`, `latency_ms`.
89
+
90
+ ```bash
91
+ curl -s -X POST http://localhost:8000/predict-batch \
92
+ -H "Content-Type: application/json" \
93
+ -d '{"texts": ["Nice video", "I hate you"], "threshold": 0.5}'
94
+ ```
95
+
96
+ ---
97
+
98
+ ## `POST /predict-video`
99
+
100
+ ```json
101
+ {
102
+ "url": "https://www.youtube.com/watch?v=VIDEO_ID",
103
+ "max_comments": 50,
104
+ "threshold": 0.5
105
+ }
106
+ ```
107
+
108
+ Set `YOUTUBE_API_KEY` in `.env` for live comment fetch. Without a key, the API may use a limited fallback scraper or demo data (see implementation in `main.py`).
109
+
110
+ ---
111
+
112
+ ## `GET /models` and model switch
113
+
114
+ ```bash
115
+ curl -s http://localhost:8000/models
116
+
117
+ curl -s -X PUT "http://localhost:8000/model/LR%20%2B%20TF-IDF%20(local)"
118
+ ```
119
+
120
+ Available names match keys in `AVAILABLE_MODELS` inside `model_service.py`, for example:
121
+
122
+ - `LR + TF-IDF (local)` — default, `models/final_model.joblib`
123
+ - `DistilBERT Toxicity` — Hugging Face remote (requires `transformers`, `torch`)
124
+ - `toxic-bert (multilabel)`
125
+ - `RoBERTa Toxicity`
126
+
127
+ ---
128
+
129
+ ## Environment variables
130
+
131
+ | Variable | Used by | Description |
132
+ |----------|---------|-------------|
133
+ | `MODEL_NAME` | API startup | Initial model from `AVAILABLE_MODELS` |
134
+ | `YOUTUBE_API_KEY` | `/predict-video` | YouTube Data API v3 |
135
+ | `ENV` | logging / behavior | `development` or `production` |
136
+
137
+ Copy from [`.env.example`](../.env.example).
138
+
139
+ ---
140
+
141
+ ## Errors
142
+
143
+ | Status | When |
144
+ |--------|------|
145
+ | `422` | Invalid body (e.g. empty `text`) |
146
+ | `503` | Model not loaded yet |
147
+ | `500` | Prediction failure |
docs/ARCHITECTURE.es.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Arquitectura del sistema
2
+
3
+ ## Componentes
4
+
5
+ ```mermaid
6
+ flowchart TB
7
+ subgraph datos [Capa de datos]
8
+ CSV[data/raw/youtoxic_english_1000.csv]
9
+ CFG[configs/*.yaml]
10
+ end
11
+
12
+ subgraph entrenamiento [Entrenamiento]
13
+ PIPE[run_pipeline.py]
14
+ PRE[TextPreprocessor]
15
+ BL[build_model]
16
+ EV[Evaluator]
17
+ CSV --> PIPE
18
+ CFG --> PIPE
19
+ PIPE --> PRE --> BL --> EV
20
+ EV --> SUM[reports/summary.csv]
21
+ end
22
+
23
+ subgraph inferencia [Inferencia]
24
+ MS[ModelService]
25
+ API[FastAPI]
26
+ UI[Streamlit]
27
+ MS --> API
28
+ MS --> UI
29
+ end
30
+ ```
31
+
32
+ ## Módulos
33
+
34
+ | Módulo | Función |
35
+ |--------|---------|
36
+ | `src/data/loader.py` | Carga del dataset |
37
+ | `src/features/text_preprocessor.py` | Limpieza y lematización |
38
+ | `src/models/baseline.py` | Modelos sklearn + TF-IDF |
39
+ | `src/evaluation/evaluator.py` | Métricas y comparativa |
40
+ | `src/pipeline/run_pipeline.py` | Pipeline completo |
41
+ | `src/service/model_service.py` | Predicción unificada |
42
+ | `src/api/main.py` | API REST |
43
+ | `src/app/app.py` | Interfaz Streamlit |
44
+
45
+ ## Etiquetas
46
+
47
+ - Binario: `IsToxic` → Seguro (0) / Tóxico (1)
48
+ - API: `is_toxic`, `probability`
49
+
50
+ ## Docker
51
+
52
+ Dos servicios: API (8000) y Streamlit (8501), imagen `youtube_hate_detector:latest`.
docs/ARCHITECTURE.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # System architecture
2
+
3
+ ## Components
4
+
5
+ ```mermaid
6
+ flowchart TB
7
+ subgraph data [Data layer]
8
+ CSV[data/raw/youtoxic_english_1000.csv]
9
+ CFG[configs/*.yaml]
10
+ end
11
+
12
+ subgraph training [Training]
13
+ PIPE[run_pipeline.py]
14
+ PRE[TextPreprocessor]
15
+ BL[build_model LR RF XGB]
16
+ EV[Evaluator]
17
+ CSV --> PIPE
18
+ CFG --> PIPE
19
+ PIPE --> PRE --> BL --> EV
20
+ EV --> SUM[reports/summary.csv]
21
+ BL --> JOB[models/experiments/]
22
+ end
23
+
24
+ subgraph inference [Inference]
25
+ MS[ModelService]
26
+ JOB2[models/final_model.joblib]
27
+ JOB2 --> MS
28
+ API[FastAPI src/api/main.py]
29
+ UI[Streamlit src/app/app.py]
30
+ MS --> API
31
+ MS --> UI
32
+ end
33
+ ```
34
+
35
+ ## Module map
36
+
37
+ | Module | Responsibility |
38
+ |--------|----------------|
39
+ | `src/data/loader.py` | Load raw CSV, optional processed paths |
40
+ | `src/features/text_preprocessor.py` | Clean and lemmatize text |
41
+ | `src/features/vectorizer.py` | Standalone TF-IDF (notebooks); baselines embed TF-IDF in sklearn `Pipeline` |
42
+ | `src/models/baseline.py` | `LRModel`, `RFModel`, `XGBModel`, `build_model()` |
43
+ | `src/evaluation/evaluator.py` | Metrics, ROC, confusion matrix, error analysis, `summary.csv` |
44
+ | `src/pipeline/run_pipeline.py` | Orchestrates training + evaluation |
45
+ | `src/service/model_service.py` | Loads joblib or Hugging Face models; `predict(text)` |
46
+ | `src/api/main.py` | REST endpoints, lifespan model load |
47
+ | `src/app/app.py` | Streamlit UI; calls `ModelService` directly |
48
+
49
+ ## Label strategy
50
+
51
+ - **Binary default:** column `IsToxic` → Safe `0`, Toxic `1`
52
+ - User-facing strings: **Safe** / **Toxic** (not “hate” or “harmful” in the UI copy)
53
+ - API returns `is_toxic` and `probability` (P(toxic))
54
+
55
+ ## Docker
56
+
57
+ [`docker-compose.yml`](../docker-compose.yml) runs two containers from one image:
58
+
59
+ - `youtube_hate_detector-api` — uvicorn port 8000
60
+ - `youtube_hate_detector-streamlit` — port 8501
61
+
62
+ Both include `final_model.joblib`, configs, spaCy, and NLTK data baked into the image.
63
+
64
+ ## Tests
65
+
66
+ [`tests/`](../tests/) — preprocessor, vectorizer, model binary outputs, `/predict` schema (mocked service).
docs/PIPELINE.es.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pipeline de entrenamiento
2
+
3
+ Punto de entrada: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)
4
+
5
+ ## Comando
6
+
7
+ ```bash
8
+ python -m src.pipeline.run_pipeline --model lr
9
+ ```
10
+
11
+ | Flag | Valores | Por defecto |
12
+ |------|---------|-------------|
13
+ | `--model` | `lr`, `rf`, `xgboost` | `lr` |
14
+
15
+ Ejecutar desde la raíz del repositorio.
16
+
17
+ ## Fases
18
+
19
+ 1. **Carga** — CSV en `data/raw/youtoxic_english_1000.csv`
20
+ 2. **Split** — train/test estratificado
21
+ 3. **Preprocesado** — `TextPreprocessor` (spaCy + NLTK)
22
+ 4. **Entrenamiento** — `build_model()`
23
+ 5. **Validación cruzada** — 5 folds
24
+ 6. **Evaluación** — `Evaluator.evaluate_and_report()` en test
25
+ 7. **Guardado** — `models/experiments/{model}/`
26
+ 8. **MLflow** — `mlruns/`
27
+ 9. **Informes** — `reports/summary.csv` y `reports/pipeline/{model}/`
28
+
29
+ ## Configuración
30
+
31
+ | Archivo | Uso |
32
+ |---------|-----|
33
+ | `configs/pipeline.yaml` | Rutas, `IsToxic`, split, CV |
34
+ | `configs/features.yaml` | TF-IDF y preprocesado |
35
+ | `configs/models.yaml` | Hiperparámetros de clasificadores |
36
+ | `configs/best_params.yaml` | Ganador Optuna (LR) |
37
+
38
+ ## Salidas
39
+
40
+ | Ruta | Contenido |
41
+ |------|-----------|
42
+ | `reports/summary.csv` | Tabla comparativa de modelos |
43
+ | `reports/pipeline/lr/cm_lr.png` | Matriz de confusión |
44
+ | `reports/pipeline/lr/roc_lr.png` | Curva ROC |
45
+ | `reports/pipeline/lr/errors_lr.csv` | FP / FN |
46
+
47
+ ## Modelo en producción
48
+
49
+ La API y Streamlit cargan `models/final_model.joblib` vía `ModelService`.
docs/PIPELINE.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training pipeline
2
+
3
+ Entry point: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)
4
+
5
+ ## Command
6
+
7
+ ```bash
8
+ python -m src.pipeline.run_pipeline --model lr
9
+ ```
10
+
11
+ | Flag | Choices | Default |
12
+ |------|---------|---------|
13
+ | `--model` | `lr`, `rf`, `xgboost` | `lr` |
14
+
15
+ Run from the repository root so `configs/` and `data/raw/` resolve correctly.
16
+
17
+ ## Phases
18
+
19
+ 1. **Load data** — `load_raw_data()` reads `configs/pipeline.yaml` → `data/raw/youtoxic_english_1000.csv`
20
+ 2. **Split** — stratified train/test (`test_size`, `random_state` in YAML)
21
+ 3. **Preprocess** — `TextPreprocessor` (lowercase, regex cleanup, spaCy lemmas, NLTK stopwords)
22
+ 4. **Train** — `build_model(model_type)` fits TF-IDF + classifier pipeline
23
+ 5. **Cross-validation** — 5-fold stratified CV, F1 weighted + ROC-AUC
24
+ 6. **Evaluate** — `Evaluator.evaluate_and_report()` on test set
25
+ 7. **Save** — `models/experiments/{model}/{model}_pipeline_{timestamp}.joblib`
26
+ 8. **MLflow** — metrics and sklearn pipeline under `mlruns/`
27
+ 9. **Reports** — append row to `reports/summary.csv`; PNGs in `reports/pipeline/{model}/`
28
+
29
+ ## Configuration
30
+
31
+ | File | Keys (examples) |
32
+ |------|-----------------|
33
+ | `configs/pipeline.yaml` | `target_binary: IsToxic`, `test_size: 0.2`, `cv_folds: 5` |
34
+ | `configs/features.yaml` | TF-IDF `max_features`, `ngram_range`, preprocessing flags |
35
+ | `configs/models.yaml` | LR `C`, RF `n_estimators`, etc. |
36
+ | `configs/best_params.yaml` | Optuna winner for LR (overrides defaults when training LR) |
37
+
38
+ ## Outputs
39
+
40
+ | Path | Content |
41
+ |------|---------|
42
+ | `reports/summary.csv` | All runs — model comparison table |
43
+ | `reports/pipeline/lr/cm_lr.png` | Confusion matrix |
44
+ | `reports/pipeline/lr/roc_lr.png` | ROC curve |
45
+ | `reports/pipeline/lr/errors_lr.csv` | False positives / negatives |
46
+ | `reports/pipeline/lr/exp_*.json` | Full metrics per run |
47
+ | `models/experiments/lr/*.joblib` | Serialized pipeline |
48
+
49
+ ## Evaluator API
50
+
51
+ [`src/evaluation/evaluator.py`](../src/evaluation/evaluator.py):
52
+
53
+ ```python
54
+ from src.evaluation.evaluator import Evaluator
55
+
56
+ evaluator = Evaluator(output_dir="reports/pipeline/lr")
57
+ metrics = evaluator.evaluate_and_report(
58
+ model, X_test, y_test, model_name="LR",
59
+ X_train=X_train, y_train=y_train, cv_results=cv_results,
60
+ summary_path="reports/summary.csv",
61
+ )
62
+ ```
63
+
64
+ Metrics include: `f1_weighted`, `f1_toxic`, `roc_auc`, `fp`, `fn`, `cv_test_gap_pp`, `train_test_gap_pp`, plus paths to plots.
65
+
66
+ ## Production model
67
+
68
+ Inference uses `models/final_model.joblib` (loaded by `ModelService`). After a successful pipeline run, copy or export the best experiment artifact to `final_model.joblib` if you want to update production.
docs/RESULTS.es.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Resultados y comparativa de modelos
2
+
3
+ Datos: [`reports/summary.csv`](../reports/summary.csv)
4
+ Hiperparámetros: [`configs/best_params.yaml`](../configs/best_params.yaml)
5
+
6
+ ## Mejor modelo sklearn (producción)
7
+
8
+ **Ganador:** Regresión logística + TF-IDF (Optuna), archivo `models/final_model.joblib`.
9
+
10
+ | Métrica | Valor en test | Notas |
11
+ |---------|---------------|-------|
12
+ | F1 (ponderado) | **0.7579** | Métrica principal |
13
+ | ROC-AUC | **0.81** | |
14
+ | Falsos positivos | **18** | Seguros marcados como tóxicos |
15
+ | Falsos negativos | **30** | Tóxicos no detectados |
16
+ | F1 (train) | 0.8987 | |
17
+ | Brecha train–test | 14.07 pp | |
18
+ | Brecha CV–test | **4.76 pp** | Objetivo < 5 pp |
19
+
20
+ ## Tabla comparativa
21
+
22
+ | Modelo | Familia | F1 (test) | ROC-AUC | FP | FN | Por defecto |
23
+ |--------|---------|-----------|---------|----|----|-------------|
24
+ | LR + TF-IDF (ajustado) | sklearn | 0.7579 | 0.81 | 18 | 30 | Sí |
25
+ | LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Sí |
26
+ | Random Forest | sklearn | — | — | — | — | Ejecutar `--model rf` |
27
+ | XGBoost | sklearn | — | — | — | — | Ejecutar `--model xgboost` |
28
+ | DistilBERT Toxicity | Hugging Face | — | — | — | — | Opcional en API |
29
+ | toxic-bert | Hugging Face | — | — | — | — | Opcional |
30
+ | RoBERTa Toxicity | Hugging Face | — | — | — | — | Opcional |
31
+
32
+ ## Actualizar métricas
33
+
34
+ ```bash
35
+ python -m src.pipeline.run_pipeline --model lr
36
+ python -m src.pipeline.run_pipeline --model rf
37
+ python -m src.pipeline.run_pipeline --model xgboost
38
+ ```
39
+
40
+ Salidas: `reports/summary.csv`, gráficos en `reports/pipeline/{model}/`.
41
+
42
+ ## EDA
43
+
44
+ Figuras adicionales en `reports/v2/`.
45
+
46
+ ## Análisis de errores
47
+
48
+ Términos frecuentes en FP/FN y ejemplos en `reports/pipeline/*/errors_*.csv`.
docs/RESULTS.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model results and comparison
2
+
3
+ Canonical data: [`reports/summary.csv`](../reports/summary.csv)
4
+ Tuned hyperparameters: [`configs/best_params.yaml`](../configs/best_params.yaml)
5
+
6
+ ## Best sklearn model (production)
7
+
8
+ **Winner:** Logistic Regression + TF-IDF (Optuna-tuned), exported as `models/final_model.joblib`.
9
+
10
+ | Metric | Test value | Notes |
11
+ |--------|------------|-------|
12
+ | F1 (weighted) | **0.7579** | Primary project metric |
13
+ | ROC-AUC | **0.81** | Ranking quality |
14
+ | False positives | **18** | Safe comments marked toxic |
15
+ | False negatives | **30** | Toxic comments missed |
16
+ | F1 (train) | 0.8987 | In-sample |
17
+ | Train–test gap | 14.07 pp | High; prefer CV gap for generalization |
18
+ | CV–test gap | **4.76 pp** | Meets < 5 pp rubric |
19
+ | Test size | ~20% stratified | See `configs/pipeline.yaml` |
20
+
21
+ **Optuna hyperparameters (LR):** `C≈0.32`, `max_features=4045`, bigrams `(1,2)`, `min_df=2`.
22
+
23
+ ## Comparison table
24
+
25
+ | Model | Family | F1 (test) | ROC-AUC | FP | FN | Default in API/UI |
26
+ |-------|--------|-----------|---------|----|----|-------------------|
27
+ | LR + TF-IDF (tuned) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes |
28
+ | LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes (`final_model.joblib`) |
29
+ | Random Forest | sklearn | — | — | — | — | Run pipeline `--model rf` |
30
+ | XGBoost | sklearn | — | — | — | — | Run pipeline `--model xgboost` |
31
+ | DistilBERT Toxicity | Hugging Face | — | — | — | — | Optional (`PUT /model/...`) |
32
+ | toxic-bert (multilabel) | Hugging Face | — | — | — | — | Optional |
33
+ | RoBERTa Toxicity | Hugging Face | — | — | — | — | Optional |
34
+
35
+ Rows with empty metrics are placeholders until you run the pipeline or evaluate HF models on the same test split.
36
+
37
+ ## How to refresh metrics
38
+
39
+ ```bash
40
+ python -m src.pipeline.run_pipeline --model lr
41
+ python -m src.pipeline.run_pipeline --model rf
42
+ python -m src.pipeline.run_pipeline --model xgboost
43
+ ```
44
+
45
+ Each run appends/updates [`reports/summary.csv`](../reports/summary.csv) and writes:
46
+
47
+ - `reports/pipeline/{model}/cm_{model}.png`
48
+ - `reports/pipeline/{model}/roc_{model}.png`
49
+ - `reports/pipeline/{model}/errors_{model}.csv`
50
+
51
+ ## EDA and experiments
52
+
53
+ Additional figures (notebooks): `reports/v2/` — label distribution, TF-IDF features, ensemble charts, transformer confusion matrices (`nb08_*`).
54
+
55
+ ## Error analysis
56
+
57
+ The evaluator prints and saves:
58
+
59
+ - **Most common terms** in false positives and false negatives
60
+ - Example comments with highest/lowest toxic probability among errors
61
+
62
+ See `reports/pipeline/*/errors_*.csv` after a pipeline run.
env.example DELETED
@@ -1,9 +0,0 @@
1
- # Copia este archivo como .env y rellena los valores
2
- # cp .env.example .env
3
-
4
- # YouTube Data API v3
5
- # Obtener en: https://console.cloud.google.com/apis/credentials
6
- YOUTUBE_API_KEY=your_youtube_api_key_here
7
-
8
- # Entorno
9
- ENV=development # development | production
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -12,3 +12,6 @@ joblib==1.5.3
12
  pydantic==2.13.4
13
  transformers==5.9.0
14
  httpx==0.28.1
 
 
 
 
12
  pydantic==2.13.4
13
  transformers==5.9.0
14
  httpx==0.28.1
15
+ matplotlib>=3.8.0
16
+ seaborn>=0.13.0
17
+ mlflow>=2.0.0