Spaces:

devrup404
/

SignalMod

Running

App Files Files Community

Mirae Kang commited on 5 days ago

Commit

52b0ede

1 Parent(s): 975d796

docs: documentation, #15

Browse files

Files changed (13) hide show

.env.example +1 -1
README.es.md +171 -0
README.md +202 -40
docs/API.es.md +92 -0
docs/API.md +147 -0
docs/ARCHITECTURE.es.md +52 -0
docs/ARCHITECTURE.md +66 -0
docs/PIPELINE.es.md +49 -0
docs/PIPELINE.md +68 -0
docs/RESULTS.es.md +48 -0
docs/RESULTS.md +62 -0
env.example +0 -9
requirements.txt +3 -0

.env.example CHANGED Viewed

@@ -1,4 +1,4 @@
-# Copy to .env for local development:  cp env.example .env
 # Docker Compose reads these via environment (optional).
 # YouTube Data API v3 (optional — /predict-video and scraping)

+# Copy to .env for local development:  cp .env.example .env
 # Docker Compose reads these via environment (optional).
 # YouTube Data API v3 (optional — /predict-video and scraping)

README.es.md ADDED Viewed

	@@ -0,0 +1,171 @@

+# Detector de comentarios tóxicos en YouTube (SignalMod)
+[![Python](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688.svg)](https://fastapi.tiangolo.com/)
+[![Streamlit](https://img.shields.io/badge/Streamlit-UI-FF4B4B.svg)](https://streamlit.io/)
+[![Docker](https://img.shields.io/badge/docker-compose-2496ED.svg)](https://docs.docker.com/compose/)
+**English:** [README.md](README.md)
+Clasificación binaria **Seguro vs Tóxico** para comentarios estilo YouTube. Stack de producción: **FastAPI** (API REST) y **Streamlit** (interfaz tipo página de vídeo). Modelo por defecto: **Regresión logística + TF-IDF** (`models/final_model.joblib`).
+---
+## Descripción del proyecto
+| Elemento | Detalle |
+|----------|---------|
+| **Objetivo** | Apoyar a moderadores detectando comentarios tóxicos |
+| **Dataset** | `data/raw/youtoxic_english_1000.csv` (~1000 comentarios en inglés) |
+| **Etiqueta** | `IsToxic` → **Seguro (0)** / **Tóxico (1)** |
+| **Métrica principal** | F1 ponderado y ROC-AUC |
+| **Control de sobreajuste** | \|F1 CV − F1 test\| &lt; 5 puntos porcentuales |
+---
+## Arquitectura
+```
+youtube_hate_detector/
+├── configs/              # YAML: pipeline, features, models, best_params
+├── data/raw/             # CSV fuente
+├── models/               # final_model.joblib, experimentos/
+├── reports/              # summary.csv, gráficos, artefactos del pipeline
+├── src/
+│   ├── api/              # FastAPI
+│   ├── app/              # Streamlit (src/app/app.py)
+│   ├── evaluation/       # Evaluator
+│   ├── features/         # Preprocesado y vectorización
+│   ├── models/           # LR, RF, XGBoost
+│   ├── pipeline/         # Entrenamiento end-to-end
+│   └── service/          # ModelService
+├── tests/
+├── Dockerfile
+└── docker-compose.yml
+```
+**Flujo:** entrenamiento (`run_pipeline`) → inferencia API o Streamlit vía `ModelService`.
+Más detalle: [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md)
+---
+## Instalación
+```bash
+git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
+cd Project_9_Equipo3
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+```
+Coloca `youtoxic_english_1000.csv` en `data/raw/`.
+```bash
+cp .env.example .env
+# Opcional: YOUTUBE_API_KEY, MODEL_NAME
+```
+---
+## Pipeline de entrenamiento
+```bash
+python -m src.pipeline.run_pipeline --model lr
+# lr | rf | xgboost
+```
+Actualiza [`reports/summary.csv`](reports/summary.csv) y guarda gráficos en `reports/pipeline/{model}/`.
+Documentación: [docs/PIPELINE.es.md](docs/PIPELINE.es.md)
+---
+## Docker
+```bash
+docker compose up --build
+```
+| Servicio | URL |
+|----------|-----|
+| Streamlit | http://localhost:8501 |
+| FastAPI | http://localhost:8000 |
+| Swagger | http://localhost:8000/docs |
+```bash
+docker compose down
+```
+---
+## Ejecución local
+```bash
+uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
+streamlit run src/app/app.py --server.port 8501
+```
+---
+## Ejemplos de API
+Ver [docs/API.es.md](docs/API.es.md)
+```bash
+curl -s -X POST http://localhost:8000/predict \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Great video!", "threshold": 0.5}'
+```
+---
+## Resultados
+Mejor modelo **sklearn** en test (`configs/best_params.yaml`):
+| Métrica | Valor |
+|---------|-------|
+| F1 (ponderado, test) | **0.7579** |
+| ROC-AUC | **0.81** |
+| Falsos positivos | 18 |
+| Falsos negativos | 30 |
+| Brecha CV–test | **4.76 pp** |
+Gráficos EDA: `reports/v2/`.
+---
+## Comparativa de modelos
+Tabla canónica: [`reports/summary.csv`](reports/summary.csv)
+Resumen: [docs/RESULTS.es.md](docs/RESULTS.es.md)
+| Modelo | Familia | F1 (test) | ROC-AUC | Por defecto |
+|--------|---------|-----------|---------|-------------|
+| LR + TF-IDF (ajustado) | sklearn | 0.7579 | 0.81 | Sí |
+| RF / XGBoost | sklearn | — | — | Ejecutar pipeline |
+| DistilBERT / toxic-bert / RoBERTa | Hugging Face | — | — | Opcional en API/UI |
+---
+## Tests
+```bash
+pytest tests/ -v
+```
+---
+## Índice de documentación
+| Español | English |
+|---------|---------|
+| [docs/API.es.md](docs/API.es.md) | [docs/API.md](docs/API.md) |
+| [docs/PIPELINE.es.md](docs/PIPELINE.es.md) | [docs/PIPELINE.md](docs/PIPELINE.md) |
+| [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) |
+| [docs/RESULTS.es.md](docs/RESULTS.es.md) | [docs/RESULTS.md](docs/RESULTS.md) |

README.md CHANGED Viewed

@@ -1,66 +1,134 @@
 # YouTube Toxic Comment Detector (SignalMod)
-Binary **Safe vs Toxic** comment moderation assistant with a **FastAPI** backend and a **Streamlit** UI.
-## Quick start (Docker)
-No manual setup beyond Docker. The image bundles the default model (`models/final_model.joblib`), configs, and NLP assets (spaCy + NLTK).
-```bash
-docker compose up --build
 ```
-| Service   | URL |
-|-----------|-----|
-| Streamlit UI | http://localhost:8501 |
-| FastAPI      | http://localhost:8000 |
-| API docs     | http://localhost:8000/docs |
-Optional: set `YOUTUBE_API_KEY` for live comment scraping on `/predict-video`:
 ```bash
-export YOUTUBE_API_KEY=your_key_here
-docker compose up --build
 ```
-Stop containers:
 ```bash
-docker compose down
 ```
-Docker image and containers use the project name `youtube_hate_detector` (e.g. `youtube_hate_detector-api`). If you previously built `ai-nlp-app:latest`, remove it once: `docker rmi ai-nlp-app:latest`.
-## Architecture
 ```
-youtube_hate_detector/
-├── configs/           # YAML hyperparameters (non-secret)
-├── data/
-│   ├── raw/           # Original dataset (gitignored)
-│   └── processed/
-├── models/            # Serialized models (e.g. final_model.joblib)
-├── src/
-│   ├── api/           # FastAPI (REST)
-│   ├── app/           # Streamlit UI
-│   ├── data/
-│   ├── features/
-│   ├── models/
-│   ├── pipeline/
-│   ├── service/       # ModelService (inference)
-│   └── utils/
-├── tests/
-├── Dockerfile
-└── docker-compose.yml
 ```
-## Local development (without Docker)
 ```bash
-python -m venv .venv && source .venv/bin/activate
-pip install -r requirements.txt
-python -m spacy download en_core_web_sm
 # Terminal 1 — API
 uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
@@ -68,10 +136,104 @@ uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
 streamlit run src/app/app.py --server.port 8501
 ```
-Copy `env.example` to `.env` if you need a YouTube API key or custom `MODEL_NAME`.
 ## Tests
 ```bash
 pytest tests/ -v
 ```

 # YouTube Toxic Comment Detector (SignalMod)
+[![Python](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-0.136-009688.svg)](https://fastapi.tiangolo.com/)
+[![Streamlit](https://img.shields.io/badge/Streamlit-UI-FF4B4B.svg)](https://streamlit.io/)
+[![Docker](https://img.shields.io/badge/docker-compose-2496ED.svg)](https://docs.docker.com/compose/)
+**Español:** [README.es.md](README.es.md)
+Automated **Safe vs Toxic** classification for YouTube-style comments. The production stack is **FastAPI** (REST inference) plus **Streamlit** (watch-page style UI). The default model is **Logistic Regression + TF-IDF** (`models/final_model.joblib`).
+---
+## Project description
+| Item | Detail |
+|------|--------|
+| **Goal** | Help moderation teams flag toxic comments quickly |
+| **Dataset** | `data/raw/youtoxic_english_1000.csv` (~1k English comments) |
+| **Target** | `IsToxic` → **Safe (0)** / **Toxic (1)** |
+| **Primary metric** | Weighted F1 and ROC-AUC (imbalanced classes) |
+| **Overfitting check** | \|CV F1 − test F1\| &lt; 5 percentage points (project rubric) |
+---
+## Architecture
 ```
+youtube_hate_detector/
+├── configs/              # YAML: pipeline, features, models, best_params
+├── data/raw/             # Source CSV (not committed if gitignored)
+├── models/               # final_model.joblib, experiments/
+├── reports/              # summary.csv, plots, pipeline artifacts
+├── src/
+│   ├── api/              # FastAPI — /predict, /predict-batch, …
+│   ├── app/              # Streamlit UI (src/app/app.py)
+│   ├── data/             # load_raw_data, scraping helpers
+│   ├── evaluation/       # Evaluator — metrics, ROC, confusion matrix
+│   ├── features/         # TextPreprocessor, Vectorizer
+│   ├── models/           # LR, RF, XGBoost baselines
+│   ├── pipeline/         # run_pipeline.py — train end-to-end
+│   └── service/          # ModelService — shared inference layer
+├── tests/
+├── Dockerfile
+└── docker-compose.yml
+```
+**Runtime flow**
+1. **Training:** `load_raw_data` → `TextPreprocessor` → `build_model().fit()` → `Evaluator` → `reports/summary.csv`
+2. **API:** `uvicorn` loads `ModelService` → `POST /predict`
+3. **Streamlit:** `ModelService.predict()` in-process (same models as API catalog)
+See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for more detail.
+---
+## Installation
+**Requirements:** Python 3.12+, ~2 GB disk for dependencies (optional PyTorch if using Hugging Face models in the UI).
 ```bash
+git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
+cd Project_9_Equipo3   # or your local folder name
+python -m venv .venv
+source .venv/bin/activate   # Windows: .venv\Scripts\activate
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
 ```
+**Data:** place `youtoxic_english_1000.csv` under `data/raw/` (path in `configs/pipeline.yaml`).
+**Environment:**
 ```bash
+cp .env.example .env
+# Optional: YOUTUBE_API_KEY for /predict-video
+# MODEL_NAME must match a key in ModelService (default: LR + TF-IDF (local))
 ```
+---
+## Training pipeline
+End-to-end training and evaluation:
+```bash
+python -m src.pipeline.run_pipeline --model lr
+# Options: lr | rf | xgboost
 ```
+**Phases:** load data → stratified split → spaCy/NLTK preprocessing → train → 5-fold CV → test metrics → save `models/experiments/{model}/` → MLflow → update [`reports/summary.csv`](reports/summary.csv) and plots under `reports/pipeline/{model}/`.
+Config files:
+| File | Purpose |
+|------|---------|
+| `configs/pipeline.yaml` | Paths, `IsToxic`, test_size, CV folds |
+| `configs/features.yaml` | Preprocessing + TF-IDF |
+| `configs/models.yaml` | Classifier hyperparameters |
+| `configs/best_params.yaml` | Optuna winner (LR) |
+Details: [docs/PIPELINE.md](docs/PIPELINE.md)
+---
+## Run with Docker
+```bash
+docker compose up --build
 ```
+| Service | URL |
+|---------|-----|
+| Streamlit | http://localhost:8501 |
+| FastAPI | http://localhost:8000 |
+| Swagger | http://localhost:8000/docs |
 ```bash
+export YOUTUBE_API_KEY=your_key   # optional
+docker compose down               # stop
+```
+Containers: `youtube_hate_detector-api`, `youtube_hate_detector-streamlit`.
+---
+## Local run (without Docker)
+```bash
 # Terminal 1 — API
 uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
 streamlit run src/app/app.py --server.port 8501
 ```
+---
+## API examples
+Full reference: [docs/API.md](docs/API.md)
+**Health check**
+```bash
+curl -s http://localhost:8000/ | python -m json.tool
+```
+**Single prediction**
+```bash
+curl -s -X POST http://localhost:8000/predict \
+  -H "Content-Type: application/json" \
+  -d '{"text": "This video is amazing, thanks for sharing!", "threshold": 0.5}'
+```
+Example response:
+```json
+{
+  "text": "This video is amazing, thanks for sharing!",
+  "is_toxic": false,
+  "probability": 0.08,
+  "labels": [],
+  "model_used": "LR + TF-IDF (local)",
+  "latency_ms": 12.5
+}
+```
+**Batch**
+```bash
+curl -s -X POST http://localhost:8000/predict-batch \
+  -H "Content-Type: application/json" \
+  -d '{"texts": ["Great content!", "You are an idiot"], "threshold": 0.5}'
+```
+**List / switch models**
+```bash
+curl -s http://localhost:8000/models
+curl -s -X PUT http://localhost:8000/model/DistilBERT%20Toxicity
+```
+---
+## Results
+Best **sklearn** model on the project test split (from `configs/best_params.yaml`):
+| Metric | Value |
+|--------|-------|
+| F1 (weighted, test) | **0.7579** |
+| ROC-AUC | **0.81** |
+| False positives | 18 |
+| False negatives | 30 |
+| CV–test gap | **4.76 pp** (within 5 pp target) |
+| Train–test gap | 14.07 pp |
+Plots and EDA: `reports/v2/`. Per-run artifacts: `reports/pipeline/{lr,rf,xgboost}/`.
+---
+## Model comparison
+Canonical table: [`reports/summary.csv`](reports/summary.csv)
+Human-readable: [docs/RESULTS.md](docs/RESULTS.md)
+| Model | Family | F1 (test) | ROC-AUC | FP | FN | Production default |
+|-------|--------|-----------|---------|----|----|--------------------|
+| LR + TF-IDF (tuned) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes |
+| LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes (`final_model.joblib`) |
+| RF / XGBoost | sklearn | — | — | — | — | Run pipeline to fill |
+| DistilBERT / toxic-bert / RoBERTa | Hugging Face | — | — | — | — | Optional via API/UI |
+Re-run `python -m src.pipeline.run_pipeline --model rf` to append RF metrics to `summary.csv`.
+---
 ## Tests
 ```bash
 pytest tests/ -v
 ```
+Covers preprocessor, vectorizer, model binary output, and `/predict` response shape.
+---
+## Documentation index
+| English | Español |
+|---------|---------|
+| [docs/API.md](docs/API.md) | [docs/API.es.md](docs/API.es.md) |
+| [docs/PIPELINE.md](docs/PIPELINE.md) | [docs/PIPELINE.es.md](docs/PIPELINE.es.md) |
+| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) |
+| [docs/RESULTS.md](docs/RESULTS.md) | [docs/RESULTS.es.md](docs/RESULTS.es.md) |

docs/API.es.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# Referencia API (FastAPI)
+URL base (local): `http://localhost:8000`
+Documentación interactiva: `/docs`, `/redoc`
+Implementación: [`src/api/main.py`](../src/api/main.py)
+---
+## Endpoints
+| Método | Ruta | Descripción |
+|--------|------|-------------|
+| `GET` | `/` | Estado del servicio y modelo activo |
+| `GET` | `/model-info` | Metadatos del modelo cargado |
+| `GET` | `/models` | Modelos disponibles y activo |
+| `PUT` | `/model/{model_name}` | Cambiar modelo activo |
+| `POST` | `/predict` | Clasificar un comentario |
+| `POST` | `/predict-batch` | Hasta 100 comentarios |
+| `POST` | `/predict-video` | Comentarios de un vídeo de YouTube |
+---
+## `POST /predict`
+**Cuerpo**
+```json
+{
+  "text": "Texto del comentario",
+  "threshold": 0.5
+}
+```
+**Respuesta**
+```json
+{
+  "text": "Texto del comentario",
+  "is_toxic": false,
+  "probability": 0.08,
+  "labels": [],
+  "model_used": "LR + TF-IDF (local)",
+  "latency_ms": 15.2
+}
+```
+- `is_toxic`: `true` = **Tóxico**, `false` = **Seguro**
+- `probability`: probabilidad de clase tóxica (0–1)
+**curl**
+```bash
+curl -s -X POST http://localhost:8000/predict \
+  -H "Content-Type: application/json" \
+  -d '{"text": "¡Gran vídeo, gracias!", "threshold": 0.5}'
+```
+---
+## `POST /predict-batch`
+```bash
+curl -s -X POST http://localhost:8000/predict-batch \
+  -H "Content-Type: application/json" \
+  -d '{"texts": ["Comentario seguro", "Eres un idiota"], "threshold": 0.5}'
+```
+---
+## `POST /predict-video`
+Requiere `YOUTUBE_API_KEY` en `.env` para comentarios reales.
+```json
+{
+  "url": "https://www.youtube.com/watch?v=VIDEO_ID",
+  "max_comments": 50,
+  "threshold": 0.5
+}
+```
+---
+## Variables de entorno
+| Variable | Descripción |
+|----------|-------------|
+| `MODEL_NAME` | Modelo al arrancar la API |
+| `YOUTUBE_API_KEY` | API de YouTube para `/predict-video` |
+Ver [`.env.example`](../.env.example).

docs/API.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# API reference (FastAPI)
+Base URL (local): `http://localhost:8000`
+Interactive docs: `/docs` (Swagger), `/redoc` (ReDoc)
+Implementation: [`src/api/main.py`](../src/api/main.py)
+Inference: [`src/service/model_service.py`](../src/service/model_service.py)
+---
+## Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| `GET` | `/` | Health check and active model name |
+| `GET` | `/model-info` | Metadata for the loaded model |
+| `GET` | `/models` | List available models and active one |
+| `PUT` | `/model/{model_name}` | Switch active model (lazy load on next predict) |
+| `POST` | `/predict` | Classify one comment |
+| `POST` | `/predict-batch` | Classify up to 100 comments |
+| `POST` | `/predict-video` | Fetch YouTube comments and classify (needs API key or demo fallback) |
+---
+## `POST /predict`
+**Request body**
+```json
+{
+  "text": "Comment text here",
+  "threshold": 0.5
+}
+```
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `text` | string | yes | 1–5000 characters, non-empty after trim |
+| `threshold` | float | no | Toxic if `probability >= threshold` (default `0.5`) |
+**Response**
+```json
+{
+  "text": "Comment text here",
+  "is_toxic": false,
+  "probability": 0.0821,
+  "labels": [],
+  "model_used": "LR + TF-IDF (local)",
+  "latency_ms": 15.2
+}
+```
+| Field | Description |
+|-------|-------------|
+| `is_toxic` | `true` = **Toxic**, `false` = **Safe** |
+| `probability` | P(toxic), 0.0–1.0 |
+| `labels` | Optional category hints when toxic (keyword/heuristic or HF labels) |
+| `model_used` | Active model id from `ModelService` |
+**curl**
+```bash
+curl -s -X POST http://localhost:8000/predict \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Thanks for the tutorial!", "threshold": 0.5}'
+```
+**Toxic example**
+```bash
+curl -s -X POST http://localhost:8000/predict \
+  -H "Content-Type: application/json" \
+  -d '{"text": "You are worthless garbage", "threshold": 0.5}'
+```
+---
+## `POST /predict-batch`
+```json
+{
+  "texts": ["Safe comment", "Another line"],
+  "threshold": 0.5
+}
+```
+Response includes `results` (list of predict objects), `total`, `toxic_count`, `latency_ms`.
+```bash
+curl -s -X POST http://localhost:8000/predict-batch \
+  -H "Content-Type: application/json" \
+  -d '{"texts": ["Nice video", "I hate you"], "threshold": 0.5}'
+```
+---
+## `POST /predict-video`
+```json
+{
+  "url": "https://www.youtube.com/watch?v=VIDEO_ID",
+  "max_comments": 50,
+  "threshold": 0.5
+}
+```
+Set `YOUTUBE_API_KEY` in `.env` for live comment fetch. Without a key, the API may use a limited fallback scraper or demo data (see implementation in `main.py`).
+---
+## `GET /models` and model switch
+```bash
+curl -s http://localhost:8000/models
+curl -s -X PUT "http://localhost:8000/model/LR%20%2B%20TF-IDF%20(local)"
+```
+Available names match keys in `AVAILABLE_MODELS` inside `model_service.py`, for example:
+- `LR + TF-IDF (local)` — default, `models/final_model.joblib`
+- `DistilBERT Toxicity` — Hugging Face remote (requires `transformers`, `torch`)
+- `toxic-bert (multilabel)`
+- `RoBERTa Toxicity`
+---
+## Environment variables
+| Variable | Used by | Description |
+|----------|---------|-------------|
+| `MODEL_NAME` | API startup | Initial model from `AVAILABLE_MODELS` |
+| `YOUTUBE_API_KEY` | `/predict-video` | YouTube Data API v3 |
+| `ENV` | logging / behavior | `development` or `production` |
+Copy from [`.env.example`](../.env.example).
+---
+## Errors
+| Status | When |
+|--------|------|
+| `422` | Invalid body (e.g. empty `text`) |
+| `503` | Model not loaded yet |
+| `500` | Prediction failure |

docs/ARCHITECTURE.es.md ADDED Viewed

	@@ -0,0 +1,52 @@

+# Arquitectura del sistema
+## Componentes
+```mermaid
+flowchart TB
+  subgraph datos [Capa de datos]
+    CSV[data/raw/youtoxic_english_1000.csv]
+    CFG[configs/*.yaml]
+  end
+  subgraph entrenamiento [Entrenamiento]
+    PIPE[run_pipeline.py]
+    PRE[TextPreprocessor]
+    BL[build_model]
+    EV[Evaluator]
+    CSV --> PIPE
+    CFG --> PIPE
+    PIPE --> PRE --> BL --> EV
+    EV --> SUM[reports/summary.csv]
+  end
+  subgraph inferencia [Inferencia]
+    MS[ModelService]
+    API[FastAPI]
+    UI[Streamlit]
+    MS --> API
+    MS --> UI
+  end
+```
+## Módulos
+| Módulo | Función |
+|--------|---------|
+| `src/data/loader.py` | Carga del dataset |
+| `src/features/text_preprocessor.py` | Limpieza y lematización |
+| `src/models/baseline.py` | Modelos sklearn + TF-IDF |
+| `src/evaluation/evaluator.py` | Métricas y comparativa |
+| `src/pipeline/run_pipeline.py` | Pipeline completo |
+| `src/service/model_service.py` | Predicción unificada |
+| `src/api/main.py` | API REST |
+| `src/app/app.py` | Interfaz Streamlit |
+## Etiquetas
+- Binario: `IsToxic` → Seguro (0) / Tóxico (1)
+- API: `is_toxic`, `probability`
+## Docker
+Dos servicios: API (8000) y Streamlit (8501), imagen `youtube_hate_detector:latest`.

docs/ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# System architecture
+## Components
+```mermaid
+flowchart TB
+  subgraph data [Data layer]
+    CSV[data/raw/youtoxic_english_1000.csv]
+    CFG[configs/*.yaml]
+  end
+  subgraph training [Training]
+    PIPE[run_pipeline.py]
+    PRE[TextPreprocessor]
+    BL[build_model LR RF XGB]
+    EV[Evaluator]
+    CSV --> PIPE
+    CFG --> PIPE
+    PIPE --> PRE --> BL --> EV
+    EV --> SUM[reports/summary.csv]
+    BL --> JOB[models/experiments/]
+  end
+  subgraph inference [Inference]
+    MS[ModelService]
+    JOB2[models/final_model.joblib]
+    JOB2 --> MS
+    API[FastAPI src/api/main.py]
+    UI[Streamlit src/app/app.py]
+    MS --> API
+    MS --> UI
+  end
+```
+## Module map
+| Module | Responsibility |
+|--------|----------------|
+| `src/data/loader.py` | Load raw CSV, optional processed paths |
+| `src/features/text_preprocessor.py` | Clean and lemmatize text |
+| `src/features/vectorizer.py` | Standalone TF-IDF (notebooks); baselines embed TF-IDF in sklearn `Pipeline` |
+| `src/models/baseline.py` | `LRModel`, `RFModel`, `XGBModel`, `build_model()` |
+| `src/evaluation/evaluator.py` | Metrics, ROC, confusion matrix, error analysis, `summary.csv` |
+| `src/pipeline/run_pipeline.py` | Orchestrates training + evaluation |
+| `src/service/model_service.py` | Loads joblib or Hugging Face models; `predict(text)` |
+| `src/api/main.py` | REST endpoints, lifespan model load |
+| `src/app/app.py` | Streamlit UI; calls `ModelService` directly |
+## Label strategy
+- **Binary default:** column `IsToxic` → Safe `0`, Toxic `1`
+- User-facing strings: **Safe** / **Toxic** (not “hate” or “harmful” in the UI copy)
+- API returns `is_toxic` and `probability` (P(toxic))
+## Docker
+[`docker-compose.yml`](../docker-compose.yml) runs two containers from one image:
+- `youtube_hate_detector-api` — uvicorn port 8000
+- `youtube_hate_detector-streamlit` — port 8501
+Both include `final_model.joblib`, configs, spaCy, and NLTK data baked into the image.
+## Tests
+[`tests/`](../tests/) — preprocessor, vectorizer, model binary outputs, `/predict` schema (mocked service).

docs/PIPELINE.es.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# Pipeline de entrenamiento
+Punto de entrada: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)
+## Comando
+```bash
+python -m src.pipeline.run_pipeline --model lr
+```
+| Flag | Valores | Por defecto |
+|------|---------|-------------|
+| `--model` | `lr`, `rf`, `xgboost` | `lr` |
+Ejecutar desde la raíz del repositorio.
+## Fases
+1. **Carga** — CSV en `data/raw/youtoxic_english_1000.csv`
+2. **Split** — train/test estratificado
+3. **Preprocesado** — `TextPreprocessor` (spaCy + NLTK)
+4. **Entrenamiento** — `build_model()`
+5. **Validación cruzada** — 5 folds
+6. **Evaluación** — `Evaluator.evaluate_and_report()` en test
+7. **Guardado** — `models/experiments/{model}/`
+8. **MLflow** — `mlruns/`
+9. **Informes** — `reports/summary.csv` y `reports/pipeline/{model}/`
+## Configuración
+| Archivo | Uso |
+|---------|-----|
+| `configs/pipeline.yaml` | Rutas, `IsToxic`, split, CV |
+| `configs/features.yaml` | TF-IDF y preprocesado |
+| `configs/models.yaml` | Hiperparámetros de clasificadores |
+| `configs/best_params.yaml` | Ganador Optuna (LR) |
+## Salidas
+| Ruta | Contenido |
+|------|-----------|
+| `reports/summary.csv` | Tabla comparativa de modelos |
+| `reports/pipeline/lr/cm_lr.png` | Matriz de confusión |
+| `reports/pipeline/lr/roc_lr.png` | Curva ROC |
+| `reports/pipeline/lr/errors_lr.csv` | FP / FN |
+## Modelo en producción
+La API y Streamlit cargan `models/final_model.joblib` vía `ModelService`.

docs/PIPELINE.md ADDED Viewed

	@@ -0,0 +1,68 @@

+# Training pipeline
+Entry point: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)
+## Command
+```bash
+python -m src.pipeline.run_pipeline --model lr
+```
+| Flag | Choices | Default |
+|------|---------|---------|
+| `--model` | `lr`, `rf`, `xgboost` | `lr` |
+Run from the repository root so `configs/` and `data/raw/` resolve correctly.
+## Phases
+1. **Load data** — `load_raw_data()` reads `configs/pipeline.yaml` → `data/raw/youtoxic_english_1000.csv`
+2. **Split** — stratified train/test (`test_size`, `random_state` in YAML)
+3. **Preprocess** — `TextPreprocessor` (lowercase, regex cleanup, spaCy lemmas, NLTK stopwords)
+4. **Train** — `build_model(model_type)` fits TF-IDF + classifier pipeline
+5. **Cross-validation** — 5-fold stratified CV, F1 weighted + ROC-AUC
+6. **Evaluate** — `Evaluator.evaluate_and_report()` on test set
+7. **Save** — `models/experiments/{model}/{model}_pipeline_{timestamp}.joblib`
+8. **MLflow** — metrics and sklearn pipeline under `mlruns/`
+9. **Reports** — append row to `reports/summary.csv`; PNGs in `reports/pipeline/{model}/`
+## Configuration
+| File | Keys (examples) |
+|------|-----------------|
+| `configs/pipeline.yaml` | `target_binary: IsToxic`, `test_size: 0.2`, `cv_folds: 5` |
+| `configs/features.yaml` | TF-IDF `max_features`, `ngram_range`, preprocessing flags |
+| `configs/models.yaml` | LR `C`, RF `n_estimators`, etc. |
+| `configs/best_params.yaml` | Optuna winner for LR (overrides defaults when training LR) |
+## Outputs
+| Path | Content |
+|------|---------|
+| `reports/summary.csv` | All runs — model comparison table |
+| `reports/pipeline/lr/cm_lr.png` | Confusion matrix |
+| `reports/pipeline/lr/roc_lr.png` | ROC curve |
+| `reports/pipeline/lr/errors_lr.csv` | False positives / negatives |
+| `reports/pipeline/lr/exp_*.json` | Full metrics per run |
+| `models/experiments/lr/*.joblib` | Serialized pipeline |
+## Evaluator API
+[`src/evaluation/evaluator.py`](../src/evaluation/evaluator.py):
+```python
+from src.evaluation.evaluator import Evaluator
+evaluator = Evaluator(output_dir="reports/pipeline/lr")
+metrics = evaluator.evaluate_and_report(
+    model, X_test, y_test, model_name="LR",
+    X_train=X_train, y_train=y_train, cv_results=cv_results,
+    summary_path="reports/summary.csv",
+)
+```
+Metrics include: `f1_weighted`, `f1_toxic`, `roc_auc`, `fp`, `fn`, `cv_test_gap_pp`, `train_test_gap_pp`, plus paths to plots.
+## Production model
+Inference uses `models/final_model.joblib` (loaded by `ModelService`). After a successful pipeline run, copy or export the best experiment artifact to `final_model.joblib` if you want to update production.

docs/RESULTS.es.md ADDED Viewed

	@@ -0,0 +1,48 @@

+# Resultados y comparativa de modelos
+Datos: [`reports/summary.csv`](../reports/summary.csv)
+Hiperparámetros: [`configs/best_params.yaml`](../configs/best_params.yaml)
+## Mejor modelo sklearn (producción)
+**Ganador:** Regresión logística + TF-IDF (Optuna), archivo `models/final_model.joblib`.
+| Métrica | Valor en test | Notas |
+|---------|---------------|-------|
+| F1 (ponderado) | **0.7579** | Métrica principal |
+| ROC-AUC | **0.81** | |
+| Falsos positivos | **18** | Seguros marcados como tóxicos |
+| Falsos negativos | **30** | Tóxicos no detectados |
+| F1 (train) | 0.8987 | |
+| Brecha train–test | 14.07 pp | |
+| Brecha CV–test | **4.76 pp** | Objetivo &lt; 5 pp |
+## Tabla comparativa
+| Modelo | Familia | F1 (test) | ROC-AUC | FP | FN | Por defecto |
+|--------|---------|-----------|---------|----|----|-------------|
+| LR + TF-IDF (ajustado) | sklearn | 0.7579 | 0.81 | 18 | 30 | Sí |
+| LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Sí |
+| Random Forest | sklearn | — | — | — | — | Ejecutar `--model rf` |
+| XGBoost | sklearn | — | — | — | — | Ejecutar `--model xgboost` |
+| DistilBERT Toxicity | Hugging Face | — | — | — | — | Opcional en API |
+| toxic-bert | Hugging Face | — | — | — | — | Opcional |
+| RoBERTa Toxicity | Hugging Face | — | — | — | — | Opcional |
+## Actualizar métricas
+```bash
+python -m src.pipeline.run_pipeline --model lr
+python -m src.pipeline.run_pipeline --model rf
+python -m src.pipeline.run_pipeline --model xgboost
+```
+Salidas: `reports/summary.csv`, gráficos en `reports/pipeline/{model}/`.
+## EDA
+Figuras adicionales en `reports/v2/`.
+## Análisis de errores
+Términos frecuentes en FP/FN y ejemplos en `reports/pipeline/*/errors_*.csv`.

docs/RESULTS.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# Model results and comparison
+Canonical data: [`reports/summary.csv`](../reports/summary.csv)
+Tuned hyperparameters: [`configs/best_params.yaml`](../configs/best_params.yaml)
+## Best sklearn model (production)
+**Winner:** Logistic Regression + TF-IDF (Optuna-tuned), exported as `models/final_model.joblib`.
+| Metric | Test value | Notes |
+|--------|------------|-------|
+| F1 (weighted) | **0.7579** | Primary project metric |
+| ROC-AUC | **0.81** | Ranking quality |
+| False positives | **18** | Safe comments marked toxic |
+| False negatives | **30** | Toxic comments missed |
+| F1 (train) | 0.8987 | In-sample |
+| Train–test gap | 14.07 pp | High; prefer CV gap for generalization |
+| CV–test gap | **4.76 pp** | Meets &lt; 5 pp rubric |
+| Test size | ~20% stratified | See `configs/pipeline.yaml` |
+**Optuna hyperparameters (LR):** `C≈0.32`, `max_features=4045`, bigrams `(1,2)`, `min_df=2`.
+## Comparison table
+| Model | Family | F1 (test) | ROC-AUC | FP | FN | Default in API/UI |
+|-------|--------|-----------|---------|----|----|-------------------|
+| LR + TF-IDF (tuned) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes |
+| LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes (`final_model.joblib`) |
+| Random Forest | sklearn | — | — | — | — | Run pipeline `--model rf` |
+| XGBoost | sklearn | — | — | — | — | Run pipeline `--model xgboost` |
+| DistilBERT Toxicity | Hugging Face | — | — | — | — | Optional (`PUT /model/...`) |
+| toxic-bert (multilabel) | Hugging Face | — | — | — | — | Optional |
+| RoBERTa Toxicity | Hugging Face | — | — | — | — | Optional |
+Rows with empty metrics are placeholders until you run the pipeline or evaluate HF models on the same test split.
+## How to refresh metrics
+```bash
+python -m src.pipeline.run_pipeline --model lr
+python -m src.pipeline.run_pipeline --model rf
+python -m src.pipeline.run_pipeline --model xgboost
+```
+Each run appends/updates [`reports/summary.csv`](../reports/summary.csv) and writes:
+- `reports/pipeline/{model}/cm_{model}.png`
+- `reports/pipeline/{model}/roc_{model}.png`
+- `reports/pipeline/{model}/errors_{model}.csv`
+## EDA and experiments
+Additional figures (notebooks): `reports/v2/` — label distribution, TF-IDF features, ensemble charts, transformer confusion matrices (`nb08_*`).
+## Error analysis
+The evaluator prints and saves:
+- **Most common terms** in false positives and false negatives
+- Example comments with highest/lowest toxic probability among errors
+See `reports/pipeline/*/errors_*.csv` after a pipeline run.

env.example DELETED Viewed

@@ -1,9 +0,0 @@
-# Copia este archivo como .env y rellena los valores
-# cp .env.example .env
-# YouTube Data API v3
-# Obtener en: https://console.cloud.google.com/apis/credentials
-YOUTUBE_API_KEY=your_youtube_api_key_here
-# Entorno
-ENV=development  # development | production

requirements.txt CHANGED Viewed

@@ -12,3 +12,6 @@ joblib==1.5.3
 pydantic==2.13.4
 transformers==5.9.0
 httpx==0.28.1

 pydantic==2.13.4
 transformers==5.9.0
 httpx==0.28.1
+matplotlib>=3.8.0
+seaborn>=0.13.0
+mlflow>=2.0.0