Mirae Kang commited on
Commit ·
52b0ede
1
Parent(s): 975d796
docs: documentation, #15
Browse files- .env.example +1 -1
- README.es.md +171 -0
- README.md +202 -40
- docs/API.es.md +92 -0
- docs/API.md +147 -0
- docs/ARCHITECTURE.es.md +52 -0
- docs/ARCHITECTURE.md +66 -0
- docs/PIPELINE.es.md +49 -0
- docs/PIPELINE.md +68 -0
- docs/RESULTS.es.md +48 -0
- docs/RESULTS.md +62 -0
- env.example +0 -9
- requirements.txt +3 -0
.env.example
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# Copy to .env for local development: cp env.example .env
|
| 2 |
# Docker Compose reads these via environment (optional).
|
| 3 |
|
| 4 |
# YouTube Data API v3 (optional — /predict-video and scraping)
|
|
|
|
| 1 |
+
# Copy to .env for local development: cp .env.example .env
|
| 2 |
# Docker Compose reads these via environment (optional).
|
| 3 |
|
| 4 |
# YouTube Data API v3 (optional — /predict-video and scraping)
|
README.es.md
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Detector de comentarios tóxicos en YouTube (SignalMod)
|
| 2 |
+
|
| 3 |
+
[](https://www.python.org/downloads/)
|
| 4 |
+
[](https://fastapi.tiangolo.com/)
|
| 5 |
+
[](https://streamlit.io/)
|
| 6 |
+
[](https://docs.docker.com/compose/)
|
| 7 |
+
|
| 8 |
+
**English:** [README.md](README.md)
|
| 9 |
+
|
| 10 |
+
Clasificación binaria **Seguro vs Tóxico** para comentarios estilo YouTube. Stack de producción: **FastAPI** (API REST) y **Streamlit** (interfaz tipo página de vídeo). Modelo por defecto: **Regresión logística + TF-IDF** (`models/final_model.joblib`).
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Descripción del proyecto
|
| 15 |
+
|
| 16 |
+
| Elemento | Detalle |
|
| 17 |
+
|----------|---------|
|
| 18 |
+
| **Objetivo** | Apoyar a moderadores detectando comentarios tóxicos |
|
| 19 |
+
| **Dataset** | `data/raw/youtoxic_english_1000.csv` (~1000 comentarios en inglés) |
|
| 20 |
+
| **Etiqueta** | `IsToxic` → **Seguro (0)** / **Tóxico (1)** |
|
| 21 |
+
| **Métrica principal** | F1 ponderado y ROC-AUC |
|
| 22 |
+
| **Control de sobreajuste** | \|F1 CV − F1 test\| < 5 puntos porcentuales |
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Arquitectura
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
youtube_hate_detector/
|
| 30 |
+
├── configs/ # YAML: pipeline, features, models, best_params
|
| 31 |
+
├── data/raw/ # CSV fuente
|
| 32 |
+
├── models/ # final_model.joblib, experimentos/
|
| 33 |
+
├── reports/ # summary.csv, gráficos, artefactos del pipeline
|
| 34 |
+
├── src/
|
| 35 |
+
│ ├── api/ # FastAPI
|
| 36 |
+
│ ├── app/ # Streamlit (src/app/app.py)
|
| 37 |
+
│ ├── evaluation/ # Evaluator
|
| 38 |
+
│ ├── features/ # Preprocesado y vectorización
|
| 39 |
+
│ ├── models/ # LR, RF, XGBoost
|
| 40 |
+
│ ├── pipeline/ # Entrenamiento end-to-end
|
| 41 |
+
│ └── service/ # ModelService
|
| 42 |
+
├── tests/
|
| 43 |
+
├── Dockerfile
|
| 44 |
+
└── docker-compose.yml
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**Flujo:** entrenamiento (`run_pipeline`) → inferencia API o Streamlit vía `ModelService`.
|
| 48 |
+
|
| 49 |
+
Más detalle: [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md)
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## Instalación
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
|
| 57 |
+
cd Project_9_Equipo3
|
| 58 |
+
|
| 59 |
+
python -m venv .venv
|
| 60 |
+
source .venv/bin/activate
|
| 61 |
+
|
| 62 |
+
pip install -r requirements.txt
|
| 63 |
+
python -m spacy download en_core_web_sm
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Coloca `youtoxic_english_1000.csv` en `data/raw/`.
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
cp .env.example .env
|
| 70 |
+
# Opcional: YOUTUBE_API_KEY, MODEL_NAME
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## Pipeline de entrenamiento
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
python -m src.pipeline.run_pipeline --model lr
|
| 79 |
+
# lr | rf | xgboost
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
Actualiza [`reports/summary.csv`](reports/summary.csv) y guarda gráficos en `reports/pipeline/{model}/`.
|
| 83 |
+
|
| 84 |
+
Documentación: [docs/PIPELINE.es.md](docs/PIPELINE.es.md)
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## Docker
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
docker compose up --build
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
| Servicio | URL |
|
| 95 |
+
|----------|-----|
|
| 96 |
+
| Streamlit | http://localhost:8501 |
|
| 97 |
+
| FastAPI | http://localhost:8000 |
|
| 98 |
+
| Swagger | http://localhost:8000/docs |
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
docker compose down
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Ejecución local
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
|
| 110 |
+
streamlit run src/app/app.py --server.port 8501
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## Ejemplos de API
|
| 116 |
+
|
| 117 |
+
Ver [docs/API.es.md](docs/API.es.md)
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
curl -s -X POST http://localhost:8000/predict \
|
| 121 |
+
-H "Content-Type: application/json" \
|
| 122 |
+
-d '{"text": "Great video!", "threshold": 0.5}'
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## Resultados
|
| 128 |
+
|
| 129 |
+
Mejor modelo **sklearn** en test (`configs/best_params.yaml`):
|
| 130 |
+
|
| 131 |
+
| Métrica | Valor |
|
| 132 |
+
|---------|-------|
|
| 133 |
+
| F1 (ponderado, test) | **0.7579** |
|
| 134 |
+
| ROC-AUC | **0.81** |
|
| 135 |
+
| Falsos positivos | 18 |
|
| 136 |
+
| Falsos negativos | 30 |
|
| 137 |
+
| Brecha CV–test | **4.76 pp** |
|
| 138 |
+
|
| 139 |
+
Gráficos EDA: `reports/v2/`.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## Comparativa de modelos
|
| 144 |
+
|
| 145 |
+
Tabla canónica: [`reports/summary.csv`](reports/summary.csv)
|
| 146 |
+
Resumen: [docs/RESULTS.es.md](docs/RESULTS.es.md)
|
| 147 |
+
|
| 148 |
+
| Modelo | Familia | F1 (test) | ROC-AUC | Por defecto |
|
| 149 |
+
|--------|---------|-----------|---------|-------------|
|
| 150 |
+
| LR + TF-IDF (ajustado) | sklearn | 0.7579 | 0.81 | Sí |
|
| 151 |
+
| RF / XGBoost | sklearn | — | — | Ejecutar pipeline |
|
| 152 |
+
| DistilBERT / toxic-bert / RoBERTa | Hugging Face | — | — | Opcional en API/UI |
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Tests
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
pytest tests/ -v
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
---
|
| 163 |
+
|
| 164 |
+
## Índice de documentación
|
| 165 |
+
|
| 166 |
+
| Español | English |
|
| 167 |
+
|---------|---------|
|
| 168 |
+
| [docs/API.es.md](docs/API.es.md) | [docs/API.md](docs/API.md) |
|
| 169 |
+
| [docs/PIPELINE.es.md](docs/PIPELINE.es.md) | [docs/PIPELINE.md](docs/PIPELINE.md) |
|
| 170 |
+
| [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) |
|
| 171 |
+
| [docs/RESULTS.es.md](docs/RESULTS.es.md) | [docs/RESULTS.md](docs/RESULTS.md) |
|
README.md
CHANGED
|
@@ -1,66 +1,134 @@
|
|
| 1 |
# YouTube Toxic Comment Detector (SignalMod)
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
```bash
|
| 10 |
-
docker compose up --build
|
| 11 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|-----------|-----|
|
| 15 |
-
| Streamlit UI | http://localhost:8501 |
|
| 16 |
-
| FastAPI | http://localhost:8000 |
|
| 17 |
-
| API docs | http://localhost:8000/docs |
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
```bash
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
| 27 |
|
| 28 |
```bash
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
```
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
##
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
```
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
|
|
|
|
|
| 55 |
```
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
```bash
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
# Terminal 1 — API
|
| 65 |
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
|
| 66 |
|
|
@@ -68,10 +136,104 @@ uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
|
|
| 68 |
streamlit run src/app/app.py --server.port 8501
|
| 69 |
```
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
## Tests
|
| 74 |
|
| 75 |
```bash
|
| 76 |
pytest tests/ -v
|
| 77 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# YouTube Toxic Comment Detector (SignalMod)
|
| 2 |
|
| 3 |
+
[](https://www.python.org/downloads/)
|
| 4 |
+
[](https://fastapi.tiangolo.com/)
|
| 5 |
+
[](https://streamlit.io/)
|
| 6 |
+
[](https://docs.docker.com/compose/)
|
| 7 |
+
**Español:** [README.es.md](README.es.md)
|
| 8 |
|
| 9 |
+
Automated **Safe vs Toxic** classification for YouTube-style comments. The production stack is **FastAPI** (REST inference) plus **Streamlit** (watch-page style UI). The default model is **Logistic Regression + TF-IDF** (`models/final_model.joblib`).
|
| 10 |
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Project description
|
| 14 |
+
|
| 15 |
+
| Item | Detail |
|
| 16 |
+
|------|--------|
|
| 17 |
+
| **Goal** | Help moderation teams flag toxic comments quickly |
|
| 18 |
+
| **Dataset** | `data/raw/youtoxic_english_1000.csv` (~1k English comments) |
|
| 19 |
+
| **Target** | `IsToxic` → **Safe (0)** / **Toxic (1)** |
|
| 20 |
+
| **Primary metric** | Weighted F1 and ROC-AUC (imbalanced classes) |
|
| 21 |
+
| **Overfitting check** | \|CV F1 − test F1\| < 5 percentage points (project rubric) |
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## Architecture
|
| 26 |
|
|
|
|
|
|
|
| 27 |
```
|
| 28 |
+
youtube_hate_detector/
|
| 29 |
+
├── configs/ # YAML: pipeline, features, models, best_params
|
| 30 |
+
├── data/raw/ # Source CSV (not committed if gitignored)
|
| 31 |
+
├── models/ # final_model.joblib, experiments/
|
| 32 |
+
├── reports/ # summary.csv, plots, pipeline artifacts
|
| 33 |
+
├── src/
|
| 34 |
+
│ ├── api/ # FastAPI — /predict, /predict-batch, …
|
| 35 |
+
│ ├── app/ # Streamlit UI (src/app/app.py)
|
| 36 |
+
│ ├── data/ # load_raw_data, scraping helpers
|
| 37 |
+
│ ├── evaluation/ # Evaluator — metrics, ROC, confusion matrix
|
| 38 |
+
│ ├── features/ # TextPreprocessor, Vectorizer
|
| 39 |
+
│ ├── models/ # LR, RF, XGBoost baselines
|
| 40 |
+
│ ├── pipeline/ # run_pipeline.py — train end-to-end
|
| 41 |
+
│ └── service/ # ModelService — shared inference layer
|
| 42 |
+
├── tests/
|
| 43 |
+
├── Dockerfile
|
| 44 |
+
└── docker-compose.yml
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**Runtime flow**
|
| 48 |
+
|
| 49 |
+
1. **Training:** `load_raw_data` → `TextPreprocessor` → `build_model().fit()` → `Evaluator` → `reports/summary.csv`
|
| 50 |
+
2. **API:** `uvicorn` loads `ModelService` → `POST /predict`
|
| 51 |
+
3. **Streamlit:** `ModelService.predict()` in-process (same models as API catalog)
|
| 52 |
+
|
| 53 |
+
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for more detail.
|
| 54 |
|
| 55 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
## Installation
|
| 58 |
+
|
| 59 |
+
**Requirements:** Python 3.12+, ~2 GB disk for dependencies (optional PyTorch if using Hugging Face models in the UI).
|
| 60 |
|
| 61 |
```bash
|
| 62 |
+
git clone https://github.com/Bootcamp-IA-P6/Project_9_Equipo3.git
|
| 63 |
+
cd Project_9_Equipo3 # or your local folder name
|
| 64 |
+
|
| 65 |
+
python -m venv .venv
|
| 66 |
+
source .venv/bin/activate # Windows: .venv\Scripts\activate
|
| 67 |
+
|
| 68 |
+
pip install -r requirements.txt
|
| 69 |
+
python -m spacy download en_core_web_sm
|
| 70 |
```
|
| 71 |
|
| 72 |
+
**Data:** place `youtoxic_english_1000.csv` under `data/raw/` (path in `configs/pipeline.yaml`).
|
| 73 |
+
|
| 74 |
+
**Environment:**
|
| 75 |
|
| 76 |
```bash
|
| 77 |
+
cp .env.example .env
|
| 78 |
+
# Optional: YOUTUBE_API_KEY for /predict-video
|
| 79 |
+
# MODEL_NAME must match a key in ModelService (default: LR + TF-IDF (local))
|
| 80 |
```
|
| 81 |
|
| 82 |
+
---
|
| 83 |
|
| 84 |
+
## Training pipeline
|
| 85 |
|
| 86 |
+
End-to-end training and evaluation:
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
python -m src.pipeline.run_pipeline --model lr
|
| 90 |
+
# Options: lr | rf | xgboost
|
| 91 |
```
|
| 92 |
+
|
| 93 |
+
**Phases:** load data → stratified split → spaCy/NLTK preprocessing → train → 5-fold CV → test metrics → save `models/experiments/{model}/` → MLflow → update [`reports/summary.csv`](reports/summary.csv) and plots under `reports/pipeline/{model}/`.
|
| 94 |
+
|
| 95 |
+
Config files:
|
| 96 |
+
|
| 97 |
+
| File | Purpose |
|
| 98 |
+
|------|---------|
|
| 99 |
+
| `configs/pipeline.yaml` | Paths, `IsToxic`, test_size, CV folds |
|
| 100 |
+
| `configs/features.yaml` | Preprocessing + TF-IDF |
|
| 101 |
+
| `configs/models.yaml` | Classifier hyperparameters |
|
| 102 |
+
| `configs/best_params.yaml` | Optuna winner (LR) |
|
| 103 |
+
|
| 104 |
+
Details: [docs/PIPELINE.md](docs/PIPELINE.md)
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## Run with Docker
|
| 109 |
+
|
| 110 |
+
```bash
|
| 111 |
+
docker compose up --build
|
| 112 |
```
|
| 113 |
|
| 114 |
+
| Service | URL |
|
| 115 |
+
|---------|-----|
|
| 116 |
+
| Streamlit | http://localhost:8501 |
|
| 117 |
+
| FastAPI | http://localhost:8000 |
|
| 118 |
+
| Swagger | http://localhost:8000/docs |
|
| 119 |
|
| 120 |
```bash
|
| 121 |
+
export YOUTUBE_API_KEY=your_key # optional
|
| 122 |
+
docker compose down # stop
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
Containers: `youtube_hate_detector-api`, `youtube_hate_detector-streamlit`.
|
| 126 |
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Local run (without Docker)
|
| 130 |
+
|
| 131 |
+
```bash
|
| 132 |
# Terminal 1 — API
|
| 133 |
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
|
| 134 |
|
|
|
|
| 136 |
streamlit run src/app/app.py --server.port 8501
|
| 137 |
```
|
| 138 |
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## API examples
|
| 142 |
+
|
| 143 |
+
Full reference: [docs/API.md](docs/API.md)
|
| 144 |
+
|
| 145 |
+
**Health check**
|
| 146 |
+
|
| 147 |
+
```bash
|
| 148 |
+
curl -s http://localhost:8000/ | python -m json.tool
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
**Single prediction**
|
| 152 |
+
|
| 153 |
+
```bash
|
| 154 |
+
curl -s -X POST http://localhost:8000/predict \
|
| 155 |
+
-H "Content-Type: application/json" \
|
| 156 |
+
-d '{"text": "This video is amazing, thanks for sharing!", "threshold": 0.5}'
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
Example response:
|
| 160 |
+
|
| 161 |
+
```json
|
| 162 |
+
{
|
| 163 |
+
"text": "This video is amazing, thanks for sharing!",
|
| 164 |
+
"is_toxic": false,
|
| 165 |
+
"probability": 0.08,
|
| 166 |
+
"labels": [],
|
| 167 |
+
"model_used": "LR + TF-IDF (local)",
|
| 168 |
+
"latency_ms": 12.5
|
| 169 |
+
}
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
**Batch**
|
| 173 |
+
|
| 174 |
+
```bash
|
| 175 |
+
curl -s -X POST http://localhost:8000/predict-batch \
|
| 176 |
+
-H "Content-Type: application/json" \
|
| 177 |
+
-d '{"texts": ["Great content!", "You are an idiot"], "threshold": 0.5}'
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
**List / switch models**
|
| 181 |
+
|
| 182 |
+
```bash
|
| 183 |
+
curl -s http://localhost:8000/models
|
| 184 |
+
curl -s -X PUT http://localhost:8000/model/DistilBERT%20Toxicity
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## Results
|
| 190 |
+
|
| 191 |
+
Best **sklearn** model on the project test split (from `configs/best_params.yaml`):
|
| 192 |
+
|
| 193 |
+
| Metric | Value |
|
| 194 |
+
|--------|-------|
|
| 195 |
+
| F1 (weighted, test) | **0.7579** |
|
| 196 |
+
| ROC-AUC | **0.81** |
|
| 197 |
+
| False positives | 18 |
|
| 198 |
+
| False negatives | 30 |
|
| 199 |
+
| CV–test gap | **4.76 pp** (within 5 pp target) |
|
| 200 |
+
| Train–test gap | 14.07 pp |
|
| 201 |
+
|
| 202 |
+
Plots and EDA: `reports/v2/`. Per-run artifacts: `reports/pipeline/{lr,rf,xgboost}/`.
|
| 203 |
+
|
| 204 |
+
---
|
| 205 |
+
|
| 206 |
+
## Model comparison
|
| 207 |
+
|
| 208 |
+
Canonical table: [`reports/summary.csv`](reports/summary.csv)
|
| 209 |
+
Human-readable: [docs/RESULTS.md](docs/RESULTS.md)
|
| 210 |
+
|
| 211 |
+
| Model | Family | F1 (test) | ROC-AUC | FP | FN | Production default |
|
| 212 |
+
|-------|--------|-----------|---------|----|----|--------------------|
|
| 213 |
+
| LR + TF-IDF (tuned) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes |
|
| 214 |
+
| LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes (`final_model.joblib`) |
|
| 215 |
+
| RF / XGBoost | sklearn | — | — | — | — | Run pipeline to fill |
|
| 216 |
+
| DistilBERT / toxic-bert / RoBERTa | Hugging Face | — | — | — | — | Optional via API/UI |
|
| 217 |
+
|
| 218 |
+
Re-run `python -m src.pipeline.run_pipeline --model rf` to append RF metrics to `summary.csv`.
|
| 219 |
+
|
| 220 |
+
---
|
| 221 |
|
| 222 |
## Tests
|
| 223 |
|
| 224 |
```bash
|
| 225 |
pytest tests/ -v
|
| 226 |
```
|
| 227 |
+
|
| 228 |
+
Covers preprocessor, vectorizer, model binary output, and `/predict` response shape.
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## Documentation index
|
| 233 |
+
|
| 234 |
+
| English | Español |
|
| 235 |
+
|---------|---------|
|
| 236 |
+
| [docs/API.md](docs/API.md) | [docs/API.es.md](docs/API.es.md) |
|
| 237 |
+
| [docs/PIPELINE.md](docs/PIPELINE.md) | [docs/PIPELINE.es.md](docs/PIPELINE.es.md) |
|
| 238 |
+
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | [docs/ARCHITECTURE.es.md](docs/ARCHITECTURE.es.md) |
|
| 239 |
+
| [docs/RESULTS.md](docs/RESULTS.md) | [docs/RESULTS.es.md](docs/RESULTS.es.md) |
|
docs/API.es.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Referencia API (FastAPI)
|
| 2 |
+
|
| 3 |
+
URL base (local): `http://localhost:8000`
|
| 4 |
+
Documentación interactiva: `/docs`, `/redoc`
|
| 5 |
+
|
| 6 |
+
Implementación: [`src/api/main.py`](../src/api/main.py)
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Endpoints
|
| 11 |
+
|
| 12 |
+
| Método | Ruta | Descripción |
|
| 13 |
+
|--------|------|-------------|
|
| 14 |
+
| `GET` | `/` | Estado del servicio y modelo activo |
|
| 15 |
+
| `GET` | `/model-info` | Metadatos del modelo cargado |
|
| 16 |
+
| `GET` | `/models` | Modelos disponibles y activo |
|
| 17 |
+
| `PUT` | `/model/{model_name}` | Cambiar modelo activo |
|
| 18 |
+
| `POST` | `/predict` | Clasificar un comentario |
|
| 19 |
+
| `POST` | `/predict-batch` | Hasta 100 comentarios |
|
| 20 |
+
| `POST` | `/predict-video` | Comentarios de un vídeo de YouTube |
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## `POST /predict`
|
| 25 |
+
|
| 26 |
+
**Cuerpo**
|
| 27 |
+
|
| 28 |
+
```json
|
| 29 |
+
{
|
| 30 |
+
"text": "Texto del comentario",
|
| 31 |
+
"threshold": 0.5
|
| 32 |
+
}
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
**Respuesta**
|
| 36 |
+
|
| 37 |
+
```json
|
| 38 |
+
{
|
| 39 |
+
"text": "Texto del comentario",
|
| 40 |
+
"is_toxic": false,
|
| 41 |
+
"probability": 0.08,
|
| 42 |
+
"labels": [],
|
| 43 |
+
"model_used": "LR + TF-IDF (local)",
|
| 44 |
+
"latency_ms": 15.2
|
| 45 |
+
}
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
- `is_toxic`: `true` = **Tóxico**, `false` = **Seguro**
|
| 49 |
+
- `probability`: probabilidad de clase tóxica (0–1)
|
| 50 |
+
|
| 51 |
+
**curl**
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
curl -s -X POST http://localhost:8000/predict \
|
| 55 |
+
-H "Content-Type: application/json" \
|
| 56 |
+
-d '{"text": "¡Gran vídeo, gracias!", "threshold": 0.5}'
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## `POST /predict-batch`
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
curl -s -X POST http://localhost:8000/predict-batch \
|
| 65 |
+
-H "Content-Type: application/json" \
|
| 66 |
+
-d '{"texts": ["Comentario seguro", "Eres un idiota"], "threshold": 0.5}'
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## `POST /predict-video`
|
| 72 |
+
|
| 73 |
+
Requiere `YOUTUBE_API_KEY` en `.env` para comentarios reales.
|
| 74 |
+
|
| 75 |
+
```json
|
| 76 |
+
{
|
| 77 |
+
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
|
| 78 |
+
"max_comments": 50,
|
| 79 |
+
"threshold": 0.5
|
| 80 |
+
}
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## Variables de entorno
|
| 86 |
+
|
| 87 |
+
| Variable | Descripción |
|
| 88 |
+
|----------|-------------|
|
| 89 |
+
| `MODEL_NAME` | Modelo al arrancar la API |
|
| 90 |
+
| `YOUTUBE_API_KEY` | API de YouTube para `/predict-video` |
|
| 91 |
+
|
| 92 |
+
Ver [`.env.example`](../.env.example).
|
docs/API.md
ADDED
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# API reference (FastAPI)
|
| 2 |
+
|
| 3 |
+
Base URL (local): `http://localhost:8000`
|
| 4 |
+
Interactive docs: `/docs` (Swagger), `/redoc` (ReDoc)
|
| 5 |
+
|
| 6 |
+
Implementation: [`src/api/main.py`](../src/api/main.py)
|
| 7 |
+
Inference: [`src/service/model_service.py`](../src/service/model_service.py)
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Endpoints
|
| 12 |
+
|
| 13 |
+
| Method | Path | Description |
|
| 14 |
+
|--------|------|-------------|
|
| 15 |
+
| `GET` | `/` | Health check and active model name |
|
| 16 |
+
| `GET` | `/model-info` | Metadata for the loaded model |
|
| 17 |
+
| `GET` | `/models` | List available models and active one |
|
| 18 |
+
| `PUT` | `/model/{model_name}` | Switch active model (lazy load on next predict) |
|
| 19 |
+
| `POST` | `/predict` | Classify one comment |
|
| 20 |
+
| `POST` | `/predict-batch` | Classify up to 100 comments |
|
| 21 |
+
| `POST` | `/predict-video` | Fetch YouTube comments and classify (needs API key or demo fallback) |
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## `POST /predict`
|
| 26 |
+
|
| 27 |
+
**Request body**
|
| 28 |
+
|
| 29 |
+
```json
|
| 30 |
+
{
|
| 31 |
+
"text": "Comment text here",
|
| 32 |
+
"threshold": 0.5
|
| 33 |
+
}
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
| Field | Type | Required | Description |
|
| 37 |
+
|-------|------|----------|-------------|
|
| 38 |
+
| `text` | string | yes | 1–5000 characters, non-empty after trim |
|
| 39 |
+
| `threshold` | float | no | Toxic if `probability >= threshold` (default `0.5`) |
|
| 40 |
+
|
| 41 |
+
**Response**
|
| 42 |
+
|
| 43 |
+
```json
|
| 44 |
+
{
|
| 45 |
+
"text": "Comment text here",
|
| 46 |
+
"is_toxic": false,
|
| 47 |
+
"probability": 0.0821,
|
| 48 |
+
"labels": [],
|
| 49 |
+
"model_used": "LR + TF-IDF (local)",
|
| 50 |
+
"latency_ms": 15.2
|
| 51 |
+
}
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
| Field | Description |
|
| 55 |
+
|-------|-------------|
|
| 56 |
+
| `is_toxic` | `true` = **Toxic**, `false` = **Safe** |
|
| 57 |
+
| `probability` | P(toxic), 0.0–1.0 |
|
| 58 |
+
| `labels` | Optional category hints when toxic (keyword/heuristic or HF labels) |
|
| 59 |
+
| `model_used` | Active model id from `ModelService` |
|
| 60 |
+
|
| 61 |
+
**curl**
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
curl -s -X POST http://localhost:8000/predict \
|
| 65 |
+
-H "Content-Type: application/json" \
|
| 66 |
+
-d '{"text": "Thanks for the tutorial!", "threshold": 0.5}'
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
**Toxic example**
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
curl -s -X POST http://localhost:8000/predict \
|
| 73 |
+
-H "Content-Type: application/json" \
|
| 74 |
+
-d '{"text": "You are worthless garbage", "threshold": 0.5}'
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## `POST /predict-batch`
|
| 80 |
+
|
| 81 |
+
```json
|
| 82 |
+
{
|
| 83 |
+
"texts": ["Safe comment", "Another line"],
|
| 84 |
+
"threshold": 0.5
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
Response includes `results` (list of predict objects), `total`, `toxic_count`, `latency_ms`.
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
curl -s -X POST http://localhost:8000/predict-batch \
|
| 92 |
+
-H "Content-Type: application/json" \
|
| 93 |
+
-d '{"texts": ["Nice video", "I hate you"], "threshold": 0.5}'
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## `POST /predict-video`
|
| 99 |
+
|
| 100 |
+
```json
|
| 101 |
+
{
|
| 102 |
+
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
|
| 103 |
+
"max_comments": 50,
|
| 104 |
+
"threshold": 0.5
|
| 105 |
+
}
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
Set `YOUTUBE_API_KEY` in `.env` for live comment fetch. Without a key, the API may use a limited fallback scraper or demo data (see implementation in `main.py`).
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## `GET /models` and model switch
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
curl -s http://localhost:8000/models
|
| 116 |
+
|
| 117 |
+
curl -s -X PUT "http://localhost:8000/model/LR%20%2B%20TF-IDF%20(local)"
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
Available names match keys in `AVAILABLE_MODELS` inside `model_service.py`, for example:
|
| 121 |
+
|
| 122 |
+
- `LR + TF-IDF (local)` — default, `models/final_model.joblib`
|
| 123 |
+
- `DistilBERT Toxicity` — Hugging Face remote (requires `transformers`, `torch`)
|
| 124 |
+
- `toxic-bert (multilabel)`
|
| 125 |
+
- `RoBERTa Toxicity`
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Environment variables
|
| 130 |
+
|
| 131 |
+
| Variable | Used by | Description |
|
| 132 |
+
|----------|---------|-------------|
|
| 133 |
+
| `MODEL_NAME` | API startup | Initial model from `AVAILABLE_MODELS` |
|
| 134 |
+
| `YOUTUBE_API_KEY` | `/predict-video` | YouTube Data API v3 |
|
| 135 |
+
| `ENV` | logging / behavior | `development` or `production` |
|
| 136 |
+
|
| 137 |
+
Copy from [`.env.example`](../.env.example).
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## Errors
|
| 142 |
+
|
| 143 |
+
| Status | When |
|
| 144 |
+
|--------|------|
|
| 145 |
+
| `422` | Invalid body (e.g. empty `text`) |
|
| 146 |
+
| `503` | Model not loaded yet |
|
| 147 |
+
| `500` | Prediction failure |
|
docs/ARCHITECTURE.es.md
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Arquitectura del sistema
|
| 2 |
+
|
| 3 |
+
## Componentes
|
| 4 |
+
|
| 5 |
+
```mermaid
|
| 6 |
+
flowchart TB
|
| 7 |
+
subgraph datos [Capa de datos]
|
| 8 |
+
CSV[data/raw/youtoxic_english_1000.csv]
|
| 9 |
+
CFG[configs/*.yaml]
|
| 10 |
+
end
|
| 11 |
+
|
| 12 |
+
subgraph entrenamiento [Entrenamiento]
|
| 13 |
+
PIPE[run_pipeline.py]
|
| 14 |
+
PRE[TextPreprocessor]
|
| 15 |
+
BL[build_model]
|
| 16 |
+
EV[Evaluator]
|
| 17 |
+
CSV --> PIPE
|
| 18 |
+
CFG --> PIPE
|
| 19 |
+
PIPE --> PRE --> BL --> EV
|
| 20 |
+
EV --> SUM[reports/summary.csv]
|
| 21 |
+
end
|
| 22 |
+
|
| 23 |
+
subgraph inferencia [Inferencia]
|
| 24 |
+
MS[ModelService]
|
| 25 |
+
API[FastAPI]
|
| 26 |
+
UI[Streamlit]
|
| 27 |
+
MS --> API
|
| 28 |
+
MS --> UI
|
| 29 |
+
end
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## Módulos
|
| 33 |
+
|
| 34 |
+
| Módulo | Función |
|
| 35 |
+
|--------|---------|
|
| 36 |
+
| `src/data/loader.py` | Carga del dataset |
|
| 37 |
+
| `src/features/text_preprocessor.py` | Limpieza y lematización |
|
| 38 |
+
| `src/models/baseline.py` | Modelos sklearn + TF-IDF |
|
| 39 |
+
| `src/evaluation/evaluator.py` | Métricas y comparativa |
|
| 40 |
+
| `src/pipeline/run_pipeline.py` | Pipeline completo |
|
| 41 |
+
| `src/service/model_service.py` | Predicción unificada |
|
| 42 |
+
| `src/api/main.py` | API REST |
|
| 43 |
+
| `src/app/app.py` | Interfaz Streamlit |
|
| 44 |
+
|
| 45 |
+
## Etiquetas
|
| 46 |
+
|
| 47 |
+
- Binario: `IsToxic` → Seguro (0) / Tóxico (1)
|
| 48 |
+
- API: `is_toxic`, `probability`
|
| 49 |
+
|
| 50 |
+
## Docker
|
| 51 |
+
|
| 52 |
+
Dos servicios: API (8000) y Streamlit (8501), imagen `youtube_hate_detector:latest`.
|
docs/ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# System architecture
|
| 2 |
+
|
| 3 |
+
## Components
|
| 4 |
+
|
| 5 |
+
```mermaid
|
| 6 |
+
flowchart TB
|
| 7 |
+
subgraph data [Data layer]
|
| 8 |
+
CSV[data/raw/youtoxic_english_1000.csv]
|
| 9 |
+
CFG[configs/*.yaml]
|
| 10 |
+
end
|
| 11 |
+
|
| 12 |
+
subgraph training [Training]
|
| 13 |
+
PIPE[run_pipeline.py]
|
| 14 |
+
PRE[TextPreprocessor]
|
| 15 |
+
BL[build_model LR RF XGB]
|
| 16 |
+
EV[Evaluator]
|
| 17 |
+
CSV --> PIPE
|
| 18 |
+
CFG --> PIPE
|
| 19 |
+
PIPE --> PRE --> BL --> EV
|
| 20 |
+
EV --> SUM[reports/summary.csv]
|
| 21 |
+
BL --> JOB[models/experiments/]
|
| 22 |
+
end
|
| 23 |
+
|
| 24 |
+
subgraph inference [Inference]
|
| 25 |
+
MS[ModelService]
|
| 26 |
+
JOB2[models/final_model.joblib]
|
| 27 |
+
JOB2 --> MS
|
| 28 |
+
API[FastAPI src/api/main.py]
|
| 29 |
+
UI[Streamlit src/app/app.py]
|
| 30 |
+
MS --> API
|
| 31 |
+
MS --> UI
|
| 32 |
+
end
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
## Module map
|
| 36 |
+
|
| 37 |
+
| Module | Responsibility |
|
| 38 |
+
|--------|----------------|
|
| 39 |
+
| `src/data/loader.py` | Load raw CSV, optional processed paths |
|
| 40 |
+
| `src/features/text_preprocessor.py` | Clean and lemmatize text |
|
| 41 |
+
| `src/features/vectorizer.py` | Standalone TF-IDF (notebooks); baselines embed TF-IDF in sklearn `Pipeline` |
|
| 42 |
+
| `src/models/baseline.py` | `LRModel`, `RFModel`, `XGBModel`, `build_model()` |
|
| 43 |
+
| `src/evaluation/evaluator.py` | Metrics, ROC, confusion matrix, error analysis, `summary.csv` |
|
| 44 |
+
| `src/pipeline/run_pipeline.py` | Orchestrates training + evaluation |
|
| 45 |
+
| `src/service/model_service.py` | Loads joblib or Hugging Face models; `predict(text)` |
|
| 46 |
+
| `src/api/main.py` | REST endpoints, lifespan model load |
|
| 47 |
+
| `src/app/app.py` | Streamlit UI; calls `ModelService` directly |
|
| 48 |
+
|
| 49 |
+
## Label strategy
|
| 50 |
+
|
| 51 |
+
- **Binary default:** column `IsToxic` → Safe `0`, Toxic `1`
|
| 52 |
+
- User-facing strings: **Safe** / **Toxic** (not “hate” or “harmful” in the UI copy)
|
| 53 |
+
- API returns `is_toxic` and `probability` (P(toxic))
|
| 54 |
+
|
| 55 |
+
## Docker
|
| 56 |
+
|
| 57 |
+
[`docker-compose.yml`](../docker-compose.yml) runs two containers from one image:
|
| 58 |
+
|
| 59 |
+
- `youtube_hate_detector-api` — uvicorn port 8000
|
| 60 |
+
- `youtube_hate_detector-streamlit` — port 8501
|
| 61 |
+
|
| 62 |
+
Both include `final_model.joblib`, configs, spaCy, and NLTK data baked into the image.
|
| 63 |
+
|
| 64 |
+
## Tests
|
| 65 |
+
|
| 66 |
+
[`tests/`](../tests/) — preprocessor, vectorizer, model binary outputs, `/predict` schema (mocked service).
|
docs/PIPELINE.es.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Pipeline de entrenamiento
|
| 2 |
+
|
| 3 |
+
Punto de entrada: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)
|
| 4 |
+
|
| 5 |
+
## Comando
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
python -m src.pipeline.run_pipeline --model lr
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
| Flag | Valores | Por defecto |
|
| 12 |
+
|------|---------|-------------|
|
| 13 |
+
| `--model` | `lr`, `rf`, `xgboost` | `lr` |
|
| 14 |
+
|
| 15 |
+
Ejecutar desde la raíz del repositorio.
|
| 16 |
+
|
| 17 |
+
## Fases
|
| 18 |
+
|
| 19 |
+
1. **Carga** — CSV en `data/raw/youtoxic_english_1000.csv`
|
| 20 |
+
2. **Split** — train/test estratificado
|
| 21 |
+
3. **Preprocesado** — `TextPreprocessor` (spaCy + NLTK)
|
| 22 |
+
4. **Entrenamiento** — `build_model()`
|
| 23 |
+
5. **Validación cruzada** — 5 folds
|
| 24 |
+
6. **Evaluación** — `Evaluator.evaluate_and_report()` en test
|
| 25 |
+
7. **Guardado** — `models/experiments/{model}/`
|
| 26 |
+
8. **MLflow** — `mlruns/`
|
| 27 |
+
9. **Informes** — `reports/summary.csv` y `reports/pipeline/{model}/`
|
| 28 |
+
|
| 29 |
+
## Configuración
|
| 30 |
+
|
| 31 |
+
| Archivo | Uso |
|
| 32 |
+
|---------|-----|
|
| 33 |
+
| `configs/pipeline.yaml` | Rutas, `IsToxic`, split, CV |
|
| 34 |
+
| `configs/features.yaml` | TF-IDF y preprocesado |
|
| 35 |
+
| `configs/models.yaml` | Hiperparámetros de clasificadores |
|
| 36 |
+
| `configs/best_params.yaml` | Ganador Optuna (LR) |
|
| 37 |
+
|
| 38 |
+
## Salidas
|
| 39 |
+
|
| 40 |
+
| Ruta | Contenido |
|
| 41 |
+
|------|-----------|
|
| 42 |
+
| `reports/summary.csv` | Tabla comparativa de modelos |
|
| 43 |
+
| `reports/pipeline/lr/cm_lr.png` | Matriz de confusión |
|
| 44 |
+
| `reports/pipeline/lr/roc_lr.png` | Curva ROC |
|
| 45 |
+
| `reports/pipeline/lr/errors_lr.csv` | FP / FN |
|
| 46 |
+
|
| 47 |
+
## Modelo en producción
|
| 48 |
+
|
| 49 |
+
La API y Streamlit cargan `models/final_model.joblib` vía `ModelService`.
|
docs/PIPELINE.md
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training pipeline
|
| 2 |
+
|
| 3 |
+
Entry point: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)
|
| 4 |
+
|
| 5 |
+
## Command
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
python -m src.pipeline.run_pipeline --model lr
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
| Flag | Choices | Default |
|
| 12 |
+
|------|---------|---------|
|
| 13 |
+
| `--model` | `lr`, `rf`, `xgboost` | `lr` |
|
| 14 |
+
|
| 15 |
+
Run from the repository root so `configs/` and `data/raw/` resolve correctly.
|
| 16 |
+
|
| 17 |
+
## Phases
|
| 18 |
+
|
| 19 |
+
1. **Load data** — `load_raw_data()` reads `configs/pipeline.yaml` → `data/raw/youtoxic_english_1000.csv`
|
| 20 |
+
2. **Split** — stratified train/test (`test_size`, `random_state` in YAML)
|
| 21 |
+
3. **Preprocess** — `TextPreprocessor` (lowercase, regex cleanup, spaCy lemmas, NLTK stopwords)
|
| 22 |
+
4. **Train** — `build_model(model_type)` fits TF-IDF + classifier pipeline
|
| 23 |
+
5. **Cross-validation** — 5-fold stratified CV, F1 weighted + ROC-AUC
|
| 24 |
+
6. **Evaluate** — `Evaluator.evaluate_and_report()` on test set
|
| 25 |
+
7. **Save** — `models/experiments/{model}/{model}_pipeline_{timestamp}.joblib`
|
| 26 |
+
8. **MLflow** — metrics and sklearn pipeline under `mlruns/`
|
| 27 |
+
9. **Reports** — append row to `reports/summary.csv`; PNGs in `reports/pipeline/{model}/`
|
| 28 |
+
|
| 29 |
+
## Configuration
|
| 30 |
+
|
| 31 |
+
| File | Keys (examples) |
|
| 32 |
+
|------|-----------------|
|
| 33 |
+
| `configs/pipeline.yaml` | `target_binary: IsToxic`, `test_size: 0.2`, `cv_folds: 5` |
|
| 34 |
+
| `configs/features.yaml` | TF-IDF `max_features`, `ngram_range`, preprocessing flags |
|
| 35 |
+
| `configs/models.yaml` | LR `C`, RF `n_estimators`, etc. |
|
| 36 |
+
| `configs/best_params.yaml` | Optuna winner for LR (overrides defaults when training LR) |
|
| 37 |
+
|
| 38 |
+
## Outputs
|
| 39 |
+
|
| 40 |
+
| Path | Content |
|
| 41 |
+
|------|---------|
|
| 42 |
+
| `reports/summary.csv` | All runs — model comparison table |
|
| 43 |
+
| `reports/pipeline/lr/cm_lr.png` | Confusion matrix |
|
| 44 |
+
| `reports/pipeline/lr/roc_lr.png` | ROC curve |
|
| 45 |
+
| `reports/pipeline/lr/errors_lr.csv` | False positives / negatives |
|
| 46 |
+
| `reports/pipeline/lr/exp_*.json` | Full metrics per run |
|
| 47 |
+
| `models/experiments/lr/*.joblib` | Serialized pipeline |
|
| 48 |
+
|
| 49 |
+
## Evaluator API
|
| 50 |
+
|
| 51 |
+
[`src/evaluation/evaluator.py`](../src/evaluation/evaluator.py):
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
from src.evaluation.evaluator import Evaluator
|
| 55 |
+
|
| 56 |
+
evaluator = Evaluator(output_dir="reports/pipeline/lr")
|
| 57 |
+
metrics = evaluator.evaluate_and_report(
|
| 58 |
+
model, X_test, y_test, model_name="LR",
|
| 59 |
+
X_train=X_train, y_train=y_train, cv_results=cv_results,
|
| 60 |
+
summary_path="reports/summary.csv",
|
| 61 |
+
)
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Metrics include: `f1_weighted`, `f1_toxic`, `roc_auc`, `fp`, `fn`, `cv_test_gap_pp`, `train_test_gap_pp`, plus paths to plots.
|
| 65 |
+
|
| 66 |
+
## Production model
|
| 67 |
+
|
| 68 |
+
Inference uses `models/final_model.joblib` (loaded by `ModelService`). After a successful pipeline run, copy or export the best experiment artifact to `final_model.joblib` if you want to update production.
|
docs/RESULTS.es.md
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Resultados y comparativa de modelos
|
| 2 |
+
|
| 3 |
+
Datos: [`reports/summary.csv`](../reports/summary.csv)
|
| 4 |
+
Hiperparámetros: [`configs/best_params.yaml`](../configs/best_params.yaml)
|
| 5 |
+
|
| 6 |
+
## Mejor modelo sklearn (producción)
|
| 7 |
+
|
| 8 |
+
**Ganador:** Regresión logística + TF-IDF (Optuna), archivo `models/final_model.joblib`.
|
| 9 |
+
|
| 10 |
+
| Métrica | Valor en test | Notas |
|
| 11 |
+
|---------|---------------|-------|
|
| 12 |
+
| F1 (ponderado) | **0.7579** | Métrica principal |
|
| 13 |
+
| ROC-AUC | **0.81** | |
|
| 14 |
+
| Falsos positivos | **18** | Seguros marcados como tóxicos |
|
| 15 |
+
| Falsos negativos | **30** | Tóxicos no detectados |
|
| 16 |
+
| F1 (train) | 0.8987 | |
|
| 17 |
+
| Brecha train–test | 14.07 pp | |
|
| 18 |
+
| Brecha CV–test | **4.76 pp** | Objetivo < 5 pp |
|
| 19 |
+
|
| 20 |
+
## Tabla comparativa
|
| 21 |
+
|
| 22 |
+
| Modelo | Familia | F1 (test) | ROC-AUC | FP | FN | Por defecto |
|
| 23 |
+
|--------|---------|-----------|---------|----|----|-------------|
|
| 24 |
+
| LR + TF-IDF (ajustado) | sklearn | 0.7579 | 0.81 | 18 | 30 | Sí |
|
| 25 |
+
| LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Sí |
|
| 26 |
+
| Random Forest | sklearn | — | — | — | — | Ejecutar `--model rf` |
|
| 27 |
+
| XGBoost | sklearn | — | — | — | — | Ejecutar `--model xgboost` |
|
| 28 |
+
| DistilBERT Toxicity | Hugging Face | — | — | — | — | Opcional en API |
|
| 29 |
+
| toxic-bert | Hugging Face | — | — | — | — | Opcional |
|
| 30 |
+
| RoBERTa Toxicity | Hugging Face | — | — | — | — | Opcional |
|
| 31 |
+
|
| 32 |
+
## Actualizar métricas
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
python -m src.pipeline.run_pipeline --model lr
|
| 36 |
+
python -m src.pipeline.run_pipeline --model rf
|
| 37 |
+
python -m src.pipeline.run_pipeline --model xgboost
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
Salidas: `reports/summary.csv`, gráficos en `reports/pipeline/{model}/`.
|
| 41 |
+
|
| 42 |
+
## EDA
|
| 43 |
+
|
| 44 |
+
Figuras adicionales en `reports/v2/`.
|
| 45 |
+
|
| 46 |
+
## Análisis de errores
|
| 47 |
+
|
| 48 |
+
Términos frecuentes en FP/FN y ejemplos en `reports/pipeline/*/errors_*.csv`.
|
docs/RESULTS.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model results and comparison
|
| 2 |
+
|
| 3 |
+
Canonical data: [`reports/summary.csv`](../reports/summary.csv)
|
| 4 |
+
Tuned hyperparameters: [`configs/best_params.yaml`](../configs/best_params.yaml)
|
| 5 |
+
|
| 6 |
+
## Best sklearn model (production)
|
| 7 |
+
|
| 8 |
+
**Winner:** Logistic Regression + TF-IDF (Optuna-tuned), exported as `models/final_model.joblib`.
|
| 9 |
+
|
| 10 |
+
| Metric | Test value | Notes |
|
| 11 |
+
|--------|------------|-------|
|
| 12 |
+
| F1 (weighted) | **0.7579** | Primary project metric |
|
| 13 |
+
| ROC-AUC | **0.81** | Ranking quality |
|
| 14 |
+
| False positives | **18** | Safe comments marked toxic |
|
| 15 |
+
| False negatives | **30** | Toxic comments missed |
|
| 16 |
+
| F1 (train) | 0.8987 | In-sample |
|
| 17 |
+
| Train–test gap | 14.07 pp | High; prefer CV gap for generalization |
|
| 18 |
+
| CV–test gap | **4.76 pp** | Meets < 5 pp rubric |
|
| 19 |
+
| Test size | ~20% stratified | See `configs/pipeline.yaml` |
|
| 20 |
+
|
| 21 |
+
**Optuna hyperparameters (LR):** `C≈0.32`, `max_features=4045`, bigrams `(1,2)`, `min_df=2`.
|
| 22 |
+
|
| 23 |
+
## Comparison table
|
| 24 |
+
|
| 25 |
+
| Model | Family | F1 (test) | ROC-AUC | FP | FN | Default in API/UI |
|
| 26 |
+
|-------|--------|-----------|---------|----|----|-------------------|
|
| 27 |
+
| LR + TF-IDF (tuned) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes |
|
| 28 |
+
| LR + TF-IDF (local) | sklearn | 0.7579 | 0.81 | 18 | 30 | Yes (`final_model.joblib`) |
|
| 29 |
+
| Random Forest | sklearn | — | — | — | — | Run pipeline `--model rf` |
|
| 30 |
+
| XGBoost | sklearn | — | — | — | — | Run pipeline `--model xgboost` |
|
| 31 |
+
| DistilBERT Toxicity | Hugging Face | — | — | — | — | Optional (`PUT /model/...`) |
|
| 32 |
+
| toxic-bert (multilabel) | Hugging Face | — | — | — | — | Optional |
|
| 33 |
+
| RoBERTa Toxicity | Hugging Face | — | — | — | — | Optional |
|
| 34 |
+
|
| 35 |
+
Rows with empty metrics are placeholders until you run the pipeline or evaluate HF models on the same test split.
|
| 36 |
+
|
| 37 |
+
## How to refresh metrics
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
python -m src.pipeline.run_pipeline --model lr
|
| 41 |
+
python -m src.pipeline.run_pipeline --model rf
|
| 42 |
+
python -m src.pipeline.run_pipeline --model xgboost
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
Each run appends/updates [`reports/summary.csv`](../reports/summary.csv) and writes:
|
| 46 |
+
|
| 47 |
+
- `reports/pipeline/{model}/cm_{model}.png`
|
| 48 |
+
- `reports/pipeline/{model}/roc_{model}.png`
|
| 49 |
+
- `reports/pipeline/{model}/errors_{model}.csv`
|
| 50 |
+
|
| 51 |
+
## EDA and experiments
|
| 52 |
+
|
| 53 |
+
Additional figures (notebooks): `reports/v2/` — label distribution, TF-IDF features, ensemble charts, transformer confusion matrices (`nb08_*`).
|
| 54 |
+
|
| 55 |
+
## Error analysis
|
| 56 |
+
|
| 57 |
+
The evaluator prints and saves:
|
| 58 |
+
|
| 59 |
+
- **Most common terms** in false positives and false negatives
|
| 60 |
+
- Example comments with highest/lowest toxic probability among errors
|
| 61 |
+
|
| 62 |
+
See `reports/pipeline/*/errors_*.csv` after a pipeline run.
|
env.example
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
# Copia este archivo como .env y rellena los valores
|
| 2 |
-
# cp .env.example .env
|
| 3 |
-
|
| 4 |
-
# YouTube Data API v3
|
| 5 |
-
# Obtener en: https://console.cloud.google.com/apis/credentials
|
| 6 |
-
YOUTUBE_API_KEY=your_youtube_api_key_here
|
| 7 |
-
|
| 8 |
-
# Entorno
|
| 9 |
-
ENV=development # development | production
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements.txt
CHANGED
|
@@ -12,3 +12,6 @@ joblib==1.5.3
|
|
| 12 |
pydantic==2.13.4
|
| 13 |
transformers==5.9.0
|
| 14 |
httpx==0.28.1
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
pydantic==2.13.4
|
| 13 |
transformers==5.9.0
|
| 14 |
httpx==0.28.1
|
| 15 |
+
matplotlib>=3.8.0
|
| 16 |
+
seaborn>=0.13.0
|
| 17 |
+
mlflow>=2.0.0
|