Mirae Kang commited on
Commit
df28c90
Β·
1 Parent(s): 9e29187

feat: dockerize application, #14

Browse files
Files changed (6) hide show
  1. .dockerignore +31 -0
  2. .env.example +15 -0
  3. Dockerfile +40 -0
  4. README.md +71 -20
  5. docker-compose.yml +57 -0
  6. requirements.txt +14 -0
.dockerignore ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .git
2
+ .gitignore
3
+ .venv
4
+ venv
5
+ env
6
+ __pycache__
7
+ *.pyc
8
+ .pytest_cache
9
+ .mypy_cache
10
+ .ruff_cache
11
+ .matplotlib_cache
12
+ notebooks
13
+ mlruns
14
+ data/raw
15
+ data/processed
16
+ logs
17
+ docs
18
+ tests
19
+ *.md
20
+ !README.md
21
+ .env
22
+ .env.*
23
+ frontend/dist
24
+ models/checkpoints
25
+ models/**/checkpoints
26
+ models/experiments
27
+ models/roberta_hate_results
28
+ models/distilbert_results
29
+ models/best_distilbert
30
+ models/nb08_*
31
+ models/*_frozen
.env.example ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copy to .env for local development: cp env.example .env
2
+ # Docker Compose reads these via environment (optional).
3
+
4
+ # YouTube Data API v3 (optional β€” /predict-video and scraping)
5
+ # https://console.cloud.google.com/apis/credentials
6
+ YOUTUBE_API_KEY=
7
+
8
+ # Active model (must match a key in ModelService.AVAILABLE_MODELS)
9
+ MODEL_NAME=LR + TF-IDF (local)
10
+
11
+ # development | production
12
+ ENV=production
13
+
14
+ # Used by Streamlit when calling the API from another host (Docker sets this automatically)
15
+ API_URL=http://localhost:8000
Dockerfile ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # youtube_hate_detector β€” shared image for FastAPI + Streamlit services
2
+ FROM python:3.12-slim-bookworm
3
+
4
+ ENV PYTHONDONTWRITEBYTECODE=1 \
5
+ PYTHONUNBUFFERED=1 \
6
+ PYTHONPATH=/app \
7
+ NLTK_DATA=/app/nltk_data \
8
+ MODEL_NAME="LR + TF-IDF (local)" \
9
+ ENV=production
10
+
11
+ WORKDIR /app
12
+
13
+ # System deps for spaCy / sklearn wheels
14
+ RUN apt-get update \
15
+ && apt-get install -y --no-install-recommends build-essential curl \
16
+ && rm -rf /var/lib/apt/lists/*
17
+
18
+ COPY requirements.txt .
19
+
20
+ # CPU-only PyTorch keeps the image smaller; sufficient for the default local LR model
21
+ RUN pip install --no-cache-dir --upgrade pip \
22
+ && pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu \
23
+ && pip install --no-cache-dir -r requirements.txt \
24
+ && python -m spacy download en_core_web_sm
25
+
26
+ # NLTK corpora used by TextPreprocessor
27
+ RUN python - <<'PY'
28
+ import nltk
29
+ for pkg in ("stopwords", "punkt"):
30
+ nltk.download(pkg, download_dir="/app/nltk_data")
31
+ PY
32
+
33
+ COPY configs/ configs/
34
+ COPY src/ src/
35
+ COPY models/final_model.joblib models/final_model.joblib
36
+
37
+ # Default env template (overridden by docker-compose)
38
+ COPY env.example .env.example
39
+
40
+ EXPOSE 8000 8501
README.md CHANGED
@@ -1,26 +1,77 @@
1
- # Project_9_Equipo3
2
 
3
- ## πŸ—οΈ Arquitectura
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ```
6
  youtube_hate_detector/
 
7
  β”œβ”€β”€ data/
8
- β”‚ β”œβ”€β”€ raw/ # Dataset original (no se sube al repo)
9
- β”‚ └── processed/ # Datos preprocesados
10
- β”œβ”€β”€ notebooks/ # EDA y experimentos
11
- β”œβ”€β”€ models/ # Modelos entrenados (.joblib)
12
- β”œβ”€β”€ reports/ # MΓ©tricas y resultados por experimento
13
  β”œβ”€β”€ src/
14
- β”‚ β”œβ”€β”€ data/ # Carga y scraping de datos
15
- β”‚ β”œβ”€β”€ features/ # Preprocesamiento y vectorizaciΓ³n
16
- β”‚ β”œβ”€β”€ models/ # Baseline, ensemble, deep learning
17
- β”‚ β”œβ”€β”€ evaluation/ # MΓ©tricas y anΓ‘lisis de errores
18
- β”‚ β”œβ”€β”€ pipeline/ # Pipeline end-to-end
19
- β”‚ β”œβ”€β”€ api/ # FastAPI
20
- β”‚ β”œβ”€β”€ app/ # Streamlit
21
- β”‚ └── utils/ # Logger, config loader
22
- β”œβ”€β”€ configs/ # ConfiguraciΓ³n YAML
23
- β”œβ”€β”€ tests/ # Tests unitarios
24
- └── logs/ # Logs de ejecuciΓ³n
25
-
26
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # YouTube Toxic Comment Detector (SignalMod)
2
 
3
+ Binary **Safe vs Toxic** comment moderation assistant with a **FastAPI** backend and a **Streamlit** UI.
4
+
5
+ ## Quick start (Docker)
6
+
7
+ No manual setup beyond Docker. The image bundles the default model (`models/final_model.joblib`), configs, and NLP assets (spaCy + NLTK).
8
+
9
+ ```bash
10
+ docker compose up --build
11
+ ```
12
+
13
+ | Service | URL |
14
+ |-----------|-----|
15
+ | Streamlit UI | http://localhost:8501 |
16
+ | FastAPI | http://localhost:8000 |
17
+ | API docs | http://localhost:8000/docs |
18
+
19
+ Optional: set `YOUTUBE_API_KEY` for live comment scraping on `/predict-video`:
20
+
21
+ ```bash
22
+ export YOUTUBE_API_KEY=your_key_here
23
+ docker compose up --build
24
+ ```
25
+
26
+ Stop containers:
27
+
28
+ ```bash
29
+ docker compose down
30
+ ```
31
+
32
+ Docker image and containers use the project name `youtube_hate_detector` (e.g. `youtube_hate_detector-api`). If you previously built `ai-nlp-app:latest`, remove it once: `docker rmi ai-nlp-app:latest`.
33
+
34
+ ## Architecture
35
 
36
  ```
37
  youtube_hate_detector/
38
+ β”œβ”€β”€ configs/ # YAML hyperparameters (non-secret)
39
  β”œβ”€β”€ data/
40
+ β”‚ β”œβ”€β”€ raw/ # Original dataset (gitignored)
41
+ β”‚ └── processed/
42
+ β”œβ”€β”€ models/ # Serialized models (e.g. final_model.joblib)
 
 
43
  β”œβ”€β”€ src/
44
+ β”‚ β”œβ”€β”€ api/ # FastAPI (REST)
45
+ β”‚ β”œβ”€β”€ app/ # Streamlit UI
46
+ β”‚ β”œβ”€β”€ data/
47
+ β”‚ β”œβ”€β”€ features/
48
+ β”‚ β”œβ”€β”€ models/
49
+ β”‚ β”œβ”€β”€ pipeline/
50
+ β”‚ β”œβ”€β”€ service/ # ModelService (inference)
51
+ β”‚ └── utils/
52
+ β”œβ”€β”€ tests/
53
+ β”œβ”€β”€ Dockerfile
54
+ └── docker-compose.yml
55
+ ```
56
+
57
+ ## Local development (without Docker)
58
+
59
+ ```bash
60
+ python -m venv .venv && source .venv/bin/activate
61
+ pip install -r requirements.txt
62
+ python -m spacy download en_core_web_sm
63
+
64
+ # Terminal 1 β€” API
65
+ uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
66
+
67
+ # Terminal 2 β€” Streamlit
68
+ streamlit run src/app/app.py --server.port 8501
69
+ ```
70
+
71
+ Copy `env.example` to `.env` if you need a YouTube API key or custom `MODEL_NAME`.
72
+
73
+ ## Tests
74
+
75
+ ```bash
76
+ pytest tests/ -v
77
+ ```
docker-compose.yml ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # youtube_hate_detector β€” API + Streamlit UI
2
+ # Start everything: docker compose up --build
3
+ # Stop: docker compose down
4
+
5
+ name: youtube_hate_detector
6
+
7
+ services:
8
+ api:
9
+ build: .
10
+ image: youtube_hate_detector:latest
11
+ container_name: youtube_hate_detector-api
12
+ command:
13
+ - uvicorn
14
+ - src.api.main:app
15
+ - --host
16
+ - "0.0.0.0"
17
+ - --port
18
+ - "8000"
19
+ ports:
20
+ - "8000:8000"
21
+ environment:
22
+ MODEL_NAME: "LR + TF-IDF (local)"
23
+ ENV: production
24
+ YOUTUBE_API_KEY: ${YOUTUBE_API_KEY:-}
25
+ NLTK_DATA: /app/nltk_data
26
+ healthcheck:
27
+ test: ["CMD", "curl", "-f", "http://localhost:8000/"]
28
+ interval: 10s
29
+ timeout: 5s
30
+ retries: 12
31
+ start_period: 40s
32
+ restart: unless-stopped
33
+
34
+ streamlit:
35
+ # Reuses the image built by `api` β€” do not add `build:` here (parallel builds race on the same tag)
36
+ image: youtube_hate_detector:latest
37
+ container_name: youtube_hate_detector-streamlit
38
+ command:
39
+ - streamlit
40
+ - run
41
+ - src/app/app.py
42
+ - --server.port=8501
43
+ - --server.address=0.0.0.0
44
+ - --server.headless=true
45
+ - --browser.gatherUsageStats=false
46
+ ports:
47
+ - "8501:8501"
48
+ environment:
49
+ MODEL_NAME: "LR + TF-IDF (local)"
50
+ ENV: production
51
+ API_URL: http://api:8000
52
+ YOUTUBE_API_KEY: ${YOUTUBE_API_KEY:-}
53
+ NLTK_DATA: /app/nltk_data
54
+ depends_on:
55
+ api:
56
+ condition: service_healthy
57
+ restart: unless-stopped
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Runtime dependencies for API + Streamlit (Docker and local installs)
2
+ fastapi==0.136.1
3
+ uvicorn[standard]==0.47.0
4
+ streamlit>=1.41.0,<2
5
+ scikit-learn==1.8.0
6
+ spacy==3.8.14
7
+ nltk==3.9.4
8
+ pandas==3.0.2
9
+ PyYAML==6.0.3
10
+ python-dotenv==1.2.2
11
+ joblib==1.5.3
12
+ pydantic==2.13.4
13
+ transformers==5.9.0
14
+ httpx==0.28.1