Spaces:

devrup404
/

SignalMod

Sleeping

App Files Files Community

SignalMod / docs /PIPELINE.md

Mirae Kang

feat: implement new models and improve UI, #23

46cc63a 5 days ago

preview code

raw

history blame

7.88 kB

	# Training pipeline

	Entry point: [`src/pipeline/run_pipeline.py`](../src/pipeline/run_pipeline.py)

	## Command

	```bash
	python -m src.pipeline.run_pipeline --model lr
	```

	\| Flag \| Choices \| Default \|
	\|------\|---------\|---------\|
	\| `--model` \| `lr`, `rf`, `xgboost` \| `lr` \|

	Run from the repository root so `configs/` and `data/raw/` resolve correctly.

	## Phases

	1. Load data — `load_raw_data()` reads `configs/pipeline.yaml` → `data/raw/youtoxic_english_1000.csv`
	2. Split — stratified train/test (`test_size`, `random_state` in YAML)
	3. Preprocess — `TextPreprocessor` (lowercase, regex cleanup, spaCy lemmas, NLTK stopwords)
	4. Train — `build_model(model_type)` fits TF-IDF + classifier pipeline
	5. Cross-validation — 5-fold stratified CV, F1 weighted + ROC-AUC
	6. Evaluate — `Evaluator.evaluate_and_report()` on test set
	7. Save — `models/experiments/{model}/{model}_pipeline_{timestamp}.joblib`
	8. MLflow — metrics and sklearn pipeline under `mlruns/`
	9. Reports — append row to `reports/summary.csv`; PNGs in `reports/pipeline/{model}/`

	## Configuration

	\| File \| Keys (examples) \|
	\|------\|-----------------\|
	\| `configs/pipeline.yaml` \| `target_binary: IsToxic`, `test_size: 0.2`, `cv_folds: 5` \|
	\| `configs/features.yaml` \| TF-IDF `max_features`, `ngram_range`, preprocessing flags \|
	\| `configs/models.yaml` \| LR `C`, RF `n_estimators`, etc. \|
	\| `configs/best_params.yaml` \| Optuna winner for LR (overrides defaults when training LR) \|

	## Outputs

	\| Path \| Content \|
	\|------\|---------\|
	\| `reports/summary.csv` \| All runs — model comparison table \|
	\| `reports/pipeline/lr/cm_lr.png` \| Confusion matrix \|
	\| `reports/pipeline/lr/roc_lr.png` \| ROC curve \|
	\| `reports/pipeline/lr/errors_lr.csv` \| False positives / negatives \|
	\| `reports/pipeline/lr/exp_*.json` \| Full metrics per run \|
	\| `models/experiments/lr/*.joblib` \| Serialized pipeline \|

	## Evaluator API

	[`src/evaluation/evaluator.py`](../src/evaluation/evaluator.py):

	```python
	from src.evaluation.evaluator import Evaluator

	evaluator = Evaluator(output_dir="reports/pipeline/lr")
	metrics = evaluator.evaluate_and_report(
	model, X_test, y_test, model_name="LR",
	X_train=X_train, y_train=y_train, cv_results=cv_results,
	summary_path="reports/summary.csv",
	)
	```

	Metrics include: `f1_weighted`, `f1_toxic`, `roc_auc`, `fp`, `fn`, `cv_test_gap_pp`, `train_test_gap_pp`, plus paths to plots.

	## Stable training (DistilBERT + LR ensemble)

	Entry point: [`src/pipeline/run_stable_pipeline.py`](../src/pipeline/run_stable_pipeline.py)

	Implements partial DistilBERT freezing, toxic-only back-translation with cosine dedup, gap-aware early stopping, regularized head (dropout 0.5, label smoothing 0.1), and soft-voting with TF-IDF LR (`C=0.01`).

	```bash
	uv sync --extra hf --extra train
	uv run python -m src.pipeline.run_stable_pipeline
	uv run python -m src.pipeline.run_stable_pipeline --skip-augmentation # no network BT
	uv run python -m src.pipeline.run_stable_pipeline --bert-only # DistilBERT only
	```

	Config: `configs/stable_training.yaml`. Outputs under `models/stable_distilbert/`, `models/stable_lr_tfidf.joblib`, `reports/stable/`.

	## Phase 5: Expert adaptation (Toxic-BERT + hybrid)

	Entry point: [`src/pipeline/run_expert_pipeline.py`](../src/pipeline/run_expert_pipeline.py)

	`unitary/toxic-bert` with head-only fine-tune, TF-IDF LR at 250 features, validation threshold tuning on F1-toxic, hybrid 0.7 / 0.3, EN→DE→EN augmentation. Notebook: `notebooks/11_expert_phase5_toxicbert.ipynb`.

	```bash
	uv sync --extra hf --extra train
	uv run python -m src.pipeline.run_expert_pipeline
	```

	Config: `configs/expert_training.yaml`. Outputs under `models/expert_toxic_bert/`, `models/expert_lr_tfidf.joblib`, `reports/expert/`.

	## Clean-Signal Dual-Input Hybrid

	Entry point: [`src/pipeline/run_hybrid_clean_pipeline.py`](../src/pipeline/run_hybrid_clean_pipeline.py)

	- Toxic-BERT: raw `Text` (reuses `models/expert_toxic_bert`, threshold 0.33)
	- LR: `clean_text` from `data/processed/v2/comments_preprocessed.csv` (generated via spaCy if missing) + metadata from `comments_with_stats.csv`
	- Weights: validation F1–based (clamped LR share 0.15–0.45)

	```bash
	uv run python -m src.pipeline.run_hybrid_clean_pipeline
	uv run python -m src.pipeline.run_hybrid_clean_pipeline --skip-augmentation
	```

	Config: `configs/hybrid_clean_training.yaml`. Reports: `reports/hybrid_clean/`.

	## Performance Push (Final Squeeze)

	Entry point: [`src/pipeline/run_performance_push_pipeline.py`](../src/pipeline/run_performance_push_pipeline.py)

	Full Toxic-BERT unfreeze (lr=5e-6, 20 epochs, early stop patience 4 on `val_f1_weighted`), test-time augmentation (original + back-translated average), LR anchor 300 features / 0.05 ensemble weight, threshold grid 0.30–0.70, gap defense 4.8 pp.

	```bash
	uv run python -m src.pipeline.run_performance_push_pipeline
	```

	Config: `configs/performance_push_training.yaml`. Reports: `reports/performance_push/`.

	## Stealth Learning (0.80 push)

	Entry point: [`src/pipeline/run_stealth_learning_pipeline.py`](../src/pipeline/run_stealth_learning_pipeline.py)

	Last 2 Toxic-BERT layers (`lr=7e-6`) + head (`2e-5`), training gap limit 5.5%, patience 5, SWA over last 5 epochs, threshold step 0.005, LR anchor 250 features / 0.05 weight, TTA on test.

	```bash
	uv run python -m src.pipeline.run_stealth_learning_pipeline
	```

	Config: `configs/stealth_learning_training.yaml`. Reports: `reports/stealth_learning/`.

	## Golden Baseline Strategy (Briefing gap + F1 0.80)

	Entry point: [`src/pipeline/run_golden_baseline_pipeline.py`](../src/pipeline/run_golden_baseline_pipeline.py) · Notebook: [`notebooks/12_golden_baseline_strategy.ipynb`](../notebooks/12_golden_baseline_strategy.ipynb)

	1. Golden Baseline — frozen pretrained Toxic-BERT (no training; gap <1%)
	2. Performance Squeeze — last 2 layers + R-Drop, lr=5e-6, 15 epochs, gap ≤4.9%
	3. Hybrid Safety Net — BERT + LR (C=0.001, 200 features)

	```bash
	uv run python -m src.pipeline.run_golden_baseline_pipeline
	```

	Config: `configs/golden_baseline_training.yaml`. Reports: `reports/golden_baseline/`.

	## Hyper-Optimization Sprints (Notebook 13)

	Entry point: [`src/experiments/notebook_13_sprints.py`](../src/experiments/notebook_13_sprints.py) · Notebook: [`notebooks/13_hyper_optimization_sprints.ipynb`](../notebooks/13_hyper_optimization_sprints.ipynb)

	Four CV sprints (multi-pivot aug, TTA, meta stacking, ultra-fine threshold) on Golden Baseline foundation. Artifacts: `models/notebook_13/`, reports: `reports/notebook_13/`.

	```bash
	uv run python -m src.experiments.notebook_13_sprints
	```

	## Final Meta Stacking (Notebook 14)

	Entry point: [`src/experiments/notebook_14_final_stack.py`](../src/experiments/notebook_14_final_stack.py) · Notebook: [`notebooks/14_final_meta_stacking.ipynb`](../notebooks/14_final_meta_stacking.ipynb)

	Single 80/20 split, Exp3 meta stacking, C=0.001, test threshold grid (step 0.001). Report: `reports/notebook_14/final_result.json`.

	```bash
	uv run python -m src.experiments.notebook_14_final_stack
	```

	## Production model (inference)

	Demo inference (API / UI):

	\| Model \| Path / weights \|
	\|-------\|----------------\|
	\| Meta-Feature Stacking (Production) \| `models/production_final/meta_stack_final.joblib` \|
	\| LR + TF-IDF (Baseline) \| `models/baseline/lr_tfidf.joblib` \|
	\| Frozen Toxic-BERT (Baseline) \| Hub `unitary/toxic-bert` (metrics in `models/baseline/manifest.json`) \|

	Catalog: [`configs/model_catalog.yaml`](../configs/model_catalog.yaml).

	Other pipelines below (stable, expert, etc.) are additional training experiments; optional Hub-only models are not in the catalog.

	Handover script: [`reports/HANDOVER_REPORT.md`](../reports/HANDOVER_REPORT.md).