Model results and comparison
Demo catalog: configs/model_catalog.yaml · Baseline metrics: models/baseline/manifest.json
| Model | F1 (test, weighted) | Train–test gap | Default in UI |
|---|---|---|---|
| LR + TF-IDF (Baseline) | 0.758 | 4.76 pp | No |
| Frozen Toxic-BERT (Baseline) | 0.790 | 0.16 pp | No |
| Meta-Feature Stacking (Production) | 0.805 | 2.54 pp | Yes |
Handover: reports/HANDOVER_REPORT.md · Production JSON: reports/notebook_14/final_result.json · Golden baseline: reports/golden_baseline/
Baselines
LR + TF-IDF — Notebooks 01–03, artifact models/baseline/lr_tfidf.joblib, tuning in configs/best_params.yaml.
Frozen Toxic-BERT — Notebook 12, unitary/toxic-bert inference-only; see golden baseline reports and manifest.json → frozen_toxic_bert.
Production
uv run python -m src.experiments.notebook_14_final_stack
Requires uv sync --extra hf.
Other experiments
Historical table: reports/summary.csv. RF/XGBoost pipelines and reports/v2/ figures are teammate or archived work — not in the demo model catalog.