multilabel-news-classifier / docs /PROJECT_NOTES.md
Solareva Taisia
chore(release): initial public snapshot
198ccb0
# Project Notes (Consolidated from legacy docs)
This file preserves the **useful** information from the historical “*_COMPLETE.md / *_PLAN.md” docs that were produced while building the project, without keeping dozens of duplicative markdown files in the repo root.
---
## What’s implemented (high signal)
- **API**: FastAPI model serving + analytics + sentiment + monitoring endpoints (see `api/`).
- **Dashboards**: Streamlit dashboards for Evaluation, Analytics, Model Comparison, Sentiment (see `dashboards/`).
- **Training/Eval**:
- Training script supports modern transformer fine-tuning + LoRA (see `scripts/train_model.py`).
- Evaluation supports metrics + threshold optimization and emits artifacts for dashboards (see `scripts/evaluate.py` + `experiments/results/*.json`).
- Fair comparison protocol exists: `experiments/model_zoo/protocol_10k_1k`.
- **CI/CD**: GitHub Actions workflows exist in `.github/workflows/` (CI, lint, security, release, CD, model-deploy).
---
## Repo conventions that matter (so reviewers don’t get lost)
### Model zoo artifacts (canonical)
- **Protocol**: `experiments/model_zoo/protocol_10k_1k` (splits + `tag_to_idx.json`).
- **Predictions**: `experiments/predictions/<model_id>_val_preds.csv`.
- **Metrics**: `experiments/results/<model_id>.json` (includes optimized threshold).
- **Served model + threshold**: `config/thresholds.json`.
### Dashboards
- Dashboards can run from **uploaded CSV/JSON artifacts** (no training required).
- Dashboards can optionally call the API (“Use API”) — useful for analytics/sentiment and for remote deployments.
---
## GitHub publishing checklist (the practical bits)
### Hygiene before first push
- Confirm `.gitignore` prevents committing:
- `venv/`, `wandb/`, `logs/`, `.env`, datasets, and large model checkpoints.
- Decide model distribution strategy:
- **Recommended**: publish checkpoints via **GitHub Releases** (or W&B Artifacts), and provide a small download script.
### CI/CD notes
- CI should be “always green” on forks:
- CD deploy steps should be **conditional** on secrets like `STAGING_API_URL` / `PRODUCTION_API_URL`.
- Optional integrations:
- Codecov is optional; CI should still pass if it’s not configured.
---
## Known “portfolio polish” gaps (things to do next)
- **Publish to GitHub** (currently local-only).
- **Hosted demo**:
- API on Render/Fly/Railway
- Streamlit dashboards on Streamlit Cloud / HuggingFace Spaces
- **Model artifacts**: make downloadable by reviewers (release assets or W&B artifacts).
- **Screenshots/GIFs**: add to `docs/screenshots/` and link in README.
- **License + data provenance**: add a real `LICENSE` and clarify dataset source/constraints.
---
## Future ideas backlog (optional)
- Add a strict **title-only Distil** variant for clean ablation vs `use_snippet=True`.
- Add a small `models/REGISTRY.md` (or `config/models.yaml`) that maps model_id → checkpoint URL → threshold → W&B run.