Spaces:

solarevat
/

multilabel-news-classifier

Sleeping

App Files Files Community

multilabel-news-classifier / docs /PROJECT_NOTES.md

Solareva Taisia

chore(release): initial public snapshot

198ccb0 2 months ago

preview code

raw

history blame contribute delete

2.99 kB

	# Project Notes (Consolidated from legacy docs)

	This file preserves the useful information from the historical “_COMPLETE.md / _PLAN.md” docs that were produced while building the project, without keeping dozens of duplicative markdown files in the repo root.

	---

	## What’s implemented (high signal)

	- API: FastAPI model serving + analytics + sentiment + monitoring endpoints (see `api/`).
	- Dashboards: Streamlit dashboards for Evaluation, Analytics, Model Comparison, Sentiment (see `dashboards/`).
	- Training/Eval:
	- Training script supports modern transformer fine-tuning + LoRA (see `scripts/train_model.py`).
	- Evaluation supports metrics + threshold optimization and emits artifacts for dashboards (see `scripts/evaluate.py` + `experiments/results/*.json`).
	- Fair comparison protocol exists: `experiments/model_zoo/protocol_10k_1k`.
	- CI/CD: GitHub Actions workflows exist in `.github/workflows/` (CI, lint, security, release, CD, model-deploy).

	---

	## Repo conventions that matter (so reviewers don’t get lost)

	### Model zoo artifacts (canonical)
	- Protocol: `experiments/model_zoo/protocol_10k_1k` (splits + `tag_to_idx.json`).
	- Predictions: `experiments/predictions/<model_id>_val_preds.csv`.
	- Metrics: `experiments/results/<model_id>.json` (includes optimized threshold).
	- Served model + threshold: `config/thresholds.json`.

	### Dashboards
	- Dashboards can run from uploaded CSV/JSON artifacts (no training required).
	- Dashboards can optionally call the API (“Use API”) — useful for analytics/sentiment and for remote deployments.

	---

	## GitHub publishing checklist (the practical bits)

	### Hygiene before first push
	- Confirm `.gitignore` prevents committing:
	- `venv/`, `wandb/`, `logs/`, `.env`, datasets, and large model checkpoints.
	- Decide model distribution strategy:
	- Recommended: publish checkpoints via GitHub Releases (or W&B Artifacts), and provide a small download script.

	### CI/CD notes
	- CI should be “always green” on forks:
	- CD deploy steps should be conditional on secrets like `STAGING_API_URL` / `PRODUCTION_API_URL`.
	- Optional integrations:
	- Codecov is optional; CI should still pass if it’s not configured.

	---

	## Known “portfolio polish” gaps (things to do next)

	- Publish to GitHub (currently local-only).
	- Hosted demo:
	- API on Render/Fly/Railway
	- Streamlit dashboards on Streamlit Cloud / HuggingFace Spaces
	- Model artifacts: make downloadable by reviewers (release assets or W&B artifacts).
	- Screenshots/GIFs: add to `docs/screenshots/` and link in README.
	- License + data provenance: add a real `LICENSE` and clarify dataset source/constraints.

	---

	## Future ideas backlog (optional)

	- Add a strict title-only Distil variant for clean ablation vs `use_snippet=True`.
	- Add a small `models/REGISTRY.md` (or `config/models.yaml`) that maps model_id → checkpoint URL → threshold → W&B run.