Spaces:

solarevat
/

multilabel-news-classifier

Sleeping

App Files Files Community

multilabel-news-classifier / docs /PROJECT_NOTES.md

Solareva Taisia

chore(release): initial public snapshot

198ccb0 2 months ago

preview code

raw

history blame contribute delete

2.99 kB

Project Notes (Consolidated from legacy docs)

This file preserves the useful information from the historical “*_COMPLETE.md / *_PLAN.md” docs that were produced while building the project, without keeping dozens of duplicative markdown files in the repo root.

What’s implemented (high signal)

API: FastAPI model serving + analytics + sentiment + monitoring endpoints (see api/).
Dashboards: Streamlit dashboards for Evaluation, Analytics, Model Comparison, Sentiment (see dashboards/).
Training/Eval:
- Training script supports modern transformer fine-tuning + LoRA (see scripts/train_model.py).
- Evaluation supports metrics + threshold optimization and emits artifacts for dashboards (see scripts/evaluate.py + experiments/results/*.json).
- Fair comparison protocol exists: experiments/model_zoo/protocol_10k_1k.
CI/CD: GitHub Actions workflows exist in .github/workflows/ (CI, lint, security, release, CD, model-deploy).

Repo conventions that matter (so reviewers don’t get lost)

Model zoo artifacts (canonical)

Protocol: experiments/model_zoo/protocol_10k_1k (splits + tag_to_idx.json).
Predictions: experiments/predictions/<model_id>_val_preds.csv.
Metrics: experiments/results/<model_id>.json (includes optimized threshold).
Served model + threshold: config/thresholds.json.

Dashboards

Dashboards can run from uploaded CSV/JSON artifacts (no training required).
Dashboards can optionally call the API (“Use API”) — useful for analytics/sentiment and for remote deployments.

GitHub publishing checklist (the practical bits)

Hygiene before first push

Confirm .gitignore prevents committing:
- venv/, wandb/, logs/, .env, datasets, and large model checkpoints.
Decide model distribution strategy:
- Recommended: publish checkpoints via GitHub Releases (or W&B Artifacts), and provide a small download script.

CI/CD notes

CI should be “always green” on forks:
- CD deploy steps should be conditional on secrets like STAGING_API_URL / PRODUCTION_API_URL.
Optional integrations:
- Codecov is optional; CI should still pass if it’s not configured.

Known “portfolio polish” gaps (things to do next)

Publish to GitHub (currently local-only).
Hosted demo:
- API on Render/Fly/Railway
- Streamlit dashboards on Streamlit Cloud / HuggingFace Spaces
Model artifacts: make downloadable by reviewers (release assets or W&B artifacts).
Screenshots/GIFs: add to docs/screenshots/ and link in README.
License + data provenance: add a real LICENSE and clarify dataset source/constraints.

Future ideas backlog (optional)

Add a strict title-only Distil variant for clean ablation vs use_snippet=True.
Add a small models/REGISTRY.md (or config/models.yaml) that maps model_id → checkpoint URL → threshold → W&B run.