Project Notes (Consolidated from legacy docs)
This file preserves the useful information from the historical “*_COMPLETE.md / *_PLAN.md” docs that were produced while building the project, without keeping dozens of duplicative markdown files in the repo root.
What’s implemented (high signal)
- API: FastAPI model serving + analytics + sentiment + monitoring endpoints (see
api/). - Dashboards: Streamlit dashboards for Evaluation, Analytics, Model Comparison, Sentiment (see
dashboards/). - Training/Eval:
- Training script supports modern transformer fine-tuning + LoRA (see
scripts/train_model.py). - Evaluation supports metrics + threshold optimization and emits artifacts for dashboards (see
scripts/evaluate.py+experiments/results/*.json). - Fair comparison protocol exists:
experiments/model_zoo/protocol_10k_1k.
- Training script supports modern transformer fine-tuning + LoRA (see
- CI/CD: GitHub Actions workflows exist in
.github/workflows/(CI, lint, security, release, CD, model-deploy).
Repo conventions that matter (so reviewers don’t get lost)
Model zoo artifacts (canonical)
- Protocol:
experiments/model_zoo/protocol_10k_1k(splits +tag_to_idx.json). - Predictions:
experiments/predictions/<model_id>_val_preds.csv. - Metrics:
experiments/results/<model_id>.json(includes optimized threshold). - Served model + threshold:
config/thresholds.json.
Dashboards
- Dashboards can run from uploaded CSV/JSON artifacts (no training required).
- Dashboards can optionally call the API (“Use API”) — useful for analytics/sentiment and for remote deployments.
GitHub publishing checklist (the practical bits)
Hygiene before first push
- Confirm
.gitignoreprevents committing:venv/,wandb/,logs/,.env, datasets, and large model checkpoints.
- Decide model distribution strategy:
- Recommended: publish checkpoints via GitHub Releases (or W&B Artifacts), and provide a small download script.
CI/CD notes
- CI should be “always green” on forks:
- CD deploy steps should be conditional on secrets like
STAGING_API_URL/PRODUCTION_API_URL.
- CD deploy steps should be conditional on secrets like
- Optional integrations:
- Codecov is optional; CI should still pass if it’s not configured.
Known “portfolio polish” gaps (things to do next)
- Publish to GitHub (currently local-only).
- Hosted demo:
- API on Render/Fly/Railway
- Streamlit dashboards on Streamlit Cloud / HuggingFace Spaces
- Model artifacts: make downloadable by reviewers (release assets or W&B artifacts).
- Screenshots/GIFs: add to
docs/screenshots/and link in README. - License + data provenance: add a real
LICENSEand clarify dataset source/constraints.
Future ideas backlog (optional)
- Add a strict title-only Distil variant for clean ablation vs
use_snippet=True. - Add a small
models/REGISTRY.md(orconfig/models.yaml) that maps model_id → checkpoint URL → threshold → W&B run.