Spaces:
Sleeping
Sleeping
| # `tools/` β Operational scripts | |
| Loose collection of CLI scripts: corpus operations, data uploads, probes, KB regeneration, scheduled-job runners. Nothing under `tools/` is imported by the live server β `backend/` and `rag/` are the runtime surface. | |
| Scheduling for the long-running ones is wired via macOS LaunchAgents β see `CRON_README.md` in this folder for cadence + script paths, and [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md) for the disk-safety LaunchAgents. | |
| ## Corpus + extraction batch ops | |
| | Script | Purpose | | |
| | --- | --- | | |
| | `extract_all_corpus.py`, `extract_batch_5.py`, `extract_failed.py`, `extract_pdf_range.py`, `reextract_all.py` | Batch re-extractions over `rag/corpus/`. Useful when the schema or extraction prompt changes. | | |
| | `extract_pdf_text.py`, `extract_policy_text.py`, `extract_policy_text_batch2.py` | Raw text dumps for manual inspection / regex curation. | | |
| | `curate_batch2.py`, `curate_remaining.py`, `clear_batch2.py` | Verbatim-quote curation passes that produced `40-data/policy_facts/`. See [`40-data/policy_facts/_curation_report.md`](../40-data/policy_facts/_curation_report.md). | | |
| | `generate_policy_facts.py` | Convert extraction outputs to the `40-data/policy_facts/<id>.json` shape with `{value, unit, source_pdf_path, source_quote}` provenance. | | |
| | `pydantic_validate_batch_5.py`, `validate_batch_5.py`, `validate_json.py`, `validate_schema.py` | Schema validators for the 62-field `HealthPolicy`. | | |
| | `count_fields.py` | Per-policy completeness scorer that feeds the `kb/INDEX.md` completeness % column. | | |
| ## Source-map + verification | |
| | Script | Purpose | | |
| | --- | --- | | |
| | `info_source_map.py` | Builds `eval/info_source_map.json` + `40-data/information_source_map.md` β claim β URL β verdict (β / β οΈ / β / β³). The canonical KPI for source-grounding quality. | | |
| | `verify_urls.py` | HEAD-checks every URL in the corpus / facts; writes `eval/verified_urls.json`. | | |
| | `verify_review_urls.py`, `verify_new_corpus.py` | Sub-verifiers for the reviews dataset and freshly-added corpus URLs. | | |
| | `browser_verify.py` | Playwright-backed verifier for URLs that block HEAD requests. Output: `tools/browser_verified.json`. | | |
| | `check_link_rot.py`, `check_pdf_etags.py` | LaunchAgent-driven freshness checks β corpus URL rot + PDF eTag drift. | | |
| | `refresh_premiums.py` | LaunchAgent-driven refresh of `40-data/premiums/illustrative_premiums.json`. | | |
| ## KB + dataset builders | |
| | Script | Purpose | | |
| | --- | --- | | |
| | `build_kb_mirror.py` | Regenerates the entire `kb/policies/<id>.md` tree from `40-data/policy_facts/`. Idempotent. | | |
| | `ingest_kb_summaries.py` | Ingests `kb/policies/*.md` summaries into Chroma so policy meta is retrievable. Carries the HNSW bloat tripwire. | | |
| | `ingest_reviews.py` | Ingests `40-data/reviews/<insurer>.json` into Chroma. Carries the HNSW bloat tripwire. | | |
| | `build_readme_pdf.py` | Renders the master `README.md` to PDF for offline review. | | |
| ## HF Hub uploads (data-side mirror) | |
| | Script | Target | | |
| | --- | --- | | |
| | `upload_to_hf.py` | Code-side push to the HF Space repo (`huggingface.co/spaces/rohitsar567/InsuranceBot`). | | |
| | `upload_corpus_to_dataset.py`, `upload_extracted_to_dataset.py`, `upload_vectors_to_dataset.py`, `upload_all_to_dataset.py` | Push specific slices of `rag/` to the companion HF Dataset `rohitsar567/insurance-bot-data`. See [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md) and [ADR-024](../70-docs/60-decisions/ADR-024-triple-mirror-code-and-data.md). | | |
| | `set_hf_secrets.py` | One-shot helper that pushes the runtime secrets into the HF Space (idempotent). Current secret set: `GOOGLE_API_KEY` (Google AI Studio, per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)), `NVIDIA_NIM_API_KEY`, `OPENROUTER_API_KEY`, `SARVAM_API_KEY`, plus admin password / IP allowlist. | | |
| ## Probes + diagnostics | |
| | Script | Provider it pokes | | |
| | --- | --- | | |
| | `sarvam_probe.py`, `sarvam_nothink_probe.py` | Sarvam-M / Saarika / Bulbul connectivity + latency. | | |
| | `groq_probe.py`, `groq_long_probe.py` | Historical Groq Llama free-tier probe β Groq is no longer in any production chain (removed in [ADR-038](../70-docs/60-decisions/ADR-038-nim-only-chains.md), not re-added in [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)). Kept for benchmarking. | | |
| | `openrouter_probe.py`, `or_models.py` | OpenRouter routing + model-list inspection. Used by KI-178 to audit which `:free` models expose `response_format`. | | |
| | `pdf_probe.py` | pdfplumber parse on a single PDF β first stop when extraction silently produces empty text. | | |
| | `heavy_smoke_test.py` | End-to-end smoke against the live HF Space (every provider in one call). | | |
| ## Chunk-size & retrieval sweeps | |
| | Script | Purpose | | |
| | --- | --- | | |
| | `chunk_sweep.py`, `chunk_sweep_diagnostic.py` | Grid-search over chunk size / overlap. Output: `eval/chunk_sweep_results.json`. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). | | |
| | `sweep_retrieval.py` | Retrieval-strategy A/B (filter vs no-filter, top-k variants). | | |
| ## Scheduled jobs / shell wrappers | |
| | Path | Purpose | | |
| | --- | --- | | |
| | `install_crons.sh`, `CRON_README.md` | Install the LaunchAgents; the README is the canonical cadence + path reference. | | |
| | `install_git_hooks.sh`, `git-hooks/` | Pre-commit hooks (decimal grep, secret scan, schema validation). | | |
| | `full_pipeline.sh`, `pipeline_finish_all.sh`, `post_extract_deploy.sh`, `reextract_then_deploy.sh`, `quarterly_rebuild.sh` | Multi-step orchestrations (download β extract β ingest β push β smoke). | | |
| | `reconcile_manifest.py` | Drift check between `rag/corpus/_manifest.json` and what's actually on disk. | | |
| ## Subdirectory | |
| `audit/` β multi-persona conversational audit framework. See `tools/audit/README.md`. | |
| ## Related | |
| - `CRON_README.md` (this folder) β LaunchAgent cadence reference | |
| - [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md), [ADR-024](../70-docs/60-decisions/ADR-024-triple-mirror-code-and-data.md), [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md) | |
| - `80-audit/ENTERPRISE_AUDIT.md` β defect register, including silent-LaunchAgent regressions (D-002) | |