InsuranceBot / tools /README.md
rohitsar567's picture
feat(llm+docs): KI-177 + KI-179 + KI-183 + ADR-040 docs cascade
132a829
|
Raw
History Blame Contribute Delete
6.21 kB
# `tools/` β€” Operational scripts
Loose collection of CLI scripts: corpus operations, data uploads, probes, KB regeneration, scheduled-job runners. Nothing under `tools/` is imported by the live server β€” `backend/` and `rag/` are the runtime surface.
Scheduling for the long-running ones is wired via macOS LaunchAgents β€” see `CRON_README.md` in this folder for cadence + script paths, and [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md) for the disk-safety LaunchAgents.
## Corpus + extraction batch ops
| Script | Purpose |
| --- | --- |
| `extract_all_corpus.py`, `extract_batch_5.py`, `extract_failed.py`, `extract_pdf_range.py`, `reextract_all.py` | Batch re-extractions over `rag/corpus/`. Useful when the schema or extraction prompt changes. |
| `extract_pdf_text.py`, `extract_policy_text.py`, `extract_policy_text_batch2.py` | Raw text dumps for manual inspection / regex curation. |
| `curate_batch2.py`, `curate_remaining.py`, `clear_batch2.py` | Verbatim-quote curation passes that produced `40-data/policy_facts/`. See [`40-data/policy_facts/_curation_report.md`](../40-data/policy_facts/_curation_report.md). |
| `generate_policy_facts.py` | Convert extraction outputs to the `40-data/policy_facts/<id>.json` shape with `{value, unit, source_pdf_path, source_quote}` provenance. |
| `pydantic_validate_batch_5.py`, `validate_batch_5.py`, `validate_json.py`, `validate_schema.py` | Schema validators for the 62-field `HealthPolicy`. |
| `count_fields.py` | Per-policy completeness scorer that feeds the `kb/INDEX.md` completeness % column. |
## Source-map + verification
| Script | Purpose |
| --- | --- |
| `info_source_map.py` | Builds `eval/info_source_map.json` + `40-data/information_source_map.md` β€” claim β†’ URL β†’ verdict (βœ… / ⚠️ / ❌ / ⏳). The canonical KPI for source-grounding quality. |
| `verify_urls.py` | HEAD-checks every URL in the corpus / facts; writes `eval/verified_urls.json`. |
| `verify_review_urls.py`, `verify_new_corpus.py` | Sub-verifiers for the reviews dataset and freshly-added corpus URLs. |
| `browser_verify.py` | Playwright-backed verifier for URLs that block HEAD requests. Output: `tools/browser_verified.json`. |
| `check_link_rot.py`, `check_pdf_etags.py` | LaunchAgent-driven freshness checks β€” corpus URL rot + PDF eTag drift. |
| `refresh_premiums.py` | LaunchAgent-driven refresh of `40-data/premiums/illustrative_premiums.json`. |
## KB + dataset builders
| Script | Purpose |
| --- | --- |
| `build_kb_mirror.py` | Regenerates the entire `kb/policies/<id>.md` tree from `40-data/policy_facts/`. Idempotent. |
| `ingest_kb_summaries.py` | Ingests `kb/policies/*.md` summaries into Chroma so policy meta is retrievable. Carries the HNSW bloat tripwire. |
| `ingest_reviews.py` | Ingests `40-data/reviews/<insurer>.json` into Chroma. Carries the HNSW bloat tripwire. |
| `build_readme_pdf.py` | Renders the master `README.md` to PDF for offline review. |
## HF Hub uploads (data-side mirror)
| Script | Target |
| --- | --- |
| `upload_to_hf.py` | Code-side push to the HF Space repo (`huggingface.co/spaces/rohitsar567/InsuranceBot`). |
| `upload_corpus_to_dataset.py`, `upload_extracted_to_dataset.py`, `upload_vectors_to_dataset.py`, `upload_all_to_dataset.py` | Push specific slices of `rag/` to the companion HF Dataset `rohitsar567/insurance-bot-data`. See [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md) and [ADR-024](../70-docs/60-decisions/ADR-024-triple-mirror-code-and-data.md). |
| `set_hf_secrets.py` | One-shot helper that pushes the runtime secrets into the HF Space (idempotent). Current secret set: `GOOGLE_API_KEY` (Google AI Studio, per [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)), `NVIDIA_NIM_API_KEY`, `OPENROUTER_API_KEY`, `SARVAM_API_KEY`, plus admin password / IP allowlist. |
## Probes + diagnostics
| Script | Provider it pokes |
| --- | --- |
| `sarvam_probe.py`, `sarvam_nothink_probe.py` | Sarvam-M / Saarika / Bulbul connectivity + latency. |
| `groq_probe.py`, `groq_long_probe.py` | Historical Groq Llama free-tier probe β€” Groq is no longer in any production chain (removed in [ADR-038](../70-docs/60-decisions/ADR-038-nim-only-chains.md), not re-added in [ADR-040](../70-docs/60-decisions/ADR-040-google-gemini-primary.md)). Kept for benchmarking. |
| `openrouter_probe.py`, `or_models.py` | OpenRouter routing + model-list inspection. Used by KI-178 to audit which `:free` models expose `response_format`. |
| `pdf_probe.py` | pdfplumber parse on a single PDF β€” first stop when extraction silently produces empty text. |
| `heavy_smoke_test.py` | End-to-end smoke against the live HF Space (every provider in one call). |
## Chunk-size & retrieval sweeps
| Script | Purpose |
| --- | --- |
| `chunk_sweep.py`, `chunk_sweep_diagnostic.py` | Grid-search over chunk size / overlap. Output: `eval/chunk_sweep_results.json`. See [ADR-018](../70-docs/60-decisions/ADR-018-chunk-size-sweep-deferred.md). |
| `sweep_retrieval.py` | Retrieval-strategy A/B (filter vs no-filter, top-k variants). |
## Scheduled jobs / shell wrappers
| Path | Purpose |
| --- | --- |
| `install_crons.sh`, `CRON_README.md` | Install the LaunchAgents; the README is the canonical cadence + path reference. |
| `install_git_hooks.sh`, `git-hooks/` | Pre-commit hooks (decimal grep, secret scan, schema validation). |
| `full_pipeline.sh`, `pipeline_finish_all.sh`, `post_extract_deploy.sh`, `reextract_then_deploy.sh`, `quarterly_rebuild.sh` | Multi-step orchestrations (download β†’ extract β†’ ingest β†’ push β†’ smoke). |
| `reconcile_manifest.py` | Drift check between `rag/corpus/_manifest.json` and what's actually on disk. |
## Subdirectory
`audit/` β€” multi-persona conversational audit framework. See `tools/audit/README.md`.
## Related
- `CRON_README.md` (this folder) β€” LaunchAgent cadence reference
- [ADR-020](../70-docs/60-decisions/ADR-020-code-data-split-hf-dataset.md), [ADR-024](../70-docs/60-decisions/ADR-024-triple-mirror-code-and-data.md), [ADR-029](../70-docs/60-decisions/ADR-029-hnsw-bloat-tripwire.md)
- `80-audit/ENTERPRISE_AUDIT.md` β€” defect register, including silent-LaunchAgent regressions (D-002)