Spaces:
Sleeping
Sleeping
| # `tools/audit/` β Multi-persona conversational audit | |
| End-to-end conversational stress test: walks 100 distinct personas through 30-turn flows against the live API, captures every turn's reply / latency / faithfulness verdict / blocked status, and rolls up to a defect-counting report. | |
| This is the framework that surfaced the headline KI-018 (QAβfact-find misrouting) and KI-021 (latency p95 blow-out) defects in the readiness audit. | |
| ## Files | |
| | File | Role | | |
| | --- | --- | | |
| | `run_audit.py` | Entry point. Walks each persona Γ 30 turns against the live `/api/chat`. Resumable (per-persona transcripts land as they finish). Concurrent (`--workers W`) but rate-aware β global NIM 40 req/min cap enforced via per-request sleep. Retries 5xx with exponential backoff. | | |
| | `personas.py` | Generator: 10 archetypes Γ 10 demographic profiles Γ 1 deterministic style = **100 unique personas**. Stable order = stable persona IDs across runs, so diffs are regressions not shuffle noise. Run as a script to (re)generate `personas.json`. | | |
| | `personas.json` | Materialised 100-persona list. Stable input to `run_audit.py`. | | |
| | `flows.py` | Generator: per persona produces a 30-turn user-text sequence in 5 phases β opening (1) Β· fact-find answers (9) Β· free-form Qs (10) Β· edge-case probes (5) Β· adversarial + close (5). | | |
| | `flows.json` | Materialised flows. `dict[persona_id, list[str]]` of the 30 turns each persona sends. | | |
| | `analyze.py` | Post-run aggregator: reads `80-audit/<run_id>/transcripts/*.json`, computes per-archetype / per-language / per-style breakdowns of faithfulness, blocked rate, p95 latency. Emits `report.md` + `summary.json` into the run dir. | | |
| ## Output layout | |
| ``` | |
| 80-audit/<run_id>/ | |
| βββ transcripts/ | |
| β βββ P001.json (complete persona) | |
| β βββ P002.json | |
| β βββ P003.partial.json (in-flight or interrupted) | |
| β βββ β¦ | |
| βββ report.md (analyze.py output β defect breakdown) | |
| βββ summary.json (machine-readable rollup) | |
| ``` | |
| `<run_id>` convention is `full_YYYYMMDD_HHMMSS` for the full 100-persona pass and `postfix_YYYYMMDD_HHMMSS` for a post-fix re-run targeting a specific defect. | |
| ## Typical run | |
| ```bash | |
| # Full audit against the live HF Space | |
| python tools/audit/run_audit.py --workers 4 | |
| # Smoke (5 personas) for a config change | |
| python tools/audit/run_audit.py --max-personas 5 --base http://localhost:8000 | |
| # Aggregate after | |
| python tools/audit/analyze.py 80-audit/full_20260514_145243/ | |
| ``` | |
| ## Watch-outs | |
| - **HF Space rebuild is 5-8 min.** Don't start an audit until the desired image is stably deployed, or transcripts span multiple builds and become useless for A/B. | |
| - **The 40 req/min NIM cap is global.** Bumping `--workers` past 4 will not help β the per-request sleep clamps dispatch rate. | |
| - **Personas are stable by index.** If you change the `ARCHETYPES` / demographic lists, P037 is no longer the same person β call it out in the run notes. | |
| ## Related | |
| - `80-audit/ENTERPRISE_AUDIT.md` β defect register fed by audit output | |
| - `80-audit/README.md` β output-folder layout reference | |
| - Root `CLAUDE.md` Β§ Routing invariants β the KI-018 / KI-023 regressions the audit catches | |