hmc-rag / docs /plan.md
webmuppet
Initial commit β€” health marketing compliance RAG
bad8b6c
# Healthcare Marketing Compliance β€” RAG Prototype
## Document version
**v2 β€” incorporates Becki's domain-expert feedback** (collaborator review of v1 source-acquisition recommendations).
### v1 β†’ v2 changes
- **Audience narrowed and clarified** β€” from generic "clinic/pharmacy/healthcare-product owner-operators" to **complementary/alternative practitioners + supplement sellers** (chiros, osteos, physios, Chinese medicine, naturopaths, supplement retailers)
- **Council standards swap** β€” drop Pharmacy/Dental Council standards from v1; add Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council; keep Medical Council as scope-tagged benchmark only
- **Five new mandatory additions** β€” HPCA Act 2003, Dietary Supplements Regulations 1985, HDC Code of Rights, ASA TAC Dec 2025 (alongside current), ACC provider rules
- **Section-level metadata flagging** β€” Medicines Act s58 (testimonial bans) and Fair Trading Act s12A (substantiation) called out explicitly in build scripts
- **Transition window metadata** β€” ASA TAC effective dates (1 April 2026 / 1 July 2026) encoded as section metadata so corpus answers stay accurate across the transition
- **Domain restructure** β€” 5 β†’ 6 domains, regrouped to match how marketers actually think about the problem
- **Acquisition list grows** from 11 to ~17 documents; cost still ~$10–$15 with Opus
- Commercial / expansion strategy notes added (held in local-only STRATEGY.md)
## Context
The ECE Compliance RAG prototype proved that PageIndex tree retrieval + LLM reasoning + multi-regulator corpus indexing produces a working compliance assistant. The architecture, pipeline and UI are reusable. The next prototype applies the same approach to **NZ healthcare marketing regulation**, scoped to **complementary/alternative practitioners and supplement sellers** β€” an audience with high-stakes compliance pain (advertising claims, testimonials, title use, supplement classification, registration declarations) that spans multiple regulators with overlapping but non-identical requirements.
Goal: validate that the architecture transfers cleanly to a different legislative domain, and that targeting a coherent practitioner segment produces actionable answers. This is a second proof of concept, not a productisation step.
## Decisions locked in
- **Same tech stack** β€” Streamlit + uv + litellm + PageIndex
- **Keep local model switching** β€” MLX (Qwen, Gemma) options preserved alongside Claude API
- **No te reo Māori in prototype** β€” language detection code stays but is bypassed
- **Source acquisition via Python scripts** β€” same pattern as ECE. Per-domain `build_*_compilation.py` scripts read raw HTML/PDF from `sources/raw/` and emit markdown with `Source:` URLs in place
- **Deployment flexible** β€” local-only for prototype; cloud later if value is validated
- **Audience** β€” complementary/alternative practitioners (chiros, osteos, physios, Chinese medicine, naturopaths, acupuncturists) + supplement sellers
- **Indexing budget** β€” Opus is fine (~$10–$15 expected for the v2 corpus, slightly larger than v1)
## Repo strategy β€” staged
The long-term goal is a reusable compliance-RAG framework that benefits multiple domain projects. The path to get there is staged, not upfront.
### Stage 1 (now) β€” sibling fork
- Create new repo `health-marketing-compliance-rag` as a sibling to `ece-compliance-rag`
- Clean-copy the ECE codebase, strip ECE-specific content, refit for healthcare marketing
- Repos diverge freely; bug fixes copy-paste between them
- Deliberate discipline: keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` byte-identical between both repos to make later extraction painless
### Stage 2 (trigger: scoping a third domain) β€” template repo
- Create `compliance-rag-template` containing only what's genuinely shared across both ECE and healthcare-marketing experience
- New domain projects start with `gh repo create --template compliance-rag-template`
- Existing prototypes can optionally rebase onto template or stay as-is β€” per-project decision
- This is where the "rule of three" pays off: by then we'll have hindsight on what was shared vs domain-coloured
### Stage 3 (trigger: this becomes a product, not just prototypes) β€” installable package
- Publish framework as a versioned Python package (`compliance-rag-core`) on private PyPI or GitHub Packages
- Domain repos depend on a pinned version: `compliance-rag-core==0.3.1`
- Framework releases propagate to all consumers explicitly, on next bump
- This is where "shared improvements" actually scales
### Why not extract framework now
Rule of three. Frameworks built from one prototype over-fit to that prototype. ECE alone produced the *current* shape, but healthcare marketing will reveal which parts of that shape were ECE-specific in disguise. Premature abstraction bakes in wrong assumptions. Forking is cheap; refactoring a wrong abstraction is expensive.
### Anti-patterns explicitly avoided
- **Git submodules** β€” drift management nightmare; skip directly to package-based sharing at Stage 3
- **Monorepo tooling** (Bazel, Nx, Turborepo) β€” overhead too high for solo prototype work
- **Speculative abstraction** β€” no shared base classes "just in case"; only extract what's already shared verbatim across two repos
## Reuse vs. replace
### Reused as-is
- `src/pipeline.py`, `src/retriever.py`, `src/router.py`, `src/generator.py` β€” pipeline core
- `src/config.py` framework (DOCUMENT_REGISTRY pattern, model presets)
- `src/usage.py`, `src/language.py` (kept dormant β€” no language switching active)
- Multi-model switching (litellm + MLX server orchestration in Makefile)
- `scripts/build_indexes.py` β€” PageIndex tree builder
- `scripts/propagate_source_urls.py` β€” URL inheritance through tree
- `app.py` β€” Streamlit shell (welcome + chat + sidebar)
- `Makefile` β€” build pipeline, per-domain index targets, propagate hooks
- `test_pipeline.py`, `benchmark/run_benchmark.py` β€” eval framework
### Replaced (domain-specific, written fresh)
- `corpus/*.md` β€” produced by new per-domain build scripts
- `indexes/*.json` β€” rebuilt from new corpus
- `sources/raw/*` β€” raw HTML/PDF acquired from healthcare regulators
- `DOCUMENT_REGISTRY` in `src/config.py`
- Router few-shot examples in `src/router.py`
- Generator `SYSTEM_PROMPT` in `src/generator.py`
- Welcome message + starter questions in `app.py`
- `benchmark/questions.json`
### Adapted from ECE templates (rewritten with healthcare URLs/parsers)
- `scripts/build_medicines_and_supplements_compilation.py` β€” Medicines Act 1981 (with **s58 flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985**, Medsafe advertising guidance
- `scripts/build_advertising_standards_compilation.py` β€” ASA TAC current + **ASA TAC Dec 2025 (with effective-date metadata)** + General ASA Advertising Standards Code
- `scripts/build_consumer_protection_compilation.py` β€” Fair Trading Act 1986 (with **s12A flagged**) + ComCom Health & Wellness Claims guidance
- `scripts/build_marketing_comms_compilation.py` β€” Privacy Act 2020, Health Information Privacy Code 2020, **UEMA 2007** (the "can I email this list?" cluster)
- `scripts/build_practitioner_regulation_compilation.py` β€” **HPCA Act 2003**, **HDC Code of Rights**, **ACC provider rules**
- `scripts/build_professional_codes_compilation.py` β€” **Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council** standards + Medical Council (scope-tagged as benchmark only)
- `scripts/convert_legislation_html.py` β€” copied from ECE, works unchanged for Medicines Act, Medicines Regs, Dietary Supplements Regs, Fair Trading Act, Privacy Act, HPCA Act, UEMA (all on legislation.govt.nz with same structure)
- `scripts/download_sources.sh` (or equivalent) β€” source-fetching commands
### Removed
- ECE-specific build scripts (`scripts/build_ero_compilation.py`, `build_reform_compilation.py`)
- ECE corpus, indexes, source raw files
- ECE benchmark questions
## Domains (v2)
Six domains, restructured to match how the audience thinks about compliance ("can I say X?", "can I email this list?", "can I call myself Y?"):
| Domain key | Coverage | Acquisition source |
|---|---|---|
| `medicines_and_supplements` | Medicines Act 1981 (Parts 4 & 5; **s58 testimonial ban flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985** (the "marketed therapeutically β†’ reclassified as medicine" trapdoor), Medsafe advertising guidance | legislation.govt.nz (HTML) + medsafe.govt.nz (HTML/PDF) |
| `advertising_standards` | ASA Therapeutic and Health Advertising Code (current) + **ASA TAC Dec 2025** (applies 1 Apr 2026 / 1 Jul 2026 β€” both kept with effective-date metadata) + General ASA Advertising Standards Code | asa.co.nz (PDFs) |
| `consumer_protection` | Fair Trading Act 1986 (Part 1; **s12A substantiation flagged**) + ComCom Health & Wellness Claims guidance | legislation.govt.nz (HTML) + comcom.govt.nz (HTML/PDF) |
| `marketing_comms` | Privacy Act 2020 (IPP 10, IPP 11), Health Information Privacy Code 2020, **Unsolicited Electronic Messages Act 2007** | legislation.govt.nz (HTML) + privacy.org.nz (PDF) |
| `practitioner_regulation` | **HPCA Act 2003** (titles, scopes of practice, restricted activities), **HDC Code of Rights** (Rights 6 & 7 β€” information and informed consent), **ACC provider rules** | legislation.govt.nz + hdc.org.nz + acc.co.nz |
| `professional_codes` | **Chiropractic Board**, **Osteopathic Council**, **Physiotherapy Board**, **Chinese Medicine Council** standards on advertising; Medical Council Statement on Advertising (**scope-tagged as benchmark only β€” does NOT bind non-MD practitioners**) | chiropracticboard.org.nz, osteopathiccouncil.org.nz, physioboard.org.nz, chinesemedicinecouncil.org.nz, mcnz.org.nz |
Six domains, ~17 documents, comparable to ECE's complexity. The professional_codes domain requires per-document scope metadata (which practitioners each standard binds) β€” see "Section-level metadata flags" below.
## Corpus format produced by build scripts
Each `build_<domain>_compilation.py` script must emit markdown matching the conventions the existing pipeline reads.
**File layout (v2):**
```
corpus/
medicines-and-supplements.md
advertising-standards.md
consumer-protection.md
marketing-comms.md
practitioner-regulation.md
professional-codes.md
```
**Per-file structure:**
```markdown
# Domain Title
Source: https://canonical-hub-url
One-paragraph orientation describing what this corpus covers and who issues it.
## Section Title (H2)
Source: https://specific-page-or-section-url
Section content in plain markdown...
### Subsection (H3)
Subsection content...
```
**Conventions enforced by the pipeline:**
- Each H2 has a `Source: https://...` line directly under it. The propagation script inherits this URL down through all descendants β€” build scripts do NOT need to repeat URLs at H3/H4.
- Heading hierarchy must not skip levels (H2 β†’ H4 confuses the tree builder).
- Plain markdown only β€” no HTML, no front matter.
- File slug becomes domain key (slug case β†’ snake case in `DOCUMENT_REGISTRY`).
- One Act / one regulator / one code per file when possible. Cross-references between files are fine in body text.
For legislation-sourced content, `convert_legislation_html.py` (copied from ECE) already produces this format with per-section `LMS#####`/`DLM#####` source URLs. For PDF-sourced content (ASA TAC, council standards), the build script extracts text via `markitdown` or `pymupdf`, structures into H2/H3 sections, and inserts the canonical hub URL under each H2.
### HTML extraction fallback ladder
Most NZ govt targets are SilverStripe (comcom.govt.nz, acc.co.nz, hdc.org.nz, privacy.org.nz, most council sites) β€” semantic HTML5, content extractors work well. Legacy exceptions: medsafe.govt.nz (older ASP), asa.co.nz (industry body, likely WordPress).
Escalate only as needed:
1. **`markitdown`** (default) β€” works for most pages.
2. **`trafilatura`** (Python; `uv add` if needed) β€” drop-in for pages markitdown handles poorly.
3. **`defuddle`** (Node.js; manual pre-processing) β€” escalation only. Likely needed (if at all) for medsafe legacy or asa.co.nz. Run `npx defuddle <url> --markdown > sources/raw/<file>.md` outside the build script, then read the pre-cleaned file.
PDF sources stay on `markitdown` / `pymupdf`. `legislation.govt.nz` content uses `convert_legislation_html.py` (custom parser tuned to its parliamentary drafting markup) β€” don't apply general extractors there.
## Section-level metadata flags
Some provisions are retrieved disproportionately often and benefit from explicit metadata so the router/retriever can surface them precisely. Build scripts attach a `tags:` line under the relevant H2/H3 heading:
```markdown
## Section 58 β€” Restrictions on advertising of medicines
Source: https://www.legislation.govt.nz/...
tags: testimonial-ban, medicines, frequently-cited
Section content...
```
**v2 mandatory flags:**
| Provision | Flag tags | Why |
|---|---|---|
| Medicines Act 1981 **s58** | `testimonial-ban, medicines, frequently-cited` | Workhorse provision for testimonial bans on medicines, devices, methods of treatment |
| Fair Trading Act 1986 **s12A** | `substantiation, claims, frequently-cited` | Most-tripped-over provision in health/wellness advertising |
| HPCA Act 2003 **title-protection sections** | `title-use, registration, scope` | Foundation for "can I call myself X?" questions |
| Dietary Supplements Regs 1985 **r3 & r5** | `classification, supplements, therapeutic-claim` | The "marketed therapeutically β†’ reclassified as medicine" rule |
**Council standards β€” scope tags (mandatory):**
Each council document gets a `binds:` metadata field listing the practitioner classes it binds. The retriever filters by binding when the query mentions a practitioner type.
```markdown
# Medical Council of NZ β€” Statement on Advertising
binds: medical-practitioners
benchmark-only: true
Source: https://www.mcnz.org.nz/...
```
This prevents the model from citing the Medical Council statement as authoritative for chiropractors or naturopaths (a real failure mode without the metadata).
## Transition window metadata (ASA TAC)
The ASA Therapeutic and Health Advertising Code is mid-transition:
- **Current code** β€” applies until 1 April 2026 (for new ads), 1 July 2026 (for all ads)
- **December 2025 code** β€” applies from 1 April 2026 (new ads), 1 July 2026 (all ads). Materially different rules on testimonials, user-generated content, vulnerable audiences
Both codes go in the corpus with `effective_from` / `effective_until` metadata on each section. The generator's system prompt includes "today's date is X" and instructs the model to answer based on the code in force on that date.
```markdown
# ASA Therapeutic and Health Advertising Code (current)
effective_from: <unknown - in force prior to 2026>
effective_until: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
```
```markdown
# ASA Therapeutic and Health Advertising Code (December 2025)
effective_from: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
effective_until: null
```
Without this, the corpus answers correctly today but silently goes stale on a known date.
## Strategic notes
Commercial / expansion strategy (e.g. AU opportunity sizing) lives in `STRATEGY.md` at the repo root β€” gitignored, local-only.
## Build sequence
### Phase A β€” Repo bootstrap
| Step | Action | Command |
|---|---|---|
| A1 | Create new repo, copy ECE codebase | shell ops |
| A2 | Strip ECE corpus, indexes, raw sources, ECE-only build scripts | shell ops |
| A3 | Update `pyproject.toml` name/description | edit |
| A4 | Keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` | no-op |
### Phase B β€” Source acquisition (v2: 6 domains, ~17 documents)
| Step | Action | Notes |
|---|---|---|
| B1 | Identify canonical URLs and PDF locations per domain | see "Domains (v2)" table |
| B2 | Write `scripts/download_sources.sh` (or per-domain fetchers) | mirror ECE's pattern using curl/wget |
| B3 | Acquire raw HTML/PDF into `sources/raw/<domain>/` | run download script |
| B4 | Write `scripts/build_medicines_and_supplements_compilation.py` | Medicines Act (flag s58), Medicines Regs, **Dietary Supplements Regs**, Medsafe guidance |
| B5 | Write `scripts/build_advertising_standards_compilation.py` | ASA TAC current + Dec 2025 (with effective-date metadata) + General ASA Code |
| B6 | Write `scripts/build_consumer_protection_compilation.py` | Fair Trading Act (flag s12A) + ComCom guidance |
| B7 | Write `scripts/build_marketing_comms_compilation.py` | Privacy Act + HIPC + UEMA |
| B8 | Write `scripts/build_practitioner_regulation_compilation.py` | HPCA Act + HDC Code + ACC rules |
| B9 | Write `scripts/build_professional_codes_compilation.py` | Chiro/Osteo/Physio/Chinese Medicine + Medical Council benchmark; **emit `binds:` scope metadata** |
| B10 | `make corpus` β€” run all build scripts | produces `corpus/*.md` |
| B11 | Sanity-check: `wc -l corpus/*.md`, spot-check Source URLs and section flags | manual |
### Phase C β€” Domain registry & prompt tuning
| Step | Action | Command |
|---|---|---|
| C1 | Fill `DOCUMENT_REGISTRY` in `src/config.py` | edit |
| C2 | Rewrite router few-shot examples for healthcare scenarios | edit `src/router.py` |
| C3 | Rewrite generator SYSTEM_PROMPT for healthcare marketing voice | edit `src/generator.py` |
| C4 | Rewrite welcome + starter questions | edit `app.py` |
### Phase D β€” Index, test, explore
| Step | Action | Command |
|---|---|---|
| D1 | Verify cost estimate | `make dry-run` |
| D2 | Build indexes with Opus + auto-propagate URLs | `make index` |
| D3 | Smoke test | `make test` |
| D4 | Write 10–20 healthcare marketing benchmark questions | edit `benchmark/questions.json` |
| D5 | Quick benchmark | `make benchmark-quick` |
| D6 | Manual exploration | `make app-sonnet` |
## Verification
| Check | Command / action | Pass criteria |
|---|---|---|
| Repo boots | `make app-sonnet` | Streamlit serves on :8501, welcome renders |
| Index builds | `make dry-run` then `make index` | All 6 domains build, propagation runs after |
| URL propagation | `make propagate-urls` | All deep nodes have inherited Source URLs |
| Pipeline test | `make test` | One LLM call returns answer + citation |
| Benchmark sanity | `make benchmark-quick` | 3 questions, no exceptions, citations present |
| Single-domain query | Manual β€” one question per domain | Right domain selected, right content cited |
| Cross-domain query | Manual: "can I post a patient testimonial about my chiro practice on Instagram?" | Routes to β‰₯3 domains (advertising_standards + medicines_and_supplements [s58] + professional_codes [Chiro Board]); ASA answer reflects current code + flags Dec 2025 changes |
| Council scope-tagging | Manual: same question for "physio" β€” should NOT cite Medical Council statement as authoritative | `binds:` metadata respected by retriever |
| Transition window | Manual: ask same testimonial question with system date set to 2026-05-01 | Answer reflects Dec 2025 ASA TAC, not the current code |
| Local model swap | `make app-qwen` | Same questions answer via Qwen + MLX |
## Critical files (in ECE repo, paths in new repo will mirror)
- `src/config.py` β€” DOCUMENT_REGISTRY structure to copy
- `src/router.py` β€” few-shot pattern to replicate with healthcare scenarios
- `src/generator.py` β€” SYSTEM_PROMPT to rewrite without ECE-specific tone
- `src/retriever.py` β€” keeps existing source URL fallback logic
- `app.py` β€” welcome message + starter question buttons
- `Makefile` β€” build/index/propagate targets already wired
- `scripts/propagate_source_urls.py` β€” generic, ships with new repo unchanged
- `scripts/build_indexes.py` β€” generic, ships with new repo unchanged
## Out of scope for prototype
- Cloud deployment (defer)
- Auth / user accounts
- Persistent conversation history
- Te reo Māori handling
- Multi-tenancy
- Legal review of answer accuracy (disclaimer-only)
- Productisation as multi-domain framework
## Open questions
### Resolved in v2
- βœ… Repo location β€” `/Users/gregf/Documents/Workspace/Projects/Personal/health-marketing-compliance-rag/`
- βœ… Domain count β€” 6 domains (was 5 in v1)
- βœ… Audience β€” complementary/alternative practitioners + supplement sellers
- βœ… Council standards selection β€” Chiro/Osteo/Physio/Chinese Medicine + Medical (benchmark only)
- βœ… UEMA inclusion β€” yes, in marketing-comms cluster
### For Becki to confirm before Phase B starts
- **TAPS in v1?** Becki didn't explicitly comment. Recommendation: defer to v2 β€” TAPS is geared at therapeutic-product advertisers, less directly relevant to the practitioner audience. Confirm.
- **ASA TAC handling at the transition** β€” does Becki want answers framed as "current code says X, from 1 April 2026 the new code says Y" *throughout*, or strictly "the code in force today says X, with no forward-looking commentary"? Affects generator prompt design.
- **Chinese Medicine Council vs HPCA Chinese Medicine specifics** β€” is the council document materially different from the HPCA-derived rules, or do they overlap heavily? Affects whether to ship both or just one.
- **Food Act / Standard 1.2.7 scope** β€” defer to v2 unless audience has serious supplemented-food product lines. Confirm.
### Acquisition phase (Phase B)
- ASA TAC PDF parsing quality β€” if structure is messy, may need manual cleanup pass. Both versions (current + Dec 2025) should be eyeballed before committing build script effort.
- Council standards PDF quality β€” Becki flagged this is the most likely domain to need manual cleanup. Eyeball each before committing build script effort.
- ACC provider rules β€” canonical source location and format are less standardised than the legislation/council docs; may need manual curation.