# Healthcare Marketing Compliance — RAG Prototype ## Document version **v2 — incorporates Becki's domain-expert feedback** (collaborator review of v1 source-acquisition recommendations). ### v1 → v2 changes - **Audience narrowed and clarified** — from generic "clinic/pharmacy/healthcare-product owner-operators" to **complementary/alternative practitioners + supplement sellers** (chiros, osteos, physios, Chinese medicine, naturopaths, supplement retailers) - **Council standards swap** — drop Pharmacy/Dental Council standards from v1; add Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council; keep Medical Council as scope-tagged benchmark only - **Five new mandatory additions** — HPCA Act 2003, Dietary Supplements Regulations 1985, HDC Code of Rights, ASA TAC Dec 2025 (alongside current), ACC provider rules - **Section-level metadata flagging** — Medicines Act s58 (testimonial bans) and Fair Trading Act s12A (substantiation) called out explicitly in build scripts - **Transition window metadata** — ASA TAC effective dates (1 April 2026 / 1 July 2026) encoded as section metadata so corpus answers stay accurate across the transition - **Domain restructure** — 5 → 6 domains, regrouped to match how marketers actually think about the problem - **Acquisition list grows** from 11 to ~17 documents; cost still ~$10–$15 with Opus - Commercial / expansion strategy notes added (held in local-only STRATEGY.md) ## Context The ECE Compliance RAG prototype proved that PageIndex tree retrieval + LLM reasoning + multi-regulator corpus indexing produces a working compliance assistant. The architecture, pipeline and UI are reusable. The next prototype applies the same approach to **NZ healthcare marketing regulation**, scoped to **complementary/alternative practitioners and supplement sellers** — an audience with high-stakes compliance pain (advertising claims, testimonials, title use, supplement classification, registration declarations) that spans multiple regulators with overlapping but non-identical requirements. Goal: validate that the architecture transfers cleanly to a different legislative domain, and that targeting a coherent practitioner segment produces actionable answers. This is a second proof of concept, not a productisation step. ## Decisions locked in - **Same tech stack** — Streamlit + uv + litellm + PageIndex - **Keep local model switching** — MLX (Qwen, Gemma) options preserved alongside Claude API - **No te reo Māori in prototype** — language detection code stays but is bypassed - **Source acquisition via Python scripts** — same pattern as ECE. Per-domain `build_*_compilation.py` scripts read raw HTML/PDF from `sources/raw/` and emit markdown with `Source:` URLs in place - **Deployment flexible** — local-only for prototype; cloud later if value is validated - **Audience** — complementary/alternative practitioners (chiros, osteos, physios, Chinese medicine, naturopaths, acupuncturists) + supplement sellers - **Indexing budget** — Opus is fine (~$10–$15 expected for the v2 corpus, slightly larger than v1) ## Repo strategy — staged The long-term goal is a reusable compliance-RAG framework that benefits multiple domain projects. The path to get there is staged, not upfront. ### Stage 1 (now) — sibling fork - Create new repo `health-marketing-compliance-rag` as a sibling to `ece-compliance-rag` - Clean-copy the ECE codebase, strip ECE-specific content, refit for healthcare marketing - Repos diverge freely; bug fixes copy-paste between them - Deliberate discipline: keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` byte-identical between both repos to make later extraction painless ### Stage 2 (trigger: scoping a third domain) — template repo - Create `compliance-rag-template` containing only what's genuinely shared across both ECE and healthcare-marketing experience - New domain projects start with `gh repo create --template compliance-rag-template` - Existing prototypes can optionally rebase onto template or stay as-is — per-project decision - This is where the "rule of three" pays off: by then we'll have hindsight on what was shared vs domain-coloured ### Stage 3 (trigger: this becomes a product, not just prototypes) — installable package - Publish framework as a versioned Python package (`compliance-rag-core`) on private PyPI or GitHub Packages - Domain repos depend on a pinned version: `compliance-rag-core==0.3.1` - Framework releases propagate to all consumers explicitly, on next bump - This is where "shared improvements" actually scales ### Why not extract framework now Rule of three. Frameworks built from one prototype over-fit to that prototype. ECE alone produced the *current* shape, but healthcare marketing will reveal which parts of that shape were ECE-specific in disguise. Premature abstraction bakes in wrong assumptions. Forking is cheap; refactoring a wrong abstraction is expensive. ### Anti-patterns explicitly avoided - **Git submodules** — drift management nightmare; skip directly to package-based sharing at Stage 3 - **Monorepo tooling** (Bazel, Nx, Turborepo) — overhead too high for solo prototype work - **Speculative abstraction** — no shared base classes "just in case"; only extract what's already shared verbatim across two repos ## Reuse vs. replace ### Reused as-is - `src/pipeline.py`, `src/retriever.py`, `src/router.py`, `src/generator.py` — pipeline core - `src/config.py` framework (DOCUMENT_REGISTRY pattern, model presets) - `src/usage.py`, `src/language.py` (kept dormant — no language switching active) - Multi-model switching (litellm + MLX server orchestration in Makefile) - `scripts/build_indexes.py` — PageIndex tree builder - `scripts/propagate_source_urls.py` — URL inheritance through tree - `app.py` — Streamlit shell (welcome + chat + sidebar) - `Makefile` — build pipeline, per-domain index targets, propagate hooks - `test_pipeline.py`, `benchmark/run_benchmark.py` — eval framework ### Replaced (domain-specific, written fresh) - `corpus/*.md` — produced by new per-domain build scripts - `indexes/*.json` — rebuilt from new corpus - `sources/raw/*` — raw HTML/PDF acquired from healthcare regulators - `DOCUMENT_REGISTRY` in `src/config.py` - Router few-shot examples in `src/router.py` - Generator `SYSTEM_PROMPT` in `src/generator.py` - Welcome message + starter questions in `app.py` - `benchmark/questions.json` ### Adapted from ECE templates (rewritten with healthcare URLs/parsers) - `scripts/build_medicines_and_supplements_compilation.py` — Medicines Act 1981 (with **s58 flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985**, Medsafe advertising guidance - `scripts/build_advertising_standards_compilation.py` — ASA TAC current + **ASA TAC Dec 2025 (with effective-date metadata)** + General ASA Advertising Standards Code - `scripts/build_consumer_protection_compilation.py` — Fair Trading Act 1986 (with **s12A flagged**) + ComCom Health & Wellness Claims guidance - `scripts/build_marketing_comms_compilation.py` — Privacy Act 2020, Health Information Privacy Code 2020, **UEMA 2007** (the "can I email this list?" cluster) - `scripts/build_practitioner_regulation_compilation.py` — **HPCA Act 2003**, **HDC Code of Rights**, **ACC provider rules** - `scripts/build_professional_codes_compilation.py` — **Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council** standards + Medical Council (scope-tagged as benchmark only) - `scripts/convert_legislation_html.py` — copied from ECE, works unchanged for Medicines Act, Medicines Regs, Dietary Supplements Regs, Fair Trading Act, Privacy Act, HPCA Act, UEMA (all on legislation.govt.nz with same structure) - `scripts/download_sources.sh` (or equivalent) — source-fetching commands ### Removed - ECE-specific build scripts (`scripts/build_ero_compilation.py`, `build_reform_compilation.py`) - ECE corpus, indexes, source raw files - ECE benchmark questions ## Domains (v2) Six domains, restructured to match how the audience thinks about compliance ("can I say X?", "can I email this list?", "can I call myself Y?"): | Domain key | Coverage | Acquisition source | |---|---|---| | `medicines_and_supplements` | Medicines Act 1981 (Parts 4 & 5; **s58 testimonial ban flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985** (the "marketed therapeutically → reclassified as medicine" trapdoor), Medsafe advertising guidance | legislation.govt.nz (HTML) + medsafe.govt.nz (HTML/PDF) | | `advertising_standards` | ASA Therapeutic and Health Advertising Code (current) + **ASA TAC Dec 2025** (applies 1 Apr 2026 / 1 Jul 2026 — both kept with effective-date metadata) + General ASA Advertising Standards Code | asa.co.nz (PDFs) | | `consumer_protection` | Fair Trading Act 1986 (Part 1; **s12A substantiation flagged**) + ComCom Health & Wellness Claims guidance | legislation.govt.nz (HTML) + comcom.govt.nz (HTML/PDF) | | `marketing_comms` | Privacy Act 2020 (IPP 10, IPP 11), Health Information Privacy Code 2020, **Unsolicited Electronic Messages Act 2007** | legislation.govt.nz (HTML) + privacy.org.nz (PDF) | | `practitioner_regulation` | **HPCA Act 2003** (titles, scopes of practice, restricted activities), **HDC Code of Rights** (Rights 6 & 7 — information and informed consent), **ACC provider rules** | legislation.govt.nz + hdc.org.nz + acc.co.nz | | `professional_codes` | **Chiropractic Board**, **Osteopathic Council**, **Physiotherapy Board**, **Chinese Medicine Council** standards on advertising; Medical Council Statement on Advertising (**scope-tagged as benchmark only — does NOT bind non-MD practitioners**) | chiropracticboard.org.nz, osteopathiccouncil.org.nz, physioboard.org.nz, chinesemedicinecouncil.org.nz, mcnz.org.nz | Six domains, ~17 documents, comparable to ECE's complexity. The professional_codes domain requires per-document scope metadata (which practitioners each standard binds) — see "Section-level metadata flags" below. ## Corpus format produced by build scripts Each `build__compilation.py` script must emit markdown matching the conventions the existing pipeline reads. **File layout (v2):** ``` corpus/ medicines-and-supplements.md advertising-standards.md consumer-protection.md marketing-comms.md practitioner-regulation.md professional-codes.md ``` **Per-file structure:** ```markdown # Domain Title Source: https://canonical-hub-url One-paragraph orientation describing what this corpus covers and who issues it. ## Section Title (H2) Source: https://specific-page-or-section-url Section content in plain markdown... ### Subsection (H3) Subsection content... ``` **Conventions enforced by the pipeline:** - Each H2 has a `Source: https://...` line directly under it. The propagation script inherits this URL down through all descendants — build scripts do NOT need to repeat URLs at H3/H4. - Heading hierarchy must not skip levels (H2 → H4 confuses the tree builder). - Plain markdown only — no HTML, no front matter. - File slug becomes domain key (slug case → snake case in `DOCUMENT_REGISTRY`). - One Act / one regulator / one code per file when possible. Cross-references between files are fine in body text. For legislation-sourced content, `convert_legislation_html.py` (copied from ECE) already produces this format with per-section `LMS#####`/`DLM#####` source URLs. For PDF-sourced content (ASA TAC, council standards), the build script extracts text via `markitdown` or `pymupdf`, structures into H2/H3 sections, and inserts the canonical hub URL under each H2. ### HTML extraction fallback ladder Most NZ govt targets are SilverStripe (comcom.govt.nz, acc.co.nz, hdc.org.nz, privacy.org.nz, most council sites) — semantic HTML5, content extractors work well. Legacy exceptions: medsafe.govt.nz (older ASP), asa.co.nz (industry body, likely WordPress). Escalate only as needed: 1. **`markitdown`** (default) — works for most pages. 2. **`trafilatura`** (Python; `uv add` if needed) — drop-in for pages markitdown handles poorly. 3. **`defuddle`** (Node.js; manual pre-processing) — escalation only. Likely needed (if at all) for medsafe legacy or asa.co.nz. Run `npx defuddle --markdown > sources/raw/.md` outside the build script, then read the pre-cleaned file. PDF sources stay on `markitdown` / `pymupdf`. `legislation.govt.nz` content uses `convert_legislation_html.py` (custom parser tuned to its parliamentary drafting markup) — don't apply general extractors there. ## Section-level metadata flags Some provisions are retrieved disproportionately often and benefit from explicit metadata so the router/retriever can surface them precisely. Build scripts attach a `tags:` line under the relevant H2/H3 heading: ```markdown ## Section 58 — Restrictions on advertising of medicines Source: https://www.legislation.govt.nz/... tags: testimonial-ban, medicines, frequently-cited Section content... ``` **v2 mandatory flags:** | Provision | Flag tags | Why | |---|---|---| | Medicines Act 1981 **s58** | `testimonial-ban, medicines, frequently-cited` | Workhorse provision for testimonial bans on medicines, devices, methods of treatment | | Fair Trading Act 1986 **s12A** | `substantiation, claims, frequently-cited` | Most-tripped-over provision in health/wellness advertising | | HPCA Act 2003 **title-protection sections** | `title-use, registration, scope` | Foundation for "can I call myself X?" questions | | Dietary Supplements Regs 1985 **r3 & r5** | `classification, supplements, therapeutic-claim` | The "marketed therapeutically → reclassified as medicine" rule | **Council standards — scope tags (mandatory):** Each council document gets a `binds:` metadata field listing the practitioner classes it binds. The retriever filters by binding when the query mentions a practitioner type. ```markdown # Medical Council of NZ — Statement on Advertising binds: medical-practitioners benchmark-only: true Source: https://www.mcnz.org.nz/... ``` This prevents the model from citing the Medical Council statement as authoritative for chiropractors or naturopaths (a real failure mode without the metadata). ## Transition window metadata (ASA TAC) The ASA Therapeutic and Health Advertising Code is mid-transition: - **Current code** — applies until 1 April 2026 (for new ads), 1 July 2026 (for all ads) - **December 2025 code** — applies from 1 April 2026 (new ads), 1 July 2026 (all ads). Materially different rules on testimonials, user-generated content, vulnerable audiences Both codes go in the corpus with `effective_from` / `effective_until` metadata on each section. The generator's system prompt includes "today's date is X" and instructs the model to answer based on the code in force on that date. ```markdown # ASA Therapeutic and Health Advertising Code (current) effective_from: effective_until: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising) ``` ```markdown # ASA Therapeutic and Health Advertising Code (December 2025) effective_from: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising) effective_until: null ``` Without this, the corpus answers correctly today but silently goes stale on a known date. ## Strategic notes Commercial / expansion strategy (e.g. AU opportunity sizing) lives in `STRATEGY.md` at the repo root — gitignored, local-only. ## Build sequence ### Phase A — Repo bootstrap | Step | Action | Command | |---|---|---| | A1 | Create new repo, copy ECE codebase | shell ops | | A2 | Strip ECE corpus, indexes, raw sources, ECE-only build scripts | shell ops | | A3 | Update `pyproject.toml` name/description | edit | | A4 | Keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` | no-op | ### Phase B — Source acquisition (v2: 6 domains, ~17 documents) | Step | Action | Notes | |---|---|---| | B1 | Identify canonical URLs and PDF locations per domain | see "Domains (v2)" table | | B2 | Write `scripts/download_sources.sh` (or per-domain fetchers) | mirror ECE's pattern using curl/wget | | B3 | Acquire raw HTML/PDF into `sources/raw//` | run download script | | B4 | Write `scripts/build_medicines_and_supplements_compilation.py` | Medicines Act (flag s58), Medicines Regs, **Dietary Supplements Regs**, Medsafe guidance | | B5 | Write `scripts/build_advertising_standards_compilation.py` | ASA TAC current + Dec 2025 (with effective-date metadata) + General ASA Code | | B6 | Write `scripts/build_consumer_protection_compilation.py` | Fair Trading Act (flag s12A) + ComCom guidance | | B7 | Write `scripts/build_marketing_comms_compilation.py` | Privacy Act + HIPC + UEMA | | B8 | Write `scripts/build_practitioner_regulation_compilation.py` | HPCA Act + HDC Code + ACC rules | | B9 | Write `scripts/build_professional_codes_compilation.py` | Chiro/Osteo/Physio/Chinese Medicine + Medical Council benchmark; **emit `binds:` scope metadata** | | B10 | `make corpus` — run all build scripts | produces `corpus/*.md` | | B11 | Sanity-check: `wc -l corpus/*.md`, spot-check Source URLs and section flags | manual | ### Phase C — Domain registry & prompt tuning | Step | Action | Command | |---|---|---| | C1 | Fill `DOCUMENT_REGISTRY` in `src/config.py` | edit | | C2 | Rewrite router few-shot examples for healthcare scenarios | edit `src/router.py` | | C3 | Rewrite generator SYSTEM_PROMPT for healthcare marketing voice | edit `src/generator.py` | | C4 | Rewrite welcome + starter questions | edit `app.py` | ### Phase D — Index, test, explore | Step | Action | Command | |---|---|---| | D1 | Verify cost estimate | `make dry-run` | | D2 | Build indexes with Opus + auto-propagate URLs | `make index` | | D3 | Smoke test | `make test` | | D4 | Write 10–20 healthcare marketing benchmark questions | edit `benchmark/questions.json` | | D5 | Quick benchmark | `make benchmark-quick` | | D6 | Manual exploration | `make app-sonnet` | ## Verification | Check | Command / action | Pass criteria | |---|---|---| | Repo boots | `make app-sonnet` | Streamlit serves on :8501, welcome renders | | Index builds | `make dry-run` then `make index` | All 6 domains build, propagation runs after | | URL propagation | `make propagate-urls` | All deep nodes have inherited Source URLs | | Pipeline test | `make test` | One LLM call returns answer + citation | | Benchmark sanity | `make benchmark-quick` | 3 questions, no exceptions, citations present | | Single-domain query | Manual — one question per domain | Right domain selected, right content cited | | Cross-domain query | Manual: "can I post a patient testimonial about my chiro practice on Instagram?" | Routes to ≥3 domains (advertising_standards + medicines_and_supplements [s58] + professional_codes [Chiro Board]); ASA answer reflects current code + flags Dec 2025 changes | | Council scope-tagging | Manual: same question for "physio" — should NOT cite Medical Council statement as authoritative | `binds:` metadata respected by retriever | | Transition window | Manual: ask same testimonial question with system date set to 2026-05-01 | Answer reflects Dec 2025 ASA TAC, not the current code | | Local model swap | `make app-qwen` | Same questions answer via Qwen + MLX | ## Critical files (in ECE repo, paths in new repo will mirror) - `src/config.py` — DOCUMENT_REGISTRY structure to copy - `src/router.py` — few-shot pattern to replicate with healthcare scenarios - `src/generator.py` — SYSTEM_PROMPT to rewrite without ECE-specific tone - `src/retriever.py` — keeps existing source URL fallback logic - `app.py` — welcome message + starter question buttons - `Makefile` — build/index/propagate targets already wired - `scripts/propagate_source_urls.py` — generic, ships with new repo unchanged - `scripts/build_indexes.py` — generic, ships with new repo unchanged ## Out of scope for prototype - Cloud deployment (defer) - Auth / user accounts - Persistent conversation history - Te reo Māori handling - Multi-tenancy - Legal review of answer accuracy (disclaimer-only) - Productisation as multi-domain framework ## Open questions ### Resolved in v2 - ✅ Repo location — `/Users/gregf/Documents/Workspace/Projects/Personal/health-marketing-compliance-rag/` - ✅ Domain count — 6 domains (was 5 in v1) - ✅ Audience — complementary/alternative practitioners + supplement sellers - ✅ Council standards selection — Chiro/Osteo/Physio/Chinese Medicine + Medical (benchmark only) - ✅ UEMA inclusion — yes, in marketing-comms cluster ### For Becki to confirm before Phase B starts - **TAPS in v1?** Becki didn't explicitly comment. Recommendation: defer to v2 — TAPS is geared at therapeutic-product advertisers, less directly relevant to the practitioner audience. Confirm. - **ASA TAC handling at the transition** — does Becki want answers framed as "current code says X, from 1 April 2026 the new code says Y" *throughout*, or strictly "the code in force today says X, with no forward-looking commentary"? Affects generator prompt design. - **Chinese Medicine Council vs HPCA Chinese Medicine specifics** — is the council document materially different from the HPCA-derived rules, or do they overlap heavily? Affects whether to ship both or just one. - **Food Act / Standard 1.2.7 scope** — defer to v2 unless audience has serious supplemented-food product lines. Confirm. ### Acquisition phase (Phase B) - ASA TAC PDF parsing quality — if structure is messy, may need manual cleanup pass. Both versions (current + Dec 2025) should be eyeballed before committing build script effort. - Council standards PDF quality — Becki flagged this is the most likely domain to need manual cleanup. Eyeball each before committing build script effort. - ACC provider rules — canonical source location and format are less standardised than the legislation/council docs; may need manual curation.