Spaces:
Sleeping
Sleeping
| # Healthcare Marketing Compliance β RAG Prototype | |
| ## Document version | |
| **v2 β incorporates Becki's domain-expert feedback** (collaborator review of v1 source-acquisition recommendations). | |
| ### v1 β v2 changes | |
| - **Audience narrowed and clarified** β from generic "clinic/pharmacy/healthcare-product owner-operators" to **complementary/alternative practitioners + supplement sellers** (chiros, osteos, physios, Chinese medicine, naturopaths, supplement retailers) | |
| - **Council standards swap** β drop Pharmacy/Dental Council standards from v1; add Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council; keep Medical Council as scope-tagged benchmark only | |
| - **Five new mandatory additions** β HPCA Act 2003, Dietary Supplements Regulations 1985, HDC Code of Rights, ASA TAC Dec 2025 (alongside current), ACC provider rules | |
| - **Section-level metadata flagging** β Medicines Act s58 (testimonial bans) and Fair Trading Act s12A (substantiation) called out explicitly in build scripts | |
| - **Transition window metadata** β ASA TAC effective dates (1 April 2026 / 1 July 2026) encoded as section metadata so corpus answers stay accurate across the transition | |
| - **Domain restructure** β 5 β 6 domains, regrouped to match how marketers actually think about the problem | |
| - **Acquisition list grows** from 11 to ~17 documents; cost still ~$10β$15 with Opus | |
| - Commercial / expansion strategy notes added (held in local-only STRATEGY.md) | |
| ## Context | |
| The ECE Compliance RAG prototype proved that PageIndex tree retrieval + LLM reasoning + multi-regulator corpus indexing produces a working compliance assistant. The architecture, pipeline and UI are reusable. The next prototype applies the same approach to **NZ healthcare marketing regulation**, scoped to **complementary/alternative practitioners and supplement sellers** β an audience with high-stakes compliance pain (advertising claims, testimonials, title use, supplement classification, registration declarations) that spans multiple regulators with overlapping but non-identical requirements. | |
| Goal: validate that the architecture transfers cleanly to a different legislative domain, and that targeting a coherent practitioner segment produces actionable answers. This is a second proof of concept, not a productisation step. | |
| ## Decisions locked in | |
| - **Same tech stack** β Streamlit + uv + litellm + PageIndex | |
| - **Keep local model switching** β MLX (Qwen, Gemma) options preserved alongside Claude API | |
| - **No te reo MΔori in prototype** β language detection code stays but is bypassed | |
| - **Source acquisition via Python scripts** β same pattern as ECE. Per-domain `build_*_compilation.py` scripts read raw HTML/PDF from `sources/raw/` and emit markdown with `Source:` URLs in place | |
| - **Deployment flexible** β local-only for prototype; cloud later if value is validated | |
| - **Audience** β complementary/alternative practitioners (chiros, osteos, physios, Chinese medicine, naturopaths, acupuncturists) + supplement sellers | |
| - **Indexing budget** β Opus is fine (~$10β$15 expected for the v2 corpus, slightly larger than v1) | |
| ## Repo strategy β staged | |
| The long-term goal is a reusable compliance-RAG framework that benefits multiple domain projects. The path to get there is staged, not upfront. | |
| ### Stage 1 (now) β sibling fork | |
| - Create new repo `health-marketing-compliance-rag` as a sibling to `ece-compliance-rag` | |
| - Clean-copy the ECE codebase, strip ECE-specific content, refit for healthcare marketing | |
| - Repos diverge freely; bug fixes copy-paste between them | |
| - Deliberate discipline: keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` byte-identical between both repos to make later extraction painless | |
| ### Stage 2 (trigger: scoping a third domain) β template repo | |
| - Create `compliance-rag-template` containing only what's genuinely shared across both ECE and healthcare-marketing experience | |
| - New domain projects start with `gh repo create --template compliance-rag-template` | |
| - Existing prototypes can optionally rebase onto template or stay as-is β per-project decision | |
| - This is where the "rule of three" pays off: by then we'll have hindsight on what was shared vs domain-coloured | |
| ### Stage 3 (trigger: this becomes a product, not just prototypes) β installable package | |
| - Publish framework as a versioned Python package (`compliance-rag-core`) on private PyPI or GitHub Packages | |
| - Domain repos depend on a pinned version: `compliance-rag-core==0.3.1` | |
| - Framework releases propagate to all consumers explicitly, on next bump | |
| - This is where "shared improvements" actually scales | |
| ### Why not extract framework now | |
| Rule of three. Frameworks built from one prototype over-fit to that prototype. ECE alone produced the *current* shape, but healthcare marketing will reveal which parts of that shape were ECE-specific in disguise. Premature abstraction bakes in wrong assumptions. Forking is cheap; refactoring a wrong abstraction is expensive. | |
| ### Anti-patterns explicitly avoided | |
| - **Git submodules** β drift management nightmare; skip directly to package-based sharing at Stage 3 | |
| - **Monorepo tooling** (Bazel, Nx, Turborepo) β overhead too high for solo prototype work | |
| - **Speculative abstraction** β no shared base classes "just in case"; only extract what's already shared verbatim across two repos | |
| ## Reuse vs. replace | |
| ### Reused as-is | |
| - `src/pipeline.py`, `src/retriever.py`, `src/router.py`, `src/generator.py` β pipeline core | |
| - `src/config.py` framework (DOCUMENT_REGISTRY pattern, model presets) | |
| - `src/usage.py`, `src/language.py` (kept dormant β no language switching active) | |
| - Multi-model switching (litellm + MLX server orchestration in Makefile) | |
| - `scripts/build_indexes.py` β PageIndex tree builder | |
| - `scripts/propagate_source_urls.py` β URL inheritance through tree | |
| - `app.py` β Streamlit shell (welcome + chat + sidebar) | |
| - `Makefile` β build pipeline, per-domain index targets, propagate hooks | |
| - `test_pipeline.py`, `benchmark/run_benchmark.py` β eval framework | |
| ### Replaced (domain-specific, written fresh) | |
| - `corpus/*.md` β produced by new per-domain build scripts | |
| - `indexes/*.json` β rebuilt from new corpus | |
| - `sources/raw/*` β raw HTML/PDF acquired from healthcare regulators | |
| - `DOCUMENT_REGISTRY` in `src/config.py` | |
| - Router few-shot examples in `src/router.py` | |
| - Generator `SYSTEM_PROMPT` in `src/generator.py` | |
| - Welcome message + starter questions in `app.py` | |
| - `benchmark/questions.json` | |
| ### Adapted from ECE templates (rewritten with healthcare URLs/parsers) | |
| - `scripts/build_medicines_and_supplements_compilation.py` β Medicines Act 1981 (with **s58 flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985**, Medsafe advertising guidance | |
| - `scripts/build_advertising_standards_compilation.py` β ASA TAC current + **ASA TAC Dec 2025 (with effective-date metadata)** + General ASA Advertising Standards Code | |
| - `scripts/build_consumer_protection_compilation.py` β Fair Trading Act 1986 (with **s12A flagged**) + ComCom Health & Wellness Claims guidance | |
| - `scripts/build_marketing_comms_compilation.py` β Privacy Act 2020, Health Information Privacy Code 2020, **UEMA 2007** (the "can I email this list?" cluster) | |
| - `scripts/build_practitioner_regulation_compilation.py` β **HPCA Act 2003**, **HDC Code of Rights**, **ACC provider rules** | |
| - `scripts/build_professional_codes_compilation.py` β **Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council** standards + Medical Council (scope-tagged as benchmark only) | |
| - `scripts/convert_legislation_html.py` β copied from ECE, works unchanged for Medicines Act, Medicines Regs, Dietary Supplements Regs, Fair Trading Act, Privacy Act, HPCA Act, UEMA (all on legislation.govt.nz with same structure) | |
| - `scripts/download_sources.sh` (or equivalent) β source-fetching commands | |
| ### Removed | |
| - ECE-specific build scripts (`scripts/build_ero_compilation.py`, `build_reform_compilation.py`) | |
| - ECE corpus, indexes, source raw files | |
| - ECE benchmark questions | |
| ## Domains (v2) | |
| Six domains, restructured to match how the audience thinks about compliance ("can I say X?", "can I email this list?", "can I call myself Y?"): | |
| | Domain key | Coverage | Acquisition source | | |
| |---|---|---| | |
| | `medicines_and_supplements` | Medicines Act 1981 (Parts 4 & 5; **s58 testimonial ban flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985** (the "marketed therapeutically β reclassified as medicine" trapdoor), Medsafe advertising guidance | legislation.govt.nz (HTML) + medsafe.govt.nz (HTML/PDF) | | |
| | `advertising_standards` | ASA Therapeutic and Health Advertising Code (current) + **ASA TAC Dec 2025** (applies 1 Apr 2026 / 1 Jul 2026 β both kept with effective-date metadata) + General ASA Advertising Standards Code | asa.co.nz (PDFs) | | |
| | `consumer_protection` | Fair Trading Act 1986 (Part 1; **s12A substantiation flagged**) + ComCom Health & Wellness Claims guidance | legislation.govt.nz (HTML) + comcom.govt.nz (HTML/PDF) | | |
| | `marketing_comms` | Privacy Act 2020 (IPP 10, IPP 11), Health Information Privacy Code 2020, **Unsolicited Electronic Messages Act 2007** | legislation.govt.nz (HTML) + privacy.org.nz (PDF) | | |
| | `practitioner_regulation` | **HPCA Act 2003** (titles, scopes of practice, restricted activities), **HDC Code of Rights** (Rights 6 & 7 β information and informed consent), **ACC provider rules** | legislation.govt.nz + hdc.org.nz + acc.co.nz | | |
| | `professional_codes` | **Chiropractic Board**, **Osteopathic Council**, **Physiotherapy Board**, **Chinese Medicine Council** standards on advertising; Medical Council Statement on Advertising (**scope-tagged as benchmark only β does NOT bind non-MD practitioners**) | chiropracticboard.org.nz, osteopathiccouncil.org.nz, physioboard.org.nz, chinesemedicinecouncil.org.nz, mcnz.org.nz | | |
| Six domains, ~17 documents, comparable to ECE's complexity. The professional_codes domain requires per-document scope metadata (which practitioners each standard binds) β see "Section-level metadata flags" below. | |
| ## Corpus format produced by build scripts | |
| Each `build_<domain>_compilation.py` script must emit markdown matching the conventions the existing pipeline reads. | |
| **File layout (v2):** | |
| ``` | |
| corpus/ | |
| medicines-and-supplements.md | |
| advertising-standards.md | |
| consumer-protection.md | |
| marketing-comms.md | |
| practitioner-regulation.md | |
| professional-codes.md | |
| ``` | |
| **Per-file structure:** | |
| ```markdown | |
| # Domain Title | |
| Source: https://canonical-hub-url | |
| One-paragraph orientation describing what this corpus covers and who issues it. | |
| ## Section Title (H2) | |
| Source: https://specific-page-or-section-url | |
| Section content in plain markdown... | |
| ### Subsection (H3) | |
| Subsection content... | |
| ``` | |
| **Conventions enforced by the pipeline:** | |
| - Each H2 has a `Source: https://...` line directly under it. The propagation script inherits this URL down through all descendants β build scripts do NOT need to repeat URLs at H3/H4. | |
| - Heading hierarchy must not skip levels (H2 β H4 confuses the tree builder). | |
| - Plain markdown only β no HTML, no front matter. | |
| - File slug becomes domain key (slug case β snake case in `DOCUMENT_REGISTRY`). | |
| - One Act / one regulator / one code per file when possible. Cross-references between files are fine in body text. | |
| For legislation-sourced content, `convert_legislation_html.py` (copied from ECE) already produces this format with per-section `LMS#####`/`DLM#####` source URLs. For PDF-sourced content (ASA TAC, council standards), the build script extracts text via `markitdown` or `pymupdf`, structures into H2/H3 sections, and inserts the canonical hub URL under each H2. | |
| ### HTML extraction fallback ladder | |
| Most NZ govt targets are SilverStripe (comcom.govt.nz, acc.co.nz, hdc.org.nz, privacy.org.nz, most council sites) β semantic HTML5, content extractors work well. Legacy exceptions: medsafe.govt.nz (older ASP), asa.co.nz (industry body, likely WordPress). | |
| Escalate only as needed: | |
| 1. **`markitdown`** (default) β works for most pages. | |
| 2. **`trafilatura`** (Python; `uv add` if needed) β drop-in for pages markitdown handles poorly. | |
| 3. **`defuddle`** (Node.js; manual pre-processing) β escalation only. Likely needed (if at all) for medsafe legacy or asa.co.nz. Run `npx defuddle <url> --markdown > sources/raw/<file>.md` outside the build script, then read the pre-cleaned file. | |
| PDF sources stay on `markitdown` / `pymupdf`. `legislation.govt.nz` content uses `convert_legislation_html.py` (custom parser tuned to its parliamentary drafting markup) β don't apply general extractors there. | |
| ## Section-level metadata flags | |
| Some provisions are retrieved disproportionately often and benefit from explicit metadata so the router/retriever can surface them precisely. Build scripts attach a `tags:` line under the relevant H2/H3 heading: | |
| ```markdown | |
| ## Section 58 β Restrictions on advertising of medicines | |
| Source: https://www.legislation.govt.nz/... | |
| tags: testimonial-ban, medicines, frequently-cited | |
| Section content... | |
| ``` | |
| **v2 mandatory flags:** | |
| | Provision | Flag tags | Why | | |
| |---|---|---| | |
| | Medicines Act 1981 **s58** | `testimonial-ban, medicines, frequently-cited` | Workhorse provision for testimonial bans on medicines, devices, methods of treatment | | |
| | Fair Trading Act 1986 **s12A** | `substantiation, claims, frequently-cited` | Most-tripped-over provision in health/wellness advertising | | |
| | HPCA Act 2003 **title-protection sections** | `title-use, registration, scope` | Foundation for "can I call myself X?" questions | | |
| | Dietary Supplements Regs 1985 **r3 & r5** | `classification, supplements, therapeutic-claim` | The "marketed therapeutically β reclassified as medicine" rule | | |
| **Council standards β scope tags (mandatory):** | |
| Each council document gets a `binds:` metadata field listing the practitioner classes it binds. The retriever filters by binding when the query mentions a practitioner type. | |
| ```markdown | |
| # Medical Council of NZ β Statement on Advertising | |
| binds: medical-practitioners | |
| benchmark-only: true | |
| Source: https://www.mcnz.org.nz/... | |
| ``` | |
| This prevents the model from citing the Medical Council statement as authoritative for chiropractors or naturopaths (a real failure mode without the metadata). | |
| ## Transition window metadata (ASA TAC) | |
| The ASA Therapeutic and Health Advertising Code is mid-transition: | |
| - **Current code** β applies until 1 April 2026 (for new ads), 1 July 2026 (for all ads) | |
| - **December 2025 code** β applies from 1 April 2026 (new ads), 1 July 2026 (all ads). Materially different rules on testimonials, user-generated content, vulnerable audiences | |
| Both codes go in the corpus with `effective_from` / `effective_until` metadata on each section. The generator's system prompt includes "today's date is X" and instructs the model to answer based on the code in force on that date. | |
| ```markdown | |
| # ASA Therapeutic and Health Advertising Code (current) | |
| effective_from: <unknown - in force prior to 2026> | |
| effective_until: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising) | |
| ``` | |
| ```markdown | |
| # ASA Therapeutic and Health Advertising Code (December 2025) | |
| effective_from: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising) | |
| effective_until: null | |
| ``` | |
| Without this, the corpus answers correctly today but silently goes stale on a known date. | |
| ## Strategic notes | |
| Commercial / expansion strategy (e.g. AU opportunity sizing) lives in `STRATEGY.md` at the repo root β gitignored, local-only. | |
| ## Build sequence | |
| ### Phase A β Repo bootstrap | |
| | Step | Action | Command | | |
| |---|---|---| | |
| | A1 | Create new repo, copy ECE codebase | shell ops | | |
| | A2 | Strip ECE corpus, indexes, raw sources, ECE-only build scripts | shell ops | | |
| | A3 | Update `pyproject.toml` name/description | edit | | |
| | A4 | Keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` | no-op | | |
| ### Phase B β Source acquisition (v2: 6 domains, ~17 documents) | |
| | Step | Action | Notes | | |
| |---|---|---| | |
| | B1 | Identify canonical URLs and PDF locations per domain | see "Domains (v2)" table | | |
| | B2 | Write `scripts/download_sources.sh` (or per-domain fetchers) | mirror ECE's pattern using curl/wget | | |
| | B3 | Acquire raw HTML/PDF into `sources/raw/<domain>/` | run download script | | |
| | B4 | Write `scripts/build_medicines_and_supplements_compilation.py` | Medicines Act (flag s58), Medicines Regs, **Dietary Supplements Regs**, Medsafe guidance | | |
| | B5 | Write `scripts/build_advertising_standards_compilation.py` | ASA TAC current + Dec 2025 (with effective-date metadata) + General ASA Code | | |
| | B6 | Write `scripts/build_consumer_protection_compilation.py` | Fair Trading Act (flag s12A) + ComCom guidance | | |
| | B7 | Write `scripts/build_marketing_comms_compilation.py` | Privacy Act + HIPC + UEMA | | |
| | B8 | Write `scripts/build_practitioner_regulation_compilation.py` | HPCA Act + HDC Code + ACC rules | | |
| | B9 | Write `scripts/build_professional_codes_compilation.py` | Chiro/Osteo/Physio/Chinese Medicine + Medical Council benchmark; **emit `binds:` scope metadata** | | |
| | B10 | `make corpus` β run all build scripts | produces `corpus/*.md` | | |
| | B11 | Sanity-check: `wc -l corpus/*.md`, spot-check Source URLs and section flags | manual | | |
| ### Phase C β Domain registry & prompt tuning | |
| | Step | Action | Command | | |
| |---|---|---| | |
| | C1 | Fill `DOCUMENT_REGISTRY` in `src/config.py` | edit | | |
| | C2 | Rewrite router few-shot examples for healthcare scenarios | edit `src/router.py` | | |
| | C3 | Rewrite generator SYSTEM_PROMPT for healthcare marketing voice | edit `src/generator.py` | | |
| | C4 | Rewrite welcome + starter questions | edit `app.py` | | |
| ### Phase D β Index, test, explore | |
| | Step | Action | Command | | |
| |---|---|---| | |
| | D1 | Verify cost estimate | `make dry-run` | | |
| | D2 | Build indexes with Opus + auto-propagate URLs | `make index` | | |
| | D3 | Smoke test | `make test` | | |
| | D4 | Write 10β20 healthcare marketing benchmark questions | edit `benchmark/questions.json` | | |
| | D5 | Quick benchmark | `make benchmark-quick` | | |
| | D6 | Manual exploration | `make app-sonnet` | | |
| ## Verification | |
| | Check | Command / action | Pass criteria | | |
| |---|---|---| | |
| | Repo boots | `make app-sonnet` | Streamlit serves on :8501, welcome renders | | |
| | Index builds | `make dry-run` then `make index` | All 6 domains build, propagation runs after | | |
| | URL propagation | `make propagate-urls` | All deep nodes have inherited Source URLs | | |
| | Pipeline test | `make test` | One LLM call returns answer + citation | | |
| | Benchmark sanity | `make benchmark-quick` | 3 questions, no exceptions, citations present | | |
| | Single-domain query | Manual β one question per domain | Right domain selected, right content cited | | |
| | Cross-domain query | Manual: "can I post a patient testimonial about my chiro practice on Instagram?" | Routes to β₯3 domains (advertising_standards + medicines_and_supplements [s58] + professional_codes [Chiro Board]); ASA answer reflects current code + flags Dec 2025 changes | | |
| | Council scope-tagging | Manual: same question for "physio" β should NOT cite Medical Council statement as authoritative | `binds:` metadata respected by retriever | | |
| | Transition window | Manual: ask same testimonial question with system date set to 2026-05-01 | Answer reflects Dec 2025 ASA TAC, not the current code | | |
| | Local model swap | `make app-qwen` | Same questions answer via Qwen + MLX | | |
| ## Critical files (in ECE repo, paths in new repo will mirror) | |
| - `src/config.py` β DOCUMENT_REGISTRY structure to copy | |
| - `src/router.py` β few-shot pattern to replicate with healthcare scenarios | |
| - `src/generator.py` β SYSTEM_PROMPT to rewrite without ECE-specific tone | |
| - `src/retriever.py` β keeps existing source URL fallback logic | |
| - `app.py` β welcome message + starter question buttons | |
| - `Makefile` β build/index/propagate targets already wired | |
| - `scripts/propagate_source_urls.py` β generic, ships with new repo unchanged | |
| - `scripts/build_indexes.py` β generic, ships with new repo unchanged | |
| ## Out of scope for prototype | |
| - Cloud deployment (defer) | |
| - Auth / user accounts | |
| - Persistent conversation history | |
| - Te reo MΔori handling | |
| - Multi-tenancy | |
| - Legal review of answer accuracy (disclaimer-only) | |
| - Productisation as multi-domain framework | |
| ## Open questions | |
| ### Resolved in v2 | |
| - β Repo location β `/Users/gregf/Documents/Workspace/Projects/Personal/health-marketing-compliance-rag/` | |
| - β Domain count β 6 domains (was 5 in v1) | |
| - β Audience β complementary/alternative practitioners + supplement sellers | |
| - β Council standards selection β Chiro/Osteo/Physio/Chinese Medicine + Medical (benchmark only) | |
| - β UEMA inclusion β yes, in marketing-comms cluster | |
| ### For Becki to confirm before Phase B starts | |
| - **TAPS in v1?** Becki didn't explicitly comment. Recommendation: defer to v2 β TAPS is geared at therapeutic-product advertisers, less directly relevant to the practitioner audience. Confirm. | |
| - **ASA TAC handling at the transition** β does Becki want answers framed as "current code says X, from 1 April 2026 the new code says Y" *throughout*, or strictly "the code in force today says X, with no forward-looking commentary"? Affects generator prompt design. | |
| - **Chinese Medicine Council vs HPCA Chinese Medicine specifics** β is the council document materially different from the HPCA-derived rules, or do they overlap heavily? Affects whether to ship both or just one. | |
| - **Food Act / Standard 1.2.7 scope** β defer to v2 unless audience has serious supplemented-food product lines. Confirm. | |
| ### Acquisition phase (Phase B) | |
| - ASA TAC PDF parsing quality β if structure is messy, may need manual cleanup pass. Both versions (current + Dec 2025) should be eyeballed before committing build script effort. | |
| - Council standards PDF quality β Becki flagged this is the most likely domain to need manual cleanup. Eyeball each before committing build script effort. | |
| - ACC provider rules β canonical source location and format are less standardised than the legislation/council docs; may need manual curation. | |