hmc-rag / docs /plan.md
webmuppet
Initial commit β€” health marketing compliance RAG
bad8b6c

Healthcare Marketing Compliance β€” RAG Prototype

Document version

v2 β€” incorporates Becki's domain-expert feedback (collaborator review of v1 source-acquisition recommendations).

v1 β†’ v2 changes

  • Audience narrowed and clarified β€” from generic "clinic/pharmacy/healthcare-product owner-operators" to complementary/alternative practitioners + supplement sellers (chiros, osteos, physios, Chinese medicine, naturopaths, supplement retailers)
  • Council standards swap β€” drop Pharmacy/Dental Council standards from v1; add Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council; keep Medical Council as scope-tagged benchmark only
  • Five new mandatory additions β€” HPCA Act 2003, Dietary Supplements Regulations 1985, HDC Code of Rights, ASA TAC Dec 2025 (alongside current), ACC provider rules
  • Section-level metadata flagging β€” Medicines Act s58 (testimonial bans) and Fair Trading Act s12A (substantiation) called out explicitly in build scripts
  • Transition window metadata β€” ASA TAC effective dates (1 April 2026 / 1 July 2026) encoded as section metadata so corpus answers stay accurate across the transition
  • Domain restructure β€” 5 β†’ 6 domains, regrouped to match how marketers actually think about the problem
  • Acquisition list grows from 11 to ~17 documents; cost still ~$10–$15 with Opus
  • Commercial / expansion strategy notes added (held in local-only STRATEGY.md)

Context

The ECE Compliance RAG prototype proved that PageIndex tree retrieval + LLM reasoning + multi-regulator corpus indexing produces a working compliance assistant. The architecture, pipeline and UI are reusable. The next prototype applies the same approach to NZ healthcare marketing regulation, scoped to complementary/alternative practitioners and supplement sellers β€” an audience with high-stakes compliance pain (advertising claims, testimonials, title use, supplement classification, registration declarations) that spans multiple regulators with overlapping but non-identical requirements.

Goal: validate that the architecture transfers cleanly to a different legislative domain, and that targeting a coherent practitioner segment produces actionable answers. This is a second proof of concept, not a productisation step.

Decisions locked in

  • Same tech stack β€” Streamlit + uv + litellm + PageIndex
  • Keep local model switching β€” MLX (Qwen, Gemma) options preserved alongside Claude API
  • No te reo Māori in prototype β€” language detection code stays but is bypassed
  • Source acquisition via Python scripts β€” same pattern as ECE. Per-domain build_*_compilation.py scripts read raw HTML/PDF from sources/raw/ and emit markdown with Source: URLs in place
  • Deployment flexible β€” local-only for prototype; cloud later if value is validated
  • Audience β€” complementary/alternative practitioners (chiros, osteos, physios, Chinese medicine, naturopaths, acupuncturists) + supplement sellers
  • Indexing budget β€” Opus is fine (~$10–$15 expected for the v2 corpus, slightly larger than v1)

Repo strategy β€” staged

The long-term goal is a reusable compliance-RAG framework that benefits multiple domain projects. The path to get there is staged, not upfront.

Stage 1 (now) β€” sibling fork

  • Create new repo health-marketing-compliance-rag as a sibling to ece-compliance-rag
  • Clean-copy the ECE codebase, strip ECE-specific content, refit for healthcare marketing
  • Repos diverge freely; bug fixes copy-paste between them
  • Deliberate discipline: keep convert_legislation_html.py, propagate_source_urls.py, build_indexes.py byte-identical between both repos to make later extraction painless

Stage 2 (trigger: scoping a third domain) β€” template repo

  • Create compliance-rag-template containing only what's genuinely shared across both ECE and healthcare-marketing experience
  • New domain projects start with gh repo create --template compliance-rag-template
  • Existing prototypes can optionally rebase onto template or stay as-is β€” per-project decision
  • This is where the "rule of three" pays off: by then we'll have hindsight on what was shared vs domain-coloured

Stage 3 (trigger: this becomes a product, not just prototypes) β€” installable package

  • Publish framework as a versioned Python package (compliance-rag-core) on private PyPI or GitHub Packages
  • Domain repos depend on a pinned version: compliance-rag-core==0.3.1
  • Framework releases propagate to all consumers explicitly, on next bump
  • This is where "shared improvements" actually scales

Why not extract framework now

Rule of three. Frameworks built from one prototype over-fit to that prototype. ECE alone produced the current shape, but healthcare marketing will reveal which parts of that shape were ECE-specific in disguise. Premature abstraction bakes in wrong assumptions. Forking is cheap; refactoring a wrong abstraction is expensive.

Anti-patterns explicitly avoided

  • Git submodules β€” drift management nightmare; skip directly to package-based sharing at Stage 3
  • Monorepo tooling (Bazel, Nx, Turborepo) β€” overhead too high for solo prototype work
  • Speculative abstraction β€” no shared base classes "just in case"; only extract what's already shared verbatim across two repos

Reuse vs. replace

Reused as-is

  • src/pipeline.py, src/retriever.py, src/router.py, src/generator.py β€” pipeline core
  • src/config.py framework (DOCUMENT_REGISTRY pattern, model presets)
  • src/usage.py, src/language.py (kept dormant β€” no language switching active)
  • Multi-model switching (litellm + MLX server orchestration in Makefile)
  • scripts/build_indexes.py β€” PageIndex tree builder
  • scripts/propagate_source_urls.py β€” URL inheritance through tree
  • app.py β€” Streamlit shell (welcome + chat + sidebar)
  • Makefile β€” build pipeline, per-domain index targets, propagate hooks
  • test_pipeline.py, benchmark/run_benchmark.py β€” eval framework

Replaced (domain-specific, written fresh)

  • corpus/*.md β€” produced by new per-domain build scripts
  • indexes/*.json β€” rebuilt from new corpus
  • sources/raw/* β€” raw HTML/PDF acquired from healthcare regulators
  • DOCUMENT_REGISTRY in src/config.py
  • Router few-shot examples in src/router.py
  • Generator SYSTEM_PROMPT in src/generator.py
  • Welcome message + starter questions in app.py
  • benchmark/questions.json

Adapted from ECE templates (rewritten with healthcare URLs/parsers)

  • scripts/build_medicines_and_supplements_compilation.py β€” Medicines Act 1981 (with s58 flagged), Medicines Regs 1984, Dietary Supplements Regs 1985, Medsafe advertising guidance
  • scripts/build_advertising_standards_compilation.py β€” ASA TAC current + ASA TAC Dec 2025 (with effective-date metadata) + General ASA Advertising Standards Code
  • scripts/build_consumer_protection_compilation.py β€” Fair Trading Act 1986 (with s12A flagged) + ComCom Health & Wellness Claims guidance
  • scripts/build_marketing_comms_compilation.py β€” Privacy Act 2020, Health Information Privacy Code 2020, UEMA 2007 (the "can I email this list?" cluster)
  • scripts/build_practitioner_regulation_compilation.py β€” HPCA Act 2003, HDC Code of Rights, ACC provider rules
  • scripts/build_professional_codes_compilation.py β€” Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council standards + Medical Council (scope-tagged as benchmark only)
  • scripts/convert_legislation_html.py β€” copied from ECE, works unchanged for Medicines Act, Medicines Regs, Dietary Supplements Regs, Fair Trading Act, Privacy Act, HPCA Act, UEMA (all on legislation.govt.nz with same structure)
  • scripts/download_sources.sh (or equivalent) β€” source-fetching commands

Removed

  • ECE-specific build scripts (scripts/build_ero_compilation.py, build_reform_compilation.py)
  • ECE corpus, indexes, source raw files
  • ECE benchmark questions

Domains (v2)

Six domains, restructured to match how the audience thinks about compliance ("can I say X?", "can I email this list?", "can I call myself Y?"):

Domain key Coverage Acquisition source
medicines_and_supplements Medicines Act 1981 (Parts 4 & 5; s58 testimonial ban flagged), Medicines Regs 1984, Dietary Supplements Regs 1985 (the "marketed therapeutically β†’ reclassified as medicine" trapdoor), Medsafe advertising guidance legislation.govt.nz (HTML) + medsafe.govt.nz (HTML/PDF)
advertising_standards ASA Therapeutic and Health Advertising Code (current) + ASA TAC Dec 2025 (applies 1 Apr 2026 / 1 Jul 2026 β€” both kept with effective-date metadata) + General ASA Advertising Standards Code asa.co.nz (PDFs)
consumer_protection Fair Trading Act 1986 (Part 1; s12A substantiation flagged) + ComCom Health & Wellness Claims guidance legislation.govt.nz (HTML) + comcom.govt.nz (HTML/PDF)
marketing_comms Privacy Act 2020 (IPP 10, IPP 11), Health Information Privacy Code 2020, Unsolicited Electronic Messages Act 2007 legislation.govt.nz (HTML) + privacy.org.nz (PDF)
practitioner_regulation HPCA Act 2003 (titles, scopes of practice, restricted activities), HDC Code of Rights (Rights 6 & 7 β€” information and informed consent), ACC provider rules legislation.govt.nz + hdc.org.nz + acc.co.nz
professional_codes Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council standards on advertising; Medical Council Statement on Advertising (scope-tagged as benchmark only β€” does NOT bind non-MD practitioners) chiropracticboard.org.nz, osteopathiccouncil.org.nz, physioboard.org.nz, chinesemedicinecouncil.org.nz, mcnz.org.nz

Six domains, ~17 documents, comparable to ECE's complexity. The professional_codes domain requires per-document scope metadata (which practitioners each standard binds) β€” see "Section-level metadata flags" below.

Corpus format produced by build scripts

Each build_<domain>_compilation.py script must emit markdown matching the conventions the existing pipeline reads.

File layout (v2):

corpus/
  medicines-and-supplements.md
  advertising-standards.md
  consumer-protection.md
  marketing-comms.md
  practitioner-regulation.md
  professional-codes.md

Per-file structure:

# Domain Title

Source: https://canonical-hub-url

One-paragraph orientation describing what this corpus covers and who issues it.

## Section Title (H2)

Source: https://specific-page-or-section-url

Section content in plain markdown...

### Subsection (H3)

Subsection content...

Conventions enforced by the pipeline:

  • Each H2 has a Source: https://... line directly under it. The propagation script inherits this URL down through all descendants β€” build scripts do NOT need to repeat URLs at H3/H4.
  • Heading hierarchy must not skip levels (H2 β†’ H4 confuses the tree builder).
  • Plain markdown only β€” no HTML, no front matter.
  • File slug becomes domain key (slug case β†’ snake case in DOCUMENT_REGISTRY).
  • One Act / one regulator / one code per file when possible. Cross-references between files are fine in body text.

For legislation-sourced content, convert_legislation_html.py (copied from ECE) already produces this format with per-section LMS#####/DLM##### source URLs. For PDF-sourced content (ASA TAC, council standards), the build script extracts text via markitdown or pymupdf, structures into H2/H3 sections, and inserts the canonical hub URL under each H2.

HTML extraction fallback ladder

Most NZ govt targets are SilverStripe (comcom.govt.nz, acc.co.nz, hdc.org.nz, privacy.org.nz, most council sites) β€” semantic HTML5, content extractors work well. Legacy exceptions: medsafe.govt.nz (older ASP), asa.co.nz (industry body, likely WordPress).

Escalate only as needed:

  1. markitdown (default) β€” works for most pages.
  2. trafilatura (Python; uv add if needed) β€” drop-in for pages markitdown handles poorly.
  3. defuddle (Node.js; manual pre-processing) β€” escalation only. Likely needed (if at all) for medsafe legacy or asa.co.nz. Run npx defuddle <url> --markdown > sources/raw/<file>.md outside the build script, then read the pre-cleaned file.

PDF sources stay on markitdown / pymupdf. legislation.govt.nz content uses convert_legislation_html.py (custom parser tuned to its parliamentary drafting markup) β€” don't apply general extractors there.

Section-level metadata flags

Some provisions are retrieved disproportionately often and benefit from explicit metadata so the router/retriever can surface them precisely. Build scripts attach a tags: line under the relevant H2/H3 heading:

## Section 58 β€” Restrictions on advertising of medicines

Source: https://www.legislation.govt.nz/...

tags: testimonial-ban, medicines, frequently-cited

Section content...

v2 mandatory flags:

Provision Flag tags Why
Medicines Act 1981 s58 testimonial-ban, medicines, frequently-cited Workhorse provision for testimonial bans on medicines, devices, methods of treatment
Fair Trading Act 1986 s12A substantiation, claims, frequently-cited Most-tripped-over provision in health/wellness advertising
HPCA Act 2003 title-protection sections title-use, registration, scope Foundation for "can I call myself X?" questions
Dietary Supplements Regs 1985 r3 & r5 classification, supplements, therapeutic-claim The "marketed therapeutically β†’ reclassified as medicine" rule

Council standards β€” scope tags (mandatory):

Each council document gets a binds: metadata field listing the practitioner classes it binds. The retriever filters by binding when the query mentions a practitioner type.

# Medical Council of NZ β€” Statement on Advertising

binds: medical-practitioners
benchmark-only: true

Source: https://www.mcnz.org.nz/...

This prevents the model from citing the Medical Council statement as authoritative for chiropractors or naturopaths (a real failure mode without the metadata).

Transition window metadata (ASA TAC)

The ASA Therapeutic and Health Advertising Code is mid-transition:

  • Current code β€” applies until 1 April 2026 (for new ads), 1 July 2026 (for all ads)
  • December 2025 code β€” applies from 1 April 2026 (new ads), 1 July 2026 (all ads). Materially different rules on testimonials, user-generated content, vulnerable audiences

Both codes go in the corpus with effective_from / effective_until metadata on each section. The generator's system prompt includes "today's date is X" and instructs the model to answer based on the code in force on that date.

# ASA Therapeutic and Health Advertising Code (current)

effective_from: <unknown - in force prior to 2026>
effective_until: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
# ASA Therapeutic and Health Advertising Code (December 2025)

effective_from: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
effective_until: null

Without this, the corpus answers correctly today but silently goes stale on a known date.

Strategic notes

Commercial / expansion strategy (e.g. AU opportunity sizing) lives in STRATEGY.md at the repo root β€” gitignored, local-only.

Build sequence

Phase A β€” Repo bootstrap

Step Action Command
A1 Create new repo, copy ECE codebase shell ops
A2 Strip ECE corpus, indexes, raw sources, ECE-only build scripts shell ops
A3 Update pyproject.toml name/description edit
A4 Keep convert_legislation_html.py, propagate_source_urls.py, build_indexes.py no-op

Phase B β€” Source acquisition (v2: 6 domains, ~17 documents)

Step Action Notes
B1 Identify canonical URLs and PDF locations per domain see "Domains (v2)" table
B2 Write scripts/download_sources.sh (or per-domain fetchers) mirror ECE's pattern using curl/wget
B3 Acquire raw HTML/PDF into sources/raw/<domain>/ run download script
B4 Write scripts/build_medicines_and_supplements_compilation.py Medicines Act (flag s58), Medicines Regs, Dietary Supplements Regs, Medsafe guidance
B5 Write scripts/build_advertising_standards_compilation.py ASA TAC current + Dec 2025 (with effective-date metadata) + General ASA Code
B6 Write scripts/build_consumer_protection_compilation.py Fair Trading Act (flag s12A) + ComCom guidance
B7 Write scripts/build_marketing_comms_compilation.py Privacy Act + HIPC + UEMA
B8 Write scripts/build_practitioner_regulation_compilation.py HPCA Act + HDC Code + ACC rules
B9 Write scripts/build_professional_codes_compilation.py Chiro/Osteo/Physio/Chinese Medicine + Medical Council benchmark; emit binds: scope metadata
B10 make corpus β€” run all build scripts produces corpus/*.md
B11 Sanity-check: wc -l corpus/*.md, spot-check Source URLs and section flags manual

Phase C β€” Domain registry & prompt tuning

Step Action Command
C1 Fill DOCUMENT_REGISTRY in src/config.py edit
C2 Rewrite router few-shot examples for healthcare scenarios edit src/router.py
C3 Rewrite generator SYSTEM_PROMPT for healthcare marketing voice edit src/generator.py
C4 Rewrite welcome + starter questions edit app.py

Phase D β€” Index, test, explore

Step Action Command
D1 Verify cost estimate make dry-run
D2 Build indexes with Opus + auto-propagate URLs make index
D3 Smoke test make test
D4 Write 10–20 healthcare marketing benchmark questions edit benchmark/questions.json
D5 Quick benchmark make benchmark-quick
D6 Manual exploration make app-sonnet

Verification

Check Command / action Pass criteria
Repo boots make app-sonnet Streamlit serves on :8501, welcome renders
Index builds make dry-run then make index All 6 domains build, propagation runs after
URL propagation make propagate-urls All deep nodes have inherited Source URLs
Pipeline test make test One LLM call returns answer + citation
Benchmark sanity make benchmark-quick 3 questions, no exceptions, citations present
Single-domain query Manual β€” one question per domain Right domain selected, right content cited
Cross-domain query Manual: "can I post a patient testimonial about my chiro practice on Instagram?" Routes to β‰₯3 domains (advertising_standards + medicines_and_supplements [s58] + professional_codes [Chiro Board]); ASA answer reflects current code + flags Dec 2025 changes
Council scope-tagging Manual: same question for "physio" β€” should NOT cite Medical Council statement as authoritative binds: metadata respected by retriever
Transition window Manual: ask same testimonial question with system date set to 2026-05-01 Answer reflects Dec 2025 ASA TAC, not the current code
Local model swap make app-qwen Same questions answer via Qwen + MLX

Critical files (in ECE repo, paths in new repo will mirror)

  • src/config.py β€” DOCUMENT_REGISTRY structure to copy
  • src/router.py β€” few-shot pattern to replicate with healthcare scenarios
  • src/generator.py β€” SYSTEM_PROMPT to rewrite without ECE-specific tone
  • src/retriever.py β€” keeps existing source URL fallback logic
  • app.py β€” welcome message + starter question buttons
  • Makefile β€” build/index/propagate targets already wired
  • scripts/propagate_source_urls.py β€” generic, ships with new repo unchanged
  • scripts/build_indexes.py β€” generic, ships with new repo unchanged

Out of scope for prototype

  • Cloud deployment (defer)
  • Auth / user accounts
  • Persistent conversation history
  • Te reo Māori handling
  • Multi-tenancy
  • Legal review of answer accuracy (disclaimer-only)
  • Productisation as multi-domain framework

Open questions

Resolved in v2

  • βœ… Repo location β€” /Users/gregf/Documents/Workspace/Projects/Personal/health-marketing-compliance-rag/
  • βœ… Domain count β€” 6 domains (was 5 in v1)
  • βœ… Audience β€” complementary/alternative practitioners + supplement sellers
  • βœ… Council standards selection β€” Chiro/Osteo/Physio/Chinese Medicine + Medical (benchmark only)
  • βœ… UEMA inclusion β€” yes, in marketing-comms cluster

For Becki to confirm before Phase B starts

  • TAPS in v1? Becki didn't explicitly comment. Recommendation: defer to v2 β€” TAPS is geared at therapeutic-product advertisers, less directly relevant to the practitioner audience. Confirm.
  • ASA TAC handling at the transition β€” does Becki want answers framed as "current code says X, from 1 April 2026 the new code says Y" throughout, or strictly "the code in force today says X, with no forward-looking commentary"? Affects generator prompt design.
  • Chinese Medicine Council vs HPCA Chinese Medicine specifics β€” is the council document materially different from the HPCA-derived rules, or do they overlap heavily? Affects whether to ship both or just one.
  • Food Act / Standard 1.2.7 scope β€” defer to v2 unless audience has serious supplemented-food product lines. Confirm.

Acquisition phase (Phase B)

  • ASA TAC PDF parsing quality β€” if structure is messy, may need manual cleanup pass. Both versions (current + Dec 2025) should be eyeballed before committing build script effort.
  • Council standards PDF quality β€” Becki flagged this is the most likely domain to need manual cleanup. Eyeball each before committing build script effort.
  • ACC provider rules β€” canonical source location and format are less standardised than the legislation/council docs; may need manual curation.