Spaces:
Sleeping
Healthcare Marketing Compliance β RAG Prototype
Document version
v2 β incorporates Becki's domain-expert feedback (collaborator review of v1 source-acquisition recommendations).
v1 β v2 changes
- Audience narrowed and clarified β from generic "clinic/pharmacy/healthcare-product owner-operators" to complementary/alternative practitioners + supplement sellers (chiros, osteos, physios, Chinese medicine, naturopaths, supplement retailers)
- Council standards swap β drop Pharmacy/Dental Council standards from v1; add Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council; keep Medical Council as scope-tagged benchmark only
- Five new mandatory additions β HPCA Act 2003, Dietary Supplements Regulations 1985, HDC Code of Rights, ASA TAC Dec 2025 (alongside current), ACC provider rules
- Section-level metadata flagging β Medicines Act s58 (testimonial bans) and Fair Trading Act s12A (substantiation) called out explicitly in build scripts
- Transition window metadata β ASA TAC effective dates (1 April 2026 / 1 July 2026) encoded as section metadata so corpus answers stay accurate across the transition
- Domain restructure β 5 β 6 domains, regrouped to match how marketers actually think about the problem
- Acquisition list grows from 11 to ~17 documents; cost still ~$10β$15 with Opus
- Commercial / expansion strategy notes added (held in local-only STRATEGY.md)
Context
The ECE Compliance RAG prototype proved that PageIndex tree retrieval + LLM reasoning + multi-regulator corpus indexing produces a working compliance assistant. The architecture, pipeline and UI are reusable. The next prototype applies the same approach to NZ healthcare marketing regulation, scoped to complementary/alternative practitioners and supplement sellers β an audience with high-stakes compliance pain (advertising claims, testimonials, title use, supplement classification, registration declarations) that spans multiple regulators with overlapping but non-identical requirements.
Goal: validate that the architecture transfers cleanly to a different legislative domain, and that targeting a coherent practitioner segment produces actionable answers. This is a second proof of concept, not a productisation step.
Decisions locked in
- Same tech stack β Streamlit + uv + litellm + PageIndex
- Keep local model switching β MLX (Qwen, Gemma) options preserved alongside Claude API
- No te reo MΔori in prototype β language detection code stays but is bypassed
- Source acquisition via Python scripts β same pattern as ECE. Per-domain
build_*_compilation.pyscripts read raw HTML/PDF fromsources/raw/and emit markdown withSource:URLs in place - Deployment flexible β local-only for prototype; cloud later if value is validated
- Audience β complementary/alternative practitioners (chiros, osteos, physios, Chinese medicine, naturopaths, acupuncturists) + supplement sellers
- Indexing budget β Opus is fine (~$10β$15 expected for the v2 corpus, slightly larger than v1)
Repo strategy β staged
The long-term goal is a reusable compliance-RAG framework that benefits multiple domain projects. The path to get there is staged, not upfront.
Stage 1 (now) β sibling fork
- Create new repo
health-marketing-compliance-ragas a sibling toece-compliance-rag - Clean-copy the ECE codebase, strip ECE-specific content, refit for healthcare marketing
- Repos diverge freely; bug fixes copy-paste between them
- Deliberate discipline: keep
convert_legislation_html.py,propagate_source_urls.py,build_indexes.pybyte-identical between both repos to make later extraction painless
Stage 2 (trigger: scoping a third domain) β template repo
- Create
compliance-rag-templatecontaining only what's genuinely shared across both ECE and healthcare-marketing experience - New domain projects start with
gh repo create --template compliance-rag-template - Existing prototypes can optionally rebase onto template or stay as-is β per-project decision
- This is where the "rule of three" pays off: by then we'll have hindsight on what was shared vs domain-coloured
Stage 3 (trigger: this becomes a product, not just prototypes) β installable package
- Publish framework as a versioned Python package (
compliance-rag-core) on private PyPI or GitHub Packages - Domain repos depend on a pinned version:
compliance-rag-core==0.3.1 - Framework releases propagate to all consumers explicitly, on next bump
- This is where "shared improvements" actually scales
Why not extract framework now
Rule of three. Frameworks built from one prototype over-fit to that prototype. ECE alone produced the current shape, but healthcare marketing will reveal which parts of that shape were ECE-specific in disguise. Premature abstraction bakes in wrong assumptions. Forking is cheap; refactoring a wrong abstraction is expensive.
Anti-patterns explicitly avoided
- Git submodules β drift management nightmare; skip directly to package-based sharing at Stage 3
- Monorepo tooling (Bazel, Nx, Turborepo) β overhead too high for solo prototype work
- Speculative abstraction β no shared base classes "just in case"; only extract what's already shared verbatim across two repos
Reuse vs. replace
Reused as-is
src/pipeline.py,src/retriever.py,src/router.py,src/generator.pyβ pipeline coresrc/config.pyframework (DOCUMENT_REGISTRY pattern, model presets)src/usage.py,src/language.py(kept dormant β no language switching active)- Multi-model switching (litellm + MLX server orchestration in Makefile)
scripts/build_indexes.pyβ PageIndex tree builderscripts/propagate_source_urls.pyβ URL inheritance through treeapp.pyβ Streamlit shell (welcome + chat + sidebar)Makefileβ build pipeline, per-domain index targets, propagate hookstest_pipeline.py,benchmark/run_benchmark.pyβ eval framework
Replaced (domain-specific, written fresh)
corpus/*.mdβ produced by new per-domain build scriptsindexes/*.jsonβ rebuilt from new corpussources/raw/*β raw HTML/PDF acquired from healthcare regulatorsDOCUMENT_REGISTRYinsrc/config.py- Router few-shot examples in
src/router.py - Generator
SYSTEM_PROMPTinsrc/generator.py - Welcome message + starter questions in
app.py benchmark/questions.json
Adapted from ECE templates (rewritten with healthcare URLs/parsers)
scripts/build_medicines_and_supplements_compilation.pyβ Medicines Act 1981 (with s58 flagged), Medicines Regs 1984, Dietary Supplements Regs 1985, Medsafe advertising guidancescripts/build_advertising_standards_compilation.pyβ ASA TAC current + ASA TAC Dec 2025 (with effective-date metadata) + General ASA Advertising Standards Codescripts/build_consumer_protection_compilation.pyβ Fair Trading Act 1986 (with s12A flagged) + ComCom Health & Wellness Claims guidancescripts/build_marketing_comms_compilation.pyβ Privacy Act 2020, Health Information Privacy Code 2020, UEMA 2007 (the "can I email this list?" cluster)scripts/build_practitioner_regulation_compilation.pyβ HPCA Act 2003, HDC Code of Rights, ACC provider rulesscripts/build_professional_codes_compilation.pyβ Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council standards + Medical Council (scope-tagged as benchmark only)scripts/convert_legislation_html.pyβ copied from ECE, works unchanged for Medicines Act, Medicines Regs, Dietary Supplements Regs, Fair Trading Act, Privacy Act, HPCA Act, UEMA (all on legislation.govt.nz with same structure)scripts/download_sources.sh(or equivalent) β source-fetching commands
Removed
- ECE-specific build scripts (
scripts/build_ero_compilation.py,build_reform_compilation.py) - ECE corpus, indexes, source raw files
- ECE benchmark questions
Domains (v2)
Six domains, restructured to match how the audience thinks about compliance ("can I say X?", "can I email this list?", "can I call myself Y?"):
| Domain key | Coverage | Acquisition source |
|---|---|---|
medicines_and_supplements |
Medicines Act 1981 (Parts 4 & 5; s58 testimonial ban flagged), Medicines Regs 1984, Dietary Supplements Regs 1985 (the "marketed therapeutically β reclassified as medicine" trapdoor), Medsafe advertising guidance | legislation.govt.nz (HTML) + medsafe.govt.nz (HTML/PDF) |
advertising_standards |
ASA Therapeutic and Health Advertising Code (current) + ASA TAC Dec 2025 (applies 1 Apr 2026 / 1 Jul 2026 β both kept with effective-date metadata) + General ASA Advertising Standards Code | asa.co.nz (PDFs) |
consumer_protection |
Fair Trading Act 1986 (Part 1; s12A substantiation flagged) + ComCom Health & Wellness Claims guidance | legislation.govt.nz (HTML) + comcom.govt.nz (HTML/PDF) |
marketing_comms |
Privacy Act 2020 (IPP 10, IPP 11), Health Information Privacy Code 2020, Unsolicited Electronic Messages Act 2007 | legislation.govt.nz (HTML) + privacy.org.nz (PDF) |
practitioner_regulation |
HPCA Act 2003 (titles, scopes of practice, restricted activities), HDC Code of Rights (Rights 6 & 7 β information and informed consent), ACC provider rules | legislation.govt.nz + hdc.org.nz + acc.co.nz |
professional_codes |
Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council standards on advertising; Medical Council Statement on Advertising (scope-tagged as benchmark only β does NOT bind non-MD practitioners) | chiropracticboard.org.nz, osteopathiccouncil.org.nz, physioboard.org.nz, chinesemedicinecouncil.org.nz, mcnz.org.nz |
Six domains, ~17 documents, comparable to ECE's complexity. The professional_codes domain requires per-document scope metadata (which practitioners each standard binds) β see "Section-level metadata flags" below.
Corpus format produced by build scripts
Each build_<domain>_compilation.py script must emit markdown matching the conventions the existing pipeline reads.
File layout (v2):
corpus/
medicines-and-supplements.md
advertising-standards.md
consumer-protection.md
marketing-comms.md
practitioner-regulation.md
professional-codes.md
Per-file structure:
# Domain Title
Source: https://canonical-hub-url
One-paragraph orientation describing what this corpus covers and who issues it.
## Section Title (H2)
Source: https://specific-page-or-section-url
Section content in plain markdown...
### Subsection (H3)
Subsection content...
Conventions enforced by the pipeline:
- Each H2 has a
Source: https://...line directly under it. The propagation script inherits this URL down through all descendants β build scripts do NOT need to repeat URLs at H3/H4. - Heading hierarchy must not skip levels (H2 β H4 confuses the tree builder).
- Plain markdown only β no HTML, no front matter.
- File slug becomes domain key (slug case β snake case in
DOCUMENT_REGISTRY). - One Act / one regulator / one code per file when possible. Cross-references between files are fine in body text.
For legislation-sourced content, convert_legislation_html.py (copied from ECE) already produces this format with per-section LMS#####/DLM##### source URLs. For PDF-sourced content (ASA TAC, council standards), the build script extracts text via markitdown or pymupdf, structures into H2/H3 sections, and inserts the canonical hub URL under each H2.
HTML extraction fallback ladder
Most NZ govt targets are SilverStripe (comcom.govt.nz, acc.co.nz, hdc.org.nz, privacy.org.nz, most council sites) β semantic HTML5, content extractors work well. Legacy exceptions: medsafe.govt.nz (older ASP), asa.co.nz (industry body, likely WordPress).
Escalate only as needed:
markitdown(default) β works for most pages.trafilatura(Python;uv addif needed) β drop-in for pages markitdown handles poorly.defuddle(Node.js; manual pre-processing) β escalation only. Likely needed (if at all) for medsafe legacy or asa.co.nz. Runnpx defuddle <url> --markdown > sources/raw/<file>.mdoutside the build script, then read the pre-cleaned file.
PDF sources stay on markitdown / pymupdf. legislation.govt.nz content uses convert_legislation_html.py (custom parser tuned to its parliamentary drafting markup) β don't apply general extractors there.
Section-level metadata flags
Some provisions are retrieved disproportionately often and benefit from explicit metadata so the router/retriever can surface them precisely. Build scripts attach a tags: line under the relevant H2/H3 heading:
## Section 58 β Restrictions on advertising of medicines
Source: https://www.legislation.govt.nz/...
tags: testimonial-ban, medicines, frequently-cited
Section content...
v2 mandatory flags:
| Provision | Flag tags | Why |
|---|---|---|
| Medicines Act 1981 s58 | testimonial-ban, medicines, frequently-cited |
Workhorse provision for testimonial bans on medicines, devices, methods of treatment |
| Fair Trading Act 1986 s12A | substantiation, claims, frequently-cited |
Most-tripped-over provision in health/wellness advertising |
| HPCA Act 2003 title-protection sections | title-use, registration, scope |
Foundation for "can I call myself X?" questions |
| Dietary Supplements Regs 1985 r3 & r5 | classification, supplements, therapeutic-claim |
The "marketed therapeutically β reclassified as medicine" rule |
Council standards β scope tags (mandatory):
Each council document gets a binds: metadata field listing the practitioner classes it binds. The retriever filters by binding when the query mentions a practitioner type.
# Medical Council of NZ β Statement on Advertising
binds: medical-practitioners
benchmark-only: true
Source: https://www.mcnz.org.nz/...
This prevents the model from citing the Medical Council statement as authoritative for chiropractors or naturopaths (a real failure mode without the metadata).
Transition window metadata (ASA TAC)
The ASA Therapeutic and Health Advertising Code is mid-transition:
- Current code β applies until 1 April 2026 (for new ads), 1 July 2026 (for all ads)
- December 2025 code β applies from 1 April 2026 (new ads), 1 July 2026 (all ads). Materially different rules on testimonials, user-generated content, vulnerable audiences
Both codes go in the corpus with effective_from / effective_until metadata on each section. The generator's system prompt includes "today's date is X" and instructs the model to answer based on the code in force on that date.
# ASA Therapeutic and Health Advertising Code (current)
effective_from: <unknown - in force prior to 2026>
effective_until: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
# ASA Therapeutic and Health Advertising Code (December 2025)
effective_from: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
effective_until: null
Without this, the corpus answers correctly today but silently goes stale on a known date.
Strategic notes
Commercial / expansion strategy (e.g. AU opportunity sizing) lives in STRATEGY.md at the repo root β gitignored, local-only.
Build sequence
Phase A β Repo bootstrap
| Step | Action | Command |
|---|---|---|
| A1 | Create new repo, copy ECE codebase | shell ops |
| A2 | Strip ECE corpus, indexes, raw sources, ECE-only build scripts | shell ops |
| A3 | Update pyproject.toml name/description |
edit |
| A4 | Keep convert_legislation_html.py, propagate_source_urls.py, build_indexes.py |
no-op |
Phase B β Source acquisition (v2: 6 domains, ~17 documents)
| Step | Action | Notes |
|---|---|---|
| B1 | Identify canonical URLs and PDF locations per domain | see "Domains (v2)" table |
| B2 | Write scripts/download_sources.sh (or per-domain fetchers) |
mirror ECE's pattern using curl/wget |
| B3 | Acquire raw HTML/PDF into sources/raw/<domain>/ |
run download script |
| B4 | Write scripts/build_medicines_and_supplements_compilation.py |
Medicines Act (flag s58), Medicines Regs, Dietary Supplements Regs, Medsafe guidance |
| B5 | Write scripts/build_advertising_standards_compilation.py |
ASA TAC current + Dec 2025 (with effective-date metadata) + General ASA Code |
| B6 | Write scripts/build_consumer_protection_compilation.py |
Fair Trading Act (flag s12A) + ComCom guidance |
| B7 | Write scripts/build_marketing_comms_compilation.py |
Privacy Act + HIPC + UEMA |
| B8 | Write scripts/build_practitioner_regulation_compilation.py |
HPCA Act + HDC Code + ACC rules |
| B9 | Write scripts/build_professional_codes_compilation.py |
Chiro/Osteo/Physio/Chinese Medicine + Medical Council benchmark; emit binds: scope metadata |
| B10 | make corpus β run all build scripts |
produces corpus/*.md |
| B11 | Sanity-check: wc -l corpus/*.md, spot-check Source URLs and section flags |
manual |
Phase C β Domain registry & prompt tuning
| Step | Action | Command |
|---|---|---|
| C1 | Fill DOCUMENT_REGISTRY in src/config.py |
edit |
| C2 | Rewrite router few-shot examples for healthcare scenarios | edit src/router.py |
| C3 | Rewrite generator SYSTEM_PROMPT for healthcare marketing voice | edit src/generator.py |
| C4 | Rewrite welcome + starter questions | edit app.py |
Phase D β Index, test, explore
| Step | Action | Command |
|---|---|---|
| D1 | Verify cost estimate | make dry-run |
| D2 | Build indexes with Opus + auto-propagate URLs | make index |
| D3 | Smoke test | make test |
| D4 | Write 10β20 healthcare marketing benchmark questions | edit benchmark/questions.json |
| D5 | Quick benchmark | make benchmark-quick |
| D6 | Manual exploration | make app-sonnet |
Verification
| Check | Command / action | Pass criteria |
|---|---|---|
| Repo boots | make app-sonnet |
Streamlit serves on :8501, welcome renders |
| Index builds | make dry-run then make index |
All 6 domains build, propagation runs after |
| URL propagation | make propagate-urls |
All deep nodes have inherited Source URLs |
| Pipeline test | make test |
One LLM call returns answer + citation |
| Benchmark sanity | make benchmark-quick |
3 questions, no exceptions, citations present |
| Single-domain query | Manual β one question per domain | Right domain selected, right content cited |
| Cross-domain query | Manual: "can I post a patient testimonial about my chiro practice on Instagram?" | Routes to β₯3 domains (advertising_standards + medicines_and_supplements [s58] + professional_codes [Chiro Board]); ASA answer reflects current code + flags Dec 2025 changes |
| Council scope-tagging | Manual: same question for "physio" β should NOT cite Medical Council statement as authoritative | binds: metadata respected by retriever |
| Transition window | Manual: ask same testimonial question with system date set to 2026-05-01 | Answer reflects Dec 2025 ASA TAC, not the current code |
| Local model swap | make app-qwen |
Same questions answer via Qwen + MLX |
Critical files (in ECE repo, paths in new repo will mirror)
src/config.pyβ DOCUMENT_REGISTRY structure to copysrc/router.pyβ few-shot pattern to replicate with healthcare scenariossrc/generator.pyβ SYSTEM_PROMPT to rewrite without ECE-specific tonesrc/retriever.pyβ keeps existing source URL fallback logicapp.pyβ welcome message + starter question buttonsMakefileβ build/index/propagate targets already wiredscripts/propagate_source_urls.pyβ generic, ships with new repo unchangedscripts/build_indexes.pyβ generic, ships with new repo unchanged
Out of scope for prototype
- Cloud deployment (defer)
- Auth / user accounts
- Persistent conversation history
- Te reo MΔori handling
- Multi-tenancy
- Legal review of answer accuracy (disclaimer-only)
- Productisation as multi-domain framework
Open questions
Resolved in v2
- β
Repo location β
/Users/gregf/Documents/Workspace/Projects/Personal/health-marketing-compliance-rag/ - β Domain count β 6 domains (was 5 in v1)
- β Audience β complementary/alternative practitioners + supplement sellers
- β Council standards selection β Chiro/Osteo/Physio/Chinese Medicine + Medical (benchmark only)
- β UEMA inclusion β yes, in marketing-comms cluster
For Becki to confirm before Phase B starts
- TAPS in v1? Becki didn't explicitly comment. Recommendation: defer to v2 β TAPS is geared at therapeutic-product advertisers, less directly relevant to the practitioner audience. Confirm.
- ASA TAC handling at the transition β does Becki want answers framed as "current code says X, from 1 April 2026 the new code says Y" throughout, or strictly "the code in force today says X, with no forward-looking commentary"? Affects generator prompt design.
- Chinese Medicine Council vs HPCA Chinese Medicine specifics β is the council document materially different from the HPCA-derived rules, or do they overlap heavily? Affects whether to ship both or just one.
- Food Act / Standard 1.2.7 scope β defer to v2 unless audience has serious supplemented-food product lines. Confirm.
Acquisition phase (Phase B)
- ASA TAC PDF parsing quality β if structure is messy, may need manual cleanup pass. Both versions (current + Dec 2025) should be eyeballed before committing build script effort.
- Council standards PDF quality β Becki flagged this is the most likely domain to need manual cleanup. Eyeball each before committing build script effort.
- ACC provider rules β canonical source location and format are less standardised than the legislation/council docs; may need manual curation.