Spaces:
Sleeping
Sleeping
File size: 22,073 Bytes
bad8b6c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 | # Healthcare Marketing Compliance β RAG Prototype
## Document version
**v2 β incorporates Becki's domain-expert feedback** (collaborator review of v1 source-acquisition recommendations).
### v1 β v2 changes
- **Audience narrowed and clarified** β from generic "clinic/pharmacy/healthcare-product owner-operators" to **complementary/alternative practitioners + supplement sellers** (chiros, osteos, physios, Chinese medicine, naturopaths, supplement retailers)
- **Council standards swap** β drop Pharmacy/Dental Council standards from v1; add Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council; keep Medical Council as scope-tagged benchmark only
- **Five new mandatory additions** β HPCA Act 2003, Dietary Supplements Regulations 1985, HDC Code of Rights, ASA TAC Dec 2025 (alongside current), ACC provider rules
- **Section-level metadata flagging** β Medicines Act s58 (testimonial bans) and Fair Trading Act s12A (substantiation) called out explicitly in build scripts
- **Transition window metadata** β ASA TAC effective dates (1 April 2026 / 1 July 2026) encoded as section metadata so corpus answers stay accurate across the transition
- **Domain restructure** β 5 β 6 domains, regrouped to match how marketers actually think about the problem
- **Acquisition list grows** from 11 to ~17 documents; cost still ~$10β$15 with Opus
- Commercial / expansion strategy notes added (held in local-only STRATEGY.md)
## Context
The ECE Compliance RAG prototype proved that PageIndex tree retrieval + LLM reasoning + multi-regulator corpus indexing produces a working compliance assistant. The architecture, pipeline and UI are reusable. The next prototype applies the same approach to **NZ healthcare marketing regulation**, scoped to **complementary/alternative practitioners and supplement sellers** β an audience with high-stakes compliance pain (advertising claims, testimonials, title use, supplement classification, registration declarations) that spans multiple regulators with overlapping but non-identical requirements.
Goal: validate that the architecture transfers cleanly to a different legislative domain, and that targeting a coherent practitioner segment produces actionable answers. This is a second proof of concept, not a productisation step.
## Decisions locked in
- **Same tech stack** β Streamlit + uv + litellm + PageIndex
- **Keep local model switching** β MLX (Qwen, Gemma) options preserved alongside Claude API
- **No te reo MΔori in prototype** β language detection code stays but is bypassed
- **Source acquisition via Python scripts** β same pattern as ECE. Per-domain `build_*_compilation.py` scripts read raw HTML/PDF from `sources/raw/` and emit markdown with `Source:` URLs in place
- **Deployment flexible** β local-only for prototype; cloud later if value is validated
- **Audience** β complementary/alternative practitioners (chiros, osteos, physios, Chinese medicine, naturopaths, acupuncturists) + supplement sellers
- **Indexing budget** β Opus is fine (~$10β$15 expected for the v2 corpus, slightly larger than v1)
## Repo strategy β staged
The long-term goal is a reusable compliance-RAG framework that benefits multiple domain projects. The path to get there is staged, not upfront.
### Stage 1 (now) β sibling fork
- Create new repo `health-marketing-compliance-rag` as a sibling to `ece-compliance-rag`
- Clean-copy the ECE codebase, strip ECE-specific content, refit for healthcare marketing
- Repos diverge freely; bug fixes copy-paste between them
- Deliberate discipline: keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` byte-identical between both repos to make later extraction painless
### Stage 2 (trigger: scoping a third domain) β template repo
- Create `compliance-rag-template` containing only what's genuinely shared across both ECE and healthcare-marketing experience
- New domain projects start with `gh repo create --template compliance-rag-template`
- Existing prototypes can optionally rebase onto template or stay as-is β per-project decision
- This is where the "rule of three" pays off: by then we'll have hindsight on what was shared vs domain-coloured
### Stage 3 (trigger: this becomes a product, not just prototypes) β installable package
- Publish framework as a versioned Python package (`compliance-rag-core`) on private PyPI or GitHub Packages
- Domain repos depend on a pinned version: `compliance-rag-core==0.3.1`
- Framework releases propagate to all consumers explicitly, on next bump
- This is where "shared improvements" actually scales
### Why not extract framework now
Rule of three. Frameworks built from one prototype over-fit to that prototype. ECE alone produced the *current* shape, but healthcare marketing will reveal which parts of that shape were ECE-specific in disguise. Premature abstraction bakes in wrong assumptions. Forking is cheap; refactoring a wrong abstraction is expensive.
### Anti-patterns explicitly avoided
- **Git submodules** β drift management nightmare; skip directly to package-based sharing at Stage 3
- **Monorepo tooling** (Bazel, Nx, Turborepo) β overhead too high for solo prototype work
- **Speculative abstraction** β no shared base classes "just in case"; only extract what's already shared verbatim across two repos
## Reuse vs. replace
### Reused as-is
- `src/pipeline.py`, `src/retriever.py`, `src/router.py`, `src/generator.py` β pipeline core
- `src/config.py` framework (DOCUMENT_REGISTRY pattern, model presets)
- `src/usage.py`, `src/language.py` (kept dormant β no language switching active)
- Multi-model switching (litellm + MLX server orchestration in Makefile)
- `scripts/build_indexes.py` β PageIndex tree builder
- `scripts/propagate_source_urls.py` β URL inheritance through tree
- `app.py` β Streamlit shell (welcome + chat + sidebar)
- `Makefile` β build pipeline, per-domain index targets, propagate hooks
- `test_pipeline.py`, `benchmark/run_benchmark.py` β eval framework
### Replaced (domain-specific, written fresh)
- `corpus/*.md` β produced by new per-domain build scripts
- `indexes/*.json` β rebuilt from new corpus
- `sources/raw/*` β raw HTML/PDF acquired from healthcare regulators
- `DOCUMENT_REGISTRY` in `src/config.py`
- Router few-shot examples in `src/router.py`
- Generator `SYSTEM_PROMPT` in `src/generator.py`
- Welcome message + starter questions in `app.py`
- `benchmark/questions.json`
### Adapted from ECE templates (rewritten with healthcare URLs/parsers)
- `scripts/build_medicines_and_supplements_compilation.py` β Medicines Act 1981 (with **s58 flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985**, Medsafe advertising guidance
- `scripts/build_advertising_standards_compilation.py` β ASA TAC current + **ASA TAC Dec 2025 (with effective-date metadata)** + General ASA Advertising Standards Code
- `scripts/build_consumer_protection_compilation.py` β Fair Trading Act 1986 (with **s12A flagged**) + ComCom Health & Wellness Claims guidance
- `scripts/build_marketing_comms_compilation.py` β Privacy Act 2020, Health Information Privacy Code 2020, **UEMA 2007** (the "can I email this list?" cluster)
- `scripts/build_practitioner_regulation_compilation.py` β **HPCA Act 2003**, **HDC Code of Rights**, **ACC provider rules**
- `scripts/build_professional_codes_compilation.py` β **Chiropractic Board, Osteopathic Council, Physiotherapy Board, Chinese Medicine Council** standards + Medical Council (scope-tagged as benchmark only)
- `scripts/convert_legislation_html.py` β copied from ECE, works unchanged for Medicines Act, Medicines Regs, Dietary Supplements Regs, Fair Trading Act, Privacy Act, HPCA Act, UEMA (all on legislation.govt.nz with same structure)
- `scripts/download_sources.sh` (or equivalent) β source-fetching commands
### Removed
- ECE-specific build scripts (`scripts/build_ero_compilation.py`, `build_reform_compilation.py`)
- ECE corpus, indexes, source raw files
- ECE benchmark questions
## Domains (v2)
Six domains, restructured to match how the audience thinks about compliance ("can I say X?", "can I email this list?", "can I call myself Y?"):
| Domain key | Coverage | Acquisition source |
|---|---|---|
| `medicines_and_supplements` | Medicines Act 1981 (Parts 4 & 5; **s58 testimonial ban flagged**), Medicines Regs 1984, **Dietary Supplements Regs 1985** (the "marketed therapeutically β reclassified as medicine" trapdoor), Medsafe advertising guidance | legislation.govt.nz (HTML) + medsafe.govt.nz (HTML/PDF) |
| `advertising_standards` | ASA Therapeutic and Health Advertising Code (current) + **ASA TAC Dec 2025** (applies 1 Apr 2026 / 1 Jul 2026 β both kept with effective-date metadata) + General ASA Advertising Standards Code | asa.co.nz (PDFs) |
| `consumer_protection` | Fair Trading Act 1986 (Part 1; **s12A substantiation flagged**) + ComCom Health & Wellness Claims guidance | legislation.govt.nz (HTML) + comcom.govt.nz (HTML/PDF) |
| `marketing_comms` | Privacy Act 2020 (IPP 10, IPP 11), Health Information Privacy Code 2020, **Unsolicited Electronic Messages Act 2007** | legislation.govt.nz (HTML) + privacy.org.nz (PDF) |
| `practitioner_regulation` | **HPCA Act 2003** (titles, scopes of practice, restricted activities), **HDC Code of Rights** (Rights 6 & 7 β information and informed consent), **ACC provider rules** | legislation.govt.nz + hdc.org.nz + acc.co.nz |
| `professional_codes` | **Chiropractic Board**, **Osteopathic Council**, **Physiotherapy Board**, **Chinese Medicine Council** standards on advertising; Medical Council Statement on Advertising (**scope-tagged as benchmark only β does NOT bind non-MD practitioners**) | chiropracticboard.org.nz, osteopathiccouncil.org.nz, physioboard.org.nz, chinesemedicinecouncil.org.nz, mcnz.org.nz |
Six domains, ~17 documents, comparable to ECE's complexity. The professional_codes domain requires per-document scope metadata (which practitioners each standard binds) β see "Section-level metadata flags" below.
## Corpus format produced by build scripts
Each `build_<domain>_compilation.py` script must emit markdown matching the conventions the existing pipeline reads.
**File layout (v2):**
```
corpus/
medicines-and-supplements.md
advertising-standards.md
consumer-protection.md
marketing-comms.md
practitioner-regulation.md
professional-codes.md
```
**Per-file structure:**
```markdown
# Domain Title
Source: https://canonical-hub-url
One-paragraph orientation describing what this corpus covers and who issues it.
## Section Title (H2)
Source: https://specific-page-or-section-url
Section content in plain markdown...
### Subsection (H3)
Subsection content...
```
**Conventions enforced by the pipeline:**
- Each H2 has a `Source: https://...` line directly under it. The propagation script inherits this URL down through all descendants β build scripts do NOT need to repeat URLs at H3/H4.
- Heading hierarchy must not skip levels (H2 β H4 confuses the tree builder).
- Plain markdown only β no HTML, no front matter.
- File slug becomes domain key (slug case β snake case in `DOCUMENT_REGISTRY`).
- One Act / one regulator / one code per file when possible. Cross-references between files are fine in body text.
For legislation-sourced content, `convert_legislation_html.py` (copied from ECE) already produces this format with per-section `LMS#####`/`DLM#####` source URLs. For PDF-sourced content (ASA TAC, council standards), the build script extracts text via `markitdown` or `pymupdf`, structures into H2/H3 sections, and inserts the canonical hub URL under each H2.
### HTML extraction fallback ladder
Most NZ govt targets are SilverStripe (comcom.govt.nz, acc.co.nz, hdc.org.nz, privacy.org.nz, most council sites) β semantic HTML5, content extractors work well. Legacy exceptions: medsafe.govt.nz (older ASP), asa.co.nz (industry body, likely WordPress).
Escalate only as needed:
1. **`markitdown`** (default) β works for most pages.
2. **`trafilatura`** (Python; `uv add` if needed) β drop-in for pages markitdown handles poorly.
3. **`defuddle`** (Node.js; manual pre-processing) β escalation only. Likely needed (if at all) for medsafe legacy or asa.co.nz. Run `npx defuddle <url> --markdown > sources/raw/<file>.md` outside the build script, then read the pre-cleaned file.
PDF sources stay on `markitdown` / `pymupdf`. `legislation.govt.nz` content uses `convert_legislation_html.py` (custom parser tuned to its parliamentary drafting markup) β don't apply general extractors there.
## Section-level metadata flags
Some provisions are retrieved disproportionately often and benefit from explicit metadata so the router/retriever can surface them precisely. Build scripts attach a `tags:` line under the relevant H2/H3 heading:
```markdown
## Section 58 β Restrictions on advertising of medicines
Source: https://www.legislation.govt.nz/...
tags: testimonial-ban, medicines, frequently-cited
Section content...
```
**v2 mandatory flags:**
| Provision | Flag tags | Why |
|---|---|---|
| Medicines Act 1981 **s58** | `testimonial-ban, medicines, frequently-cited` | Workhorse provision for testimonial bans on medicines, devices, methods of treatment |
| Fair Trading Act 1986 **s12A** | `substantiation, claims, frequently-cited` | Most-tripped-over provision in health/wellness advertising |
| HPCA Act 2003 **title-protection sections** | `title-use, registration, scope` | Foundation for "can I call myself X?" questions |
| Dietary Supplements Regs 1985 **r3 & r5** | `classification, supplements, therapeutic-claim` | The "marketed therapeutically β reclassified as medicine" rule |
**Council standards β scope tags (mandatory):**
Each council document gets a `binds:` metadata field listing the practitioner classes it binds. The retriever filters by binding when the query mentions a practitioner type.
```markdown
# Medical Council of NZ β Statement on Advertising
binds: medical-practitioners
benchmark-only: true
Source: https://www.mcnz.org.nz/...
```
This prevents the model from citing the Medical Council statement as authoritative for chiropractors or naturopaths (a real failure mode without the metadata).
## Transition window metadata (ASA TAC)
The ASA Therapeutic and Health Advertising Code is mid-transition:
- **Current code** β applies until 1 April 2026 (for new ads), 1 July 2026 (for all ads)
- **December 2025 code** β applies from 1 April 2026 (new ads), 1 July 2026 (all ads). Materially different rules on testimonials, user-generated content, vulnerable audiences
Both codes go in the corpus with `effective_from` / `effective_until` metadata on each section. The generator's system prompt includes "today's date is X" and instructs the model to answer based on the code in force on that date.
```markdown
# ASA Therapeutic and Health Advertising Code (current)
effective_from: <unknown - in force prior to 2026>
effective_until: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
```
```markdown
# ASA Therapeutic and Health Advertising Code (December 2025)
effective_from: 2026-04-01 (for new advertising), 2026-07-01 (for all advertising)
effective_until: null
```
Without this, the corpus answers correctly today but silently goes stale on a known date.
## Strategic notes
Commercial / expansion strategy (e.g. AU opportunity sizing) lives in `STRATEGY.md` at the repo root β gitignored, local-only.
## Build sequence
### Phase A β Repo bootstrap
| Step | Action | Command |
|---|---|---|
| A1 | Create new repo, copy ECE codebase | shell ops |
| A2 | Strip ECE corpus, indexes, raw sources, ECE-only build scripts | shell ops |
| A3 | Update `pyproject.toml` name/description | edit |
| A4 | Keep `convert_legislation_html.py`, `propagate_source_urls.py`, `build_indexes.py` | no-op |
### Phase B β Source acquisition (v2: 6 domains, ~17 documents)
| Step | Action | Notes |
|---|---|---|
| B1 | Identify canonical URLs and PDF locations per domain | see "Domains (v2)" table |
| B2 | Write `scripts/download_sources.sh` (or per-domain fetchers) | mirror ECE's pattern using curl/wget |
| B3 | Acquire raw HTML/PDF into `sources/raw/<domain>/` | run download script |
| B4 | Write `scripts/build_medicines_and_supplements_compilation.py` | Medicines Act (flag s58), Medicines Regs, **Dietary Supplements Regs**, Medsafe guidance |
| B5 | Write `scripts/build_advertising_standards_compilation.py` | ASA TAC current + Dec 2025 (with effective-date metadata) + General ASA Code |
| B6 | Write `scripts/build_consumer_protection_compilation.py` | Fair Trading Act (flag s12A) + ComCom guidance |
| B7 | Write `scripts/build_marketing_comms_compilation.py` | Privacy Act + HIPC + UEMA |
| B8 | Write `scripts/build_practitioner_regulation_compilation.py` | HPCA Act + HDC Code + ACC rules |
| B9 | Write `scripts/build_professional_codes_compilation.py` | Chiro/Osteo/Physio/Chinese Medicine + Medical Council benchmark; **emit `binds:` scope metadata** |
| B10 | `make corpus` β run all build scripts | produces `corpus/*.md` |
| B11 | Sanity-check: `wc -l corpus/*.md`, spot-check Source URLs and section flags | manual |
### Phase C β Domain registry & prompt tuning
| Step | Action | Command |
|---|---|---|
| C1 | Fill `DOCUMENT_REGISTRY` in `src/config.py` | edit |
| C2 | Rewrite router few-shot examples for healthcare scenarios | edit `src/router.py` |
| C3 | Rewrite generator SYSTEM_PROMPT for healthcare marketing voice | edit `src/generator.py` |
| C4 | Rewrite welcome + starter questions | edit `app.py` |
### Phase D β Index, test, explore
| Step | Action | Command |
|---|---|---|
| D1 | Verify cost estimate | `make dry-run` |
| D2 | Build indexes with Opus + auto-propagate URLs | `make index` |
| D3 | Smoke test | `make test` |
| D4 | Write 10β20 healthcare marketing benchmark questions | edit `benchmark/questions.json` |
| D5 | Quick benchmark | `make benchmark-quick` |
| D6 | Manual exploration | `make app-sonnet` |
## Verification
| Check | Command / action | Pass criteria |
|---|---|---|
| Repo boots | `make app-sonnet` | Streamlit serves on :8501, welcome renders |
| Index builds | `make dry-run` then `make index` | All 6 domains build, propagation runs after |
| URL propagation | `make propagate-urls` | All deep nodes have inherited Source URLs |
| Pipeline test | `make test` | One LLM call returns answer + citation |
| Benchmark sanity | `make benchmark-quick` | 3 questions, no exceptions, citations present |
| Single-domain query | Manual β one question per domain | Right domain selected, right content cited |
| Cross-domain query | Manual: "can I post a patient testimonial about my chiro practice on Instagram?" | Routes to β₯3 domains (advertising_standards + medicines_and_supplements [s58] + professional_codes [Chiro Board]); ASA answer reflects current code + flags Dec 2025 changes |
| Council scope-tagging | Manual: same question for "physio" β should NOT cite Medical Council statement as authoritative | `binds:` metadata respected by retriever |
| Transition window | Manual: ask same testimonial question with system date set to 2026-05-01 | Answer reflects Dec 2025 ASA TAC, not the current code |
| Local model swap | `make app-qwen` | Same questions answer via Qwen + MLX |
## Critical files (in ECE repo, paths in new repo will mirror)
- `src/config.py` β DOCUMENT_REGISTRY structure to copy
- `src/router.py` β few-shot pattern to replicate with healthcare scenarios
- `src/generator.py` β SYSTEM_PROMPT to rewrite without ECE-specific tone
- `src/retriever.py` β keeps existing source URL fallback logic
- `app.py` β welcome message + starter question buttons
- `Makefile` β build/index/propagate targets already wired
- `scripts/propagate_source_urls.py` β generic, ships with new repo unchanged
- `scripts/build_indexes.py` β generic, ships with new repo unchanged
## Out of scope for prototype
- Cloud deployment (defer)
- Auth / user accounts
- Persistent conversation history
- Te reo MΔori handling
- Multi-tenancy
- Legal review of answer accuracy (disclaimer-only)
- Productisation as multi-domain framework
## Open questions
### Resolved in v2
- β
Repo location β `/Users/gregf/Documents/Workspace/Projects/Personal/health-marketing-compliance-rag/`
- β
Domain count β 6 domains (was 5 in v1)
- β
Audience β complementary/alternative practitioners + supplement sellers
- β
Council standards selection β Chiro/Osteo/Physio/Chinese Medicine + Medical (benchmark only)
- β
UEMA inclusion β yes, in marketing-comms cluster
### For Becki to confirm before Phase B starts
- **TAPS in v1?** Becki didn't explicitly comment. Recommendation: defer to v2 β TAPS is geared at therapeutic-product advertisers, less directly relevant to the practitioner audience. Confirm.
- **ASA TAC handling at the transition** β does Becki want answers framed as "current code says X, from 1 April 2026 the new code says Y" *throughout*, or strictly "the code in force today says X, with no forward-looking commentary"? Affects generator prompt design.
- **Chinese Medicine Council vs HPCA Chinese Medicine specifics** β is the council document materially different from the HPCA-derived rules, or do they overlap heavily? Affects whether to ship both or just one.
- **Food Act / Standard 1.2.7 scope** β defer to v2 unless audience has serious supplemented-food product lines. Confirm.
### Acquisition phase (Phase B)
- ASA TAC PDF parsing quality β if structure is messy, may need manual cleanup pass. Both versions (current + Dec 2025) should be eyeballed before committing build script effort.
- Council standards PDF quality β Becki flagged this is the most likely domain to need manual cleanup. Eyeball each before committing build script effort.
- ACC provider rules β canonical source location and format are less standardised than the legislation/council docs; may need manual curation.
|