Spaces:

rohitsar567
/

InsuranceBot

Sleeping

App Files Files Community

InsuranceBot / README.md

rohitsar567

docs: Cluster A (count drift) + Cluster B (deleted-module refs) sweep

4c728a9 about 1 month ago

preview code

Raw

History Blame Contribute Delete

78 kB

metadata

title: Insurance Sales Portfolio Expert
emoji: 🏥
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Voice-first AI advisor for Indian health insurance

Insurance Sales Portfolio Expert

A health-insurance advisory web app for the Indian market (presented in-app as "Insurance Advisor"). You describe your situation in plain language (typed or spoken, English or Hindi/Hinglish); it asks a few clarifying questions, then recommends and explains real policies — grounded in the actual policy documents, with every claim traceable to a source clause. It also lets you upload your own policy PDF and ask questions about it.

Live: https://rohitsar567-insurancebot.hf.space

Reading this cold? §1 is plain English. §2 walks you down four levels of abstraction: the user journey (§2.1), the building blocks (§2.2), the functional abstraction inside each block (§2.3), then deep-dives per building block (§2.4–§2.9). §3 gives a function-by-function sequence-diagram view of the six most important jobs. §4–§8 are safety, stack, repo map, run-it-locally, and deployment.

What this is
How it works, end to end
Key functions in plain language
Safety & quality
Tech stack & key decisions
Repository map
Run it locally
Deployment

1. What this is

The short answer. A health-insurance advisor that behaves like a knowledgeable, unbiased human advisor — not a lead-generation funnel. You describe your situation; it asks a few clarifying questions; it recommends real plans that fit, with every factual claim backed by the exact clause in the real policy document. No lead capture. No commission bias. If the honest answer is "this isn't in the document," it says so — instead of guessing.

It works by chat or voice, in English or Hindi/Hinglish, on desktop and mobile.

The problem this solves

Buying health insurance in India is hard for an ordinary person. A first-time buyer faces three concrete problems:

Too much to compare. ~148 plans across 21 insurers, each with dozens of decision-relevant fields (waiting periods, room-rent caps, co-pay, maternity, sub-limits, network size). No human reads them all.
The truth is buried. The number that decides whether a plan is right for you is on page 47 of a PDF written by lawyers.
Most "advice" is conflicted. Aggregator sites optimise for the sale, not the fit.

The cost of getting this wrong is real money and denied claims years later. The goal is a tool a non-expert can trust the way they would trust a good independent advisor: personalised to their profile, sourced, and never fabricating.

What it does, concretely

Conversational fact-find — short natural back-and-forth establishes your profile (age, dependants, budget, pre-existing conditions, priorities) instead of a long form.
Personalised recommendations — plans ranked for fit to your profile. A fixed-benefit plan is not pushed to someone who needs comprehensive cover; a plan whose entry age excludes you is filtered out.
Grounded answers — every factual claim about a policy is retrieved from that policy's actual document and shown with its source. Weak or missing evidence produces an honest "not stated in the document."
Marketplace & compare — browse the full indexed catalogue, open a detailed scorecard per plan, compare up to four side by side.
Profile → premium (illustrative) — a live ballpark premium range that updates as you change your profile. Not real underwriting — a multivariate range from public rate-card combinations (see §3.3).
Bring your own document — upload any policy PDF; it is safely indexed for the rest of your session so you can ask questions about your document.
Voice — speak instead of typing (tap-to-talk on mobile, push-to-talk on desktop); replies are spoken back. Indian-accent and Hinglish aware.

2. How it works, end to end

The short answer. A Next.js browser app talks to a FastAPI backend. Every chat turn goes to a single LLM "brain" (Google Gemini) with a small set of function-calling tools — most importantly a retrieval tool over a Chroma vector store built from the real policy documents. The brain decides when to retrieve, what to retrieve, and how to answer; it cannot state a policy fact it did not retrieve. If Gemini is unavailable, the turn transparently falls back to an NVIDIA NIM open-model chain. Voice in/out is handled by Sarvam (Indian-language STT/TTS). Heavy data (PDF corpus + prebuilt vectors) lives in a separate Hugging Face dataset, not the code repo.

The rest of this section walks you down four levels of abstraction: §2.1 the user's journey (plain English, no tech); §2.2 the building blocks at the highest level (the four canonical buckets); §2.3 the functional abstraction — what happens inside each bucket; and §2.4–§2.9 the deep dives per building block. Every diagram is followed by a ≤50-word summary and a hierarchical how it flows breakdown.

2.1 The user's journey (plain English — no tech)

Before the engineering detail, here is what actually happens for the person using it. No code, no jargon — just the path from opening the app to deciding with confidence.

flowchart TD
    S["🌐 You open the app — web or mobile, nothing to install"] --> TELL["🗣️ Tell it about you — a short chat, typed OR spoken, English / Hindi-Hinglish<br/>age · family · budget · health · what you care about"]
    TELL --> ASK["❓ It asks just 2–3 clarifying questions<br/>(a real conversation, never a long form)"]
    ASK --> REC["🎯 A personalised shortlist — plans ranked for YOUR fit, each with the reason it fits"]
    REC --> WHY["🔍 Open any plan: every fact is backed by the exact clause in the real policy PDF<br/>an honest &quot;not stated in the document&quot; instead of a guess"]
    WHY --> EXPLORE{"Want to dig deeper?"}
    EXPLORE -->|"Compare"| CMP["⚖️ Compare up to 4 plans side by side · full scorecard per plan"]
    EXPLORE -->|"Browse"| MKT["📚 Browse the full indexed marketplace"]
    EXPLORE -->|"Ask"| QA["💬 Ask follow-up questions — answered only from the actual documents"]
    EXPLORE -->|"My own policy"| UP["📄 Upload your own policy PDF"]
    UP --> UPIDX["⏳ Quick ack — 'Reading it through, ~30–60 s'<br/>(everything in chat is gated while the analysis runs)"]
    UPIDX --> UPCARD["📊 Inline scorecard card with FULL data:<br/>grade letter · 6 sub-scores · verbatim signals · insurer reputation"]
    UPCARD --> UPCHOICE{"How would you like to proceed?"}
    UPCHOICE -->|"Finish profile"| TELL
    UPCHOICE -->|"Dive into the PDF"| QA
    CMP --> PREM
    MKT --> PREM
    QA --> PREM
    PREM["💸 A live premium estimate that updates as you change your profile"] --> DONE["✅ Decide with confidence — no lead capture, no commission bias"]
    VOICE["🎙️ Optional the whole way: speak instead of type — it speaks the answers back"] -.-> TELL
    VOICE -.-> QA

Summary. A user opens the app and ends the session having decided on a plan with confidence — and how the system loops through compare / browse / Q&A / upload along the way. No backend in this view; just the human path. Every session starts fresh — there is no cross-session memory; closing the tab forgets you (privacy-by-design, see ADR-043).

How it flows:

Conversational fact-find. A short typed-or-spoken back-and-forth (English or Hindi-Hinglish) captures age, family, budget, health and what you care about — instead of a long form.
Personalised shortlist + a "why". Plans are ranked for your fit; every fact about a plan is backed by the exact clause in the real policy PDF, never invented.
Branches from the shortlist. Compare side by side, browse the full marketplace, ask follow-up questions, or upload your own policy PDF and ask about your document (kept private to your session).
Upload-PDF flow is a staged sequence (ADR-044, 2026-05-27): upload → bot says "reading it through, ~30–60 s" → all chat input is gated during the wait (Send button, textarea, voice paths all blocked so nothing can interrupt the staging) → bot pushes the inline scorecard card with FULL extracted data once the LLM pass lands → bot then asks whether you'd like to finish your profile or dive into the PDF. The card is the same shape as any catalogued policy card — six sub-scores, verbatim signals, real claim-settlement data when the insurer is recognised.
Live premium. Updates as you change the profile.
Decision. No lead capture and no commission bias — the path ends at decide, not at a sales handoff.

2.2 System at a glance — the big building blocks

The short answer. The system has four "tall buckets": Frontend (what you see), Backend (what runs on the server), Data layer (the policy knowledge), and Voice (in and out). They talk to each other over standard HTTP / JSON.

Two terms first, in one sentence each:

Frontend = everything you see on screen — the chat box, marketplace cards, sliders, profile builder. Built with Next.js + React (a standard, well-supported web-UI library). Runs in your browser.
Backend = everything that runs on the server — the LLM brain, the retrieval, the scoring/pricing logic, the upload-security gates. Built with FastAPI (a standard Python HTTP framework). Think of the frontend as the menu + waiter; the backend is the kitchen.

Both Next.js and FastAPI are deliberately boring, standard choices — they let us not spend engineering on the UI layer or the HTTP plumbing, so we spend that effort on the brain and the data, where the product differentiation actually lives.

Now the big picture — the buckets and how they talk:

flowchart LR
    subgraph FE["🌐 Frontend (browser · Next.js)"]
        UI["Chat · Marketplace · Compare · Profile builder<br/>Voice capture & playback"]
    end
    subgraph BE["⚙️ Backend (FastAPI server)"]
        API["HTTP endpoints + orchestration<br/>backend/main.py"]
        BRAIN["🧠 LLM Brain<br/>Google Gemini + function-calling tools<br/>(NIM fallback chain on failure)"]
        SCORE["🎯 Scoring + Pricing<br/>scorecard.py · premium_calculator.py"]
        PROF["👤 Profile (in-memory only)<br/>session_state.SessionState · 1h idle TTL"]
    end
    subgraph DATA["📚 Data layer"]
        VEC["Vector DB (Chroma) — policy chunks<br/>+ per-session quarantine (uploads)"]
        FACTS["Curated facts JSON<br/>40-data/policy_facts/*.json"]
    end
    subgraph VOICE["🎙️ Voice"]
        STT["Sarvam STT (in)"]
        TTS["Sarvam TTS (out)"]
    end

    UI <-->|"text · JSON"| API
    UI -->|"audio"| STT --> API
    API --> TTS --> UI
    API <--> BRAIN
    BRAIN <-->|"retrieve_policies"| VEC
    BRAIN <-->|"get_policy_facts"| FACTS
    BRAIN <-->|"save_profile_field"| PROF
    BRAIN --> SCORE
    SCORE <--> FACTS
    SCORE <--> PROF

Summary. Four building blocks talk over HTTP / JSON: Frontend (the chat UI you see), Voice (Sarvam STT in + TTS out), Backend (FastAPI with four sub-blocks — orchestration, LLM Brain, Scoring + Pricing, Profile & Persistence), and the Data layer (Chroma vectors + curated JSON facts).

How it flows:

1. Frontend (browser · Next.js). Renders chat, marketplace, compare, and the profile builder. Sends typed text and audio over HTTP, plays the synthesised reply.
2. Voice. Sarvam STT (in) turns spoken audio into a text turn; Sarvam TTS (out) turns the reply text back into spoken audio.
3. Backend (FastAPI). Four sub-blocks — 3a HTTP endpoints + orchestration (backend/main.py); 3b LLM Brain (Gemini + function-calling tools; NIM fallback on failure); 3c Scoring + Pricing (scorecard.py + premium_calculator.py); 3d Profile (in-memory only — session_state.SessionState, no disk).
4. Data layer. Two stores — the Chroma vector DB (shared policy chunks + per-session quarantine for uploads) and curated JSON facts at 40-data/policy_facts/*.json. The brain, scoring, and pricing all read from these.

Diagram legend (used throughout §2):

Solid arrow (→) = a real call / data flow on the request path.
Double arrow (⇄) = bidirectional — one side calls, the other returns.
Dotted arrow (-.->) = a side-channel or async event — voice playback, barge-in interrupt, end-of-turn persistence, etc. — not on the main request path.
Subgraph box = everything inside runs in one place (one process / one service / one storage layer).
Edge labels (e.g. "retrieve_policies") name the actual function or signal carried on that edge.

2.3 Functional abstraction — what happens inside each building block

flowchart TB
    subgraph FE["1. Frontend"]
        direction TB
        F1["capture_input<br/>typed text · spoken audio"]
        F2["render_reply<br/>chat · cards · scorecard · audio"]
    end

    subgraph V["2. Voice"]
        direction TB
        V1["transcribe<br/>Sarvam Saarika STT"]
        V2["synthesize<br/>voice_format → Sarvam Bulbul TTS"]
    end

    subgraph BE["3. Backend"]
        direction TB
        subgraph BE_API["3a. HTTP + orchestration"]
            A1["route_request"]
            A2["orchestrate_turn"]
        end
        subgraph BE_BRAIN["3b. LLM Brain"]
            B1["handle_turn<br/>one Gemini call + tool loop"]
            B2["fact_find<br/>save_profile_field"]
            B3["retrieve<br/>retrieve_policies"]
            B4["lookup_facts<br/>get_policy_facts"]
            B5["recommend<br/>mark_recommendation"]
        end
        subgraph BE_SCORE["3c. Scoring + Pricing"]
            SC1["grade_per_profile<br/>scorecard.py"]
            SC2["estimate_premium<br/>premium_calculator.py"]
        end
        subgraph BE_PROF["3d. Profile (in-memory)"]
            P1["update_session_profile<br/>session_state.SessionState"]
            P2["evict_on_idle<br/>1h TTL · no disk"]
        end
    end

    subgraph DATA["4. Data layer"]
        direction TB
        D1["vector_search<br/>Chroma · BGE-small"]
        D2["fact_lookup<br/>40-data/policy_facts/*.json"]
    end

    %% forward edges (input / down the pipeline)
    F1 -->|"audio"| V1
    F1 -->|"text · JSON"| A1
    V1 --> A1
    A1 --> A2
    A2 --> B1
    B1 --> B2
    B1 --> B3
    B1 --> B4
    B1 --> B5
    B2 --> P1
    B3 --> D1
    B4 --> D2
    A2 --> SC1
    A2 --> SC2
    SC1 -->|"reads"| D2
    SC2 -->|"reads"| D2
    SC1 -->|"reads"| P1
    SC2 -->|"reads"| P1
    P1 -.->|"idle 1h"| P2

    %% return edges (output / back to caller)
    D1 -.->|"top-k chunks"| B3
    D2 -.->|"per-policy facts"| B4
    SC1 -.->|"grade"| A2
    SC2 -.->|"premium range"| A2
    B1 -.->|"reply + citations"| A2
    A2 -.->|"text"| F2
    A2 -.->|"speak?"| V2
    V2 -.->|"audio"| F2

    %% blue solid = forward · orange dashed = return
    linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17 stroke:#1565c0,stroke-width:2px
    linkStyle 18,19,20,21,22,23,24,25 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3

Legend. Blue solid = forward flow (input / call down the pipeline). Orange dashed = return flow (result / reply back up).

Summary. Inside each building block from §2.2, a small set of named functions fires per turn — Frontend captures and renders, Voice transcribes and synthesises, the four Backend sub-blocks orchestrate / decide / score / remember, and the Data layer answers their reads.

How it flows:

1. Frontend. capture_input accepts typed text or recorded audio; render_reply paints chat + marketplace cards + scorecard + audio playback.
2. Voice. transcribe is the inbound path (Sarvam Saarika STT); synthesize is the outbound path (voice_format normalises money / Indic shorthand → Sarvam Bulbul TTS).
3a. HTTP + orchestration. route_request maps the URL to a handler; orchestrate_turn is the per-turn supervisor — it owns the request lifecycle and ties brain + scoring + voice + persistence together.
3b. LLM Brain. One handle_turn per turn calls Gemini, which chooses which of fact_find / retrieve / lookup_facts / recommend to run as tools. The brain may only state what its tools returned.
3c. Scoring + Pricing. grade_per_profile and estimate_premium read curated facts and the live profile, compute on every request (never stored), and hand back to orchestrate_turn.
3d. Profile (in-memory). update_session_profile reflects each fact_find write into the live SessionState.profile. State lives in process memory only; an idle session is evicted after 1 h. There is no disk persistence and no cross-session recall (see ADR-043, 2026-05-27).
4. Data layer. Two reads — vector_search for free-form Q&A, and fact_lookup for decision-critical numbers with verbatim quotes. The data layer does no writes during a request — those happen offline only (vector ingest, curated-facts edits).

2.4 LLM brain + fail-loud fallback chain

flowchart LR
    Q["chat turn"] --> G{"Gemini<br/>gemini-2.5-flash-lite"}
    G -->|"OK"| ANS["grounded reply<br/>(only from tool results)"]
    G -->|"real failure / cold-start 503"| H["backend/llm_health.py<br/>background probe + sticky-primary election"]
    H --> NIM["NVIDIA NIM open-model chain<br/>backend/nim_fallback.py"]
    NIM -->|"healthy model"| ANS
    NIM -->|"whole chain down"| LOUD["explicit 'service degraded'<br/>(never a silently wrong answer)"]
    ANS --> GUARD["prose-grounding guard:<br/>every policy/UIN named is verified<br/>against retrieve_policies + get_policy_facts"]
    GUARD --> OUT["sent to user"]

Summary. How a chat turn is served by the primary LLM, what happens when it fails, and the structural guard that prevents a silently wrong answer.

How it flows:

Primary path. Gemini (gemini-2.5-flash-lite). On a healthy response → the reply is built only from what the tools returned.
Fallback path (fail-loud). A real Gemini failure or a cold-start 503 routes through backend/llm_health.py (a background probe with sticky-primary election) to the NVIDIA NIM open-model chain (nim_fallback.py). One healthy model in that chain serves the turn.
Last resort. If the whole chain is down, the user gets an explicit "service degraded" message — never a silently wrong answer.
Prose-grounding guard. Before a reply is sent, every policy / UIN named in the prose is verified against the same retrieve_policies and get_policy_facts results the brain saw (with an exemption for genuine catalogue UINs). Faithfulness is structural, not bolt-on.

Why a single brain (not a multi-model pipeline). Earlier designs split the work across several LLM passes (a separate fact-find brain, a QA brain, a faithfulness-judge). That scaffolding was removed: a single frontier model with well-designed tools is more accurate, far simpler, and eliminates a whole class of cross-model contract bugs. Today there is exactly one brain call per turn plus its tool calls. Faithfulness is enforced structurally — the brain can only state what retrieve_policies and get_policy_facts returned — rather than by a second grader model.

More on the fallback chain. The brain's primary is Gemini (gemini-2.5-flash-lite). On a real Gemini failure or a cold-start 503, the turn falls back to an NVIDIA NIM chain of open models. Candidate selection uses a background health probe with sticky-primary election (backend/llm_health.py) so one healthy model is chosen per call. The fallback is fail-loud: if the whole chain is down the user gets an explicit "service degraded" message, never a silently wrong answer. (A separate LLM "judge" existed historically and has been retired — the single-brain design made it redundant.)

Sticky-session retry policy (hardened 2026-05-27). Once a session has completed at least one successful single-brain turn, it stays on single_brain for the rest of its lifetime — cross-fading to nim_fallback mid-stream would discard last_recommendation_ids / last_retrieved_chunks / slug_to_insurer. To absorb Gemini's intermittent "high demand" 503 bursts on sticky sessions, _gemini_call now uses an adaptive retry schedule: non-sticky session keeps 1 retry with a 1.5 s backoff (fast-fail to NIM on cold-start); sticky session gets 2 retries with jittered exponential backoffs (1.5 s → 3 s, ±25 % jitter). If the chain still fails after retries, the user sees a plain, honest reply "My model service had a brief blip on that turn — please send the same message again." (no more misleading "could you say that again?").

2.5 Voice pipeline (in / out, with barge-in)

flowchart LR
    MIC["mic — tap-to-talk (touch) / push-to-talk (desktop)"] --> MR["MediaRecorder (authoritative audio)"]
    MIC -.->|"live interim text"| WS["Web Speech API (display only)"]
    MR --> STT["/api/transcribe → Sarvam Saarika STT"]
    STT --> BR["single_brain.handle_turn"]
    BR --> RPL["reply text + citations"]
    RPL --> VF["voice_format.py<br/>money/Indic normalise · chunk at sentence bounds"]
    VF --> BUL["Sarvam Bulbul TTS"]
    BUL --> PLAY["in-DOM &lt;audio&gt;"]
    SPK["user speaks over bot"] -.->|"barge-in"| PLAY
    SPK -.->|"abort in-flight"| BR

Summary. How spoken input becomes a chat turn, how the reply becomes speech back, and how the user can interrupt mid-answer.

How it flows:

Capture. Tap-to-talk (touch) or push-to-talk (desktop) starts MediaRecorder (the authoritative audio) and Web Speech (a live interim transcript shown on screen but never trusted for the turn).
STT. The authoritative audio is sent to /api/transcribe (Sarvam Saarika — Indian-accent + Hinglish aware).
Brain → reply. The transcript runs through single_brain.handle_turn exactly like a typed turn.
TTS. voice_format.py normalises money / Indic shorthand and chunks at sentence bounds (so long replies are spoken in full); Sarvam Bulbul speaks; an in-DOM <audio> element plays.
Barge-in. The user speaking over the bot pauses playback and aborts the in-flight /api/chat, so the bot stops mid-thought rather than over-talking.

More on voice. The browser shows a live interim transcript via the Web Speech API while MediaRecorder captures the authoritative audio, which is sent to /api/transcribe (Sarvam Saarika STT). Replies are synthesised by Sarvam Bulbul TTS, with money / Indic shorthand normalised in backend/voice_format.py before synthesis (long replies are chunked at sentence boundaries so the full answer is spoken, not just the first sentence), and played through an in-DOM <audio> element. Speaking over the bot (barge-in) pauses that audio and aborts the in-flight /api/chat request. On touch devices voice is tap-to-talk; on desktop, push-to-talk; the live interim transcript accumulates the full utterance while you speak.

2.6 Profile & personalisation (in-memory only)

flowchart TB
    A["user answers (chat or profile builder)"] --> SPF["save_profile_field → SessionState.profile<br/>(in-memory dict only)"]
    SPF --> FIT["scorecard fit + grade<br/>(reads live profile)"]
    SPF --> PREM["illustrative premium<br/>(reads live profile)"]
    SPF --> TURN["next chat turn<br/>(brain sees full live profile)"]
    SPF -.->|"1h idle"| EVICT["session evicted from memory<br/>profile gone forever"]
    SPF -.->|"close tab"| EVICT
    RECOV["server restart / 1h idle WHILE tab still open<br/>+ chat_history carried by browser"] -.-> SR["STATE-RECOVERY MODE<br/>brain rebuilds profile from chat_history<br/>(in-session only · never reads disk)"]
    SR --> SPF

Summary. The profile is captured into a per-session in-memory dict (SessionState.profile), feeds scoring + pricing + the next turn's brain prompt, and is discarded the moment the session evicts (1 h idle or "Clear chat"). There is no on-disk persistence and no cross-session recall. The in-session STATE-RECOVERY path covers container restarts by rebuilding the profile from the chat history the browser still carries — it never touches disk.

How it flows:

Capture. Every answer (chat or profile-builder form) is written via save_profile_field (or the POST /api/profile endpoint) into the live SessionState.profile. This is a regular Python dataclass field on the in-memory session object.
Drives scoring + pricing. The same profile feeds the scorecard fit-and-grade (§3.2) and the live premium estimate (§3.3) on every request — both reads, never persisted.
Evicted on idle / close. A session is evicted from the _sessions dict after 1 h of inactivity (_TTL_SECONDS). Hitting Clear chat (POST /api/session/clear) evicts immediately. Closing the tab disconnects the browser — the server-side session ages out on the same TTL.
State recovery (in-session only). If the server restarted or the session evicted while the browser still has the chat open, the client re-sends its chat_history with the next turn. The brain enters STATE-RECOVERY MODE and silently re-captures the facts already stated in history — without ever asking the user's name again. This is not cross-session; it only resolves the case where the user is still in the same conversation.

Why no cross-session recall (ADR-043, 2026-05-27). An earlier design persisted profiles to 40-data/profiles/<name>.json and offered a "Welcome back, ?" prompt on return. The name-only slug key collided across distinct users (every "Rohit" wrote to the same file), which required four sequential hardening passes — prompt redaction, match-before-merge guards, same-turn fact extractors, a two-fact gate — to keep contained. The cost/benefit for an insurance-shopping app (rare-purchase, return sessions uncommon) didn't justify the surface. The simpler "session is in-memory only" model matches the privacy story the product wants to tell.

2.7 Data architecture — where the JSON lives, where the vectors live

Summary. Two complementary kinds of data power the bot: small JSON files of human-reviewed facts versioned with the code, and a Chroma vector database of the full policy text held in a separate HF dataset. JSON answers exact-number questions; vectors answer free-form Q&A.

2.7.1 The two data kinds, side by side

	JSON files	Vector database (Chroma)
What's in it	Per-policy curated structured fields — CSR%, ICR%, complaints/10k, room-rent rule, waiting periods, sub-limits, grade — each value carries its verbatim `source_quote` from the PDF	Full text of every policy PDF, chunked into ~500-token overlapping pieces, embedded as 384-d vectors with BGE-small-en-v1.5
File location	`40-data/policy_facts/.json` — inside the code repo*, versioned in git	`rag/vectors/` — git-ignored on the laptop, lives in the HF dataset `rohitsar567/insurance-bot-data`, pulled at Docker build via `huggingface_hub.snapshot_download`
Size	~150 files × few KB each — tiny, fits in git	~7.3 k chunks × 384 dims + raw PDFs (hundreds of MB) — too big for git
Built by	Offline ingest (`rag/extract.py` + `schema.py`) — LLM-assisted extraction, human-reviewed, committed to git	Offline ingest (`rag/ingest.py`) — chunked + embedded once, published to HF dataset
Read by (at request time)	`get_policy_facts` tool (LLM brain) · `scorecard.py` · `premium_calculator.py` · marketplace-card renderer	`retrieve_policies` tool (LLM brain) — only
Used for	Decision-critical exact numbers cited on cards, in the scorecard, and in pricing — with the verbatim PDF quote shown	Free-form Q&A grounded in actual policy wording — "what does this plan cover during pregnancy?"

2.7.2 Other data files (smaller, supporting)

40-data/reviews/ — sourced insurer reviews (claims stories, regulator notes).
40-data/premiums/ — illustrative public rate-card combinations consumed by the multivariate premium estimator (§3.3).
40-data/insurer_network.json — hospital-network counts per insurer. Pre-ADR-043 there was also a 40-data/profiles/<name>.json directory of saved user profiles for cross-session recall. That mechanism was removed (see §2.6 — sessions are now in-memory only).

All three remaining stores sit in the code repo (under 40-data/) because they're small, human-reviewed, and decision-critical — safe to version alongside the code.

2.7.3 Where each piece physically lives

flowchart LR
    LAPTOP["💻 Local dev laptop<br/>source of truth before any push"]
    subgraph CODE["📁 Code repository (mirrored to two remotes)"]
        direction TB
        GH["GitHub — public mirror<br/>rohitsar567/insurance-sales-bot"]
        HFS["HF Space — code + running app<br/>rohitsar567/InsuranceBot"]
    end
    HFD["🤗 HF Dataset<br/>rohitsar567/insurance-bot-data<br/>PDF corpus + prebuilt Chroma vectors"]
    APP["🌐 Live app (HF Space container)"]

    LAPTOP -->|"git push github"| GH
    LAPTOP -->|"git push origin"| HFS
    LAPTOP -.->|"offline ingest publishes here"| HFD
    HFS -->|"Docker build"| APP
    HFD -.->|"snapshot_download at build"| APP

    linkStyle 0,1,3 stroke:#1565c0,stroke-width:2px
    linkStyle 2,4 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3

Summary. Three physical homes: the laptop (source of truth before any push), the code repo (mirrored to two git remotes — GitHub for reviewers and the HF Space's own repo for deployment), and the HF dataset (the heavy binaries).

How it flows:

Local laptop. Single source of truth before any push. All editing happens here.
Code repo — two remotes. git push github updates the GitHub public mirror (for reviewers). git push origin updates the HF Space's own repo and triggers the Docker rebuild.
HF dataset (offline channel). Heavy binaries — PDF corpus + prebuilt Chroma vectors — are published here separately from the code so the deployable image stays small.
Live container. On every Docker build, huggingface_hub.snapshot_download hydrates rag/corpus/ and rag/vectors/ from the HF dataset; the FastAPI app then has both data kinds available.

2.7.4 Offline ingest pipeline (built once, not on the request path)

flowchart LR
    PDF["📄 Raw policy PDFs<br/>rag/corpus/"] --> ING["rag/ingest.py<br/>chunk pages"]
    ING --> EMB["embed<br/>BGE-small · local · 384-d"]
    EMB --> VEC["Chroma vector store<br/>rag/vectors/"]
    ING --> XT["rag/extract.py + schema.py<br/>structured fact extraction (LLM-assisted)"]
    XT --> JSON["40-data/policy_facts/*.json<br/>+ verbatim source_quote"]
    XT --> DUCK["policies.duckdb<br/>structured rollup"]
    VEC -.->|"published to"| HFD["HF dataset"]
    PDF -.->|"published to"| HFD
    JSON -->|"versioned with code"| CODE["Code repo"]

    linkStyle 0,1,2,3,4,5,8 stroke:#1565c0,stroke-width:2px
    linkStyle 6,7 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3

Summary. A single offline pipeline turns each raw PDF into two artefacts: chunked embeddings for free-form retrieval, and a structured JSON of decision-critical fields with verbatim quotes. Vectors → HF dataset; JSON → code repo.

How it flows:

Chunking + embedding. rag/ingest.py splits each PDF into overlapping ~500-token chunks; BGE-small encodes each chunk into a 384-d vector; Chroma persists them to rag/vectors/.
Extraction. rag/extract.py + schema.py run an LLM-assisted pass to pull structured fields (waiting periods, room-rent caps, CSR%, etc.) into a schema-validated JSON — with the verbatim source_quote that justifies each value.
Two destinations. Vectors and raw PDFs are published to the HF dataset (too big for git); JSON files are versioned with the code so a reviewer can see exactly what facts feed the marketplace cards.
Why split. Chunks power free-form Q&A; the JSON powers the marketplace cards, scoring, and pricing — two different queries, two different data shapes.

Provenance rule. Every policy fact shown to a user traces to a real clause in a real PDF. Where a document genuinely doesn't state something, it is recorded as a sourced-null ("not stated in <file>.pdf") — never invented or back-filled.

2.8 Uploaded-PDF flow — 8 security gates → Gemini extraction → catalogued-grade card

flowchart TB
    UP["/api/upload-policy (public web)"] --> G1["1 File mechanics<br/>%PDF · size band · %%EOF · no exe/JS"]
    G1 --> G2["2 Content quality<br/>≥1500 chars · ≥3 pp · domain keyword"]
    G2 --> G3["3 Prompt-injection sweep"]
    G3 --> G4["4 Per-session rate limit"]
    G4 --> G5["5 Per-IP rate limit"]
    G5 --> G6["6 Encrypted/locked → reject"]
    G6 --> G7["7 Page-count ceiling (>200)"]
    G7 --> G8["8 Hash dedupe + reject-cache"]
    G8 -->|"pass"| QC["per-session QUARANTINE Chroma + global policies collection<br/>BGE-small embeddings · 24h idle TTL"]
    QC --> HEUR["Heuristic baseline (synchronous, sub-second)<br/>regex/keyword over PDF text<br/>writes UPLOADED_DOCS_DIR/&lt;pid&gt;/record.json @ ~30-50% completeness"]
    HEUR -.->|"detect_insurer_slug<br/>match against 21 known insurers"| INS["insurer_slug = manipalcigna / hdfc-ergo / ...<br/>(or 'user-upload' on no match — fail-closed)"]
    HEUR --> ACK["HTTP 200 returns immediately<br/>frontend pushes ack 'Got it — reading X, ~30-60s'<br/>EVERY chat input GATED via extractionInFlight"]
    ACK --> CACHE{"sha256(pdf_bytes)<br/>seen before?"}
    CACHE -->|"hit"| COPY["copy prior rag/extracted/&lt;other_pid&gt;.json → this pid<br/>llm_used='hash-cache' · ~1s"]
    CACHE -->|"miss"| LLM["Background extract_one_for_upload<br/>Gemini 2.5-flash · 3 retries (2/4/8s ± 25% jitter)<br/>llm_used='gemini-2.5-flash#N'<br/>llm_response_chars logged for ops"]
    LLM -->|"all fail"| NIM["NIM fallback chain<br/>llm_used='nim-fallback'"]
    LLM -->|"success"| WRITE["write rag/extracted/&lt;pid&gt;.json"]
    NIM -->|"success"| WRITE
    NIM -->|"fail"| FLOOR["heuristic record.json wins<br/>status='failed' but card still renders at ~47% / grade C"]
    WRITE --> MERGE["merge LLM scalars INTO record.json<br/>LLM value wins where non-empty<br/>heuristic stays where LLM silent"]
    COPY --> MERGE
    MERGE --> BUST["invalidate _MG_CACHE (marketplace grade cache)"]
    BUST --> RESOLVE["_catalogue_scorecard(pid, None)<br/>SAME resolver /api/policies/&lcub;id&rcub;/scorecard uses"]
    RESOLVE --> STATUS["_set_extraction_status(complete, comp, grade, llm_used, llm_response_chars)<br/>BY CONSTRUCTION equal to card endpoint"]
    FLOOR --> STATUS
    STATUS --> POLL["frontend GET /api/upload/extraction-status/&lcub;pid&rcub;<br/>every 3s, max 120s"]
    POLL --> CARD["pushAssistant(card_ready, citations=[&lcub;pid&rcub;])<br/>then pushAssistant(choice_prompt)<br/>setActiveUploadPid(pid) · setExtractionInFlight(false)"]
    CARD --> ENABLE["Send + textarea + PDF + voice all re-enabled<br/>view_context.active_policy_id=&lcub;pid&rcub; on next chat turn<br/>→ single_brain enters ACTIVE POLICY DIVE-IN mode (KI-330)"]
    G1 & G2 & G3 & G4 & G5 & G6 & G7 & G8 -->|"fail"| REJ["clean rejection (reason surfaced)"]

Summary. The full pipeline an uploaded PDF traverses to become a catalogued-grade card with the same data depth as the 148 pre-curated policies — from HTTP request through Gemini extraction through inline chat card.

How it flows:

The 8 security gates, in order. (1) File mechanics — %PDF magic, 5 KB–25 MB size band, well-formed %%EOF, no embedded executables / JavaScript / launch actions. (2) Content quality — ≥1500 extractable chars, ≥3 pages, at least one insurance-domain keyword. (3) Prompt-injection sweep — "ignore previous instructions", "reveal your system prompt", jailbreak patterns. (4) Per-session rate limit. (5) Per-IP rate limit (catches session-ID rotation). (6) Encrypted/locked PDF — rejected cleanly. (7) Page-count ceiling (>200 pages — an abuse/bundle vector). (8) Hash dedupe + reject-cache — identical re-uploads short-circuit.
Beyond identical-file dedup. A UIN net-new check also runs — if the PDF's IRDAI UIN already belongs to a catalogued policy, the caller is pointed at the existing marketplace card instead of indexing a duplicate. PDF-text fuzzy matching also runs — if the upload content identifies as a known catalogued product (matching insurer + product-name patterns), the upload endpoint resolves to the existing <insurer-slug>__<product> id and reuses the curated card, skipping fresh extraction entirely (best UX for known products).
On pass. Chunks land in a per-session quarantine Chroma collection — session-isolated, 24 h idle TTL — AND in the shared policies collection so the upload can become a marketplace card.
The locked chat sequence (ADR-044). ack → gated wait → card → choice. The frontend's extractionInFlight flag disables Send, textarea, PDF button, and EVERY voice path (PTT / Sarvam / voice auto-submit) for the entire wait window. The choice prompt NEVER fires before the card-bearing message lands. Live-verified by Playwright: ack at idx 155 → card/fail at idx 315 → choice at idx 500, strictly ordered. Inputs re-enable in the same render as the choice prompt.
The two LLM fast paths. Hash-cache — sha256(pdf_bytes) matches a prior successful extraction → copy that rag/extracted/<pid>.json to this pid, surface llm_used="hash-cache", ~1 s. Gemini path — 3 retries with jittered exponential backoff (2/4/8 s ± 25 %), surfaces llm_used="gemini-2.5-flash#N" where N is the successful attempt, plus llm_response_chars so the operator can see WHICH LLM landed the extraction without HF Space stdout access.
The heuristic floor is a hard guarantee, now significantly fatter (KI-332, 2026-05-27). build_record() runs synchronously inside the upload HTTP call (sub-second) and writes record.json BEFORE the LLM ever fires. The pattern set was expanded from ~16 fields to ~28+ on the 2026-05-27 hardening pass: sum-insured ladder detection (₹3L / ₹5L / ₹10L → list), policy_type, min entry age, child entry days, lifelong-renewability flag, grace period, free-look period, geographic coverage, ICU capping, deductible amount, NCB cap %, organ donor / critical illness / preventive checkup / domiciliary / newborn presence booleans, premium payment modes. Local synthetic test hits 32 fields. Expected upload completeness on LLM-fail rises from ~47.8 % to ~65–70 %. If Gemini fails all 3 retries AND the NIM fallback fails, the card still renders at this richer floor — never fabricated, never a generic "Retry" placeholder. Verified on a hard Test Policy.pdf (8 MB) where Gemini 3/3 retries returned malformed JSON: card still landed with the expanded heuristic data.
Multi-pass per-section extraction for big PDFs (KI-332, 2026-05-27). For uploads with ≥ 25 K chars of extracted text (e.g. dense 100+ page policy wordings, 8 MB PDFs), the single-pass Gemini call reliably truncates JSON mid-emission — the HealthPolicy schema has ~40 fields and a complete output with verbatim quotes can exceed Gemini 2.5-flash's reliable output budget. Solution: split the schema into 7 logical sections (identity, eligibility, financial, waiting periods, coverage, limits, network+claims) and run each as its own smaller Gemini call IN PARALLEL via asyncio.gather. Each call carries ~15 % of the schema → fits comfortably in budget. Failure-isolated: 6/7 sections landing produces a partial extraction strictly better than the heuristic floor. Same wall-clock cost as single-pass (parallel). On total multi-pass failure, falls through to the legacy single-pass + NIM chain (heuristic floor still wins). Activation: len(text) ≥ 25_000 triggers multi-pass; smaller PDFs keep using single-pass (faster, cheaper, works fine).
Status endpoint == scorecard endpoint by construction. When the background extraction finalises, it calls the SAME _catalogue_scorecard(pid, None) resolver that /api/policies/{id}/ scorecard uses. The completeness_pct and overall_grade on GET /api/upload/extraction-status/{pid} are therefore byte- identical to what the inline card renders. Same applies to the hash-cache short-circuit (which had to be fixed separately — earlier draft of the cache branch called build_scorecard(...) without insurer_reviews AND read .overall_grade instead of .grade, silently reporting status=17.4 % / grade None while the card showed 47.8 % / C). 2026-05-27 multi-PDF audit (commit 58e3c82) confirmed parity across manipalcigna, hdfc-ergo, care-health, icici-lombard, star-health, Test Policy.pdf.
Post-card dive-in mode (KI-330). After the card lands the frontend sets activeUploadPid and plumbs it into every chat turn's view_context.active_policy_id. single_brain.handle_turn reads that and prepends an ACTIVE POLICY DIVE-IN block to the system instruction, forcing the brain to answer policy-specific questions via retrieve_policies + get_policy_facts on that pid instead of pivoting to "let me pull your recommendations". Verified 9/10 on the post-fix audit (up from 0/10 pre-fix).
On fail. A clean rejection naming the gate; the file is deleted; nothing is embedded.

Operator endpoints (admin-only):

POST /api/admin/upload/reextract?force=<bool> — re-runs extract_one_for_upload for every persisted upload that lacks a rag/extracted/<pid>.json, or for ALL persisted uploads with force=true. Wired to a startup hook so every container boot upgrades legacy uploads automatically.
GET /api/upload/extraction-status/{policy_id} — the live state of the in-memory _UPLOAD_EXTRACTION_STATUS dict. Fields: status (pending | running | complete | failed | unknown), llm_used (gemini-2.5-flash#N | nim-fallback | hash-cache), llm_response_chars, completeness_pct, overall_grade, started_at, completed_at, error.

2.9 Deployment

flowchart LR
    DEV["git push"] --> ORI["HF Space remote (origin)"]
    DEV --> GH["GitHub mirror (github)"]
    ORI --> BUILD["HF Space — Docker build"]
    BUILD --> SNAP["huggingface_hub.snapshot_download<br/>hydrate rag/ (corpus + vectors) from HF dataset"]
    BUILD --> FE["build Next.js static export"]
    SNAP --> RUN["entrypoint.sh → uvicorn backend.main:app :7860"]
    FE --> RUN
    RUN --> LIVE["live Space (FastAPI also serves the frontend)"]
    LIVE --> CHK["verify reported build SHA advanced<br/>(LFS/quota push can fail silently)"]

Summary. How a git push becomes a live Space, end to end.

How it flows:

Two remotes. origin = the Hugging Face Space (a push here triggers the Docker rebuild). github = the public mirror reviewers read.
Build. The HF Space rebuilds the image, installs the backend, builds the Next.js static frontend, and runs huggingface_hub.snapshot_download to hydrate rag/ (PDF corpus + prebuilt vectors) from the insurance-bot-data dataset — so the Space repo itself stays code-only and small.
Start. entrypoint.sh launches uvicorn backend.main:app on $PORT (default 7860); FastAPI also serves the exported frontend.
Verify. Always confirm the Space's reported build SHA actually advanced before trusting that new code is live — an LFS/quota push can fail without surfacing an error.

3. Key functions in plain language

Summary. Seven internal jobs make the bot work. Each one gets a sequence diagram showing what calls what, a ≤50-word summary, and a step-by-step explanation. An eighth subsection makes explicit what is stored vs what is live-only.

3.1 Profile construction

sequenceDiagram
    autonumber
    participant U as User
    participant API as 3a. /api/chat
    participant B as 3b. LLM Brain
    participant T as save_profile_field tool
    participant S as 3d. session_state
    U->>API: typed / spoken reply
    API->>B: handle_turn(text, profile)
    B->>T: save_profile_field("age", 32)
    T->>S: session.profile.age = 32
    S-->>T: ok
    T-->>B: saved
    B-->>API: next question (or proceed to scoring + pricing)

Summary. Every user reply runs the brain, which extracts one fact at a time and saves it via the save_profile_field tool into the live session_state.profile. No regex, no separate extractor model — extraction is a tool call.

How it flows:

User reply arrives at /api/chat (typed text or post-STT transcript).
Brain runs. handle_turn calls Gemini with the reply + the current profile + the tool schema.
Tool call. When Gemini decides a fact is captured, it calls save_profile_field(field, value) — one call per fact.
Write. The tool sets the field on session.profile (in memory). All fact-find captures are concentrated here; there is no second LLM pass.
Continue. Brain returns the next question, or — if the profile is complete enough — proceeds to scoring + pricing.
Persistence is later. End-of-turn, the whole profile is auto-persisted to disk by the orchestrator (see §3.6).

3.2 Profile-aware scoring

sequenceDiagram
    autonumber
    participant API as 3a. /api/scorecard
    participant SC as 3c. scorecard.py
    participant J as 4. policy_facts JSON
    participant R as 4. reviews JSON
    participant P as 3d. session.profile
    API->>SC: build_scorecard(policy_id, profile)
    SC->>J: read curated facts (CSR, room-rent, waitings)
    SC->>R: read insurer reviews
    SC->>P: read live profile (age, family, conditions)
    SC->>SC: 6 sub-scores → letter grade + personalised summary
    SC-->>API: grade · sub-scores · summary

Summary. Scoring grades every policy for this user — same policy, two users, two grades. Six sub-scores roll up into a letter, with a personalised one-line summary naming the strengths for this profile and the one honest caveat.

How it flows:

Trigger. Brain (or /api/scorecard directly) asks for a per-policy grade.
Three reads. scorecard.py reads (a) the policy's policy_facts JSON, (b) the insurer's reviews JSON, and (c) the live profile from session_state.
Six sub-scores. Coverage · predictability · claims · network · renewal · terms — each weighted against this profile (e.g. a smoker is penalised more in pricing predictability; an elder is penalised more in waiting-period structure).
Letter grade + summary. A grade band (A–E) and a one-line summary listing this user's top strengths and the one capping caveat.
Live, not stored. Recomputed per request. Storage would be wrong here — the grade depends on who is asking.

3.3 Profile-aware pricing — the multivariate ballpark

sequenceDiagram
    autonumber
    participant API as 3a. /api/coverage
    participant PR as 3c. premium_calculator.py
    participant J as 4. policy_facts JSON
    participant RC as 4. rate-card combinations<br/>age × metro × smoker × PED ×<br/>co-pay × deductible × tenure
    participant P as 3d. session.profile
    API->>PR: estimate(policy_id, profile)
    PR->>J: read pricing structure
    PR->>P: read profile (7 dimensions)
    PR->>RC: match combination → rate range
    PR->>PR: interpolate + apply caveats
    PR-->>API: ₹low – ₹high · point ≈ ₹X (illustrative)

Summary. Pricing is a multivariate ballpark from public rate-card combinations — age × metro × smoker × PED × chosen co-pay × deductible × sum insured. Same plan, two users, two ranges. Explicitly not real underwriting.

How it flows:

Inputs. The user's profile (seven dimensions) and the policy's pricing characteristics.
Multivariate match. premium_calculator.estimate looks up combinations across the seven dimensions and interpolates within bands.
Output. A range (e.g. ₹12 500 – ₹17 200 / yr) with a midpoint and a clear illustrative-only disclaimer.
Honest caveat. Final premium depends on the insurer's underwriting + medicals + IRDAI-filed loadings — this is a directional ballpark, not a quote.
Live, not stored. Same as scoring — recomputed per request because the answer is a function of this user.

3.4 Retrieval — `retrieve_policies` over the vector store

sequenceDiagram
    autonumber
    participant B as 3b. LLM Brain
    participant T as retrieve_policies tool
    participant E as BGE-small embedder
    participant C as 4. Chroma<br/>shared 'policies' +<br/>per-session 'quarantine'
    B->>T: retrieve_policies(query, filters, top_k)
    T->>E: encode(query) → 384-d vector
    T->>C: nearest-neighbour + metadata filter
    C-->>T: top-k chunks (text · pdf · page · policy_id)
    T-->>B: chunks — the brain may quote only from these

Summary. When the brain needs to cite policy wording, it asks the retrieval tool. The query is embedded with BGE-small, looked up in Chroma, and the top-k chunks come back with their source PDF, page, and policy_id. The brain may state nothing the tool did not return.

How it flows:

Query. Brain composes a query from the user's question + the profile.
Embed. Local BGE-small turns the query into a 384-d vector (no API hop, no rate limit).
Search. Chroma returns nearest chunks, scoped to either the shared policies collection (catalogue) or the per-session quarantine collection (user-uploaded PDFs — never crosses sessions, 24 h TTL).
Faithfulness. The brain may only state what these chunks (or get_policy_facts) returned. Anything else is a violation of the structural grounding guard (§2.4).

3.5 Curated facts — `get_policy_facts` (no embedding hop)

sequenceDiagram
    autonumber
    participant B as 3b. LLM Brain
    participant T as get_policy_facts tool
    participant F as 4. policy_facts JSON file
    B->>T: get_policy_facts([policy_id_1, policy_id_2])
    T->>F: read JSON file(s) directly
    F-->>T: CSR · complaints · grade · source_quote
    T-->>B: structured fields with verbatim PDF quote

Summary. For decision-critical numbers (claim-settlement ratio, complaints volume, waiting periods, room-rent rule, grade), the brain calls get_policy_facts which reads the curated JSON directly — no embedding hop, no LLM, exact values with the verbatim PDF quote.

How it flows:

Why this exists alongside §3.4. Retrieval is for free-form text Q&A; this is for exact, fast, source-cited number lookups. Different query shape, different mechanism.
No LLM in the path. A plain file read of 40-data/policy_facts/<id>.json. The brain quotes the value (and the source_quote) verbatim.
Used by more than the brain. scorecard.py and premium_calculator.py also read these JSON files directly — see §2.3 (the brain's edge is not the only edge into the JSON).

3.6 In-session state recovery (server restart resilience)

sequenceDiagram
    autonumber
    participant U as User
    participant Br as Browser (carries chat_history)
    participant B as 3b. LLM Brain
    participant S as 3d. SessionState (in-memory)
    U->>Br: "what about premium?"
    Br->>B: chat turn + chat_history[N msgs]
    B->>S: get_session(session_id)
    Note over S: session was evicted<br/>(1h idle / restart)<br/>profile = BLANK
    B->>B: STATE-RECOVERY MODE<br/>chat_history has prior facts
    B->>S: save_profile_field for each fact in history
    B-->>Br: reply that picks up where it left off<br/>(never re-asks the name)

Summary. Sessions are in-memory only (_TTL_SECONDS = 1h), so a container restart or long idle wipes the server-side profile. When the browser still carries the conversation, the brain silently rebuilds the profile from the chat history instead of starting over. No disk read, no cross-session memory — purely a same-conversation resilience path.

How it flows:

Detect. get_session() returns a blank SessionState, but chat_history arrives with ≥2 messages including a prior user turn → state was lost, not "fresh user".
Re-capture from history. A high-priority STATE-RECOVERY MODE prompt block tells the LLM: do not say you lost anything, do not re-ask the name, instead call save_profile_field for every fact present in the conversation so far, then continue.
Resume. From the LLM's point of view the next reply is just the next turn in an ongoing chat — the user never perceives the eviction.

(There is no cross-session recall — see §2.6 and ADR-043 for why that was removed.)

3.7 Uploaded-PDF LLM extraction (the §2.8 pipeline in code terms)

sequenceDiagram
    autonumber
    participant FE as Frontend (page.tsx)
    participant API as /api/upload-policy
    participant SEC as security.py (8 gates)
    participant UD as uploaded_docs.py
    participant H as heuristic build_record()
    participant CK as hash cache lookup
    participant G as Gemini 2.5-flash (3 retries)
    participant N as NIM fallback chain
    participant SC as scorecard.py + main._catalogue_scorecard
    participant ST as _UPLOAD_EXTRACTION_STATUS dict
    participant FE2 as Frontend poller
    FE->>API: POST multipart (PDF + session_id)
    API->>SEC: run 8 gates
    SEC-->>API: pass or reject
    API->>UD: persist_upload — writes source.pdf + meta.json to UPLOADED_DOCS_DIR/pid/
    UD->>H: build_record — regex/keyword over text
    H-->>UD: record.json at ~30-50% completeness
    UD-->>FE: HTTP 200 with policy_id
    FE->>FE: setExtractionInFlight(true) · gate ALL inputs · pushAssistant(ack)
    UD->>UD: asyncio.create_task(extract_one_for_upload)
    Note over UD: _set_extraction_status(status='running')
    UD->>CK: _find_cached_extraction(sha256(pdf_bytes))
    alt cache hit
        CK-->>UD: copy prior rag/extracted/other_pid.json
        UD->>UD: llm_used='hash-cache'
    else cache miss + text ≥25K chars
        Note over UD,G: MULTI-PASS (KI-332): 7 sections in parallel via asyncio.gather
        UD->>G: identity / eligibility / financial / waiting / coverage / limits / network
        G-->>UD: 7 partial JSONs (any landing counts as success)
        UD->>UD: merge sections · llm_used='gemini-2.5-flash-multipass'
    else cache miss + text under 25K chars
        UD->>G: single-pass chat with full schema
        alt success
            G-->>UD: HealthPolicy JSON · llm_used='gemini-2.5-flash#1'
        else timeout or malformed JSON
            UD->>G: retry #2 (2s ± 25%) then retry #3 (4s ± 25%)
            G-->>UD: HealthPolicy or fail
            UD->>N: NIM fallback (single attempt)
            N-->>UD: HealthPolicy or all-fail
        end
        UD->>UD: write rag/extracted/pid.json
        UD->>UD: merge LLM scalars INTO record.json
    end
    UD->>SC: _catalogue_scorecard(pid, None) — same as /api/policies/.../scorecard
    SC-->>UD: Scorecard (grade, data_completeness_pct, sub_scores)
    UD->>ST: _set_extraction_status(complete, comp, grade, llm_used, llm_response_chars)
    loop every 3s up to 120s
        FE2->>UD: GET /api/upload/extraction-status/pid
        UD-->>FE2: status snapshot
    end
    FE2->>FE2: pushAssistant(card_ready, citations include pid)
    FE2->>FE2: setActiveUploadPid(pid)
    FE2->>FE2: pushAssistant(choice_prompt) · setExtractionInFlight(false)

Summary. When a user uploads a PDF the backend writes a heuristic-baseline record first (sub-second), HTTP returns, and a background asyncio task either copies a prior extraction (hash cache hit) or runs Gemini 2.5-flash with 3 jittered retries → NIM fallback → heuristic floor. The status endpoint reports the SAME completeness_pct + overall_grade the card endpoint serves, by construction.

How it flows:

HTTP returns before extraction starts. extract_one_for_upload is fired with asyncio.create_task so the user sees the card-ready ack inside one second, not after 30–60 s.
Provenance, always. Every _set_extraction_status call carries llm_used (gemini-2.5-flash#1 | #2 | #3 | nim-fallback | hash-cache) and llm_response_chars. The operator can see WHICH LLM landed the extraction without HF Space stdout access — verified live on 2026-05-27.
Hash cache short-circuit. _find_cached_extraction(sha256(pdf_bytes)) looks for a prior successful extraction with the same content. On hit, the prior rag/extracted/<other_pid>.json is copied to this pid in ~1 s.
Retries are jittered exponential. 2 s / 4 s / 8 s backoffs each multiplied by random.uniform(0.75, 1.25) so repeated transient blips on a single Gemini instance don't synchronise.
Merge model. LLM output is merged INTO the heuristic record (LLM wins per-field where non-empty, heuristic stays where LLM silent) — the same "extracted + curated overlay" model the catalogued 148 use via 40-data/policy_facts/.
Status == card by construction. The status endpoint calls _catalogue_scorecard(pid, None) — the SAME resolver /api/policies/{id}/scorecard uses. If they differ, that's a bug; today they match across 5 verified uploads.

3.8 What is stored vs what is live-only

What	Where	Why
Policy PDFs + vector chunks	`rag/corpus/` + Chroma store (HF dataset → pulled at build)	Built once, offline; read every request
Curated policy facts (per policy)	`40-data/policy_facts/*.json` (code repo)	Small, human-reviewed, versioned with code
User profile (current session)	In-memory only — `SessionState.profile` (1 h idle TTL, no disk)	Closing the tab / clearing chat forgets the profile by design (ADR-043)
Per-policy grade / scorecard	Not stored — live per request	Two users get two grades for the same policy (profile-aware)
Premium range for a policy	Not stored — live per request	Same reason as the grade
Uploaded PDFs	Per-session Chroma quarantine, 24 h TTL	Isolated to the uploader, never the shared corpus
Per-turn reasoning	`logs/turns.jsonl` (one JSON line per turn)	Replay / audit; never echoed to other users

4. Safety & quality

4.1 Uploaded-PDF defence (8 gates)

/api/upload-policy accepts arbitrary PDFs from the public web — a real attack surface. backend/security.py runs every upload through eight gates before the file is ever embedded or shown to the model:

File mechanics — %PDF magic, 5 KB–25 MB size band, well-formed %%EOF, and a scan for embedded executables / JavaScript / launch actions.
Content quality — ≥1500 chars of extractable text, ≥3 pages, and at least one insurance-domain keyword (rejects scans, junk, off-topic docs).
Prompt-injection — regex sweep for "ignore previous instructions", "reveal your system prompt", jailbreak patterns, etc.
Per-session rate limit — caps uploads / chunk quota per session.
Per-IP rate limit — catches session-ID rotation.
Encrypted/locked PDF — rejected cleanly rather than stored opaque.
Page-count ceiling — >200 pages is an abuse/bundle vector.
Hash dedupe + reject-cache — identical re-uploads short-circuit.

Beyond identical-file dedup, a UIN net-new check runs on every upload: if the PDF's IRDAI UIN already belongs to a catalogued policy, the upload is recognised as not net-new and the caller is pointed at the existing marketplace card instead of a duplicate being indexed.

Accepted uploads are embedded into a separate, per-session quarantine Chroma collection (never the shared corpus), scoped by session_id so one user's document is invisible to another, and auto-purged after a 24-hour idle TTL.

4.2 Answer faithfulness

Faithfulness is structural, not bolt-on: the brain answers only from what its tools returned — retrieve_policies (policy-wording chunks) and get_policy_facts (claim-settlement ratio, complaints, scorecard and insurer-review data) — must cite, and is instructed to refuse when that grounding is weak. A prose-grounding guard verifies any policy / UIN named in the reply against both tools' returned policies before it is sent. Recommendation fit is gated (backend/scorecard.py, retrieval_filters.py) so plans that structurally don't fit the user's stated constraints are dropped, with the reason surfaced.

4.3 Evaluation

A gold Q&A harness lives at eval/run.py. Status: it is pending a re-port to the single-brain architecture (it targeted the removed orchestrator) and is intentionally hard-guarded from running so it cannot publish stale scores; see its module docstring. The automated test suite (tests/, run with pytest) is the current green gate and covers routing, scoring, premium, recall, the upload security gates, and conversation logic.

4.4 Known limitations (honest)

These are real and stated up front rather than buried:

Uploaded-doc persistence is within-session, not across restarts. Upload → graded marketplace card → grounded Q&A about the PDF all work live within a running container. But the Hugging Face Space's working filesystem is ephemeral by design (a fresh Chroma snapshot is pulled on every rebuild — see §2.7), and in practice an uploaded doc does not survive a Space rebuild/restart: the marketplace reverts to its curated/extracted baseline. Treat uploads as session-scoped. An operator/abuse prune endpoint exists (POST /api/admin/uploaded-docs/ prune, password-gated) to remove a persisted upload by id or prefix.
Uploaded-PDF field extraction is LLM-assisted, with a deterministic heuristic floor (ADR-044, 2026-05-27 hardening bundle). Every upload runs through two passes with a hash-cache fast path:
- Heuristic baseline — regex + keyword extraction over the PDF text, synchronous inside the upload HTTP call (sub-second), populates common fields like waiting periods and room-rent rule. Yields ~30–50 % data_completeness.
- Gemini extraction (3 jittered retries) — fires as a background asyncio task after the upload returns. Same Gemini 2.5-flash + same EXTRACT_SYSTEM prompt + same HealthPolicy Pydantic schema the catalogued 148 use offline. Backoffs 2 / 4 / 8 s ± 25 % jitter. On success, writes rag/extracted/<policy_id>.json and merges INTO the persisted record.json (LLM values override where present, heuristic stays where the LLM was silent). ~10–60 s total.
- NIM fallback — single attempt if all three Gemini retries fail.
- Heuristic floor — if NIM also fails, the card still renders at the heuristic baseline (~47.8 %, grade C). Verified live on Test Policy.pdf (8 MB) where Gemini 3/3 retries returned malformed JSON.
- Hash-cache short-circuit — if sha256(pdf_bytes) matches a prior successful extraction, that file is copied (~1 s) with llm_used="hash-cache" surfaced for ops. The frontend polls GET /api/upload/extraction-status/<policy_id> during the wait and renders the inline scorecard card ONLY after the LLM pass completes / fails / hits its 120 s timeout. The card is catalogued-grade — same PolicyScorecardWidget, same six sub-scores, same insurer-reputation data (detect_insurer_slug matches the PDF's legal name against the 21 known insurer slugs we have reviews data for and flips insurer_slug off the generic user-upload on a hit). Status ↔ card parity by construction: the status endpoint and the card endpoint both call _catalogue_scorecard(pid, None), so completeness_pct + overall_grade are byte-identical. Operator provenance: every status response carries llm_used (gemini-2.5- flash#N | nim-fallback | hash-cache) and llm_response_chars so the question "did Gemini actually run?" is answerable without HF Space stdout access. Post-card dive-in mode (KI-330): the just-uploaded pid becomes view_context.active_policy_id on the next chat turn, so single_brain answers policy-specific questions via retrieve_policies + get_policy_facts instead of pivoting to recommendations.
Live (BETA) voice mode uses the browser's in-built speech recognition and is labelled unstable; push-to-talk is the reliable path (warm-armed mic + pre-roll so the first word is never clipped, and long answers are chunked so nothing is truncated).
Recommendation vs. factual lookup. A factual question that names a specific policy is answerable on a cold session; broad "recommend me a plan" requests still require the short fact-find first (by design).
Admin LLM Chain — manual Refresh now and 30 s auto-poll (hardened 2026-05-27). POST /api/admin/probe runs a real serial probe of every candidate model and updates tested_at on each ModelHealth row; the admin UI's Refresh now button and the LLM Chain tab's 30 s poll now both re-fetch /api/admin/health and call renderUpdatedLabel() so the top-left "Last refresh / Next in" timer reflects the actual just-completed probe (previously it stayed frozen at the login-time snapshot, so the operator could not tell from the timer whether probing had actually happened).

5. Tech stack & key decisions

Layer	Choice	One-line why
Frontend	Next.js 16 (App Router), React 19, Tailwind v4, static export	Production-pattern UI; static export serves straight from the Space
Backend	FastAPI + Pydantic	Async I/O, typed request/response, auto OpenAPI
Brain	Google Gemini (`gemini-2.5-flash-lite`) + function calling	Frontier free-tier quality; one model + tools beats a multi-pass pipeline
Fallback	NVIDIA NIM open-model chain, health-elected	Free, diverse; fail-loud, never silently wrong
Retrieval	Chroma + BGE-small-en-v1.5 (local, 384-d)	Embedded, no infra, free, offline embeddings
Voice	Sarvam Saarika (STT) + Bulbul (TTS) + Sarvam-M (Indic)	First-class Indian-accent / Hinglish handling
Hosting	Hugging Face Space (Docker) + companion HF dataset	Free, GitHub-mirrored; code/data split keeps the image small

Decisions are deliberately biased toward one deployable artifact, no fabrication, fail loud. The single-brain consolidation, the NIM-only fallback (structured-output reliability over cross-provider breadth), the local embeddings (zero rate limits, offline ingest) and the code/data repo split are the load-bearing ones.

6. Repository map

At a glance — the root is intentionally small; you only need to know these:

backend/ — FastAPI app + the brain, tools, retrieval, scoring, security
frontend/ — the Next.js web app
rag/ — retrieval + offline ingest (corpus/vectors are git-ignored, pulled at build)
40-data/ — curated, human-reviewed policy facts (versioned with code)
tests/ — the pytest green gate
root files: Dockerfile, entrypoint.sh, requirements.txt, pytest.ini, README.md

Full directory tree — click to expand

.
├── backend/                  FastAPI app
│   ├── main.py               HTTP routes (chat, transcribe, upload, profile, …)
│   ├── single_brain.py       THE brain — Gemini + function-calling tools
│   ├── brain_tools.py        the tools the brain can call (retrieval, profile, …)
│   ├── nim_fallback.py       NIM fallback when Gemini fails / cold-start 503
│   ├── llm_health.py         background probe + sticky-primary election
│   ├── security.py           the 8 upload-defence gates
│   ├── scorecard.py /        recommendation fit + scoring
│   │   retrieval_filters.py
│   ├── premium_calculator.py profile → illustrative premium
│   │   sum_insured.py
│   ├── session_state.py      per-session profile (in-memory only, ADR-043)
│   ├── uploaded_docs.py      user-uploaded PDF pipeline (ADR-044):
│   │                          - persist_upload() — heuristic baseline
│   │                            (sub-second regex/keyword) + sha256 of
│   │                            PDF bytes → record.json + meta.json
│   │                          - build_record() — heuristic floor;
│   │                            guaranteed BEFORE any LLM fires
│   │                          - extract_fields_from_text() — 28+ regex
│   │                            patterns (KI-332 expansion 2026-05-27):
│   │                            sum_insured_options_inr ladder, policy_type,
│   │                            min/max entry age, child entry days,
│   │                            lifelong renewability flag, grace period,
│   │                            free-look period, geographic_coverage,
│   │                            ICU capping, deductible, NCB cap,
│   │                            organ/CI/preventive/domiciliary/newborn
│   │                            presence, premium payment modes — lifts
│   │                            floor from ~47.8% to ~65-70% even when
│   │                            ALL LLM passes fail
│   │                          - detect_insurer_slug() — match PDF text
│   │                            against 21 known insurer patterns; flips
│   │                            insurer_slug off 'user-upload' on hit
│   │                          - _multipass_extract_with_gemini() —
│   │                            (KI-332) 7-section parallel extraction:
│   │                            identity/eligibility/financial/waiting/
│   │                            coverage/limits/network_claims, each as
│   │                            its own Gemini 2.5-flash call via
│   │                            asyncio.gather. Fires for PDFs ≥25K chars
│   │                            where single-pass would truncate.
│   │                          - extract_one_for_upload() — background
│   │                            asyncio task. Resolution order:
│   │                              1. hash-cache (sha256 hit)
│   │                              2. multi-pass per-section (≥25K chars)
│   │                              3. single-pass Gemini (3 jittered retries)
│   │                              4. NIM fallback (single attempt)
│   │                              5. heuristic floor (always wins
│   │                                 because record.json already exists)
│   │                            Writes rag/extracted/<pid>.json, merges
│   │                            LLM scalars INTO record.json (LLM wins
│   │                            where non-empty, heuristic stays where
│   │                            silent)
│   │                          - _find_cached_extraction() — sha256
│   │                            lookup across UPLOADED_DOCS_DIR/*/meta.json
│   │                            for prior successful extractions
│   │                          - _set_extraction_status() — finalises
│   │                            status using main._catalogue_scorecard(pid)
│   │                            so completeness_pct + overall_grade match
│   │                            the card endpoint BY CONSTRUCTION (success
│   │                            path AND fail path, post-KI-333)
│   │                          - Provenance fields: llm_used
│   │                            (gemini-2.5-flash#N | gemini-2.5-flash-multipass
│   │                             | nim-fallback | hash-cache) +
│   │                            llm_response_chars
│   │                          - backfill_extractions() — startup hook
│   │                            re-runs LLM extraction on every
│   │                            UPLOADED_DOCS_DIR/<pid>/ missing
│   │                            rag/extracted/<pid>.json
│   │                          - _UPLOAD_EXTRACTION_STATUS dict + endpoint
│   │                            GET /api/upload/extraction-status/{pid}
│   │                          - Admin endpoint
│   │                            POST /api/admin/upload/reextract?force=...
│   ├── voice_format.py       TTS pre-processing (money/Indic normalisation)
│   ├── admin.py              /api/admin/* (health, telemetry)
│   └── providers/            thin clients: google_gemini, nvidia_nim, sarvam_*,
│                             local_embeddings (BGE), openrouter/groq (dormant)
├── frontend/                 Next.js 16 app (src/app/page.tsx, src/lib/*)
├── rag/                      retrieval + offline ingest pipeline
│   ├── retrieve.py           query → top-k chunks (request hot path)
│   ├── ingest.py/extract.py/schema.py   offline corpus build
│   ├── corpus/ vectors/      data — git-ignored, from the HF dataset
│   └── policies.duckdb       offline structured rollup
├── 40-data/                  curated, version-with-code structured facts
│   ├── policy_facts/*.json   per-policy facts + verbatim source_quote
│   └── reviews/ premiums/ insurer_network.json
├── eval/                     gold Q&A harness (pending single-brain re-port)
├── 70-docs/                  design docs & ADRs  ⚠️ see note below
├── 80-audit/                 defect register / audit transcripts
├── tools/                    operational scripts (corpus, probes, link-rot)
├── tests/                    pytest suite — the green gate (`pytest`)
├── Dockerfile / entrypoint.sh   HF Space image (pulls the data dataset)
├── pytest.ini                scopes pytest to tests/ (clean on a fresh clone)
└── requirements.txt

⚠️ Note on 70-docs/ and ADRs: these capture design history and rationale; some predate the single-brain rewrite and are being brought into line with the system as it actually runs today. This README is the authoritative present-state map; the ADRs are decision context.

7. Run it locally

Prerequisites: Python 3.11+, Node 20+, the API keys below.

# 1. Code
git clone <code-repo-url> "Insurance Sales Bot"
cd "Insurance Sales Bot"
python -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt

# 2. Data (corpus + prebuilt vectors live in the companion dataset)
python -c "from huggingface_hub import snapshot_download; \
  snapshot_download(repo_id='rohitsar567/insurance-bot-data', \
  repo_type='dataset', local_dir='rag/_hf_dataset_backup')"
#   then place rag/corpus and rag/vectors from the snapshot into rag/
#   (the Docker build does this automatically; see entrypoint.sh)

# 3. Secrets — copy the example and fill in:
cp .env.example .env
#   GOOGLE_API_KEY      — Gemini brain (primary)         [required]
#   NVIDIA_NIM_API_KEY  — NIM fallback chain             [required]
#   SARVAM_API_KEY      — STT / TTS / Indic              [required for voice]
#   HF_TOKEN            — pull the data dataset at boot  [required]
#   ADMIN_PASSWORD      — gates /api/admin/*             [required]
#   VOYAGE_API_KEY      — offline ingest embeddings only [ingest only]
#   OPENROUTER/GROQ_API_KEY — dormant (kept for one-flip re-enable)

# 4. Backend
uvicorn backend.main:app --host 127.0.0.1 --port 8000 --reload

# 5. Frontend (separate terminal)
cd frontend && npm install && npm run dev      # http://localhost:3000

# Tests (the green gate)
pytest                                          # collects tests/ only

8. Deployment

Hosting is a Hugging Face Space running the Dockerfile:

The image installs the backend and builds the static frontend.
At build time it runs huggingface_hub.snapshot_download to hydrate rag/ (corpus + vectors) from the rohitsar567/insurance-bot-data dataset, so the Space repo itself stays code-only and small.
entrypoint.sh starts uvicorn backend.main:app on $PORT (default 7860, the port HF Spaces routes to); FastAPI also serves the exported frontend.

The code repo is mirrored to both the HF Space remote (origin) and a GitHub remote (github); the heavy data is updated on the HF dataset side. Space repository secrets supply the API keys listed in §7. After any deploy, verify the Space's reported build SHA actually advanced before trusting that new code is live (a quota/LFS push can fail without surfacing an error).

Insurance Sales Portfolio Expert

Table of contents

1. What this is

The problem this solves

What it does, concretely

2. How it works, end to end

2.1 The user's journey (plain English — no tech)

2.2 System at a glance — the big building blocks

2.3 Functional abstraction — what happens inside each building block

2.4 LLM brain + fail-loud fallback chain

2.5 Voice pipeline (in / out, with barge-in)

2.6 Profile & personalisation (in-memory only)

2.7 Data architecture — where the JSON lives, where the vectors live

2.7.1 The two data kinds, side by side

2.7.2 Other data files (smaller, supporting)

2.7.3 Where each piece physically lives

2.7.4 Offline ingest pipeline (built once, not on the request path)

2.8 Uploaded-PDF flow — 8 security gates → Gemini extraction → catalogued-grade card

2.9 Deployment

3. Key functions in plain language

3.1 Profile construction

3.2 Profile-aware scoring

3.3 Profile-aware pricing — the multivariate ballpark

3.4 Retrieval — retrieve_policies over the vector store

3.5 Curated facts — get_policy_facts (no embedding hop)

3.6 In-session state recovery (server restart resilience)

3.7 Uploaded-PDF LLM extraction (the §2.8 pipeline in code terms)

3.8 What is stored vs what is live-only

4. Safety & quality

4.1 Uploaded-PDF defence (8 gates)

4.2 Answer faithfulness

4.3 Evaluation

4.4 Known limitations (honest)

5. Tech stack & key decisions

6. Repository map

7. Run it locally

8. Deployment

3.4 Retrieval — `retrieve_policies` over the vector store

3.5 Curated facts — `get_policy_facts` (no embedding hop)