Spaces:
Sleeping
Sleeping
| title: Insurance Sales Portfolio Expert | |
| emoji: π₯ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| short_description: Voice-first AI advisor for Indian health insurance | |
| <!-- The YAML block above is Hugging Face Space configuration β it is parsed | |
| by HF to provision the Space (docker SDK, port 7860). Do not remove. --> | |
| # Insurance Sales Portfolio Expert | |
| A health-insurance advisory web app for the Indian market (presented in-app as | |
| **"Insurance Advisor"**). You describe your situation in plain language (typed | |
| or spoken, English or Hindi/Hinglish); it asks a few clarifying questions, then | |
| recommends and explains real policies β grounded in the actual policy | |
| documents, with every claim traceable to a source clause. It also lets you | |
| upload your own policy PDF and ask questions about it. | |
| Live: **https://rohitsar567-insurancebot.hf.space** | |
| > **Reading this cold?** Β§1 is plain English. Β§2 walks you down four levels of abstraction: the user journey (Β§2.1), the building blocks (Β§2.2), the functional abstraction inside each block (Β§2.3), then deep-dives per building block (Β§2.4βΒ§2.9). Β§3 gives a function-by-function sequence-diagram view of the six most important jobs. Β§4βΒ§8 are safety, stack, repo map, run-it-locally, and deployment. | |
| --- | |
| ## Table of contents | |
| 1. [What this is](#1-what-this-is) | |
| 2. [How it works, end to end](#2-how-it-works-end-to-end) | |
| 3. [Key functions in plain language](#3-key-functions-in-plain-language) | |
| 4. [Safety & quality](#4-safety--quality) | |
| 5. [Tech stack & key decisions](#5-tech-stack--key-decisions) | |
| 6. [Repository map](#6-repository-map) | |
| 7. [Run it locally](#7-run-it-locally) | |
| 8. [Deployment](#8-deployment) | |
| --- | |
| ## 1. What this is | |
| **The short answer.** A health-insurance advisor that behaves like a | |
| knowledgeable, unbiased human advisor β *not* a lead-generation funnel. | |
| You describe your situation; it asks a few clarifying questions; it | |
| recommends real plans that fit, with every factual claim backed by the | |
| exact clause in the real policy document. No lead capture. No commission | |
| bias. If the honest answer is *"this isn't in the document,"* it says so β | |
| instead of guessing. | |
| It works by chat or voice, in English or Hindi/Hinglish, on desktop and | |
| mobile. | |
| ### The problem this solves | |
| Buying health insurance in India is hard for an ordinary person. A | |
| first-time buyer faces three concrete problems: | |
| 1. **Too much to compare.** ~148 plans across 21 insurers, each with | |
| dozens of decision-relevant fields (waiting periods, room-rent caps, | |
| co-pay, maternity, sub-limits, network size). No human reads them all. | |
| 2. **The truth is buried.** The number that decides whether a plan is | |
| right for *you* is on page 47 of a PDF written by lawyers. | |
| 3. **Most "advice" is conflicted.** Aggregator sites optimise for the | |
| sale, not the fit. | |
| The cost of getting this wrong is real money and denied claims years | |
| later. The goal is a tool a non-expert can trust the way they would trust | |
| a good independent advisor: personalised to *their* profile, sourced, and | |
| never fabricating. | |
| ### What it does, concretely | |
| - **Conversational fact-find** β short natural back-and-forth establishes | |
| your profile (age, dependants, budget, pre-existing conditions, | |
| priorities) instead of a long form. | |
| - **Personalised recommendations** β plans ranked for *fit to your | |
| profile*. A fixed-benefit plan is not pushed to someone who needs | |
| comprehensive cover; a plan whose entry age excludes you is filtered | |
| out. | |
| - **Grounded answers** β every factual claim about a policy is retrieved | |
| from that policy's actual document and shown with its source. Weak or | |
| missing evidence produces an honest "not stated in the document." | |
| - **Marketplace & compare** β browse the full indexed catalogue, open a | |
| detailed scorecard per plan, compare up to four side by side. | |
| - **Profile β premium (illustrative)** β a live ballpark premium range | |
| that updates as you change your profile. *Not* real underwriting β a | |
| multivariate range from public rate-card combinations (see Β§3.3). | |
| - **Bring your own document** β upload any policy PDF; it is safely | |
| indexed for the rest of your session so you can ask questions about | |
| *your* document. | |
| - **Voice** β speak instead of typing (tap-to-talk on mobile, | |
| push-to-talk on desktop); replies are spoken back. Indian-accent and | |
| Hinglish aware. | |
| --- | |
| ## 2. How it works, end to end | |
| **The short answer.** A Next.js browser app talks to a FastAPI backend. | |
| Every chat turn goes to a **single LLM "brain"** (Google **Gemini**) with | |
| a small set of **function-calling tools** β most importantly a retrieval | |
| tool over a **Chroma** vector store built from the real policy documents. | |
| The brain decides when to retrieve, what to retrieve, and how to answer; | |
| it *cannot* state a policy fact it did not retrieve. If Gemini is | |
| unavailable, the turn transparently falls back to an **NVIDIA NIM** | |
| open-model chain. Voice in/out is handled by **Sarvam** (Indian-language | |
| STT/TTS). Heavy data (PDF corpus + prebuilt vectors) lives in a separate | |
| Hugging Face **dataset**, not the code repo. | |
| The rest of this section walks you down four levels of abstraction: | |
| Β§2.1 the user's journey (plain English, no tech); Β§2.2 the building | |
| blocks at the highest level (the four canonical buckets); Β§2.3 the | |
| functional abstraction β what happens inside each bucket; and Β§2.4βΒ§2.9 | |
| the deep dives per building block. Every diagram is followed by a | |
| β€50-word summary and a hierarchical *how it flows* breakdown. | |
| ### 2.1 The user's journey (plain English β no tech) | |
| Before the engineering detail, here is what actually happens for the | |
| person using it. No code, no jargon β just the path from opening the app | |
| to deciding with confidence. | |
| ```mermaid | |
| flowchart TD | |
| S["π You open the app β web or mobile, nothing to install"] --> TELL["π£οΈ Tell it about you β a short chat, typed OR spoken, English / Hindi-Hinglish<br/>age Β· family Β· budget Β· health Β· what you care about"] | |
| TELL --> ASK["β It asks just 2β3 clarifying questions<br/>(a real conversation, never a long form)"] | |
| ASK --> REC["π― A personalised shortlist β plans ranked for YOUR fit, each with the reason it fits"] | |
| REC --> WHY["π Open any plan: every fact is backed by the exact clause in the real policy PDF<br/>an honest "not stated in the document" instead of a guess"] | |
| WHY --> EXPLORE{"Want to dig deeper?"} | |
| EXPLORE -->|"Compare"| CMP["βοΈ Compare up to 4 plans side by side Β· full scorecard per plan"] | |
| EXPLORE -->|"Browse"| MKT["π Browse the full indexed marketplace"] | |
| EXPLORE -->|"Ask"| QA["π¬ Ask follow-up questions β answered only from the actual documents"] | |
| EXPLORE -->|"My own policy"| UP["π Upload your own policy PDF"] | |
| UP --> UPIDX["β³ Quick ack β 'Reading it through, ~30β60 s'<br/>(everything in chat is gated while the analysis runs)"] | |
| UPIDX --> UPCARD["π Inline scorecard card with FULL data:<br/>grade letter Β· 6 sub-scores Β· verbatim signals Β· insurer reputation"] | |
| UPCARD --> UPCHOICE{"How would you like to proceed?"} | |
| UPCHOICE -->|"Finish profile"| TELL | |
| UPCHOICE -->|"Dive into the PDF"| QA | |
| CMP --> PREM | |
| MKT --> PREM | |
| QA --> PREM | |
| PREM["πΈ A live premium estimate that updates as you change your profile"] --> DONE["β Decide with confidence β no lead capture, no commission bias"] | |
| VOICE["ποΈ Optional the whole way: speak instead of type β it speaks the answers back"] -.-> TELL | |
| VOICE -.-> QA | |
| ``` | |
| **Summary.** A user opens the app and ends the session having decided on | |
| a plan with confidence β and how the system loops through compare / | |
| browse / Q&A / upload along the way. No backend in this view; just the | |
| human path. Every session starts fresh β there is no cross-session | |
| memory; closing the tab forgets you (privacy-by-design, see ADR-043). | |
| **How it flows:** | |
| - **Conversational fact-find.** A short typed-or-spoken back-and-forth | |
| (English or Hindi-Hinglish) captures age, family, budget, health and | |
| what you care about β instead of a long form. | |
| - **Personalised shortlist + a "why".** Plans are ranked for *your* fit; | |
| every fact about a plan is backed by the exact clause in the real | |
| policy PDF, never invented. | |
| - **Branches from the shortlist.** Compare side by side, browse the full | |
| marketplace, ask follow-up questions, or upload your *own* policy PDF | |
| and ask about your document (kept private to your session). | |
| - **Upload-PDF flow is a staged sequence** (ADR-044, 2026-05-27): | |
| upload β bot says *"reading it through, ~30β60 s"* β all chat input is | |
| gated during the wait (Send button, textarea, voice paths all blocked | |
| so nothing can interrupt the staging) β bot pushes the inline | |
| scorecard card with FULL extracted data once the LLM pass lands β bot | |
| then asks whether you'd like to finish your profile or dive into the | |
| PDF. The card is the same shape as any catalogued policy card β six | |
| sub-scores, verbatim signals, real claim-settlement data when the | |
| insurer is recognised. | |
| - **Live premium.** Updates as you change the profile. | |
| - **Decision.** No lead capture and no commission bias β the path ends at | |
| *decide*, not at a sales handoff. | |
| ### 2.2 System at a glance β the big building blocks | |
| **The short answer.** The system has four "tall buckets": | |
| **Frontend** (what you see), **Backend** (what runs on the server), | |
| **Data layer** (the policy knowledge), and **Voice** (in and out). They | |
| talk to each other over standard HTTP / JSON. | |
| **Two terms first, in one sentence each:** | |
| - **Frontend** = everything you see on screen β the chat box, marketplace | |
| cards, sliders, profile builder. Built with **Next.js + React** (a | |
| standard, well-supported web-UI library). Runs in your browser. | |
| - **Backend** = everything that *runs on the server* β the LLM brain, the | |
| retrieval, the scoring/pricing logic, the upload-security gates. Built | |
| with **FastAPI** (a standard Python HTTP framework). Think of the | |
| frontend as the menu + waiter; the backend is the kitchen. | |
| Both Next.js and FastAPI are deliberately boring, standard choices β they | |
| let us not spend engineering on the UI layer or the HTTP plumbing, so we | |
| spend that effort on the brain and the data, where the product | |
| differentiation actually lives. | |
| **Now the big picture β the buckets and how they talk:** | |
| ```mermaid | |
| flowchart LR | |
| subgraph FE["π Frontend (browser Β· Next.js)"] | |
| UI["Chat Β· Marketplace Β· Compare Β· Profile builder<br/>Voice capture & playback"] | |
| end | |
| subgraph BE["βοΈ Backend (FastAPI server)"] | |
| API["HTTP endpoints + orchestration<br/>backend/main.py"] | |
| BRAIN["π§ LLM Brain<br/>Google Gemini + function-calling tools<br/>(NIM fallback chain on failure)"] | |
| SCORE["π― Scoring + Pricing<br/>scorecard.py Β· premium_calculator.py"] | |
| PROF["π€ Profile (in-memory only)<br/>session_state.SessionState Β· 1h idle TTL"] | |
| end | |
| subgraph DATA["π Data layer"] | |
| VEC["Vector DB (Chroma) β policy chunks<br/>+ per-session quarantine (uploads)"] | |
| FACTS["Curated facts JSON<br/>40-data/policy_facts/*.json"] | |
| end | |
| subgraph VOICE["ποΈ Voice"] | |
| STT["Sarvam STT (in)"] | |
| TTS["Sarvam TTS (out)"] | |
| end | |
| UI <-->|"text Β· JSON"| API | |
| UI -->|"audio"| STT --> API | |
| API --> TTS --> UI | |
| API <--> BRAIN | |
| BRAIN <-->|"retrieve_policies"| VEC | |
| BRAIN <-->|"get_policy_facts"| FACTS | |
| BRAIN <-->|"save_profile_field"| PROF | |
| BRAIN --> SCORE | |
| SCORE <--> FACTS | |
| SCORE <--> PROF | |
| ``` | |
| **Summary.** Four building blocks talk over HTTP / JSON: Frontend (the chat UI you see), Voice (Sarvam STT in + TTS out), Backend (FastAPI with four sub-blocks β orchestration, LLM Brain, Scoring + Pricing, Profile & Persistence), and the Data layer (Chroma vectors + curated JSON facts). | |
| **How it flows:** | |
| - **1. Frontend (browser Β· Next.js).** Renders chat, marketplace, compare, and the profile builder. Sends typed text and audio over HTTP, plays the synthesised reply. | |
| - **2. Voice.** `Sarvam STT (in)` turns spoken audio into a text turn; `Sarvam TTS (out)` turns the reply text back into spoken audio. | |
| - **3. Backend (FastAPI).** Four sub-blocks β **3a** HTTP endpoints + orchestration (`backend/main.py`); **3b** LLM Brain (Gemini + function-calling tools; NIM fallback on failure); **3c** Scoring + Pricing (`scorecard.py` + `premium_calculator.py`); **3d** Profile (in-memory only β `session_state.SessionState`, no disk). | |
| - **4. Data layer.** Two stores β the Chroma **vector DB** (shared policy chunks + per-session quarantine for uploads) and curated **JSON facts** at `40-data/policy_facts/*.json`. The brain, scoring, and pricing all read from these. | |
| **Diagram legend (used throughout Β§2):** | |
| - **Solid arrow (`β`)** = a real call / data flow on the request path. | |
| - **Double arrow (`β`)** = bidirectional β one side calls, the other returns. | |
| - **Dotted arrow (`-.->`)** = a side-channel or async event β voice | |
| playback, barge-in interrupt, end-of-turn persistence, etc. β not on | |
| the main request path. | |
| - **Subgraph box** = everything inside runs in one place (one process / | |
| one service / one storage layer). | |
| - Edge labels (e.g. *"retrieve_policies"*) name the actual function or | |
| signal carried on that edge. | |
| ### 2.3 Functional abstraction β what happens inside each building block | |
| ```mermaid | |
| flowchart TB | |
| subgraph FE["1. Frontend"] | |
| direction TB | |
| F1["capture_input<br/>typed text Β· spoken audio"] | |
| F2["render_reply<br/>chat Β· cards Β· scorecard Β· audio"] | |
| end | |
| subgraph V["2. Voice"] | |
| direction TB | |
| V1["transcribe<br/>Sarvam Saarika STT"] | |
| V2["synthesize<br/>voice_format β Sarvam Bulbul TTS"] | |
| end | |
| subgraph BE["3. Backend"] | |
| direction TB | |
| subgraph BE_API["3a. HTTP + orchestration"] | |
| A1["route_request"] | |
| A2["orchestrate_turn"] | |
| end | |
| subgraph BE_BRAIN["3b. LLM Brain"] | |
| B1["handle_turn<br/>one Gemini call + tool loop"] | |
| B2["fact_find<br/>save_profile_field"] | |
| B3["retrieve<br/>retrieve_policies"] | |
| B4["lookup_facts<br/>get_policy_facts"] | |
| B5["recommend<br/>mark_recommendation"] | |
| end | |
| subgraph BE_SCORE["3c. Scoring + Pricing"] | |
| SC1["grade_per_profile<br/>scorecard.py"] | |
| SC2["estimate_premium<br/>premium_calculator.py"] | |
| end | |
| subgraph BE_PROF["3d. Profile (in-memory)"] | |
| P1["update_session_profile<br/>session_state.SessionState"] | |
| P2["evict_on_idle<br/>1h TTL Β· no disk"] | |
| end | |
| end | |
| subgraph DATA["4. Data layer"] | |
| direction TB | |
| D1["vector_search<br/>Chroma Β· BGE-small"] | |
| D2["fact_lookup<br/>40-data/policy_facts/*.json"] | |
| end | |
| %% forward edges (input / down the pipeline) | |
| F1 -->|"audio"| V1 | |
| F1 -->|"text Β· JSON"| A1 | |
| V1 --> A1 | |
| A1 --> A2 | |
| A2 --> B1 | |
| B1 --> B2 | |
| B1 --> B3 | |
| B1 --> B4 | |
| B1 --> B5 | |
| B2 --> P1 | |
| B3 --> D1 | |
| B4 --> D2 | |
| A2 --> SC1 | |
| A2 --> SC2 | |
| SC1 -->|"reads"| D2 | |
| SC2 -->|"reads"| D2 | |
| SC1 -->|"reads"| P1 | |
| SC2 -->|"reads"| P1 | |
| P1 -.->|"idle 1h"| P2 | |
| %% return edges (output / back to caller) | |
| D1 -.->|"top-k chunks"| B3 | |
| D2 -.->|"per-policy facts"| B4 | |
| SC1 -.->|"grade"| A2 | |
| SC2 -.->|"premium range"| A2 | |
| B1 -.->|"reply + citations"| A2 | |
| A2 -.->|"text"| F2 | |
| A2 -.->|"speak?"| V2 | |
| V2 -.->|"audio"| F2 | |
| %% blue solid = forward Β· orange dashed = return | |
| linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17 stroke:#1565c0,stroke-width:2px | |
| linkStyle 18,19,20,21,22,23,24,25 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3 | |
| ``` | |
| **Legend.** Blue solid = forward flow (input / call down the pipeline). Orange dashed = return flow (result / reply back up). | |
| **Summary.** Inside each building block from Β§2.2, a small set of named functions fires per turn β Frontend captures and renders, Voice transcribes and synthesises, the four Backend sub-blocks orchestrate / decide / score / remember, and the Data layer answers their reads. | |
| **How it flows:** | |
| - **1. Frontend.** `capture_input` accepts typed text or recorded audio; `render_reply` paints chat + marketplace cards + scorecard + audio playback. | |
| - **2. Voice.** `transcribe` is the inbound path (Sarvam Saarika STT); `synthesize` is the outbound path (`voice_format` normalises money / Indic shorthand β Sarvam Bulbul TTS). | |
| - **3a. HTTP + orchestration.** `route_request` maps the URL to a handler; `orchestrate_turn` is the per-turn supervisor β it owns the request lifecycle and ties brain + scoring + voice + persistence together. | |
| - **3b. LLM Brain.** One `handle_turn` per turn calls Gemini, which chooses which of `fact_find` / `retrieve` / `lookup_facts` / `recommend` to run as tools. The brain may only state what its tools returned. | |
| - **3c. Scoring + Pricing.** `grade_per_profile` and `estimate_premium` read curated facts **and** the live profile, compute on every request (never stored), and hand back to `orchestrate_turn`. | |
| - **3d. Profile (in-memory).** `update_session_profile` reflects each `fact_find` write into the live `SessionState.profile`. State lives in process memory only; an idle session is evicted after 1 h. There is no disk persistence and no cross-session recall (see ADR-043, 2026-05-27). | |
| - **4. Data layer.** Two reads β `vector_search` for free-form Q&A, and `fact_lookup` for decision-critical numbers with verbatim quotes. The data layer does no writes during a request β those happen offline only (vector ingest, curated-facts edits). | |
| ### 2.4 LLM brain + fail-loud fallback chain | |
| ```mermaid | |
| flowchart LR | |
| Q["chat turn"] --> G{"Gemini<br/>gemini-2.5-flash-lite"} | |
| G -->|"OK"| ANS["grounded reply<br/>(only from tool results)"] | |
| G -->|"real failure / cold-start 503"| H["backend/llm_health.py<br/>background probe + sticky-primary election"] | |
| H --> NIM["NVIDIA NIM open-model chain<br/>backend/nim_fallback.py"] | |
| NIM -->|"healthy model"| ANS | |
| NIM -->|"whole chain down"| LOUD["explicit 'service degraded'<br/>(never a silently wrong answer)"] | |
| ANS --> GUARD["prose-grounding guard:<br/>every policy/UIN named is verified<br/>against retrieve_policies + get_policy_facts"] | |
| GUARD --> OUT["sent to user"] | |
| ``` | |
| **Summary.** How a chat turn is served by the primary | |
| LLM, what happens when it fails, and the structural guard that prevents a | |
| silently wrong answer. | |
| **How it flows:** | |
| - **Primary path.** Gemini (`gemini-2.5-flash-lite`). On a healthy | |
| response β the reply is built *only* from what the tools returned. | |
| - **Fallback path (fail-loud).** A real Gemini failure or a cold-start | |
| 503 routes through `backend/llm_health.py` (a background probe with | |
| sticky-primary election) to the NVIDIA NIM open-model chain | |
| (`nim_fallback.py`). One healthy model in that chain serves the turn. | |
| - **Last resort.** If the whole chain is down, the user gets an explicit | |
| *"service degraded"* message β never a silently wrong answer. | |
| - **Prose-grounding guard.** Before a reply is sent, every policy / UIN | |
| named in the prose is verified against the same `retrieve_policies` | |
| and `get_policy_facts` results the brain saw (with an exemption for | |
| genuine catalogue UINs). Faithfulness is structural, not bolt-on. | |
| **Why a single brain (not a multi-model pipeline).** Earlier designs split | |
| the work across several LLM passes (a separate fact-find brain, a QA | |
| brain, a faithfulness-judge). That scaffolding was removed: a single | |
| frontier model with well-designed tools is more accurate, far simpler, | |
| and eliminates a whole class of cross-model contract bugs. Today there is | |
| exactly **one** brain call per turn plus its tool calls. Faithfulness is | |
| enforced *structurally* β the brain can only state what `retrieve_policies` | |
| and `get_policy_facts` returned β rather than by a second grader model. | |
| **More on the fallback chain.** The brain's primary is Gemini | |
| (`gemini-2.5-flash-lite`). On a real Gemini failure or a cold-start 503, | |
| the turn falls back to an NVIDIA NIM chain of open models. Candidate | |
| selection uses a background health probe with sticky-primary election | |
| (`backend/llm_health.py`) so one healthy model is chosen per call. The | |
| fallback is **fail-loud**: if the whole chain is down the user gets an | |
| explicit *"service degraded"* message, never a silently wrong answer. | |
| (A separate LLM "judge" existed historically and has been retired β the | |
| single-brain design made it redundant.) | |
| **Sticky-session retry policy (hardened 2026-05-27).** Once a session | |
| has completed at least one successful single-brain turn, it stays on | |
| single_brain for the rest of its lifetime β cross-fading to | |
| `nim_fallback` mid-stream would discard `last_recommendation_ids / | |
| last_retrieved_chunks / slug_to_insurer`. To absorb Gemini's | |
| intermittent "high demand" 503 bursts on sticky sessions, | |
| `_gemini_call` now uses an adaptive retry schedule: **non-sticky** | |
| session keeps 1 retry with a 1.5 s backoff (fast-fail to NIM on | |
| cold-start); **sticky** session gets 2 retries with jittered | |
| exponential backoffs (1.5 s β 3 s, Β±25 % jitter). If the chain still | |
| fails after retries, the user sees a plain, honest reply *"My model | |
| service had a brief blip on that turn β please send the same message | |
| again."* (no more misleading *"could you say that again?"*). | |
| ### 2.5 Voice pipeline (in / out, with barge-in) | |
| ```mermaid | |
| flowchart LR | |
| MIC["mic β tap-to-talk (touch) / push-to-talk (desktop)"] --> MR["MediaRecorder (authoritative audio)"] | |
| MIC -.->|"live interim text"| WS["Web Speech API (display only)"] | |
| MR --> STT["/api/transcribe β Sarvam Saarika STT"] | |
| STT --> BR["single_brain.handle_turn"] | |
| BR --> RPL["reply text + citations"] | |
| RPL --> VF["voice_format.py<br/>money/Indic normalise Β· chunk at sentence bounds"] | |
| VF --> BUL["Sarvam Bulbul TTS"] | |
| BUL --> PLAY["in-DOM <audio>"] | |
| SPK["user speaks over bot"] -.->|"barge-in"| PLAY | |
| SPK -.->|"abort in-flight"| BR | |
| ``` | |
| **Summary.** How spoken input becomes a chat turn, how | |
| the reply becomes speech back, and how the user can interrupt mid-answer. | |
| **How it flows:** | |
| - **Capture.** Tap-to-talk (touch) or push-to-talk (desktop) starts | |
| `MediaRecorder` (the authoritative audio) and Web Speech (a live | |
| interim transcript shown on screen but never trusted for the turn). | |
| - **STT.** The authoritative audio is sent to `/api/transcribe` | |
| (Sarvam Saarika β Indian-accent + Hinglish aware). | |
| - **Brain β reply.** The transcript runs through `single_brain.handle_turn` | |
| exactly like a typed turn. | |
| - **TTS.** `voice_format.py` normalises money / Indic shorthand and chunks | |
| at sentence bounds (so long replies are spoken in full); Sarvam Bulbul | |
| speaks; an in-DOM `<audio>` element plays. | |
| - **Barge-in.** The user speaking over the bot pauses playback **and** | |
| aborts the in-flight `/api/chat`, so the bot stops mid-thought rather | |
| than over-talking. | |
| **More on voice.** The browser shows a live *interim* transcript via the | |
| Web Speech API while `MediaRecorder` captures the **authoritative** audio, | |
| which is sent to `/api/transcribe` (**Sarvam Saarika** STT). Replies are | |
| synthesised by **Sarvam Bulbul** TTS, with money / Indic shorthand | |
| normalised in `backend/voice_format.py` before synthesis (long replies are | |
| chunked at sentence boundaries so the full answer is spoken, not just the | |
| first sentence), and played through an in-DOM `<audio>` element. Speaking | |
| over the bot (**barge-in**) pauses that audio **and** aborts the in-flight | |
| `/api/chat` request. On touch devices voice is tap-to-talk; on desktop, | |
| push-to-talk; the live interim transcript accumulates the full utterance | |
| while you speak. | |
| ### 2.6 Profile & personalisation (in-memory only) | |
| ```mermaid | |
| flowchart TB | |
| A["user answers (chat or profile builder)"] --> SPF["save_profile_field β SessionState.profile<br/>(in-memory dict only)"] | |
| SPF --> FIT["scorecard fit + grade<br/>(reads live profile)"] | |
| SPF --> PREM["illustrative premium<br/>(reads live profile)"] | |
| SPF --> TURN["next chat turn<br/>(brain sees full live profile)"] | |
| SPF -.->|"1h idle"| EVICT["session evicted from memory<br/>profile gone forever"] | |
| SPF -.->|"close tab"| EVICT | |
| RECOV["server restart / 1h idle WHILE tab still open<br/>+ chat_history carried by browser"] -.-> SR["STATE-RECOVERY MODE<br/>brain rebuilds profile from chat_history<br/>(in-session only Β· never reads disk)"] | |
| SR --> SPF | |
| ``` | |
| **Summary.** The profile is captured into a per-session in-memory dict | |
| (`SessionState.profile`), feeds scoring + pricing + the next turn's | |
| brain prompt, and is discarded the moment the session evicts (1 h idle | |
| or "Clear chat"). There is no on-disk persistence and no cross-session | |
| recall. The in-session **STATE-RECOVERY** path covers container | |
| restarts by rebuilding the profile from the chat history the browser | |
| still carries β it never touches disk. | |
| **How it flows:** | |
| - **Capture.** Every answer (chat or profile-builder form) is written via | |
| `save_profile_field` (or the `POST /api/profile` endpoint) into the | |
| live `SessionState.profile`. This is a regular Python dataclass field | |
| on the in-memory session object. | |
| - **Drives scoring + pricing.** The same profile feeds the scorecard | |
| fit-and-grade (Β§3.2) and the live premium estimate (Β§3.3) on every | |
| request β both reads, never persisted. | |
| - **Evicted on idle / close.** A session is evicted from the | |
| `_sessions` dict after 1 h of inactivity (`_TTL_SECONDS`). Hitting | |
| *Clear chat* (`POST /api/session/clear`) evicts immediately. Closing | |
| the tab disconnects the browser β the server-side session ages out on | |
| the same TTL. | |
| - **State recovery (in-session only).** If the server restarted or the | |
| session evicted *while the browser still has the chat open*, the | |
| client re-sends its `chat_history` with the next turn. The brain | |
| enters **STATE-RECOVERY MODE** and silently re-captures the facts | |
| already stated in history β without ever asking the user's name | |
| again. This is **not** cross-session; it only resolves the case | |
| where the user is still in the same conversation. | |
| **Why no cross-session recall (ADR-043, 2026-05-27).** An earlier | |
| design persisted profiles to `40-data/profiles/<name>.json` and offered | |
| a "Welcome back, <name>?" prompt on return. The name-only slug key | |
| collided across distinct users (every "Rohit" wrote to the same file), | |
| which required four sequential hardening passes β prompt redaction, | |
| match-before-merge guards, same-turn fact extractors, a two-fact gate β | |
| to keep contained. The cost/benefit for an insurance-shopping app | |
| (rare-purchase, return sessions uncommon) didn't justify the surface. | |
| The simpler "session is in-memory only" model matches the privacy story | |
| the product wants to tell. | |
| ### 2.7 Data architecture β where the JSON lives, where the vectors live | |
| **Summary.** Two complementary kinds of data power the bot: small JSON files of human-reviewed facts versioned with the code, and a Chroma vector database of the full policy text held in a separate HF dataset. JSON answers exact-number questions; vectors answer free-form Q&A. | |
| #### 2.7.1 The two data kinds, side by side | |
| | | **JSON files** | **Vector database (Chroma)** | | |
| |---|---|---| | |
| | **What's in it** | Per-policy curated structured fields β CSR%, ICR%, complaints/10k, room-rent rule, waiting periods, sub-limits, grade β **each value carries its verbatim `source_quote` from the PDF** | Full text of every policy PDF, chunked into ~500-token overlapping pieces, embedded as 384-d vectors with BGE-small-en-v1.5 | | |
| | **File location** | `40-data/policy_facts/*.json` β inside the **code repo**, versioned in git | `rag/vectors/` β git-**ignored** on the laptop, lives in the **HF dataset** `rohitsar567/insurance-bot-data`, pulled at Docker build via `huggingface_hub.snapshot_download` | | |
| | **Size** | ~150 files Γ few KB each β tiny, fits in git | ~7.3 k chunks Γ 384 dims + raw PDFs (hundreds of MB) β too big for git | | |
| | **Built by** | Offline ingest (`rag/extract.py` + `schema.py`) β LLM-assisted extraction, **human-reviewed**, committed to git | Offline ingest (`rag/ingest.py`) β chunked + embedded once, published to HF dataset | | |
| | **Read by (at request time)** | `get_policy_facts` tool (LLM brain) Β· `scorecard.py` Β· `premium_calculator.py` Β· marketplace-card renderer | `retrieve_policies` tool (LLM brain) β only | | |
| | **Used for** | Decision-critical exact numbers cited on cards, in the scorecard, and in pricing β with the verbatim PDF quote shown | Free-form Q&A grounded in actual policy wording β *"what does this plan cover during pregnancy?"* | | |
| #### 2.7.2 Other data files (smaller, supporting) | |
| - **`40-data/reviews/`** β sourced insurer reviews (claims stories, regulator notes). | |
| - **`40-data/premiums/`** β illustrative public rate-card combinations consumed by the multivariate premium estimator (Β§3.3). | |
| - **`40-data/insurer_network.json`** β hospital-network counts per insurer. | |
| _Pre-ADR-043 there was also a `40-data/profiles/<name>.json` directory of saved user profiles for cross-session recall. That mechanism was removed (see Β§2.6 β sessions are now in-memory only)._ | |
| All three remaining stores sit in the **code repo** (under `40-data/`) because they're small, human-reviewed, and decision-critical β safe to version alongside the code. | |
| #### 2.7.3 Where each piece physically lives | |
| ```mermaid | |
| flowchart LR | |
| LAPTOP["π» Local dev laptop<br/>source of truth before any push"] | |
| subgraph CODE["π Code repository (mirrored to two remotes)"] | |
| direction TB | |
| GH["GitHub β public mirror<br/>rohitsar567/insurance-sales-bot"] | |
| HFS["HF Space β code + running app<br/>rohitsar567/InsuranceBot"] | |
| end | |
| HFD["π€ HF Dataset<br/>rohitsar567/insurance-bot-data<br/>PDF corpus + prebuilt Chroma vectors"] | |
| APP["π Live app (HF Space container)"] | |
| LAPTOP -->|"git push github"| GH | |
| LAPTOP -->|"git push origin"| HFS | |
| LAPTOP -.->|"offline ingest publishes here"| HFD | |
| HFS -->|"Docker build"| APP | |
| HFD -.->|"snapshot_download at build"| APP | |
| linkStyle 0,1,3 stroke:#1565c0,stroke-width:2px | |
| linkStyle 2,4 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3 | |
| ``` | |
| **Summary.** Three physical homes: the laptop (source of truth before any push), the code repo (mirrored to two git remotes β GitHub for reviewers and the HF Space's own repo for deployment), and the HF dataset (the heavy binaries). | |
| **How it flows:** | |
| - **Local laptop.** Single source of truth before any push. All editing happens here. | |
| - **Code repo β two remotes.** `git push github` updates the GitHub public mirror (for reviewers). `git push origin` updates the HF Space's own repo and **triggers the Docker rebuild**. | |
| - **HF dataset (offline channel).** Heavy binaries β PDF corpus + prebuilt Chroma vectors β are published here separately from the code so the deployable image stays small. | |
| - **Live container.** On every Docker build, `huggingface_hub.snapshot_download` hydrates `rag/corpus/` and `rag/vectors/` from the HF dataset; the FastAPI app then has both data kinds available. | |
| #### 2.7.4 Offline ingest pipeline (built once, not on the request path) | |
| ```mermaid | |
| flowchart LR | |
| PDF["π Raw policy PDFs<br/>rag/corpus/"] --> ING["rag/ingest.py<br/>chunk pages"] | |
| ING --> EMB["embed<br/>BGE-small Β· local Β· 384-d"] | |
| EMB --> VEC["Chroma vector store<br/>rag/vectors/"] | |
| ING --> XT["rag/extract.py + schema.py<br/>structured fact extraction (LLM-assisted)"] | |
| XT --> JSON["40-data/policy_facts/*.json<br/>+ verbatim source_quote"] | |
| XT --> DUCK["policies.duckdb<br/>structured rollup"] | |
| VEC -.->|"published to"| HFD["HF dataset"] | |
| PDF -.->|"published to"| HFD | |
| JSON -->|"versioned with code"| CODE["Code repo"] | |
| linkStyle 0,1,2,3,4,5,8 stroke:#1565c0,stroke-width:2px | |
| linkStyle 6,7 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3 | |
| ``` | |
| **Summary.** A single offline pipeline turns each raw PDF into two artefacts: chunked embeddings for free-form retrieval, and a structured JSON of decision-critical fields with verbatim quotes. Vectors β HF dataset; JSON β code repo. | |
| **How it flows:** | |
| - **Chunking + embedding.** `rag/ingest.py` splits each PDF into overlapping ~500-token chunks; BGE-small encodes each chunk into a 384-d vector; Chroma persists them to `rag/vectors/`. | |
| - **Extraction.** `rag/extract.py` + `schema.py` run an LLM-assisted pass to pull structured fields (waiting periods, room-rent caps, CSR%, etc.) into a schema-validated JSON β **with the verbatim `source_quote` that justifies each value**. | |
| - **Two destinations.** Vectors and raw PDFs are published to the HF dataset (too big for git); JSON files are versioned with the code so a reviewer can see exactly what facts feed the marketplace cards. | |
| - **Why split.** Chunks power free-form Q&A; the JSON powers the marketplace cards, scoring, and pricing β two different queries, two different data shapes. | |
| **Provenance rule.** Every policy fact shown to a user traces to a real clause in a real PDF. Where a document genuinely doesn't state something, it is recorded as a sourced-null (*"not stated in <file>.pdf"*) β never invented or back-filled. | |
| ### 2.8 Uploaded-PDF flow β 8 security gates β Gemini extraction β catalogued-grade card | |
| ```mermaid | |
| flowchart TB | |
| UP["/api/upload-policy (public web)"] --> G1["1 File mechanics<br/>%PDF Β· size band Β· %%EOF Β· no exe/JS"] | |
| G1 --> G2["2 Content quality<br/>β₯1500 chars Β· β₯3 pp Β· domain keyword"] | |
| G2 --> G3["3 Prompt-injection sweep"] | |
| G3 --> G4["4 Per-session rate limit"] | |
| G4 --> G5["5 Per-IP rate limit"] | |
| G5 --> G6["6 Encrypted/locked β reject"] | |
| G6 --> G7["7 Page-count ceiling (>200)"] | |
| G7 --> G8["8 Hash dedupe + reject-cache"] | |
| G8 -->|"pass"| QC["per-session QUARANTINE Chroma + global policies collection<br/>BGE-small embeddings Β· 24h idle TTL"] | |
| QC --> HEUR["Heuristic baseline (synchronous, sub-second)<br/>regex/keyword over PDF text<br/>writes UPLOADED_DOCS_DIR/<pid>/record.json @ ~30-50% completeness"] | |
| HEUR -.->|"detect_insurer_slug<br/>match against 21 known insurers"| INS["insurer_slug = manipalcigna / hdfc-ergo / ...<br/>(or 'user-upload' on no match β fail-closed)"] | |
| HEUR --> ACK["HTTP 200 returns immediately<br/>frontend pushes ack 'Got it β reading X, ~30-60s'<br/>EVERY chat input GATED via extractionInFlight"] | |
| ACK --> CACHE{"sha256(pdf_bytes)<br/>seen before?"} | |
| CACHE -->|"hit"| COPY["copy prior rag/extracted/<other_pid>.json β this pid<br/>llm_used='hash-cache' Β· ~1s"] | |
| CACHE -->|"miss"| LLM["Background extract_one_for_upload<br/>Gemini 2.5-flash Β· 3 retries (2/4/8s Β± 25% jitter)<br/>llm_used='gemini-2.5-flash#N'<br/>llm_response_chars logged for ops"] | |
| LLM -->|"all fail"| NIM["NIM fallback chain<br/>llm_used='nim-fallback'"] | |
| LLM -->|"success"| WRITE["write rag/extracted/<pid>.json"] | |
| NIM -->|"success"| WRITE | |
| NIM -->|"fail"| FLOOR["heuristic record.json wins<br/>status='failed' but card still renders at ~47% / grade C"] | |
| WRITE --> MERGE["merge LLM scalars INTO record.json<br/>LLM value wins where non-empty<br/>heuristic stays where LLM silent"] | |
| COPY --> MERGE | |
| MERGE --> BUST["invalidate _MG_CACHE (marketplace grade cache)"] | |
| BUST --> RESOLVE["_catalogue_scorecard(pid, None)<br/>SAME resolver /api/policies/{id}/scorecard uses"] | |
| RESOLVE --> STATUS["_set_extraction_status(complete, comp, grade, llm_used, llm_response_chars)<br/>BY CONSTRUCTION equal to card endpoint"] | |
| FLOOR --> STATUS | |
| STATUS --> POLL["frontend GET /api/upload/extraction-status/{pid}<br/>every 3s, max 120s"] | |
| POLL --> CARD["pushAssistant(card_ready, citations=[{pid}])<br/>then pushAssistant(choice_prompt)<br/>setActiveUploadPid(pid) Β· setExtractionInFlight(false)"] | |
| CARD --> ENABLE["Send + textarea + PDF + voice all re-enabled<br/>view_context.active_policy_id={pid} on next chat turn<br/>β single_brain enters ACTIVE POLICY DIVE-IN mode (KI-330)"] | |
| G1 & G2 & G3 & G4 & G5 & G6 & G7 & G8 -->|"fail"| REJ["clean rejection (reason surfaced)"] | |
| ``` | |
| **Summary.** The full pipeline an uploaded PDF traverses to become a | |
| catalogued-grade card with the same data depth as the 148 pre-curated policies | |
| β from HTTP request through Gemini extraction through inline chat card. | |
| **How it flows:** | |
| - **The 8 security gates, in order.** (1) **File mechanics** β `%PDF` | |
| magic, 5 KBβ25 MB size band, well-formed `%%EOF`, no embedded | |
| executables / JavaScript / launch actions. (2) **Content quality** β | |
| β₯1500 extractable chars, β₯3 pages, at least one insurance-domain | |
| keyword. (3) **Prompt-injection sweep** β "ignore previous | |
| instructions", "reveal your system prompt", jailbreak patterns. | |
| (4) **Per-session rate limit.** (5) **Per-IP rate limit** (catches | |
| session-ID rotation). (6) **Encrypted/locked PDF** β rejected | |
| cleanly. (7) **Page-count ceiling** (>200 pages β an abuse/bundle | |
| vector). (8) **Hash dedupe + reject-cache** β identical re-uploads | |
| short-circuit. | |
| - **Beyond identical-file dedup.** A **UIN net-new check** also runs β | |
| if the PDF's IRDAI UIN already belongs to a catalogued policy, the | |
| caller is pointed at the existing marketplace card instead of indexing | |
| a duplicate. **PDF-text fuzzy matching** also runs β if the upload | |
| content identifies as a known catalogued product (matching insurer + | |
| product-name patterns), the upload endpoint resolves to the existing | |
| `<insurer-slug>__<product>` id and reuses the curated card, skipping | |
| fresh extraction entirely (best UX for known products). | |
| - **On pass.** Chunks land in a per-session **quarantine** Chroma | |
| collection β session-isolated, 24 h idle TTL β AND in the shared | |
| `policies` collection so the upload can become a marketplace card. | |
| - **The locked chat sequence (ADR-044).** ack β gated wait β card β | |
| choice. The frontend's `extractionInFlight` flag disables Send, | |
| textarea, PDF button, and EVERY voice path (PTT / Sarvam / voice | |
| auto-submit) for the entire wait window. The choice prompt NEVER | |
| fires before the card-bearing message lands. Live-verified by | |
| Playwright: ack at idx 155 β card/fail at idx 315 β choice at idx | |
| 500, strictly ordered. Inputs re-enable in the same render as the | |
| choice prompt. | |
| - **The two LLM fast paths.** *Hash-cache* β `sha256(pdf_bytes)` matches | |
| a prior successful extraction β copy that `rag/extracted/<pid>.json` | |
| to this pid, surface `llm_used="hash-cache"`, ~1 s. *Gemini path* β | |
| 3 retries with jittered exponential backoff (2/4/8 s Β± 25 %), | |
| surfaces `llm_used="gemini-2.5-flash#N"` where N is the successful | |
| attempt, plus `llm_response_chars` so the operator can see WHICH LLM | |
| landed the extraction without HF Space stdout access. | |
| - **The heuristic floor is a hard guarantee, now significantly fatter | |
| (KI-332, 2026-05-27).** `build_record()` runs synchronously inside | |
| the upload HTTP call (sub-second) and writes `record.json` BEFORE | |
| the LLM ever fires. The pattern set was expanded from ~16 fields to | |
| ~28+ on the 2026-05-27 hardening pass: sum-insured ladder detection | |
| (`βΉ3L / βΉ5L / βΉ10L` β list), policy_type, min entry age, child entry | |
| days, lifelong-renewability flag, grace period, free-look period, | |
| geographic coverage, ICU capping, deductible amount, NCB cap %, | |
| organ donor / critical illness / preventive checkup / domiciliary / | |
| newborn presence booleans, premium payment modes. Local synthetic | |
| test hits 32 fields. Expected upload completeness on LLM-fail rises | |
| from ~47.8 % to ~65β70 %. If Gemini fails all 3 retries AND the NIM | |
| fallback fails, the card still renders at this richer floor β never | |
| fabricated, never a generic "Retry" placeholder. Verified on a hard | |
| Test Policy.pdf (8 MB) where Gemini 3/3 retries returned malformed | |
| JSON: card still landed with the expanded heuristic data. | |
| - **Multi-pass per-section extraction for big PDFs (KI-332, 2026-05-27).** | |
| For uploads with β₯ 25 K chars of extracted text (e.g. dense | |
| 100+ page policy wordings, 8 MB PDFs), the single-pass Gemini call | |
| reliably truncates JSON mid-emission β the HealthPolicy schema has | |
| ~40 fields and a complete output with verbatim quotes can exceed | |
| Gemini 2.5-flash's reliable output budget. **Solution:** split the | |
| schema into 7 logical sections (identity, eligibility, financial, | |
| waiting periods, coverage, limits, network+claims) and run each as | |
| its own smaller Gemini call IN PARALLEL via `asyncio.gather`. Each | |
| call carries ~15 % of the schema β fits comfortably in budget. | |
| Failure-isolated: 6/7 sections landing produces a partial extraction | |
| strictly better than the heuristic floor. Same wall-clock cost as | |
| single-pass (parallel). On total multi-pass failure, falls through | |
| to the legacy single-pass + NIM chain (heuristic floor still wins). | |
| Activation: `len(text) β₯ 25_000` triggers multi-pass; smaller PDFs | |
| keep using single-pass (faster, cheaper, works fine). | |
| - **Status endpoint == scorecard endpoint by construction.** When the | |
| background extraction finalises, it calls the SAME | |
| `_catalogue_scorecard(pid, None)` resolver that `/api/policies/{id}/ | |
| scorecard` uses. The `completeness_pct` and `overall_grade` on | |
| `GET /api/upload/extraction-status/{pid}` are therefore byte- | |
| identical to what the inline card renders. Same applies to the | |
| hash-cache short-circuit (which had to be fixed separately β | |
| earlier draft of the cache branch called `build_scorecard(...)` | |
| without `insurer_reviews` AND read `.overall_grade` instead of | |
| `.grade`, silently reporting status=17.4 % / grade None while the | |
| card showed 47.8 % / C). 2026-05-27 multi-PDF audit (commit | |
| `58e3c82`) confirmed parity across manipalcigna, hdfc-ergo, | |
| care-health, icici-lombard, star-health, Test Policy.pdf. | |
| - **Post-card dive-in mode (KI-330).** After the card lands the | |
| frontend sets `activeUploadPid` and plumbs it into every chat | |
| turn's `view_context.active_policy_id`. `single_brain.handle_turn` | |
| reads that and prepends an ACTIVE POLICY DIVE-IN block to the | |
| system instruction, forcing the brain to answer policy-specific | |
| questions via `retrieve_policies` + `get_policy_facts` on that | |
| pid instead of pivoting to "let me pull your recommendations". | |
| Verified 9/10 on the post-fix audit (up from 0/10 pre-fix). | |
| - **On fail.** A clean rejection naming the gate; the file is deleted; | |
| nothing is embedded. | |
| **Operator endpoints (admin-only):** | |
| - `POST /api/admin/upload/reextract?force=<bool>` β re-runs | |
| `extract_one_for_upload` for every persisted upload that lacks a | |
| `rag/extracted/<pid>.json`, or for ALL persisted uploads with | |
| `force=true`. Wired to a startup hook so every container boot | |
| upgrades legacy uploads automatically. | |
| - `GET /api/upload/extraction-status/{policy_id}` β the live state of | |
| the in-memory `_UPLOAD_EXTRACTION_STATUS` dict. Fields: `status` | |
| (`pending | running | complete | failed | unknown`), `llm_used` | |
| (`gemini-2.5-flash#N | nim-fallback | hash-cache`), | |
| `llm_response_chars`, `completeness_pct`, `overall_grade`, | |
| `started_at`, `completed_at`, `error`. | |
| ### 2.9 Deployment | |
| ```mermaid | |
| flowchart LR | |
| DEV["git push"] --> ORI["HF Space remote (origin)"] | |
| DEV --> GH["GitHub mirror (github)"] | |
| ORI --> BUILD["HF Space β Docker build"] | |
| BUILD --> SNAP["huggingface_hub.snapshot_download<br/>hydrate rag/ (corpus + vectors) from HF dataset"] | |
| BUILD --> FE["build Next.js static export"] | |
| SNAP --> RUN["entrypoint.sh β uvicorn backend.main:app :7860"] | |
| FE --> RUN | |
| RUN --> LIVE["live Space (FastAPI also serves the frontend)"] | |
| LIVE --> CHK["verify reported build SHA advanced<br/>(LFS/quota push can fail silently)"] | |
| ``` | |
| **Summary.** How a `git push` becomes a live Space, end | |
| to end. | |
| **How it flows:** | |
| - **Two remotes.** `origin` = the Hugging Face Space (a push here | |
| triggers the Docker rebuild). `github` = the public mirror reviewers | |
| read. | |
| - **Build.** The HF Space rebuilds the image, installs the backend, | |
| builds the Next.js static frontend, and runs | |
| `huggingface_hub.snapshot_download` to hydrate `rag/` (PDF corpus + | |
| prebuilt vectors) from the `insurance-bot-data` dataset β so the Space | |
| repo itself stays code-only and small. | |
| - **Start.** `entrypoint.sh` launches `uvicorn backend.main:app` on | |
| `$PORT` (default 7860); FastAPI also serves the exported frontend. | |
| - **Verify.** Always confirm the Space's reported build SHA actually | |
| advanced before trusting that new code is live β an LFS/quota push | |
| can fail without surfacing an error. | |
| --- | |
| ## 3. Key functions in plain language | |
| **Summary.** Seven internal jobs make the bot work. Each one gets a sequence diagram showing what calls what, a β€50-word summary, and a step-by-step explanation. An eighth subsection makes explicit *what is stored vs what is live-only*. | |
| ### 3.1 Profile construction | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant U as User | |
| participant API as 3a. /api/chat | |
| participant B as 3b. LLM Brain | |
| participant T as save_profile_field tool | |
| participant S as 3d. session_state | |
| U->>API: typed / spoken reply | |
| API->>B: handle_turn(text, profile) | |
| B->>T: save_profile_field("age", 32) | |
| T->>S: session.profile.age = 32 | |
| S-->>T: ok | |
| T-->>B: saved | |
| B-->>API: next question (or proceed to scoring + pricing) | |
| ``` | |
| **Summary.** Every user reply runs the brain, which extracts one fact at a time and saves it via the `save_profile_field` tool into the live `session_state.profile`. No regex, no separate extractor model β extraction is a tool call. | |
| **How it flows:** | |
| - **User reply arrives** at `/api/chat` (typed text or post-STT transcript). | |
| - **Brain runs.** `handle_turn` calls Gemini with the reply + the current profile + the tool schema. | |
| - **Tool call.** When Gemini decides a fact is captured, it calls `save_profile_field(field, value)` β one call per fact. | |
| - **Write.** The tool sets the field on `session.profile` (in memory). All fact-find captures are concentrated here; there is no second LLM pass. | |
| - **Continue.** Brain returns the next question, or β if the profile is complete enough β proceeds to scoring + pricing. | |
| - **Persistence is later.** End-of-turn, the whole profile is auto-persisted to disk by the orchestrator (see Β§3.6). | |
| ### 3.2 Profile-aware scoring | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant API as 3a. /api/scorecard | |
| participant SC as 3c. scorecard.py | |
| participant J as 4. policy_facts JSON | |
| participant R as 4. reviews JSON | |
| participant P as 3d. session.profile | |
| API->>SC: build_scorecard(policy_id, profile) | |
| SC->>J: read curated facts (CSR, room-rent, waitings) | |
| SC->>R: read insurer reviews | |
| SC->>P: read live profile (age, family, conditions) | |
| SC->>SC: 6 sub-scores β letter grade + personalised summary | |
| SC-->>API: grade Β· sub-scores Β· summary | |
| ``` | |
| **Summary.** Scoring grades every policy *for this user* β same policy, two users, two grades. Six sub-scores roll up into a letter, with a personalised one-line summary naming the strengths for this profile and the one honest caveat. | |
| **How it flows:** | |
| - **Trigger.** Brain (or `/api/scorecard` directly) asks for a per-policy grade. | |
| - **Three reads.** `scorecard.py` reads (a) the policy's `policy_facts` JSON, (b) the insurer's reviews JSON, and (c) the live profile from `session_state`. | |
| - **Six sub-scores.** Coverage Β· predictability Β· claims Β· network Β· renewal Β· terms β each weighted against this profile (e.g. a smoker is penalised more in pricing predictability; an elder is penalised more in waiting-period structure). | |
| - **Letter grade + summary.** A grade band (AβE) and a one-line summary listing this user's top strengths and the one capping caveat. | |
| - **Live, not stored.** Recomputed per request. Storage would be wrong here β the grade depends on *who is asking*. | |
| ### 3.3 Profile-aware pricing β the multivariate ballpark | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant API as 3a. /api/coverage | |
| participant PR as 3c. premium_calculator.py | |
| participant J as 4. policy_facts JSON | |
| participant RC as 4. rate-card combinations<br/>age Γ metro Γ smoker Γ PED Γ<br/>co-pay Γ deductible Γ tenure | |
| participant P as 3d. session.profile | |
| API->>PR: estimate(policy_id, profile) | |
| PR->>J: read pricing structure | |
| PR->>P: read profile (7 dimensions) | |
| PR->>RC: match combination β rate range | |
| PR->>PR: interpolate + apply caveats | |
| PR-->>API: βΉlow β βΉhigh Β· point β βΉX (illustrative) | |
| ``` | |
| **Summary.** Pricing is a multivariate ballpark from public rate-card combinations β age Γ metro Γ smoker Γ PED Γ chosen co-pay Γ deductible Γ sum insured. Same plan, two users, two ranges. Explicitly **not** real underwriting. | |
| **How it flows:** | |
| - **Inputs.** The user's profile (seven dimensions) and the policy's pricing characteristics. | |
| - **Multivariate match.** `premium_calculator.estimate` looks up combinations across the seven dimensions and interpolates within bands. | |
| - **Output.** A range (e.g. βΉ12 500 β βΉ17 200 / yr) with a midpoint and a clear illustrative-only disclaimer. | |
| - **Honest caveat.** Final premium depends on the insurer's underwriting + medicals + IRDAI-filed loadings β this is a directional ballpark, not a quote. | |
| - **Live, not stored.** Same as scoring β recomputed per request because the answer is a function of *this* user. | |
| ### 3.4 Retrieval β `retrieve_policies` over the vector store | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant B as 3b. LLM Brain | |
| participant T as retrieve_policies tool | |
| participant E as BGE-small embedder | |
| participant C as 4. Chroma<br/>shared 'policies' +<br/>per-session 'quarantine' | |
| B->>T: retrieve_policies(query, filters, top_k) | |
| T->>E: encode(query) β 384-d vector | |
| T->>C: nearest-neighbour + metadata filter | |
| C-->>T: top-k chunks (text Β· pdf Β· page Β· policy_id) | |
| T-->>B: chunks β the brain may quote only from these | |
| ``` | |
| **Summary.** When the brain needs to *cite* policy wording, it asks the retrieval tool. The query is embedded with BGE-small, looked up in Chroma, and the top-k chunks come back with their source PDF, page, and policy_id. The brain may state nothing the tool did not return. | |
| **How it flows:** | |
| - **Query.** Brain composes a query from the user's question + the profile. | |
| - **Embed.** Local BGE-small turns the query into a 384-d vector (no API hop, no rate limit). | |
| - **Search.** Chroma returns nearest chunks, scoped to either the shared `policies` collection (catalogue) or the per-session `quarantine` collection (user-uploaded PDFs β never crosses sessions, 24 h TTL). | |
| - **Faithfulness.** The brain may *only* state what these chunks (or `get_policy_facts`) returned. Anything else is a violation of the structural grounding guard (Β§2.4). | |
| ### 3.5 Curated facts β `get_policy_facts` (no embedding hop) | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant B as 3b. LLM Brain | |
| participant T as get_policy_facts tool | |
| participant F as 4. policy_facts JSON file | |
| B->>T: get_policy_facts([policy_id_1, policy_id_2]) | |
| T->>F: read JSON file(s) directly | |
| F-->>T: CSR Β· complaints Β· grade Β· source_quote | |
| T-->>B: structured fields with verbatim PDF quote | |
| ``` | |
| **Summary.** For decision-critical numbers (claim-settlement ratio, complaints volume, waiting periods, room-rent rule, grade), the brain calls `get_policy_facts` which reads the curated JSON directly β no embedding hop, no LLM, exact values with the verbatim PDF quote. | |
| **How it flows:** | |
| - **Why this exists alongside Β§3.4.** Retrieval is for free-form text Q&A; this is for exact, fast, source-cited number lookups. Different query shape, different mechanism. | |
| - **No LLM in the path.** A plain file read of `40-data/policy_facts/<id>.json`. The brain quotes the value (and the `source_quote`) verbatim. | |
| - **Used by more than the brain.** `scorecard.py` and `premium_calculator.py` also read these JSON files directly β see Β§2.3 (the brain's edge is not the only edge into the JSON). | |
| ### 3.6 In-session state recovery (server restart resilience) | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant U as User | |
| participant Br as Browser (carries chat_history) | |
| participant B as 3b. LLM Brain | |
| participant S as 3d. SessionState (in-memory) | |
| U->>Br: "what about premium?" | |
| Br->>B: chat turn + chat_history[N msgs] | |
| B->>S: get_session(session_id) | |
| Note over S: session was evicted<br/>(1h idle / restart)<br/>profile = BLANK | |
| B->>B: STATE-RECOVERY MODE<br/>chat_history has prior facts | |
| B->>S: save_profile_field for each fact in history | |
| B-->>Br: reply that picks up where it left off<br/>(never re-asks the name) | |
| ``` | |
| **Summary.** Sessions are in-memory only (`_TTL_SECONDS = 1h`), so a | |
| container restart or long idle wipes the server-side profile. When the | |
| browser still carries the conversation, the brain silently rebuilds the | |
| profile from the chat history instead of starting over. No disk read, | |
| no cross-session memory β purely a same-conversation resilience path. | |
| **How it flows:** | |
| - **Detect.** `get_session()` returns a blank `SessionState`, but `chat_history` arrives with β₯2 messages including a prior user turn β state was lost, not "fresh user". | |
| - **Re-capture from history.** A high-priority **STATE-RECOVERY MODE** prompt block tells the LLM: do not say you lost anything, do not re-ask the name, instead call `save_profile_field` for every fact present in the conversation so far, then continue. | |
| - **Resume.** From the LLM's point of view the next reply is just the next turn in an ongoing chat β the user never perceives the eviction. | |
| (There is no cross-session recall β see Β§2.6 and ADR-043 for why that | |
| was removed.) | |
| ### 3.7 Uploaded-PDF LLM extraction (the Β§2.8 pipeline in code terms) | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| participant FE as Frontend (page.tsx) | |
| participant API as /api/upload-policy | |
| participant SEC as security.py (8 gates) | |
| participant UD as uploaded_docs.py | |
| participant H as heuristic build_record() | |
| participant CK as hash cache lookup | |
| participant G as Gemini 2.5-flash (3 retries) | |
| participant N as NIM fallback chain | |
| participant SC as scorecard.py + main._catalogue_scorecard | |
| participant ST as _UPLOAD_EXTRACTION_STATUS dict | |
| participant FE2 as Frontend poller | |
| FE->>API: POST multipart (PDF + session_id) | |
| API->>SEC: run 8 gates | |
| SEC-->>API: pass or reject | |
| API->>UD: persist_upload β writes source.pdf + meta.json to UPLOADED_DOCS_DIR/pid/ | |
| UD->>H: build_record β regex/keyword over text | |
| H-->>UD: record.json at ~30-50% completeness | |
| UD-->>FE: HTTP 200 with policy_id | |
| FE->>FE: setExtractionInFlight(true) Β· gate ALL inputs Β· pushAssistant(ack) | |
| UD->>UD: asyncio.create_task(extract_one_for_upload) | |
| Note over UD: _set_extraction_status(status='running') | |
| UD->>CK: _find_cached_extraction(sha256(pdf_bytes)) | |
| alt cache hit | |
| CK-->>UD: copy prior rag/extracted/other_pid.json | |
| UD->>UD: llm_used='hash-cache' | |
| else cache miss + text β₯25K chars | |
| Note over UD,G: MULTI-PASS (KI-332): 7 sections in parallel via asyncio.gather | |
| UD->>G: identity / eligibility / financial / waiting / coverage / limits / network | |
| G-->>UD: 7 partial JSONs (any landing counts as success) | |
| UD->>UD: merge sections Β· llm_used='gemini-2.5-flash-multipass' | |
| else cache miss + text under 25K chars | |
| UD->>G: single-pass chat with full schema | |
| alt success | |
| G-->>UD: HealthPolicy JSON Β· llm_used='gemini-2.5-flash#1' | |
| else timeout or malformed JSON | |
| UD->>G: retry #2 (2s Β± 25%) then retry #3 (4s Β± 25%) | |
| G-->>UD: HealthPolicy or fail | |
| UD->>N: NIM fallback (single attempt) | |
| N-->>UD: HealthPolicy or all-fail | |
| end | |
| UD->>UD: write rag/extracted/pid.json | |
| UD->>UD: merge LLM scalars INTO record.json | |
| end | |
| UD->>SC: _catalogue_scorecard(pid, None) β same as /api/policies/.../scorecard | |
| SC-->>UD: Scorecard (grade, data_completeness_pct, sub_scores) | |
| UD->>ST: _set_extraction_status(complete, comp, grade, llm_used, llm_response_chars) | |
| loop every 3s up to 120s | |
| FE2->>UD: GET /api/upload/extraction-status/pid | |
| UD-->>FE2: status snapshot | |
| end | |
| FE2->>FE2: pushAssistant(card_ready, citations include pid) | |
| FE2->>FE2: setActiveUploadPid(pid) | |
| FE2->>FE2: pushAssistant(choice_prompt) Β· setExtractionInFlight(false) | |
| ``` | |
| **Summary.** When a user uploads a PDF the backend writes a heuristic-baseline record first (sub-second), HTTP returns, and a background asyncio task either copies a prior extraction (hash cache hit) or runs Gemini 2.5-flash with 3 jittered retries β NIM fallback β heuristic floor. The status endpoint reports the SAME `completeness_pct` + `overall_grade` the card endpoint serves, by construction. | |
| **How it flows:** | |
| - **HTTP returns before extraction starts.** `extract_one_for_upload` is fired with `asyncio.create_task` so the user sees the card-ready ack inside one second, not after 30β60 s. | |
| - **Provenance, always.** Every `_set_extraction_status` call carries `llm_used` (`gemini-2.5-flash#1 | #2 | #3 | nim-fallback | hash-cache`) and `llm_response_chars`. The operator can see WHICH LLM landed the extraction without HF Space stdout access β verified live on 2026-05-27. | |
| - **Hash cache short-circuit.** `_find_cached_extraction(sha256(pdf_bytes))` looks for a prior successful extraction with the same content. On hit, the prior `rag/extracted/<other_pid>.json` is copied to this pid in ~1 s. | |
| - **Retries are jittered exponential.** 2 s / 4 s / 8 s backoffs each multiplied by `random.uniform(0.75, 1.25)` so repeated transient blips on a single Gemini instance don't synchronise. | |
| - **Merge model.** LLM output is merged INTO the heuristic record (LLM wins per-field where non-empty, heuristic stays where LLM silent) β the same "extracted + curated overlay" model the catalogued 148 use via `40-data/policy_facts/`. | |
| - **Status == card by construction.** The status endpoint calls `_catalogue_scorecard(pid, None)` β the SAME resolver `/api/policies/{id}/scorecard` uses. If they differ, that's a bug; today they match across 5 verified uploads. | |
| ### 3.8 What is stored vs what is live-only | |
| | What | Where | Why | | |
| |---|---|---| | |
| | Policy PDFs + vector chunks | `rag/corpus/` + Chroma store (HF dataset β pulled at build) | Built once, offline; read every request | | |
| | Curated policy facts (per policy) | `40-data/policy_facts/*.json` (code repo) | Small, human-reviewed, versioned with code | | |
| | User profile (current session) | **In-memory only β `SessionState.profile` (1 h idle TTL, no disk)** | Closing the tab / clearing chat forgets the profile by design (ADR-043) | | |
| | Per-policy **grade / scorecard** | **Not stored β live per request** | Two users get two grades for the same policy (profile-aware) | | |
| | **Premium range** for a policy | **Not stored β live per request** | Same reason as the grade | | |
| | Uploaded PDFs | Per-session Chroma quarantine, **24 h TTL** | Isolated to the uploader, never the shared corpus | | |
| | Per-turn reasoning | `logs/turns.jsonl` (one JSON line per turn) | Replay / audit; never echoed to other users | | |
| ## 4. Safety & quality | |
| ### 4.1 Uploaded-PDF defence (8 gates) | |
| `/api/upload-policy` accepts arbitrary PDFs from the public web β a real | |
| attack surface. `backend/security.py` runs every upload through eight gates | |
| before the file is ever embedded or shown to the model: | |
| 1. **File mechanics** β `%PDF` magic, 5 KBβ25 MB size band, well-formed | |
| `%%EOF`, and a scan for embedded executables / JavaScript / launch actions. | |
| 2. **Content quality** β β₯1500 chars of extractable text, β₯3 pages, and at | |
| least one insurance-domain keyword (rejects scans, junk, off-topic docs). | |
| 3. **Prompt-injection** β regex sweep for "ignore previous instructions", | |
| "reveal your system prompt", jailbreak patterns, etc. | |
| 4. **Per-session rate limit** β caps uploads / chunk quota per session. | |
| 5. **Per-IP rate limit** β catches session-ID rotation. | |
| 6. **Encrypted/locked PDF** β rejected cleanly rather than stored opaque. | |
| 7. **Page-count ceiling** β >200 pages is an abuse/bundle vector. | |
| 8. **Hash dedupe + reject-cache** β identical re-uploads short-circuit. | |
| Beyond identical-file dedup, a **UIN net-new check** runs on every upload: | |
| if the PDF's IRDAI UIN already belongs to a catalogued policy, the upload | |
| is recognised as *not* net-new and the caller is pointed at the existing | |
| marketplace card instead of a duplicate being indexed. | |
| Accepted uploads are embedded into a **separate, per-session quarantine** | |
| Chroma collection (never the shared corpus), scoped by `session_id` so one | |
| user's document is invisible to another, and auto-purged after a 24-hour idle | |
| TTL. | |
| ### 4.2 Answer faithfulness | |
| Faithfulness is structural, not bolt-on: the brain answers only from what | |
| its tools returned β `retrieve_policies` (policy-wording chunks) and | |
| `get_policy_facts` (claim-settlement ratio, complaints, scorecard and | |
| insurer-review data) β must cite, and is instructed to refuse when that | |
| grounding is weak. A prose-grounding guard verifies any policy / UIN named | |
| in the reply against both tools' returned policies before it is sent. | |
| Recommendation fit is gated (`backend/scorecard.py`, | |
| `retrieval_filters.py`) so plans that structurally don't fit the user's stated | |
| constraints are dropped, with the reason surfaced. | |
| ### 4.3 Evaluation | |
| A gold Q&A harness lives at `eval/run.py`. **Status:** it is pending a re-port | |
| to the single-brain architecture (it targeted the removed orchestrator) and is | |
| intentionally hard-guarded from running so it cannot publish stale scores; see | |
| its module docstring. The automated test suite (`tests/`, run with `pytest`) | |
| is the current green gate and covers routing, scoring, premium, recall, the | |
| upload security gates, and conversation logic. | |
| ### 4.4 Known limitations (honest) | |
| These are real and stated up front rather than buried: | |
| - **Uploaded-doc persistence is within-session, not across restarts.** | |
| Upload β graded marketplace card β grounded Q&A about the PDF all work | |
| live within a running container. But the Hugging Face Space's working | |
| filesystem is ephemeral by design (a fresh Chroma snapshot is pulled on | |
| every rebuild β see Β§2.7), and in practice an uploaded doc does **not** | |
| survive a Space rebuild/restart: the marketplace reverts to its | |
| curated/extracted baseline. Treat uploads as session-scoped. An | |
| operator/abuse prune endpoint exists (`POST /api/admin/uploaded-docs/ | |
| prune`, password-gated) to remove a persisted upload by id or prefix. | |
| - **Uploaded-PDF field extraction is LLM-assisted, with a deterministic | |
| heuristic floor (ADR-044, 2026-05-27 hardening bundle).** Every upload | |
| runs through two passes with a hash-cache fast path: | |
| - **Heuristic baseline** β regex + keyword extraction over the PDF | |
| text, synchronous inside the upload HTTP call (sub-second), populates | |
| common fields like waiting periods and room-rent rule. Yields | |
| ~30β50 % `data_completeness`. | |
| - **Gemini extraction (3 jittered retries)** β fires as a background | |
| asyncio task after the upload returns. Same Gemini 2.5-flash + same | |
| `EXTRACT_SYSTEM` prompt + same `HealthPolicy` Pydantic schema the | |
| catalogued 148 use offline. Backoffs 2 / 4 / 8 s Β± 25 % jitter. On | |
| success, writes `rag/extracted/<policy_id>.json` and merges INTO | |
| the persisted `record.json` (LLM values override where present, | |
| heuristic stays where the LLM was silent). ~10β60 s total. | |
| - **NIM fallback** β single attempt if all three Gemini retries fail. | |
| - **Heuristic floor** β if NIM also fails, the card still renders at | |
| the heuristic baseline (~47.8 %, grade C). Verified live on Test | |
| Policy.pdf (8 MB) where Gemini 3/3 retries returned malformed JSON. | |
| - **Hash-cache short-circuit** β if `sha256(pdf_bytes)` matches a | |
| prior successful extraction, that file is copied (~1 s) with | |
| `llm_used="hash-cache"` surfaced for ops. | |
| The frontend polls `GET /api/upload/extraction-status/<policy_id>` | |
| during the wait and renders the inline scorecard card ONLY after the | |
| LLM pass completes / fails / hits its 120 s timeout. The card is | |
| catalogued-grade β same `PolicyScorecardWidget`, same six sub-scores, | |
| same insurer-reputation data (`detect_insurer_slug` matches the PDF's | |
| legal name against the 21 known insurer slugs we have reviews data | |
| for and flips `insurer_slug` off the generic `user-upload` on a hit). | |
| **Status β card parity by construction:** the status endpoint and the | |
| card endpoint both call `_catalogue_scorecard(pid, None)`, so | |
| `completeness_pct` + `overall_grade` are byte-identical. **Operator | |
| provenance**: every status response carries `llm_used` (`gemini-2.5- | |
| flash#N | nim-fallback | hash-cache`) and `llm_response_chars` so the | |
| question "did Gemini actually run?" is answerable without HF Space | |
| stdout access. **Post-card dive-in mode (KI-330)**: the just-uploaded | |
| pid becomes `view_context.active_policy_id` on the next chat turn, | |
| so `single_brain` answers policy-specific questions via | |
| `retrieve_policies` + `get_policy_facts` instead of pivoting to | |
| recommendations. | |
| - **Live (BETA) voice mode** uses the browser's in-built speech | |
| recognition and is labelled unstable; **push-to-talk** is the reliable | |
| path (warm-armed mic + pre-roll so the first word is never clipped, and | |
| long answers are chunked so nothing is truncated). | |
| - **Recommendation vs. factual lookup.** A factual question that names a | |
| specific policy is answerable on a cold session; broad "recommend me a | |
| plan" requests still require the short fact-find first (by design). | |
| - **Admin LLM Chain β manual *Refresh now* and 30 s auto-poll | |
| (hardened 2026-05-27).** `POST /api/admin/probe` runs a real | |
| serial probe of every candidate model and updates `tested_at` on each | |
| `ModelHealth` row; the admin UI's `Refresh now` button and the LLM | |
| Chain tab's 30 s poll now both re-fetch `/api/admin/health` and call | |
| `renderUpdatedLabel()` so the top-left *"Last refresh / Next in"* | |
| timer reflects the actual just-completed probe (previously it stayed | |
| frozen at the login-time snapshot, so the operator could not tell | |
| from the timer whether probing had actually happened). | |
| --- | |
| ## 5. Tech stack & key decisions | |
| | Layer | Choice | One-line why | | |
| | --- | --- | --- | | |
| | Frontend | Next.js 16 (App Router), React 19, Tailwind v4, static export | Production-pattern UI; static export serves straight from the Space | | |
| | Backend | FastAPI + Pydantic | Async I/O, typed request/response, auto OpenAPI | | |
| | Brain | Google Gemini (`gemini-2.5-flash-lite`) + function calling | Frontier free-tier quality; one model + tools beats a multi-pass pipeline | | |
| | Fallback | NVIDIA NIM open-model chain, health-elected | Free, diverse; fail-loud, never silently wrong | | |
| | Retrieval | Chroma + BGE-small-en-v1.5 (local, 384-d) | Embedded, no infra, free, offline embeddings | | |
| | Voice | Sarvam Saarika (STT) + Bulbul (TTS) + Sarvam-M (Indic) | First-class Indian-accent / Hinglish handling | | |
| | Hosting | Hugging Face Space (Docker) + companion HF dataset | Free, GitHub-mirrored; code/data split keeps the image small | | |
| Decisions are deliberately biased toward *one deployable artifact, no | |
| fabrication, fail loud*. The single-brain consolidation, the NIM-only fallback | |
| (structured-output reliability over cross-provider breadth), the local | |
| embeddings (zero rate limits, offline ingest) and the code/data repo split are | |
| the load-bearing ones. | |
| --- | |
| ## 6. Repository map | |
| **At a glance** β the root is intentionally small; you only need to know | |
| these: | |
| - **`backend/`** β FastAPI app + the brain, tools, retrieval, scoring, security | |
| - **`frontend/`** β the Next.js web app | |
| - **`rag/`** β retrieval + offline ingest (corpus/vectors are git-ignored, pulled at build) | |
| - **`40-data/`** β curated, human-reviewed policy facts (versioned with code) | |
| - **`tests/`** β the pytest green gate | |
| - root files: `Dockerfile`, `entrypoint.sh`, `requirements.txt`, `pytest.ini`, `README.md` | |
| <details> | |
| <summary><b>Full directory tree</b> β click to expand</summary> | |
| ``` | |
| . | |
| βββ backend/ FastAPI app | |
| β βββ main.py HTTP routes (chat, transcribe, upload, profile, β¦) | |
| β βββ single_brain.py THE brain β Gemini + function-calling tools | |
| β βββ brain_tools.py the tools the brain can call (retrieval, profile, β¦) | |
| β βββ nim_fallback.py NIM fallback when Gemini fails / cold-start 503 | |
| β βββ llm_health.py background probe + sticky-primary election | |
| β βββ security.py the 8 upload-defence gates | |
| β βββ scorecard.py / recommendation fit + scoring | |
| β β retrieval_filters.py | |
| β βββ premium_calculator.py profile β illustrative premium | |
| β β sum_insured.py | |
| β βββ session_state.py per-session profile (in-memory only, ADR-043) | |
| β βββ uploaded_docs.py user-uploaded PDF pipeline (ADR-044): | |
| β β - persist_upload() β heuristic baseline | |
| β β (sub-second regex/keyword) + sha256 of | |
| β β PDF bytes β record.json + meta.json | |
| β β - build_record() β heuristic floor; | |
| β β guaranteed BEFORE any LLM fires | |
| β β - extract_fields_from_text() β 28+ regex | |
| β β patterns (KI-332 expansion 2026-05-27): | |
| β β sum_insured_options_inr ladder, policy_type, | |
| β β min/max entry age, child entry days, | |
| β β lifelong renewability flag, grace period, | |
| β β free-look period, geographic_coverage, | |
| β β ICU capping, deductible, NCB cap, | |
| β β organ/CI/preventive/domiciliary/newborn | |
| β β presence, premium payment modes β lifts | |
| β β floor from ~47.8% to ~65-70% even when | |
| β β ALL LLM passes fail | |
| β β - detect_insurer_slug() β match PDF text | |
| β β against 21 known insurer patterns; flips | |
| β β insurer_slug off 'user-upload' on hit | |
| β β - _multipass_extract_with_gemini() β | |
| β β (KI-332) 7-section parallel extraction: | |
| β β identity/eligibility/financial/waiting/ | |
| β β coverage/limits/network_claims, each as | |
| β β its own Gemini 2.5-flash call via | |
| β β asyncio.gather. Fires for PDFs β₯25K chars | |
| β β where single-pass would truncate. | |
| β β - extract_one_for_upload() β background | |
| β β asyncio task. Resolution order: | |
| β β 1. hash-cache (sha256 hit) | |
| β β 2. multi-pass per-section (β₯25K chars) | |
| β β 3. single-pass Gemini (3 jittered retries) | |
| β β 4. NIM fallback (single attempt) | |
| β β 5. heuristic floor (always wins | |
| β β because record.json already exists) | |
| β β Writes rag/extracted/<pid>.json, merges | |
| β β LLM scalars INTO record.json (LLM wins | |
| β β where non-empty, heuristic stays where | |
| β β silent) | |
| β β - _find_cached_extraction() β sha256 | |
| β β lookup across UPLOADED_DOCS_DIR/*/meta.json | |
| β β for prior successful extractions | |
| β β - _set_extraction_status() β finalises | |
| β β status using main._catalogue_scorecard(pid) | |
| β β so completeness_pct + overall_grade match | |
| β β the card endpoint BY CONSTRUCTION (success | |
| β β path AND fail path, post-KI-333) | |
| β β - Provenance fields: llm_used | |
| β β (gemini-2.5-flash#N | gemini-2.5-flash-multipass | |
| β β | nim-fallback | hash-cache) + | |
| β β llm_response_chars | |
| β β - backfill_extractions() β startup hook | |
| β β re-runs LLM extraction on every | |
| β β UPLOADED_DOCS_DIR/<pid>/ missing | |
| β β rag/extracted/<pid>.json | |
| β β - _UPLOAD_EXTRACTION_STATUS dict + endpoint | |
| β β GET /api/upload/extraction-status/{pid} | |
| β β - Admin endpoint | |
| β β POST /api/admin/upload/reextract?force=... | |
| β βββ voice_format.py TTS pre-processing (money/Indic normalisation) | |
| β βββ admin.py /api/admin/* (health, telemetry) | |
| β βββ providers/ thin clients: google_gemini, nvidia_nim, sarvam_*, | |
| β local_embeddings (BGE), openrouter/groq (dormant) | |
| βββ frontend/ Next.js 16 app (src/app/page.tsx, src/lib/*) | |
| βββ rag/ retrieval + offline ingest pipeline | |
| β βββ retrieve.py query β top-k chunks (request hot path) | |
| β βββ ingest.py/extract.py/schema.py offline corpus build | |
| β βββ corpus/ vectors/ data β git-ignored, from the HF dataset | |
| β βββ policies.duckdb offline structured rollup | |
| βββ 40-data/ curated, version-with-code structured facts | |
| β βββ policy_facts/*.json per-policy facts + verbatim source_quote | |
| β βββ reviews/ premiums/ insurer_network.json | |
| βββ eval/ gold Q&A harness (pending single-brain re-port) | |
| βββ 70-docs/ design docs & ADRs β οΈ see note below | |
| βββ 80-audit/ defect register / audit transcripts | |
| βββ tools/ operational scripts (corpus, probes, link-rot) | |
| βββ tests/ pytest suite β the green gate (`pytest`) | |
| βββ Dockerfile / entrypoint.sh HF Space image (pulls the data dataset) | |
| βββ pytest.ini scopes pytest to tests/ (clean on a fresh clone) | |
| βββ requirements.txt | |
| ``` | |
| </details> | |
| > β οΈ **Note on `70-docs/` and ADRs:** these capture design history and | |
| > rationale; some predate the single-brain rewrite and are being brought into | |
| > line with the system as it actually runs today. **This README is the | |
| > authoritative present-state map**; the ADRs are decision context. | |
| --- | |
| ## 7. Run it locally | |
| **Prerequisites:** Python 3.11+, Node 20+, the API keys below. | |
| ```bash | |
| # 1. Code | |
| git clone <code-repo-url> "Insurance Sales Bot" | |
| cd "Insurance Sales Bot" | |
| python -m venv .venv && . .venv/bin/activate | |
| pip install -r requirements.txt | |
| # 2. Data (corpus + prebuilt vectors live in the companion dataset) | |
| python -c "from huggingface_hub import snapshot_download; \ | |
| snapshot_download(repo_id='rohitsar567/insurance-bot-data', \ | |
| repo_type='dataset', local_dir='rag/_hf_dataset_backup')" | |
| # then place rag/corpus and rag/vectors from the snapshot into rag/ | |
| # (the Docker build does this automatically; see entrypoint.sh) | |
| # 3. Secrets β copy the example and fill in: | |
| cp .env.example .env | |
| # GOOGLE_API_KEY β Gemini brain (primary) [required] | |
| # NVIDIA_NIM_API_KEY β NIM fallback chain [required] | |
| # SARVAM_API_KEY β STT / TTS / Indic [required for voice] | |
| # HF_TOKEN β pull the data dataset at boot [required] | |
| # ADMIN_PASSWORD β gates /api/admin/* [required] | |
| # VOYAGE_API_KEY β offline ingest embeddings only [ingest only] | |
| # OPENROUTER/GROQ_API_KEY β dormant (kept for one-flip re-enable) | |
| # 4. Backend | |
| uvicorn backend.main:app --host 127.0.0.1 --port 8000 --reload | |
| # 5. Frontend (separate terminal) | |
| cd frontend && npm install && npm run dev # http://localhost:3000 | |
| # Tests (the green gate) | |
| pytest # collects tests/ only | |
| ``` | |
| --- | |
| ## 8. Deployment | |
| Hosting is a **Hugging Face Space** running the `Dockerfile`: | |
| 1. The image installs the backend and builds the static frontend. | |
| 2. At build time it runs `huggingface_hub.snapshot_download` to hydrate | |
| `rag/` (corpus + vectors) from the `rohitsar567/insurance-bot-data` | |
| dataset, so the Space repo itself stays code-only and small. | |
| 3. `entrypoint.sh` starts `uvicorn backend.main:app` on `$PORT` (default | |
| `7860`, the port HF Spaces routes to); FastAPI also serves the exported | |
| frontend. | |
| The code repo is mirrored to **both** the HF Space remote (`origin`) and a | |
| GitHub remote (`github`); the heavy data is updated on the HF **dataset** | |
| side. Space repository secrets supply the API keys listed in Β§7. After any | |
| deploy, verify the Space's reported build SHA actually advanced before | |
| trusting that new code is live (a quota/LFS push can fail without surfacing | |
| an error). | |