Spaces:
Sleeping
title: Insurance Sales Portfolio Expert
emoji: π₯
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Voice-first AI advisor for Indian health insurance
Insurance Sales Portfolio Expert
A health-insurance advisory web app for the Indian market (presented in-app as "Insurance Advisor"). You describe your situation in plain language (typed or spoken, English or Hindi/Hinglish); it asks a few clarifying questions, then recommends and explains real policies β grounded in the actual policy documents, with every claim traceable to a source clause. It also lets you upload your own policy PDF and ask questions about it.
Live: https://rohitsar567-insurancebot.hf.space
Reading this cold? Β§1 is plain English. Β§2 walks you down four levels of abstraction: the user journey (Β§2.1), the building blocks (Β§2.2), the functional abstraction inside each block (Β§2.3), then deep-dives per building block (Β§2.4βΒ§2.9). Β§3 gives a function-by-function sequence-diagram view of the six most important jobs. Β§4βΒ§8 are safety, stack, repo map, run-it-locally, and deployment.
Table of contents
- What this is
- How it works, end to end
- Key functions in plain language
- Safety & quality
- Tech stack & key decisions
- Repository map
- Run it locally
- Deployment
1. What this is
The short answer. A health-insurance advisor that behaves like a knowledgeable, unbiased human advisor β not a lead-generation funnel. You describe your situation; it asks a few clarifying questions; it recommends real plans that fit, with every factual claim backed by the exact clause in the real policy document. No lead capture. No commission bias. If the honest answer is "this isn't in the document," it says so β instead of guessing.
It works by chat or voice, in English or Hindi/Hinglish, on desktop and mobile.
The problem this solves
Buying health insurance in India is hard for an ordinary person. A first-time buyer faces three concrete problems:
- Too much to compare. ~148 plans across 21 insurers, each with dozens of decision-relevant fields (waiting periods, room-rent caps, co-pay, maternity, sub-limits, network size). No human reads them all.
- The truth is buried. The number that decides whether a plan is right for you is on page 47 of a PDF written by lawyers.
- Most "advice" is conflicted. Aggregator sites optimise for the sale, not the fit.
The cost of getting this wrong is real money and denied claims years later. The goal is a tool a non-expert can trust the way they would trust a good independent advisor: personalised to their profile, sourced, and never fabricating.
What it does, concretely
- Conversational fact-find β short natural back-and-forth establishes your profile (age, dependants, budget, pre-existing conditions, priorities) instead of a long form.
- Personalised recommendations β plans ranked for fit to your profile. A fixed-benefit plan is not pushed to someone who needs comprehensive cover; a plan whose entry age excludes you is filtered out.
- Grounded answers β every factual claim about a policy is retrieved from that policy's actual document and shown with its source. Weak or missing evidence produces an honest "not stated in the document."
- Marketplace & compare β browse the full indexed catalogue, open a detailed scorecard per plan, compare up to four side by side.
- Profile β premium (illustrative) β a live ballpark premium range that updates as you change your profile. Not real underwriting β a multivariate range from public rate-card combinations (see Β§3.3).
- Bring your own document β upload any policy PDF; it is safely indexed for the rest of your session so you can ask questions about your document.
- Voice β speak instead of typing (tap-to-talk on mobile, push-to-talk on desktop); replies are spoken back. Indian-accent and Hinglish aware.
2. How it works, end to end
The short answer. A Next.js browser app talks to a FastAPI backend. Every chat turn goes to a single LLM "brain" (Google Gemini) with a small set of function-calling tools β most importantly a retrieval tool over a Chroma vector store built from the real policy documents. The brain decides when to retrieve, what to retrieve, and how to answer; it cannot state a policy fact it did not retrieve. If Gemini is unavailable, the turn transparently falls back to an NVIDIA NIM open-model chain. Voice in/out is handled by Sarvam (Indian-language STT/TTS). Heavy data (PDF corpus + prebuilt vectors) lives in a separate Hugging Face dataset, not the code repo.
The rest of this section walks you down four levels of abstraction: Β§2.1 the user's journey (plain English, no tech); Β§2.2 the building blocks at the highest level (the four canonical buckets); Β§2.3 the functional abstraction β what happens inside each bucket; and Β§2.4βΒ§2.9 the deep dives per building block. Every diagram is followed by a β€50-word summary and a hierarchical how it flows breakdown.
2.1 The user's journey (plain English β no tech)
Before the engineering detail, here is what actually happens for the person using it. No code, no jargon β just the path from opening the app to deciding with confidence.
flowchart TD
S["π You open the app β web or mobile, nothing to install"] --> TELL["π£οΈ Tell it about you β a short chat, typed OR spoken, English / Hindi-Hinglish<br/>age Β· family Β· budget Β· health Β· what you care about"]
TELL --> ASK["β It asks just 2β3 clarifying questions<br/>(a real conversation, never a long form)"]
ASK --> REC["π― A personalised shortlist β plans ranked for YOUR fit, each with the reason it fits"]
REC --> WHY["π Open any plan: every fact is backed by the exact clause in the real policy PDF<br/>an honest "not stated in the document" instead of a guess"]
WHY --> EXPLORE{"Want to dig deeper?"}
EXPLORE -->|"Compare"| CMP["βοΈ Compare up to 4 plans side by side Β· full scorecard per plan"]
EXPLORE -->|"Browse"| MKT["π Browse the full indexed marketplace"]
EXPLORE -->|"Ask"| QA["π¬ Ask follow-up questions β answered only from the actual documents"]
EXPLORE -->|"My own policy"| UP["π Upload your own policy PDF"]
UP --> UPIDX["β³ Quick ack β 'Reading it through, ~30β60 s'<br/>(everything in chat is gated while the analysis runs)"]
UPIDX --> UPCARD["π Inline scorecard card with FULL data:<br/>grade letter Β· 6 sub-scores Β· verbatim signals Β· insurer reputation"]
UPCARD --> UPCHOICE{"How would you like to proceed?"}
UPCHOICE -->|"Finish profile"| TELL
UPCHOICE -->|"Dive into the PDF"| QA
CMP --> PREM
MKT --> PREM
QA --> PREM
PREM["πΈ A live premium estimate that updates as you change your profile"] --> DONE["β
Decide with confidence β no lead capture, no commission bias"]
VOICE["ποΈ Optional the whole way: speak instead of type β it speaks the answers back"] -.-> TELL
VOICE -.-> QA
Summary. A user opens the app and ends the session having decided on a plan with confidence β and how the system loops through compare / browse / Q&A / upload along the way. No backend in this view; just the human path. Every session starts fresh β there is no cross-session memory; closing the tab forgets you (privacy-by-design, see ADR-043).
How it flows:
- Conversational fact-find. A short typed-or-spoken back-and-forth (English or Hindi-Hinglish) captures age, family, budget, health and what you care about β instead of a long form.
- Personalised shortlist + a "why". Plans are ranked for your fit; every fact about a plan is backed by the exact clause in the real policy PDF, never invented.
- Branches from the shortlist. Compare side by side, browse the full marketplace, ask follow-up questions, or upload your own policy PDF and ask about your document (kept private to your session).
- Upload-PDF flow is a staged sequence (ADR-044, 2026-05-27): upload β bot says "reading it through, ~30β60 s" β all chat input is gated during the wait (Send button, textarea, voice paths all blocked so nothing can interrupt the staging) β bot pushes the inline scorecard card with FULL extracted data once the LLM pass lands β bot then asks whether you'd like to finish your profile or dive into the PDF. The card is the same shape as any catalogued policy card β six sub-scores, verbatim signals, real claim-settlement data when the insurer is recognised.
- Live premium. Updates as you change the profile.
- Decision. No lead capture and no commission bias β the path ends at decide, not at a sales handoff.
2.2 System at a glance β the big building blocks
The short answer. The system has four "tall buckets": Frontend (what you see), Backend (what runs on the server), Data layer (the policy knowledge), and Voice (in and out). They talk to each other over standard HTTP / JSON.
Two terms first, in one sentence each:
- Frontend = everything you see on screen β the chat box, marketplace cards, sliders, profile builder. Built with Next.js + React (a standard, well-supported web-UI library). Runs in your browser.
- Backend = everything that runs on the server β the LLM brain, the retrieval, the scoring/pricing logic, the upload-security gates. Built with FastAPI (a standard Python HTTP framework). Think of the frontend as the menu + waiter; the backend is the kitchen.
Both Next.js and FastAPI are deliberately boring, standard choices β they let us not spend engineering on the UI layer or the HTTP plumbing, so we spend that effort on the brain and the data, where the product differentiation actually lives.
Now the big picture β the buckets and how they talk:
flowchart LR
subgraph FE["π Frontend (browser Β· Next.js)"]
UI["Chat Β· Marketplace Β· Compare Β· Profile builder<br/>Voice capture & playback"]
end
subgraph BE["βοΈ Backend (FastAPI server)"]
API["HTTP endpoints + orchestration<br/>backend/main.py"]
BRAIN["π§ LLM Brain<br/>Google Gemini + function-calling tools<br/>(NIM fallback chain on failure)"]
SCORE["π― Scoring + Pricing<br/>scorecard.py Β· premium_calculator.py"]
PROF["π€ Profile (in-memory only)<br/>session_state.SessionState Β· 1h idle TTL"]
end
subgraph DATA["π Data layer"]
VEC["Vector DB (Chroma) β policy chunks<br/>+ per-session quarantine (uploads)"]
FACTS["Curated facts JSON<br/>40-data/policy_facts/*.json"]
end
subgraph VOICE["ποΈ Voice"]
STT["Sarvam STT (in)"]
TTS["Sarvam TTS (out)"]
end
UI <-->|"text Β· JSON"| API
UI -->|"audio"| STT --> API
API --> TTS --> UI
API <--> BRAIN
BRAIN <-->|"retrieve_policies"| VEC
BRAIN <-->|"get_policy_facts"| FACTS
BRAIN <-->|"save_profile_field"| PROF
BRAIN --> SCORE
SCORE <--> FACTS
SCORE <--> PROF
Summary. Four building blocks talk over HTTP / JSON: Frontend (the chat UI you see), Voice (Sarvam STT in + TTS out), Backend (FastAPI with four sub-blocks β orchestration, LLM Brain, Scoring + Pricing, Profile & Persistence), and the Data layer (Chroma vectors + curated JSON facts).
How it flows:
- 1. Frontend (browser Β· Next.js). Renders chat, marketplace, compare, and the profile builder. Sends typed text and audio over HTTP, plays the synthesised reply.
- 2. Voice.
Sarvam STT (in)turns spoken audio into a text turn;Sarvam TTS (out)turns the reply text back into spoken audio. - 3. Backend (FastAPI). Four sub-blocks β 3a HTTP endpoints + orchestration (
backend/main.py); 3b LLM Brain (Gemini + function-calling tools; NIM fallback on failure); 3c Scoring + Pricing (scorecard.py+premium_calculator.py); 3d Profile (in-memory only βsession_state.SessionState, no disk). - 4. Data layer. Two stores β the Chroma vector DB (shared policy chunks + per-session quarantine for uploads) and curated JSON facts at
40-data/policy_facts/*.json. The brain, scoring, and pricing all read from these.
Diagram legend (used throughout Β§2):
- Solid arrow (
β) = a real call / data flow on the request path. - Double arrow (
β) = bidirectional β one side calls, the other returns. - Dotted arrow (
-.->) = a side-channel or async event β voice playback, barge-in interrupt, end-of-turn persistence, etc. β not on the main request path. - Subgraph box = everything inside runs in one place (one process / one service / one storage layer).
- Edge labels (e.g. "retrieve_policies") name the actual function or signal carried on that edge.
2.3 Functional abstraction β what happens inside each building block
flowchart TB
subgraph FE["1. Frontend"]
direction TB
F1["capture_input<br/>typed text Β· spoken audio"]
F2["render_reply<br/>chat Β· cards Β· scorecard Β· audio"]
end
subgraph V["2. Voice"]
direction TB
V1["transcribe<br/>Sarvam Saarika STT"]
V2["synthesize<br/>voice_format β Sarvam Bulbul TTS"]
end
subgraph BE["3. Backend"]
direction TB
subgraph BE_API["3a. HTTP + orchestration"]
A1["route_request"]
A2["orchestrate_turn"]
end
subgraph BE_BRAIN["3b. LLM Brain"]
B1["handle_turn<br/>one Gemini call + tool loop"]
B2["fact_find<br/>save_profile_field"]
B3["retrieve<br/>retrieve_policies"]
B4["lookup_facts<br/>get_policy_facts"]
B5["recommend<br/>mark_recommendation"]
end
subgraph BE_SCORE["3c. Scoring + Pricing"]
SC1["grade_per_profile<br/>scorecard.py"]
SC2["estimate_premium<br/>premium_calculator.py"]
end
subgraph BE_PROF["3d. Profile (in-memory)"]
P1["update_session_profile<br/>session_state.SessionState"]
P2["evict_on_idle<br/>1h TTL Β· no disk"]
end
end
subgraph DATA["4. Data layer"]
direction TB
D1["vector_search<br/>Chroma Β· BGE-small"]
D2["fact_lookup<br/>40-data/policy_facts/*.json"]
end
%% forward edges (input / down the pipeline)
F1 -->|"audio"| V1
F1 -->|"text Β· JSON"| A1
V1 --> A1
A1 --> A2
A2 --> B1
B1 --> B2
B1 --> B3
B1 --> B4
B1 --> B5
B2 --> P1
B3 --> D1
B4 --> D2
A2 --> SC1
A2 --> SC2
SC1 -->|"reads"| D2
SC2 -->|"reads"| D2
SC1 -->|"reads"| P1
SC2 -->|"reads"| P1
P1 -.->|"idle 1h"| P2
%% return edges (output / back to caller)
D1 -.->|"top-k chunks"| B3
D2 -.->|"per-policy facts"| B4
SC1 -.->|"grade"| A2
SC2 -.->|"premium range"| A2
B1 -.->|"reply + citations"| A2
A2 -.->|"text"| F2
A2 -.->|"speak?"| V2
V2 -.->|"audio"| F2
%% blue solid = forward Β· orange dashed = return
linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17 stroke:#1565c0,stroke-width:2px
linkStyle 18,19,20,21,22,23,24,25 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3
Legend. Blue solid = forward flow (input / call down the pipeline). Orange dashed = return flow (result / reply back up).
Summary. Inside each building block from Β§2.2, a small set of named functions fires per turn β Frontend captures and renders, Voice transcribes and synthesises, the four Backend sub-blocks orchestrate / decide / score / remember, and the Data layer answers their reads.
How it flows:
- 1. Frontend.
capture_inputaccepts typed text or recorded audio;render_replypaints chat + marketplace cards + scorecard + audio playback. - 2. Voice.
transcribeis the inbound path (Sarvam Saarika STT);synthesizeis the outbound path (voice_formatnormalises money / Indic shorthand β Sarvam Bulbul TTS). - 3a. HTTP + orchestration.
route_requestmaps the URL to a handler;orchestrate_turnis the per-turn supervisor β it owns the request lifecycle and ties brain + scoring + voice + persistence together. - 3b. LLM Brain. One
handle_turnper turn calls Gemini, which chooses which offact_find/retrieve/lookup_facts/recommendto run as tools. The brain may only state what its tools returned. - 3c. Scoring + Pricing.
grade_per_profileandestimate_premiumread curated facts and the live profile, compute on every request (never stored), and hand back toorchestrate_turn. - 3d. Profile (in-memory).
update_session_profilereflects eachfact_findwrite into the liveSessionState.profile. State lives in process memory only; an idle session is evicted after 1 h. There is no disk persistence and no cross-session recall (see ADR-043, 2026-05-27). - 4. Data layer. Two reads β
vector_searchfor free-form Q&A, andfact_lookupfor decision-critical numbers with verbatim quotes. The data layer does no writes during a request β those happen offline only (vector ingest, curated-facts edits).
2.4 LLM brain + fail-loud fallback chain
flowchart LR
Q["chat turn"] --> G{"Gemini<br/>gemini-2.5-flash-lite"}
G -->|"OK"| ANS["grounded reply<br/>(only from tool results)"]
G -->|"real failure / cold-start 503"| H["backend/llm_health.py<br/>background probe + sticky-primary election"]
H --> NIM["NVIDIA NIM open-model chain<br/>backend/nim_fallback.py"]
NIM -->|"healthy model"| ANS
NIM -->|"whole chain down"| LOUD["explicit 'service degraded'<br/>(never a silently wrong answer)"]
ANS --> GUARD["prose-grounding guard:<br/>every policy/UIN named is verified<br/>against retrieve_policies + get_policy_facts"]
GUARD --> OUT["sent to user"]
Summary. How a chat turn is served by the primary LLM, what happens when it fails, and the structural guard that prevents a silently wrong answer.
How it flows:
- Primary path. Gemini (
gemini-2.5-flash-lite). On a healthy response β the reply is built only from what the tools returned. - Fallback path (fail-loud). A real Gemini failure or a cold-start
503 routes through
backend/llm_health.py(a background probe with sticky-primary election) to the NVIDIA NIM open-model chain (nim_fallback.py). One healthy model in that chain serves the turn. - Last resort. If the whole chain is down, the user gets an explicit "service degraded" message β never a silently wrong answer.
- Prose-grounding guard. Before a reply is sent, every policy / UIN
named in the prose is verified against the same
retrieve_policiesandget_policy_factsresults the brain saw (with an exemption for genuine catalogue UINs). Faithfulness is structural, not bolt-on.
Why a single brain (not a multi-model pipeline). Earlier designs split
the work across several LLM passes (a separate fact-find brain, a QA
brain, a faithfulness-judge). That scaffolding was removed: a single
frontier model with well-designed tools is more accurate, far simpler,
and eliminates a whole class of cross-model contract bugs. Today there is
exactly one brain call per turn plus its tool calls. Faithfulness is
enforced structurally β the brain can only state what retrieve_policies
and get_policy_facts returned β rather than by a second grader model.
More on the fallback chain. The brain's primary is Gemini
(gemini-2.5-flash-lite). On a real Gemini failure or a cold-start 503,
the turn falls back to an NVIDIA NIM chain of open models. Candidate
selection uses a background health probe with sticky-primary election
(backend/llm_health.py) so one healthy model is chosen per call. The
fallback is fail-loud: if the whole chain is down the user gets an
explicit "service degraded" message, never a silently wrong answer.
(A separate LLM "judge" existed historically and has been retired β the
single-brain design made it redundant.)
Sticky-session retry policy (hardened 2026-05-27). Once a session
has completed at least one successful single-brain turn, it stays on
single_brain for the rest of its lifetime β cross-fading to
nim_fallback mid-stream would discard last_recommendation_ids / last_retrieved_chunks / slug_to_insurer. To absorb Gemini's
intermittent "high demand" 503 bursts on sticky sessions,
_gemini_call now uses an adaptive retry schedule: non-sticky
session keeps 1 retry with a 1.5 s backoff (fast-fail to NIM on
cold-start); sticky session gets 2 retries with jittered
exponential backoffs (1.5 s β 3 s, Β±25 % jitter). If the chain still
fails after retries, the user sees a plain, honest reply "My model
service had a brief blip on that turn β please send the same message
again." (no more misleading "could you say that again?").
2.5 Voice pipeline (in / out, with barge-in)
flowchart LR
MIC["mic β tap-to-talk (touch) / push-to-talk (desktop)"] --> MR["MediaRecorder (authoritative audio)"]
MIC -.->|"live interim text"| WS["Web Speech API (display only)"]
MR --> STT["/api/transcribe β Sarvam Saarika STT"]
STT --> BR["single_brain.handle_turn"]
BR --> RPL["reply text + citations"]
RPL --> VF["voice_format.py<br/>money/Indic normalise Β· chunk at sentence bounds"]
VF --> BUL["Sarvam Bulbul TTS"]
BUL --> PLAY["in-DOM <audio>"]
SPK["user speaks over bot"] -.->|"barge-in"| PLAY
SPK -.->|"abort in-flight"| BR
Summary. How spoken input becomes a chat turn, how the reply becomes speech back, and how the user can interrupt mid-answer.
How it flows:
- Capture. Tap-to-talk (touch) or push-to-talk (desktop) starts
MediaRecorder(the authoritative audio) and Web Speech (a live interim transcript shown on screen but never trusted for the turn). - STT. The authoritative audio is sent to
/api/transcribe(Sarvam Saarika β Indian-accent + Hinglish aware). - Brain β reply. The transcript runs through
single_brain.handle_turnexactly like a typed turn. - TTS.
voice_format.pynormalises money / Indic shorthand and chunks at sentence bounds (so long replies are spoken in full); Sarvam Bulbul speaks; an in-DOM<audio>element plays. - Barge-in. The user speaking over the bot pauses playback and
aborts the in-flight
/api/chat, so the bot stops mid-thought rather than over-talking.
More on voice. The browser shows a live interim transcript via the
Web Speech API while MediaRecorder captures the authoritative audio,
which is sent to /api/transcribe (Sarvam Saarika STT). Replies are
synthesised by Sarvam Bulbul TTS, with money / Indic shorthand
normalised in backend/voice_format.py before synthesis (long replies are
chunked at sentence boundaries so the full answer is spoken, not just the
first sentence), and played through an in-DOM <audio> element. Speaking
over the bot (barge-in) pauses that audio and aborts the in-flight
/api/chat request. On touch devices voice is tap-to-talk; on desktop,
push-to-talk; the live interim transcript accumulates the full utterance
while you speak.
2.6 Profile & personalisation (in-memory only)
flowchart TB
A["user answers (chat or profile builder)"] --> SPF["save_profile_field β SessionState.profile<br/>(in-memory dict only)"]
SPF --> FIT["scorecard fit + grade<br/>(reads live profile)"]
SPF --> PREM["illustrative premium<br/>(reads live profile)"]
SPF --> TURN["next chat turn<br/>(brain sees full live profile)"]
SPF -.->|"1h idle"| EVICT["session evicted from memory<br/>profile gone forever"]
SPF -.->|"close tab"| EVICT
RECOV["server restart / 1h idle WHILE tab still open<br/>+ chat_history carried by browser"] -.-> SR["STATE-RECOVERY MODE<br/>brain rebuilds profile from chat_history<br/>(in-session only Β· never reads disk)"]
SR --> SPF
Summary. The profile is captured into a per-session in-memory dict
(SessionState.profile), feeds scoring + pricing + the next turn's
brain prompt, and is discarded the moment the session evicts (1 h idle
or "Clear chat"). There is no on-disk persistence and no cross-session
recall. The in-session STATE-RECOVERY path covers container
restarts by rebuilding the profile from the chat history the browser
still carries β it never touches disk.
How it flows:
- Capture. Every answer (chat or profile-builder form) is written via
save_profile_field(or thePOST /api/profileendpoint) into the liveSessionState.profile. This is a regular Python dataclass field on the in-memory session object. - Drives scoring + pricing. The same profile feeds the scorecard fit-and-grade (Β§3.2) and the live premium estimate (Β§3.3) on every request β both reads, never persisted.
- Evicted on idle / close. A session is evicted from the
_sessionsdict after 1 h of inactivity (_TTL_SECONDS). Hitting Clear chat (POST /api/session/clear) evicts immediately. Closing the tab disconnects the browser β the server-side session ages out on the same TTL. - State recovery (in-session only). If the server restarted or the
session evicted while the browser still has the chat open, the
client re-sends its
chat_historywith the next turn. The brain enters STATE-RECOVERY MODE and silently re-captures the facts already stated in history β without ever asking the user's name again. This is not cross-session; it only resolves the case where the user is still in the same conversation.
Why no cross-session recall (ADR-043, 2026-05-27). An earlier
design persisted profiles to 40-data/profiles/<name>.json and offered
a "Welcome back, ?" prompt on return. The name-only slug key
collided across distinct users (every "Rohit" wrote to the same file),
which required four sequential hardening passes β prompt redaction,
match-before-merge guards, same-turn fact extractors, a two-fact gate β
to keep contained. The cost/benefit for an insurance-shopping app
(rare-purchase, return sessions uncommon) didn't justify the surface.
The simpler "session is in-memory only" model matches the privacy story
the product wants to tell.
2.7 Data architecture β where the JSON lives, where the vectors live
Summary. Two complementary kinds of data power the bot: small JSON files of human-reviewed facts versioned with the code, and a Chroma vector database of the full policy text held in a separate HF dataset. JSON answers exact-number questions; vectors answer free-form Q&A.
2.7.1 The two data kinds, side by side
| JSON files | Vector database (Chroma) | |
|---|---|---|
| What's in it | Per-policy curated structured fields β CSR%, ICR%, complaints/10k, room-rent rule, waiting periods, sub-limits, grade β each value carries its verbatim source_quote from the PDF |
Full text of every policy PDF, chunked into ~500-token overlapping pieces, embedded as 384-d vectors with BGE-small-en-v1.5 |
| File location | 40-data/policy_facts/*.json β inside the code repo, versioned in git |
rag/vectors/ β git-ignored on the laptop, lives in the HF dataset rohitsar567/insurance-bot-data, pulled at Docker build via huggingface_hub.snapshot_download |
| Size | ~150 files Γ few KB each β tiny, fits in git | ~7.3 k chunks Γ 384 dims + raw PDFs (hundreds of MB) β too big for git |
| Built by | Offline ingest (rag/extract.py + schema.py) β LLM-assisted extraction, human-reviewed, committed to git |
Offline ingest (rag/ingest.py) β chunked + embedded once, published to HF dataset |
| Read by (at request time) | get_policy_facts tool (LLM brain) Β· scorecard.py Β· premium_calculator.py Β· marketplace-card renderer |
retrieve_policies tool (LLM brain) β only |
| Used for | Decision-critical exact numbers cited on cards, in the scorecard, and in pricing β with the verbatim PDF quote shown | Free-form Q&A grounded in actual policy wording β "what does this plan cover during pregnancy?" |
2.7.2 Other data files (smaller, supporting)
40-data/reviews/β sourced insurer reviews (claims stories, regulator notes).40-data/premiums/β illustrative public rate-card combinations consumed by the multivariate premium estimator (Β§3.3).40-data/insurer_network.jsonβ hospital-network counts per insurer. Pre-ADR-043 there was also a40-data/profiles/<name>.jsondirectory of saved user profiles for cross-session recall. That mechanism was removed (see Β§2.6 β sessions are now in-memory only).
All three remaining stores sit in the code repo (under 40-data/) because they're small, human-reviewed, and decision-critical β safe to version alongside the code.
2.7.3 Where each piece physically lives
flowchart LR
LAPTOP["π» Local dev laptop<br/>source of truth before any push"]
subgraph CODE["π Code repository (mirrored to two remotes)"]
direction TB
GH["GitHub β public mirror<br/>rohitsar567/insurance-sales-bot"]
HFS["HF Space β code + running app<br/>rohitsar567/InsuranceBot"]
end
HFD["π€ HF Dataset<br/>rohitsar567/insurance-bot-data<br/>PDF corpus + prebuilt Chroma vectors"]
APP["π Live app (HF Space container)"]
LAPTOP -->|"git push github"| GH
LAPTOP -->|"git push origin"| HFS
LAPTOP -.->|"offline ingest publishes here"| HFD
HFS -->|"Docker build"| APP
HFD -.->|"snapshot_download at build"| APP
linkStyle 0,1,3 stroke:#1565c0,stroke-width:2px
linkStyle 2,4 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3
Summary. Three physical homes: the laptop (source of truth before any push), the code repo (mirrored to two git remotes β GitHub for reviewers and the HF Space's own repo for deployment), and the HF dataset (the heavy binaries).
How it flows:
- Local laptop. Single source of truth before any push. All editing happens here.
- Code repo β two remotes.
git push githubupdates the GitHub public mirror (for reviewers).git push originupdates the HF Space's own repo and triggers the Docker rebuild. - HF dataset (offline channel). Heavy binaries β PDF corpus + prebuilt Chroma vectors β are published here separately from the code so the deployable image stays small.
- Live container. On every Docker build,
huggingface_hub.snapshot_downloadhydratesrag/corpus/andrag/vectors/from the HF dataset; the FastAPI app then has both data kinds available.
2.7.4 Offline ingest pipeline (built once, not on the request path)
flowchart LR
PDF["π Raw policy PDFs<br/>rag/corpus/"] --> ING["rag/ingest.py<br/>chunk pages"]
ING --> EMB["embed<br/>BGE-small Β· local Β· 384-d"]
EMB --> VEC["Chroma vector store<br/>rag/vectors/"]
ING --> XT["rag/extract.py + schema.py<br/>structured fact extraction (LLM-assisted)"]
XT --> JSON["40-data/policy_facts/*.json<br/>+ verbatim source_quote"]
XT --> DUCK["policies.duckdb<br/>structured rollup"]
VEC -.->|"published to"| HFD["HF dataset"]
PDF -.->|"published to"| HFD
JSON -->|"versioned with code"| CODE["Code repo"]
linkStyle 0,1,2,3,4,5,8 stroke:#1565c0,stroke-width:2px
linkStyle 6,7 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3
Summary. A single offline pipeline turns each raw PDF into two artefacts: chunked embeddings for free-form retrieval, and a structured JSON of decision-critical fields with verbatim quotes. Vectors β HF dataset; JSON β code repo.
How it flows:
- Chunking + embedding.
rag/ingest.pysplits each PDF into overlapping ~500-token chunks; BGE-small encodes each chunk into a 384-d vector; Chroma persists them torag/vectors/. - Extraction.
rag/extract.py+schema.pyrun an LLM-assisted pass to pull structured fields (waiting periods, room-rent caps, CSR%, etc.) into a schema-validated JSON β with the verbatimsource_quotethat justifies each value. - Two destinations. Vectors and raw PDFs are published to the HF dataset (too big for git); JSON files are versioned with the code so a reviewer can see exactly what facts feed the marketplace cards.
- Why split. Chunks power free-form Q&A; the JSON powers the marketplace cards, scoring, and pricing β two different queries, two different data shapes.
Provenance rule. Every policy fact shown to a user traces to a real clause in a real PDF. Where a document genuinely doesn't state something, it is recorded as a sourced-null ("not stated in <file>.pdf") β never invented or back-filled.
2.8 Uploaded-PDF flow β 8 security gates β Gemini extraction β catalogued-grade card
flowchart TB
UP["/api/upload-policy (public web)"] --> G1["1 File mechanics<br/>%PDF Β· size band Β· %%EOF Β· no exe/JS"]
G1 --> G2["2 Content quality<br/>β₯1500 chars Β· β₯3 pp Β· domain keyword"]
G2 --> G3["3 Prompt-injection sweep"]
G3 --> G4["4 Per-session rate limit"]
G4 --> G5["5 Per-IP rate limit"]
G5 --> G6["6 Encrypted/locked β reject"]
G6 --> G7["7 Page-count ceiling (>200)"]
G7 --> G8["8 Hash dedupe + reject-cache"]
G8 -->|"pass"| QC["per-session QUARANTINE Chroma + global policies collection<br/>BGE-small embeddings Β· 24h idle TTL"]
QC --> HEUR["Heuristic baseline (synchronous, sub-second)<br/>regex/keyword over PDF text<br/>writes UPLOADED_DOCS_DIR/<pid>/record.json @ ~30-50% completeness"]
HEUR -.->|"detect_insurer_slug<br/>match against 21 known insurers"| INS["insurer_slug = manipalcigna / hdfc-ergo / ...<br/>(or 'user-upload' on no match β fail-closed)"]
HEUR --> ACK["HTTP 200 returns immediately<br/>frontend pushes ack 'Got it β reading X, ~30-60s'<br/>EVERY chat input GATED via extractionInFlight"]
ACK --> CACHE{"sha256(pdf_bytes)<br/>seen before?"}
CACHE -->|"hit"| COPY["copy prior rag/extracted/<other_pid>.json β this pid<br/>llm_used='hash-cache' Β· ~1s"]
CACHE -->|"miss"| LLM["Background extract_one_for_upload<br/>Gemini 2.5-flash Β· 3 retries (2/4/8s Β± 25% jitter)<br/>llm_used='gemini-2.5-flash#N'<br/>llm_response_chars logged for ops"]
LLM -->|"all fail"| NIM["NIM fallback chain<br/>llm_used='nim-fallback'"]
LLM -->|"success"| WRITE["write rag/extracted/<pid>.json"]
NIM -->|"success"| WRITE
NIM -->|"fail"| FLOOR["heuristic record.json wins<br/>status='failed' but card still renders at ~47% / grade C"]
WRITE --> MERGE["merge LLM scalars INTO record.json<br/>LLM value wins where non-empty<br/>heuristic stays where LLM silent"]
COPY --> MERGE
MERGE --> BUST["invalidate _MG_CACHE (marketplace grade cache)"]
BUST --> RESOLVE["_catalogue_scorecard(pid, None)<br/>SAME resolver /api/policies/{id}/scorecard uses"]
RESOLVE --> STATUS["_set_extraction_status(complete, comp, grade, llm_used, llm_response_chars)<br/>BY CONSTRUCTION equal to card endpoint"]
FLOOR --> STATUS
STATUS --> POLL["frontend GET /api/upload/extraction-status/{pid}<br/>every 3s, max 120s"]
POLL --> CARD["pushAssistant(card_ready, citations=[{pid}])<br/>then pushAssistant(choice_prompt)<br/>setActiveUploadPid(pid) Β· setExtractionInFlight(false)"]
CARD --> ENABLE["Send + textarea + PDF + voice all re-enabled<br/>view_context.active_policy_id={pid} on next chat turn<br/>β single_brain enters ACTIVE POLICY DIVE-IN mode (KI-330)"]
G1 & G2 & G3 & G4 & G5 & G6 & G7 & G8 -->|"fail"| REJ["clean rejection (reason surfaced)"]
Summary. The full pipeline an uploaded PDF traverses to become a catalogued-grade card with the same data depth as the 148 pre-curated policies β from HTTP request through Gemini extraction through inline chat card.
How it flows:
- The 8 security gates, in order. (1) File mechanics β
%PDFmagic, 5 KBβ25 MB size band, well-formed%%EOF, no embedded executables / JavaScript / launch actions. (2) Content quality β β₯1500 extractable chars, β₯3 pages, at least one insurance-domain keyword. (3) Prompt-injection sweep β "ignore previous instructions", "reveal your system prompt", jailbreak patterns. (4) Per-session rate limit. (5) Per-IP rate limit (catches session-ID rotation). (6) Encrypted/locked PDF β rejected cleanly. (7) Page-count ceiling (>200 pages β an abuse/bundle vector). (8) Hash dedupe + reject-cache β identical re-uploads short-circuit. - Beyond identical-file dedup. A UIN net-new check also runs β
if the PDF's IRDAI UIN already belongs to a catalogued policy, the
caller is pointed at the existing marketplace card instead of indexing
a duplicate. PDF-text fuzzy matching also runs β if the upload
content identifies as a known catalogued product (matching insurer +
product-name patterns), the upload endpoint resolves to the existing
<insurer-slug>__<product>id and reuses the curated card, skipping fresh extraction entirely (best UX for known products). - On pass. Chunks land in a per-session quarantine Chroma
collection β session-isolated, 24 h idle TTL β AND in the shared
policiescollection so the upload can become a marketplace card. - The locked chat sequence (ADR-044). ack β gated wait β card β
choice. The frontend's
extractionInFlightflag disables Send, textarea, PDF button, and EVERY voice path (PTT / Sarvam / voice auto-submit) for the entire wait window. The choice prompt NEVER fires before the card-bearing message lands. Live-verified by Playwright: ack at idx 155 β card/fail at idx 315 β choice at idx 500, strictly ordered. Inputs re-enable in the same render as the choice prompt. - The two LLM fast paths. Hash-cache β
sha256(pdf_bytes)matches a prior successful extraction β copy thatrag/extracted/<pid>.jsonto this pid, surfacellm_used="hash-cache", ~1 s. Gemini path β 3 retries with jittered exponential backoff (2/4/8 s Β± 25 %), surfacesllm_used="gemini-2.5-flash#N"where N is the successful attempt, plusllm_response_charsso the operator can see WHICH LLM landed the extraction without HF Space stdout access. - The heuristic floor is a hard guarantee, now significantly fatter
(KI-332, 2026-05-27).
build_record()runs synchronously inside the upload HTTP call (sub-second) and writesrecord.jsonBEFORE the LLM ever fires. The pattern set was expanded from ~16 fields to ~28+ on the 2026-05-27 hardening pass: sum-insured ladder detection (βΉ3L / βΉ5L / βΉ10Lβ list), policy_type, min entry age, child entry days, lifelong-renewability flag, grace period, free-look period, geographic coverage, ICU capping, deductible amount, NCB cap %, organ donor / critical illness / preventive checkup / domiciliary / newborn presence booleans, premium payment modes. Local synthetic test hits 32 fields. Expected upload completeness on LLM-fail rises from ~47.8 % to ~65β70 %. If Gemini fails all 3 retries AND the NIM fallback fails, the card still renders at this richer floor β never fabricated, never a generic "Retry" placeholder. Verified on a hard Test Policy.pdf (8 MB) where Gemini 3/3 retries returned malformed JSON: card still landed with the expanded heuristic data. - Multi-pass per-section extraction for big PDFs (KI-332, 2026-05-27).
For uploads with β₯ 25 K chars of extracted text (e.g. dense
100+ page policy wordings, 8 MB PDFs), the single-pass Gemini call
reliably truncates JSON mid-emission β the HealthPolicy schema has
~40 fields and a complete output with verbatim quotes can exceed
Gemini 2.5-flash's reliable output budget. Solution: split the
schema into 7 logical sections (identity, eligibility, financial,
waiting periods, coverage, limits, network+claims) and run each as
its own smaller Gemini call IN PARALLEL via
asyncio.gather. Each call carries ~15 % of the schema β fits comfortably in budget. Failure-isolated: 6/7 sections landing produces a partial extraction strictly better than the heuristic floor. Same wall-clock cost as single-pass (parallel). On total multi-pass failure, falls through to the legacy single-pass + NIM chain (heuristic floor still wins). Activation:len(text) β₯ 25_000triggers multi-pass; smaller PDFs keep using single-pass (faster, cheaper, works fine). - Status endpoint == scorecard endpoint by construction. When the
background extraction finalises, it calls the SAME
_catalogue_scorecard(pid, None)resolver that/api/policies/{id}/ scorecarduses. Thecompleteness_pctandoverall_gradeonGET /api/upload/extraction-status/{pid}are therefore byte- identical to what the inline card renders. Same applies to the hash-cache short-circuit (which had to be fixed separately β earlier draft of the cache branch calledbuild_scorecard(...)withoutinsurer_reviewsAND read.overall_gradeinstead of.grade, silently reporting status=17.4 % / grade None while the card showed 47.8 % / C). 2026-05-27 multi-PDF audit (commit58e3c82) confirmed parity across manipalcigna, hdfc-ergo, care-health, icici-lombard, star-health, Test Policy.pdf. - Post-card dive-in mode (KI-330). After the card lands the
frontend sets
activeUploadPidand plumbs it into every chat turn'sview_context.active_policy_id.single_brain.handle_turnreads that and prepends an ACTIVE POLICY DIVE-IN block to the system instruction, forcing the brain to answer policy-specific questions viaretrieve_policies+get_policy_factson that pid instead of pivoting to "let me pull your recommendations". Verified 9/10 on the post-fix audit (up from 0/10 pre-fix). - On fail. A clean rejection naming the gate; the file is deleted; nothing is embedded.
Operator endpoints (admin-only):
POST /api/admin/upload/reextract?force=<bool>β re-runsextract_one_for_uploadfor every persisted upload that lacks arag/extracted/<pid>.json, or for ALL persisted uploads withforce=true. Wired to a startup hook so every container boot upgrades legacy uploads automatically.GET /api/upload/extraction-status/{policy_id}β the live state of the in-memory_UPLOAD_EXTRACTION_STATUSdict. Fields:status(pending | running | complete | failed | unknown),llm_used(gemini-2.5-flash#N | nim-fallback | hash-cache),llm_response_chars,completeness_pct,overall_grade,started_at,completed_at,error.
2.9 Deployment
flowchart LR
DEV["git push"] --> ORI["HF Space remote (origin)"]
DEV --> GH["GitHub mirror (github)"]
ORI --> BUILD["HF Space β Docker build"]
BUILD --> SNAP["huggingface_hub.snapshot_download<br/>hydrate rag/ (corpus + vectors) from HF dataset"]
BUILD --> FE["build Next.js static export"]
SNAP --> RUN["entrypoint.sh β uvicorn backend.main:app :7860"]
FE --> RUN
RUN --> LIVE["live Space (FastAPI also serves the frontend)"]
LIVE --> CHK["verify reported build SHA advanced<br/>(LFS/quota push can fail silently)"]
Summary. How a git push becomes a live Space, end
to end.
How it flows:
- Two remotes.
origin= the Hugging Face Space (a push here triggers the Docker rebuild).github= the public mirror reviewers read. - Build. The HF Space rebuilds the image, installs the backend,
builds the Next.js static frontend, and runs
huggingface_hub.snapshot_downloadto hydraterag/(PDF corpus + prebuilt vectors) from theinsurance-bot-datadataset β so the Space repo itself stays code-only and small. - Start.
entrypoint.shlaunchesuvicorn backend.main:appon$PORT(default 7860); FastAPI also serves the exported frontend. - Verify. Always confirm the Space's reported build SHA actually advanced before trusting that new code is live β an LFS/quota push can fail without surfacing an error.
3. Key functions in plain language
Summary. Seven internal jobs make the bot work. Each one gets a sequence diagram showing what calls what, a β€50-word summary, and a step-by-step explanation. An eighth subsection makes explicit what is stored vs what is live-only.
3.1 Profile construction
sequenceDiagram
autonumber
participant U as User
participant API as 3a. /api/chat
participant B as 3b. LLM Brain
participant T as save_profile_field tool
participant S as 3d. session_state
U->>API: typed / spoken reply
API->>B: handle_turn(text, profile)
B->>T: save_profile_field("age", 32)
T->>S: session.profile.age = 32
S-->>T: ok
T-->>B: saved
B-->>API: next question (or proceed to scoring + pricing)
Summary. Every user reply runs the brain, which extracts one fact at a time and saves it via the save_profile_field tool into the live session_state.profile. No regex, no separate extractor model β extraction is a tool call.
How it flows:
- User reply arrives at
/api/chat(typed text or post-STT transcript). - Brain runs.
handle_turncalls Gemini with the reply + the current profile + the tool schema. - Tool call. When Gemini decides a fact is captured, it calls
save_profile_field(field, value)β one call per fact. - Write. The tool sets the field on
session.profile(in memory). All fact-find captures are concentrated here; there is no second LLM pass. - Continue. Brain returns the next question, or β if the profile is complete enough β proceeds to scoring + pricing.
- Persistence is later. End-of-turn, the whole profile is auto-persisted to disk by the orchestrator (see Β§3.6).
3.2 Profile-aware scoring
sequenceDiagram
autonumber
participant API as 3a. /api/scorecard
participant SC as 3c. scorecard.py
participant J as 4. policy_facts JSON
participant R as 4. reviews JSON
participant P as 3d. session.profile
API->>SC: build_scorecard(policy_id, profile)
SC->>J: read curated facts (CSR, room-rent, waitings)
SC->>R: read insurer reviews
SC->>P: read live profile (age, family, conditions)
SC->>SC: 6 sub-scores β letter grade + personalised summary
SC-->>API: grade Β· sub-scores Β· summary
Summary. Scoring grades every policy for this user β same policy, two users, two grades. Six sub-scores roll up into a letter, with a personalised one-line summary naming the strengths for this profile and the one honest caveat.
How it flows:
- Trigger. Brain (or
/api/scorecarddirectly) asks for a per-policy grade. - Three reads.
scorecard.pyreads (a) the policy'spolicy_factsJSON, (b) the insurer's reviews JSON, and (c) the live profile fromsession_state. - Six sub-scores. Coverage Β· predictability Β· claims Β· network Β· renewal Β· terms β each weighted against this profile (e.g. a smoker is penalised more in pricing predictability; an elder is penalised more in waiting-period structure).
- Letter grade + summary. A grade band (AβE) and a one-line summary listing this user's top strengths and the one capping caveat.
- Live, not stored. Recomputed per request. Storage would be wrong here β the grade depends on who is asking.
3.3 Profile-aware pricing β the multivariate ballpark
sequenceDiagram
autonumber
participant API as 3a. /api/coverage
participant PR as 3c. premium_calculator.py
participant J as 4. policy_facts JSON
participant RC as 4. rate-card combinations<br/>age Γ metro Γ smoker Γ PED Γ<br/>co-pay Γ deductible Γ tenure
participant P as 3d. session.profile
API->>PR: estimate(policy_id, profile)
PR->>J: read pricing structure
PR->>P: read profile (7 dimensions)
PR->>RC: match combination β rate range
PR->>PR: interpolate + apply caveats
PR-->>API: βΉlow β βΉhigh Β· point β βΉX (illustrative)
Summary. Pricing is a multivariate ballpark from public rate-card combinations β age Γ metro Γ smoker Γ PED Γ chosen co-pay Γ deductible Γ sum insured. Same plan, two users, two ranges. Explicitly not real underwriting.
How it flows:
- Inputs. The user's profile (seven dimensions) and the policy's pricing characteristics.
- Multivariate match.
premium_calculator.estimatelooks up combinations across the seven dimensions and interpolates within bands. - Output. A range (e.g. βΉ12 500 β βΉ17 200 / yr) with a midpoint and a clear illustrative-only disclaimer.
- Honest caveat. Final premium depends on the insurer's underwriting + medicals + IRDAI-filed loadings β this is a directional ballpark, not a quote.
- Live, not stored. Same as scoring β recomputed per request because the answer is a function of this user.
3.4 Retrieval β retrieve_policies over the vector store
sequenceDiagram
autonumber
participant B as 3b. LLM Brain
participant T as retrieve_policies tool
participant E as BGE-small embedder
participant C as 4. Chroma<br/>shared 'policies' +<br/>per-session 'quarantine'
B->>T: retrieve_policies(query, filters, top_k)
T->>E: encode(query) β 384-d vector
T->>C: nearest-neighbour + metadata filter
C-->>T: top-k chunks (text Β· pdf Β· page Β· policy_id)
T-->>B: chunks β the brain may quote only from these
Summary. When the brain needs to cite policy wording, it asks the retrieval tool. The query is embedded with BGE-small, looked up in Chroma, and the top-k chunks come back with their source PDF, page, and policy_id. The brain may state nothing the tool did not return.
How it flows:
- Query. Brain composes a query from the user's question + the profile.
- Embed. Local BGE-small turns the query into a 384-d vector (no API hop, no rate limit).
- Search. Chroma returns nearest chunks, scoped to either the shared
policiescollection (catalogue) or the per-sessionquarantinecollection (user-uploaded PDFs β never crosses sessions, 24 h TTL). - Faithfulness. The brain may only state what these chunks (or
get_policy_facts) returned. Anything else is a violation of the structural grounding guard (Β§2.4).
3.5 Curated facts β get_policy_facts (no embedding hop)
sequenceDiagram
autonumber
participant B as 3b. LLM Brain
participant T as get_policy_facts tool
participant F as 4. policy_facts JSON file
B->>T: get_policy_facts([policy_id_1, policy_id_2])
T->>F: read JSON file(s) directly
F-->>T: CSR Β· complaints Β· grade Β· source_quote
T-->>B: structured fields with verbatim PDF quote
Summary. For decision-critical numbers (claim-settlement ratio, complaints volume, waiting periods, room-rent rule, grade), the brain calls get_policy_facts which reads the curated JSON directly β no embedding hop, no LLM, exact values with the verbatim PDF quote.
How it flows:
- Why this exists alongside Β§3.4. Retrieval is for free-form text Q&A; this is for exact, fast, source-cited number lookups. Different query shape, different mechanism.
- No LLM in the path. A plain file read of
40-data/policy_facts/<id>.json. The brain quotes the value (and thesource_quote) verbatim. - Used by more than the brain.
scorecard.pyandpremium_calculator.pyalso read these JSON files directly β see Β§2.3 (the brain's edge is not the only edge into the JSON).
3.6 In-session state recovery (server restart resilience)
sequenceDiagram
autonumber
participant U as User
participant Br as Browser (carries chat_history)
participant B as 3b. LLM Brain
participant S as 3d. SessionState (in-memory)
U->>Br: "what about premium?"
Br->>B: chat turn + chat_history[N msgs]
B->>S: get_session(session_id)
Note over S: session was evicted<br/>(1h idle / restart)<br/>profile = BLANK
B->>B: STATE-RECOVERY MODE<br/>chat_history has prior facts
B->>S: save_profile_field for each fact in history
B-->>Br: reply that picks up where it left off<br/>(never re-asks the name)
Summary. Sessions are in-memory only (_TTL_SECONDS = 1h), so a
container restart or long idle wipes the server-side profile. When the
browser still carries the conversation, the brain silently rebuilds the
profile from the chat history instead of starting over. No disk read,
no cross-session memory β purely a same-conversation resilience path.
How it flows:
- Detect.
get_session()returns a blankSessionState, butchat_historyarrives with β₯2 messages including a prior user turn β state was lost, not "fresh user". - Re-capture from history. A high-priority STATE-RECOVERY MODE prompt block tells the LLM: do not say you lost anything, do not re-ask the name, instead call
save_profile_fieldfor every fact present in the conversation so far, then continue. - Resume. From the LLM's point of view the next reply is just the next turn in an ongoing chat β the user never perceives the eviction.
(There is no cross-session recall β see Β§2.6 and ADR-043 for why that was removed.)
3.7 Uploaded-PDF LLM extraction (the Β§2.8 pipeline in code terms)
sequenceDiagram
autonumber
participant FE as Frontend (page.tsx)
participant API as /api/upload-policy
participant SEC as security.py (8 gates)
participant UD as uploaded_docs.py
participant H as heuristic build_record()
participant CK as hash cache lookup
participant G as Gemini 2.5-flash (3 retries)
participant N as NIM fallback chain
participant SC as scorecard.py + main._catalogue_scorecard
participant ST as _UPLOAD_EXTRACTION_STATUS dict
participant FE2 as Frontend poller
FE->>API: POST multipart (PDF + session_id)
API->>SEC: run 8 gates
SEC-->>API: pass or reject
API->>UD: persist_upload β writes source.pdf + meta.json to UPLOADED_DOCS_DIR/pid/
UD->>H: build_record β regex/keyword over text
H-->>UD: record.json at ~30-50% completeness
UD-->>FE: HTTP 200 with policy_id
FE->>FE: setExtractionInFlight(true) Β· gate ALL inputs Β· pushAssistant(ack)
UD->>UD: asyncio.create_task(extract_one_for_upload)
Note over UD: _set_extraction_status(status='running')
UD->>CK: _find_cached_extraction(sha256(pdf_bytes))
alt cache hit
CK-->>UD: copy prior rag/extracted/other_pid.json
UD->>UD: llm_used='hash-cache'
else cache miss + text β₯25K chars
Note over UD,G: MULTI-PASS (KI-332): 7 sections in parallel via asyncio.gather
UD->>G: identity / eligibility / financial / waiting / coverage / limits / network
G-->>UD: 7 partial JSONs (any landing counts as success)
UD->>UD: merge sections Β· llm_used='gemini-2.5-flash-multipass'
else cache miss + text under 25K chars
UD->>G: single-pass chat with full schema
alt success
G-->>UD: HealthPolicy JSON Β· llm_used='gemini-2.5-flash#1'
else timeout or malformed JSON
UD->>G: retry #2 (2s Β± 25%) then retry #3 (4s Β± 25%)
G-->>UD: HealthPolicy or fail
UD->>N: NIM fallback (single attempt)
N-->>UD: HealthPolicy or all-fail
end
UD->>UD: write rag/extracted/pid.json
UD->>UD: merge LLM scalars INTO record.json
end
UD->>SC: _catalogue_scorecard(pid, None) β same as /api/policies/.../scorecard
SC-->>UD: Scorecard (grade, data_completeness_pct, sub_scores)
UD->>ST: _set_extraction_status(complete, comp, grade, llm_used, llm_response_chars)
loop every 3s up to 120s
FE2->>UD: GET /api/upload/extraction-status/pid
UD-->>FE2: status snapshot
end
FE2->>FE2: pushAssistant(card_ready, citations include pid)
FE2->>FE2: setActiveUploadPid(pid)
FE2->>FE2: pushAssistant(choice_prompt) Β· setExtractionInFlight(false)
Summary. When a user uploads a PDF the backend writes a heuristic-baseline record first (sub-second), HTTP returns, and a background asyncio task either copies a prior extraction (hash cache hit) or runs Gemini 2.5-flash with 3 jittered retries β NIM fallback β heuristic floor. The status endpoint reports the SAME completeness_pct + overall_grade the card endpoint serves, by construction.
How it flows:
- HTTP returns before extraction starts.
extract_one_for_uploadis fired withasyncio.create_taskso the user sees the card-ready ack inside one second, not after 30β60 s. - Provenance, always. Every
_set_extraction_statuscall carriesllm_used(gemini-2.5-flash#1 | #2 | #3 | nim-fallback | hash-cache) andllm_response_chars. The operator can see WHICH LLM landed the extraction without HF Space stdout access β verified live on 2026-05-27. - Hash cache short-circuit.
_find_cached_extraction(sha256(pdf_bytes))looks for a prior successful extraction with the same content. On hit, the priorrag/extracted/<other_pid>.jsonis copied to this pid in ~1 s. - Retries are jittered exponential. 2 s / 4 s / 8 s backoffs each multiplied by
random.uniform(0.75, 1.25)so repeated transient blips on a single Gemini instance don't synchronise. - Merge model. LLM output is merged INTO the heuristic record (LLM wins per-field where non-empty, heuristic stays where LLM silent) β the same "extracted + curated overlay" model the catalogued 148 use via
40-data/policy_facts/. - Status == card by construction. The status endpoint calls
_catalogue_scorecard(pid, None)β the SAME resolver/api/policies/{id}/scorecarduses. If they differ, that's a bug; today they match across 5 verified uploads.
3.8 What is stored vs what is live-only
| What | Where | Why |
|---|---|---|
| Policy PDFs + vector chunks | rag/corpus/ + Chroma store (HF dataset β pulled at build) |
Built once, offline; read every request |
| Curated policy facts (per policy) | 40-data/policy_facts/*.json (code repo) |
Small, human-reviewed, versioned with code |
| User profile (current session) | In-memory only β SessionState.profile (1 h idle TTL, no disk) |
Closing the tab / clearing chat forgets the profile by design (ADR-043) |
| Per-policy grade / scorecard | Not stored β live per request | Two users get two grades for the same policy (profile-aware) |
| Premium range for a policy | Not stored β live per request | Same reason as the grade |
| Uploaded PDFs | Per-session Chroma quarantine, 24 h TTL | Isolated to the uploader, never the shared corpus |
| Per-turn reasoning | logs/turns.jsonl (one JSON line per turn) |
Replay / audit; never echoed to other users |
4. Safety & quality
4.1 Uploaded-PDF defence (8 gates)
/api/upload-policy accepts arbitrary PDFs from the public web β a real
attack surface. backend/security.py runs every upload through eight gates
before the file is ever embedded or shown to the model:
- File mechanics β
%PDFmagic, 5 KBβ25 MB size band, well-formed%%EOF, and a scan for embedded executables / JavaScript / launch actions. - Content quality β β₯1500 chars of extractable text, β₯3 pages, and at least one insurance-domain keyword (rejects scans, junk, off-topic docs).
- Prompt-injection β regex sweep for "ignore previous instructions", "reveal your system prompt", jailbreak patterns, etc.
- Per-session rate limit β caps uploads / chunk quota per session.
- Per-IP rate limit β catches session-ID rotation.
- Encrypted/locked PDF β rejected cleanly rather than stored opaque.
- Page-count ceiling β >200 pages is an abuse/bundle vector.
- Hash dedupe + reject-cache β identical re-uploads short-circuit.
Beyond identical-file dedup, a UIN net-new check runs on every upload: if the PDF's IRDAI UIN already belongs to a catalogued policy, the upload is recognised as not net-new and the caller is pointed at the existing marketplace card instead of a duplicate being indexed.
Accepted uploads are embedded into a separate, per-session quarantine
Chroma collection (never the shared corpus), scoped by session_id so one
user's document is invisible to another, and auto-purged after a 24-hour idle
TTL.
4.2 Answer faithfulness
Faithfulness is structural, not bolt-on: the brain answers only from what
its tools returned β retrieve_policies (policy-wording chunks) and
get_policy_facts (claim-settlement ratio, complaints, scorecard and
insurer-review data) β must cite, and is instructed to refuse when that
grounding is weak. A prose-grounding guard verifies any policy / UIN named
in the reply against both tools' returned policies before it is sent.
Recommendation fit is gated (backend/scorecard.py,
retrieval_filters.py) so plans that structurally don't fit the user's stated
constraints are dropped, with the reason surfaced.
4.3 Evaluation
A gold Q&A harness lives at eval/run.py. Status: it is pending a re-port
to the single-brain architecture (it targeted the removed orchestrator) and is
intentionally hard-guarded from running so it cannot publish stale scores; see
its module docstring. The automated test suite (tests/, run with pytest)
is the current green gate and covers routing, scoring, premium, recall, the
upload security gates, and conversation logic.
4.4 Known limitations (honest)
These are real and stated up front rather than buried:
- Uploaded-doc persistence is within-session, not across restarts.
Upload β graded marketplace card β grounded Q&A about the PDF all work
live within a running container. But the Hugging Face Space's working
filesystem is ephemeral by design (a fresh Chroma snapshot is pulled on
every rebuild β see Β§2.7), and in practice an uploaded doc does not
survive a Space rebuild/restart: the marketplace reverts to its
curated/extracted baseline. Treat uploads as session-scoped. An
operator/abuse prune endpoint exists (
POST /api/admin/uploaded-docs/ prune, password-gated) to remove a persisted upload by id or prefix. - Uploaded-PDF field extraction is LLM-assisted, with a deterministic
heuristic floor (ADR-044, 2026-05-27 hardening bundle). Every upload
runs through two passes with a hash-cache fast path:
- Heuristic baseline β regex + keyword extraction over the PDF
text, synchronous inside the upload HTTP call (sub-second), populates
common fields like waiting periods and room-rent rule. Yields
~30β50 %
data_completeness. - Gemini extraction (3 jittered retries) β fires as a background
asyncio task after the upload returns. Same Gemini 2.5-flash + same
EXTRACT_SYSTEMprompt + sameHealthPolicyPydantic schema the catalogued 148 use offline. Backoffs 2 / 4 / 8 s Β± 25 % jitter. On success, writesrag/extracted/<policy_id>.jsonand merges INTO the persistedrecord.json(LLM values override where present, heuristic stays where the LLM was silent). ~10β60 s total. - NIM fallback β single attempt if all three Gemini retries fail.
- Heuristic floor β if NIM also fails, the card still renders at the heuristic baseline (~47.8 %, grade C). Verified live on Test Policy.pdf (8 MB) where Gemini 3/3 retries returned malformed JSON.
- Hash-cache short-circuit β if
sha256(pdf_bytes)matches a prior successful extraction, that file is copied (~1 s) withllm_used="hash-cache"surfaced for ops. The frontend pollsGET /api/upload/extraction-status/<policy_id>during the wait and renders the inline scorecard card ONLY after the LLM pass completes / fails / hits its 120 s timeout. The card is catalogued-grade β samePolicyScorecardWidget, same six sub-scores, same insurer-reputation data (detect_insurer_slugmatches the PDF's legal name against the 21 known insurer slugs we have reviews data for and flipsinsurer_slugoff the genericuser-uploadon a hit). Status β card parity by construction: the status endpoint and the card endpoint both call_catalogue_scorecard(pid, None), socompleteness_pct+overall_gradeare byte-identical. Operator provenance: every status response carriesllm_used(gemini-2.5- flash#N | nim-fallback | hash-cache) andllm_response_charsso the question "did Gemini actually run?" is answerable without HF Space stdout access. Post-card dive-in mode (KI-330): the just-uploaded pid becomesview_context.active_policy_idon the next chat turn, sosingle_brainanswers policy-specific questions viaretrieve_policies+get_policy_factsinstead of pivoting to recommendations.
- Heuristic baseline β regex + keyword extraction over the PDF
text, synchronous inside the upload HTTP call (sub-second), populates
common fields like waiting periods and room-rent rule. Yields
~30β50 %
- Live (BETA) voice mode uses the browser's in-built speech recognition and is labelled unstable; push-to-talk is the reliable path (warm-armed mic + pre-roll so the first word is never clipped, and long answers are chunked so nothing is truncated).
- Recommendation vs. factual lookup. A factual question that names a specific policy is answerable on a cold session; broad "recommend me a plan" requests still require the short fact-find first (by design).
- Admin LLM Chain β manual Refresh now and 30 s auto-poll
(hardened 2026-05-27).
POST /api/admin/proberuns a real serial probe of every candidate model and updatestested_aton eachModelHealthrow; the admin UI'sRefresh nowbutton and the LLM Chain tab's 30 s poll now both re-fetch/api/admin/healthand callrenderUpdatedLabel()so the top-left "Last refresh / Next in" timer reflects the actual just-completed probe (previously it stayed frozen at the login-time snapshot, so the operator could not tell from the timer whether probing had actually happened).
5. Tech stack & key decisions
| Layer | Choice | One-line why |
|---|---|---|
| Frontend | Next.js 16 (App Router), React 19, Tailwind v4, static export | Production-pattern UI; static export serves straight from the Space |
| Backend | FastAPI + Pydantic | Async I/O, typed request/response, auto OpenAPI |
| Brain | Google Gemini (gemini-2.5-flash-lite) + function calling |
Frontier free-tier quality; one model + tools beats a multi-pass pipeline |
| Fallback | NVIDIA NIM open-model chain, health-elected | Free, diverse; fail-loud, never silently wrong |
| Retrieval | Chroma + BGE-small-en-v1.5 (local, 384-d) | Embedded, no infra, free, offline embeddings |
| Voice | Sarvam Saarika (STT) + Bulbul (TTS) + Sarvam-M (Indic) | First-class Indian-accent / Hinglish handling |
| Hosting | Hugging Face Space (Docker) + companion HF dataset | Free, GitHub-mirrored; code/data split keeps the image small |
Decisions are deliberately biased toward one deployable artifact, no fabrication, fail loud. The single-brain consolidation, the NIM-only fallback (structured-output reliability over cross-provider breadth), the local embeddings (zero rate limits, offline ingest) and the code/data repo split are the load-bearing ones.
6. Repository map
At a glance β the root is intentionally small; you only need to know these:
backend/β FastAPI app + the brain, tools, retrieval, scoring, securityfrontend/β the Next.js web apprag/β retrieval + offline ingest (corpus/vectors are git-ignored, pulled at build)40-data/β curated, human-reviewed policy facts (versioned with code)tests/β the pytest green gate- root files:
Dockerfile,entrypoint.sh,requirements.txt,pytest.ini,README.md
Full directory tree β click to expand
.
βββ backend/ FastAPI app
β βββ main.py HTTP routes (chat, transcribe, upload, profile, β¦)
β βββ single_brain.py THE brain β Gemini + function-calling tools
β βββ brain_tools.py the tools the brain can call (retrieval, profile, β¦)
β βββ nim_fallback.py NIM fallback when Gemini fails / cold-start 503
β βββ llm_health.py background probe + sticky-primary election
β βββ security.py the 8 upload-defence gates
β βββ scorecard.py / recommendation fit + scoring
β β retrieval_filters.py
β βββ premium_calculator.py profile β illustrative premium
β β sum_insured.py
β βββ session_state.py per-session profile (in-memory only, ADR-043)
β βββ uploaded_docs.py user-uploaded PDF pipeline (ADR-044):
β β - persist_upload() β heuristic baseline
β β (sub-second regex/keyword) + sha256 of
β β PDF bytes β record.json + meta.json
β β - build_record() β heuristic floor;
β β guaranteed BEFORE any LLM fires
β β - extract_fields_from_text() β 28+ regex
β β patterns (KI-332 expansion 2026-05-27):
β β sum_insured_options_inr ladder, policy_type,
β β min/max entry age, child entry days,
β β lifelong renewability flag, grace period,
β β free-look period, geographic_coverage,
β β ICU capping, deductible, NCB cap,
β β organ/CI/preventive/domiciliary/newborn
β β presence, premium payment modes β lifts
β β floor from ~47.8% to ~65-70% even when
β β ALL LLM passes fail
β β - detect_insurer_slug() β match PDF text
β β against 21 known insurer patterns; flips
β β insurer_slug off 'user-upload' on hit
β β - _multipass_extract_with_gemini() β
β β (KI-332) 7-section parallel extraction:
β β identity/eligibility/financial/waiting/
β β coverage/limits/network_claims, each as
β β its own Gemini 2.5-flash call via
β β asyncio.gather. Fires for PDFs β₯25K chars
β β where single-pass would truncate.
β β - extract_one_for_upload() β background
β β asyncio task. Resolution order:
β β 1. hash-cache (sha256 hit)
β β 2. multi-pass per-section (β₯25K chars)
β β 3. single-pass Gemini (3 jittered retries)
β β 4. NIM fallback (single attempt)
β β 5. heuristic floor (always wins
β β because record.json already exists)
β β Writes rag/extracted/<pid>.json, merges
β β LLM scalars INTO record.json (LLM wins
β β where non-empty, heuristic stays where
β β silent)
β β - _find_cached_extraction() β sha256
β β lookup across UPLOADED_DOCS_DIR/*/meta.json
β β for prior successful extractions
β β - _set_extraction_status() β finalises
β β status using main._catalogue_scorecard(pid)
β β so completeness_pct + overall_grade match
β β the card endpoint BY CONSTRUCTION (success
β β path AND fail path, post-KI-333)
β β - Provenance fields: llm_used
β β (gemini-2.5-flash#N | gemini-2.5-flash-multipass
β β | nim-fallback | hash-cache) +
β β llm_response_chars
β β - backfill_extractions() β startup hook
β β re-runs LLM extraction on every
β β UPLOADED_DOCS_DIR/<pid>/ missing
β β rag/extracted/<pid>.json
β β - _UPLOAD_EXTRACTION_STATUS dict + endpoint
β β GET /api/upload/extraction-status/{pid}
β β - Admin endpoint
β β POST /api/admin/upload/reextract?force=...
β βββ voice_format.py TTS pre-processing (money/Indic normalisation)
β βββ admin.py /api/admin/* (health, telemetry)
β βββ providers/ thin clients: google_gemini, nvidia_nim, sarvam_*,
β local_embeddings (BGE), openrouter/groq (dormant)
βββ frontend/ Next.js 16 app (src/app/page.tsx, src/lib/*)
βββ rag/ retrieval + offline ingest pipeline
β βββ retrieve.py query β top-k chunks (request hot path)
β βββ ingest.py/extract.py/schema.py offline corpus build
β βββ corpus/ vectors/ data β git-ignored, from the HF dataset
β βββ policies.duckdb offline structured rollup
βββ 40-data/ curated, version-with-code structured facts
β βββ policy_facts/*.json per-policy facts + verbatim source_quote
β βββ reviews/ premiums/ insurer_network.json
βββ eval/ gold Q&A harness (pending single-brain re-port)
βββ 70-docs/ design docs & ADRs β οΈ see note below
βββ 80-audit/ defect register / audit transcripts
βββ tools/ operational scripts (corpus, probes, link-rot)
βββ tests/ pytest suite β the green gate (`pytest`)
βββ Dockerfile / entrypoint.sh HF Space image (pulls the data dataset)
βββ pytest.ini scopes pytest to tests/ (clean on a fresh clone)
βββ requirements.txt
β οΈ Note on
70-docs/and ADRs: these capture design history and rationale; some predate the single-brain rewrite and are being brought into line with the system as it actually runs today. This README is the authoritative present-state map; the ADRs are decision context.
7. Run it locally
Prerequisites: Python 3.11+, Node 20+, the API keys below.
# 1. Code
git clone <code-repo-url> "Insurance Sales Bot"
cd "Insurance Sales Bot"
python -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt
# 2. Data (corpus + prebuilt vectors live in the companion dataset)
python -c "from huggingface_hub import snapshot_download; \
snapshot_download(repo_id='rohitsar567/insurance-bot-data', \
repo_type='dataset', local_dir='rag/_hf_dataset_backup')"
# then place rag/corpus and rag/vectors from the snapshot into rag/
# (the Docker build does this automatically; see entrypoint.sh)
# 3. Secrets β copy the example and fill in:
cp .env.example .env
# GOOGLE_API_KEY β Gemini brain (primary) [required]
# NVIDIA_NIM_API_KEY β NIM fallback chain [required]
# SARVAM_API_KEY β STT / TTS / Indic [required for voice]
# HF_TOKEN β pull the data dataset at boot [required]
# ADMIN_PASSWORD β gates /api/admin/* [required]
# VOYAGE_API_KEY β offline ingest embeddings only [ingest only]
# OPENROUTER/GROQ_API_KEY β dormant (kept for one-flip re-enable)
# 4. Backend
uvicorn backend.main:app --host 127.0.0.1 --port 8000 --reload
# 5. Frontend (separate terminal)
cd frontend && npm install && npm run dev # http://localhost:3000
# Tests (the green gate)
pytest # collects tests/ only
8. Deployment
Hosting is a Hugging Face Space running the Dockerfile:
- The image installs the backend and builds the static frontend.
- At build time it runs
huggingface_hub.snapshot_downloadto hydraterag/(corpus + vectors) from therohitsar567/insurance-bot-datadataset, so the Space repo itself stays code-only and small. entrypoint.shstartsuvicorn backend.main:appon$PORT(default7860, the port HF Spaces routes to); FastAPI also serves the exported frontend.
The code repo is mirrored to both the HF Space remote (origin) and a
GitHub remote (github); the heavy data is updated on the HF dataset
side. Space repository secrets supply the API keys listed in Β§7. After any
deploy, verify the Space's reported build SHA actually advanced before
trusting that new code is live (a quota/LFS push can fail without surfacing
an error).