---
title: Insurance Sales Portfolio Expert
emoji: ๐ฅ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Voice-first AI advisor for Indian health insurance
---
# Insurance Sales Portfolio Expert
A health-insurance advisory web app for the Indian market (presented in-app as
**"Insurance Advisor"**). You describe your situation in plain language (typed
or spoken, English or Hindi/Hinglish); it asks a few clarifying questions, then
recommends and explains real policies โ grounded in the actual policy
documents, with every claim traceable to a source clause. It also lets you
upload your own policy PDF and ask questions about it.
Live: **https://rohitsar567-insurancebot.hf.space**
> **Reading this cold?** ยง1 is plain English. ยง2 walks you down four levels of abstraction: the user journey (ยง2.1), the building blocks (ยง2.2), the functional abstraction inside each block (ยง2.3), then deep-dives per building block (ยง2.4โยง2.9). ยง3 gives a function-by-function sequence-diagram view of the six most important jobs. ยง4โยง8 are safety, stack, repo map, run-it-locally, and deployment.
---
## Table of contents
1. [What this is](#1-what-this-is)
2. [How it works, end to end](#2-how-it-works-end-to-end)
3. [Key functions in plain language](#3-key-functions-in-plain-language)
4. [Safety & quality](#4-safety--quality)
5. [Tech stack & key decisions](#5-tech-stack--key-decisions)
6. [Repository map](#6-repository-map)
7. [Run it locally](#7-run-it-locally)
8. [Deployment](#8-deployment)
---
## 1. What this is
**The short answer.** A health-insurance advisor that behaves like a
knowledgeable, unbiased human advisor โ *not* a lead-generation funnel.
You describe your situation; it asks a few clarifying questions; it
recommends real plans that fit, with every factual claim backed by the
exact clause in the real policy document. No lead capture. No commission
bias. If the honest answer is *"this isn't in the document,"* it says so โ
instead of guessing.
It works by chat or voice, in English or Hindi/Hinglish, on desktop and
mobile.
### The problem this solves
Buying health insurance in India is hard for an ordinary person. A
first-time buyer faces three concrete problems:
1. **Too much to compare.** ~148 plans across 21 insurers, each with
dozens of decision-relevant fields (waiting periods, room-rent caps,
co-pay, maternity, sub-limits, network size). No human reads them all.
2. **The truth is buried.** The number that decides whether a plan is
right for *you* is on page 47 of a PDF written by lawyers.
3. **Most "advice" is conflicted.** Aggregator sites optimise for the
sale, not the fit.
The cost of getting this wrong is real money and denied claims years
later. The goal is a tool a non-expert can trust the way they would trust
a good independent advisor: personalised to *their* profile, sourced, and
never fabricating.
### What it does, concretely
- **Conversational fact-find** โ short natural back-and-forth establishes
your profile (age, dependants, budget, pre-existing conditions,
priorities) instead of a long form.
- **Personalised recommendations** โ plans ranked for *fit to your
profile*. A fixed-benefit plan is not pushed to someone who needs
comprehensive cover; a plan whose entry age excludes you is filtered
out.
- **Grounded answers** โ every factual claim about a policy is retrieved
from that policy's actual document and shown with its source. Weak or
missing evidence produces an honest "not stated in the document."
- **Marketplace & compare** โ browse the full indexed catalogue, open a
detailed scorecard per plan, compare up to four side by side.
- **Profile โ premium (illustrative)** โ a live ballpark premium range
that updates as you change your profile. *Not* real underwriting โ a
multivariate range from public rate-card combinations (see ยง3.3).
- **Bring your own document** โ upload any policy PDF; it is safely
indexed for the rest of your session so you can ask questions about
*your* document.
- **Voice** โ speak instead of typing (tap-to-talk on mobile,
push-to-talk on desktop); replies are spoken back. Indian-accent and
Hinglish aware.
---
## 2. How it works, end to end
**The short answer.** A Next.js browser app talks to a FastAPI backend.
Every chat turn goes to a **single LLM "brain"** (Google **Gemini**) with
a small set of **function-calling tools** โ most importantly a retrieval
tool over a **Chroma** vector store built from the real policy documents.
The brain decides when to retrieve, what to retrieve, and how to answer;
it *cannot* state a policy fact it did not retrieve. If Gemini is
unavailable, the turn transparently falls back to an **NVIDIA NIM**
open-model chain. Voice in/out is handled by **Sarvam** (Indian-language
STT/TTS). Heavy data (PDF corpus + prebuilt vectors) lives in a separate
Hugging Face **dataset**, not the code repo.
The rest of this section walks you down four levels of abstraction:
ยง2.1 the user's journey (plain English, no tech); ยง2.2 the building
blocks at the highest level (the four canonical buckets); ยง2.3 the
functional abstraction โ what happens inside each bucket; and ยง2.4โยง2.9
the deep dives per building block. Every diagram is followed by a
โค50-word summary and a hierarchical *how it flows* breakdown.
### 2.1 The user's journey (plain English โ no tech)
Before the engineering detail, here is what actually happens for the
person using it. No code, no jargon โ just the path from opening the app
to deciding with confidence.
```mermaid
flowchart TD
S["๐ You open the app โ web or mobile, nothing to install"] --> TELL["๐ฃ๏ธ Tell it about you โ a short chat, typed OR spoken, English / Hindi-Hinglish age ยท family ยท budget ยท health ยท what you care about"]
TELL --> ASK["โ It asks just 2โ3 clarifying questions (a real conversation, never a long form)"]
ASK --> REC["๐ฏ A personalised shortlist โ plans ranked for YOUR fit, each with the reason it fits"]
REC --> WHY["๐ Open any plan: every fact is backed by the exact clause in the real policy PDF an honest "not stated in the document" instead of a guess"]
WHY --> EXPLORE{"Want to dig deeper?"}
EXPLORE -->|"Compare"| CMP["โ๏ธ Compare up to 4 plans side by side ยท full scorecard per plan"]
EXPLORE -->|"Browse"| MKT["๐ Browse the full indexed marketplace"]
EXPLORE -->|"Ask"| QA["๐ฌ Ask follow-up questions โ answered only from the actual documents"]
EXPLORE -->|"My own policy"| UP["๐ Upload your own policy PDF"]
UP --> UPIDX["โณ Quick ack โ 'Reading it through, ~30โ60 s' (everything in chat is gated while the analysis runs)"]
UPIDX --> UPCARD["๐ Inline scorecard card with FULL data: grade letter ยท 6 sub-scores ยท verbatim signals ยท insurer reputation"]
UPCARD --> UPCHOICE{"How would you like to proceed?"}
UPCHOICE -->|"Finish profile"| TELL
UPCHOICE -->|"Dive into the PDF"| QA
CMP --> PREM
MKT --> PREM
QA --> PREM
PREM["๐ธ A live premium estimate that updates as you change your profile"] --> DONE["โ Decide with confidence โ no lead capture, no commission bias"]
VOICE["๐๏ธ Optional the whole way: speak instead of type โ it speaks the answers back"] -.-> TELL
VOICE -.-> QA
```
**Summary.** A user opens the app and ends the session having decided on
a plan with confidence โ and how the system loops through compare /
browse / Q&A / upload along the way. No backend in this view; just the
human path. Every session starts fresh โ there is no cross-session
memory; closing the tab forgets you (privacy-by-design, see ADR-043).
**How it flows:**
- **Conversational fact-find.** A short typed-or-spoken back-and-forth
(English or Hindi-Hinglish) captures age, family, budget, health and
what you care about โ instead of a long form.
- **Personalised shortlist + a "why".** Plans are ranked for *your* fit;
every fact about a plan is backed by the exact clause in the real
policy PDF, never invented.
- **Branches from the shortlist.** Compare side by side, browse the full
marketplace, ask follow-up questions, or upload your *own* policy PDF
and ask about your document (kept private to your session).
- **Upload-PDF flow is a staged sequence** (ADR-044, 2026-05-27):
upload โ bot says *"reading it through, ~30โ60 s"* โ all chat input is
gated during the wait (Send button, textarea, voice paths all blocked
so nothing can interrupt the staging) โ bot pushes the inline
scorecard card with FULL extracted data once the LLM pass lands โ bot
then asks whether you'd like to finish your profile or dive into the
PDF. The card is the same shape as any catalogued policy card โ six
sub-scores, verbatim signals, real claim-settlement data when the
insurer is recognised.
- **Live premium.** Updates as you change the profile.
- **Decision.** No lead capture and no commission bias โ the path ends at
*decide*, not at a sales handoff.
### 2.2 System at a glance โ the big building blocks
**The short answer.** The system has four "tall buckets":
**Frontend** (what you see), **Backend** (what runs on the server),
**Data layer** (the policy knowledge), and **Voice** (in and out). They
talk to each other over standard HTTP / JSON.
**Two terms first, in one sentence each:**
- **Frontend** = everything you see on screen โ the chat box, marketplace
cards, sliders, profile builder. Built with **Next.js + React** (a
standard, well-supported web-UI library). Runs in your browser.
- **Backend** = everything that *runs on the server* โ the LLM brain, the
retrieval, the scoring/pricing logic, the upload-security gates. Built
with **FastAPI** (a standard Python HTTP framework). Think of the
frontend as the menu + waiter; the backend is the kitchen.
Both Next.js and FastAPI are deliberately boring, standard choices โ they
let us not spend engineering on the UI layer or the HTTP plumbing, so we
spend that effort on the brain and the data, where the product
differentiation actually lives.
**Now the big picture โ the buckets and how they talk:**
```mermaid
flowchart LR
subgraph FE["๐ Frontend (browser ยท Next.js)"]
UI["Chat ยท Marketplace ยท Compare ยท Profile builder Voice capture & playback"]
end
subgraph BE["โ๏ธ Backend (FastAPI server)"]
API["HTTP endpoints + orchestration backend/main.py"]
BRAIN["๐ง LLM Brain Google Gemini + function-calling tools (NIM fallback chain on failure)"]
SCORE["๐ฏ Scoring + Pricing scorecard.py ยท premium_calculator.py"]
PROF["๐ค Profile (in-memory only) session_state.SessionState ยท 1h idle TTL"]
end
subgraph DATA["๐ Data layer"]
VEC["Vector DB (Chroma) โ policy chunks + per-session quarantine (uploads)"]
FACTS["Curated facts JSON 40-data/policy_facts/*.json"]
end
subgraph VOICE["๐๏ธ Voice"]
STT["Sarvam STT (in)"]
TTS["Sarvam TTS (out)"]
end
UI <-->|"text ยท JSON"| API
UI -->|"audio"| STT --> API
API --> TTS --> UI
API <--> BRAIN
BRAIN <-->|"retrieve_policies"| VEC
BRAIN <-->|"get_policy_facts"| FACTS
BRAIN <-->|"save_profile_field"| PROF
BRAIN --> SCORE
SCORE <--> FACTS
SCORE <--> PROF
```
**Summary.** Four building blocks talk over HTTP / JSON: Frontend (the chat UI you see), Voice (Sarvam STT in + TTS out), Backend (FastAPI with four sub-blocks โ orchestration, LLM Brain, Scoring + Pricing, Profile & Persistence), and the Data layer (Chroma vectors + curated JSON facts).
**How it flows:**
- **1. Frontend (browser ยท Next.js).** Renders chat, marketplace, compare, and the profile builder. Sends typed text and audio over HTTP, plays the synthesised reply.
- **2. Voice.** `Sarvam STT (in)` turns spoken audio into a text turn; `Sarvam TTS (out)` turns the reply text back into spoken audio.
- **3. Backend (FastAPI).** Four sub-blocks โ **3a** HTTP endpoints + orchestration (`backend/main.py`); **3b** LLM Brain (Gemini + function-calling tools; NIM fallback on failure); **3c** Scoring + Pricing (`scorecard.py` + `premium_calculator.py`); **3d** Profile (in-memory only โ `session_state.SessionState`, no disk).
- **4. Data layer.** Two stores โ the Chroma **vector DB** (shared policy chunks + per-session quarantine for uploads) and curated **JSON facts** at `40-data/policy_facts/*.json`. The brain, scoring, and pricing all read from these.
**Diagram legend (used throughout ยง2):**
- **Solid arrow (`โ`)** = a real call / data flow on the request path.
- **Double arrow (`โ`)** = bidirectional โ one side calls, the other returns.
- **Dotted arrow (`-.->`)** = a side-channel or async event โ voice
playback, barge-in interrupt, end-of-turn persistence, etc. โ not on
the main request path.
- **Subgraph box** = everything inside runs in one place (one process /
one service / one storage layer).
- Edge labels (e.g. *"retrieve_policies"*) name the actual function or
signal carried on that edge.
### 2.3 Functional abstraction โ what happens inside each building block
```mermaid
flowchart TB
subgraph FE["1. Frontend"]
direction TB
F1["capture_input typed text ยท spoken audio"]
F2["render_reply chat ยท cards ยท scorecard ยท audio"]
end
subgraph V["2. Voice"]
direction TB
V1["transcribe Sarvam Saarika STT"]
V2["synthesize voice_format โ Sarvam Bulbul TTS"]
end
subgraph BE["3. Backend"]
direction TB
subgraph BE_API["3a. HTTP + orchestration"]
A1["route_request"]
A2["orchestrate_turn"]
end
subgraph BE_BRAIN["3b. LLM Brain"]
B1["handle_turn one Gemini call + tool loop"]
B2["fact_find save_profile_field"]
B3["retrieve retrieve_policies"]
B4["lookup_facts get_policy_facts"]
B5["recommend mark_recommendation"]
end
subgraph BE_SCORE["3c. Scoring + Pricing"]
SC1["grade_per_profile scorecard.py"]
SC2["estimate_premium premium_calculator.py"]
end
subgraph BE_PROF["3d. Profile (in-memory)"]
P1["update_session_profile session_state.SessionState"]
P2["evict_on_idle 1h TTL ยท no disk"]
end
end
subgraph DATA["4. Data layer"]
direction TB
D1["vector_search Chroma ยท BGE-small"]
D2["fact_lookup 40-data/policy_facts/*.json"]
end
%% forward edges (input / down the pipeline)
F1 -->|"audio"| V1
F1 -->|"text ยท JSON"| A1
V1 --> A1
A1 --> A2
A2 --> B1
B1 --> B2
B1 --> B3
B1 --> B4
B1 --> B5
B2 --> P1
B3 --> D1
B4 --> D2
A2 --> SC1
A2 --> SC2
SC1 -->|"reads"| D2
SC2 -->|"reads"| D2
SC1 -->|"reads"| P1
SC2 -->|"reads"| P1
P1 -.->|"idle 1h"| P2
%% return edges (output / back to caller)
D1 -.->|"top-k chunks"| B3
D2 -.->|"per-policy facts"| B4
SC1 -.->|"grade"| A2
SC2 -.->|"premium range"| A2
B1 -.->|"reply + citations"| A2
A2 -.->|"text"| F2
A2 -.->|"speak?"| V2
V2 -.->|"audio"| F2
%% blue solid = forward ยท orange dashed = return
linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17 stroke:#1565c0,stroke-width:2px
linkStyle 18,19,20,21,22,23,24,25 stroke:#e65100,stroke-width:2px,stroke-dasharray:6 3
```
**Legend.** Blue solid = forward flow (input / call down the pipeline). Orange dashed = return flow (result / reply back up).
**Summary.** Inside each building block from ยง2.2, a small set of named functions fires per turn โ Frontend captures and renders, Voice transcribes and synthesises, the four Backend sub-blocks orchestrate / decide / score / remember, and the Data layer answers their reads.
**How it flows:**
- **1. Frontend.** `capture_input` accepts typed text or recorded audio; `render_reply` paints chat + marketplace cards + scorecard + audio playback.
- **2. Voice.** `transcribe` is the inbound path (Sarvam Saarika STT); `synthesize` is the outbound path (`voice_format` normalises money / Indic shorthand โ Sarvam Bulbul TTS).
- **3a. HTTP + orchestration.** `route_request` maps the URL to a handler; `orchestrate_turn` is the per-turn supervisor โ it owns the request lifecycle and ties brain + scoring + voice + persistence together.
- **3b. LLM Brain.** One `handle_turn` per turn calls Gemini, which chooses which of `fact_find` / `retrieve` / `lookup_facts` / `recommend` to run as tools. The brain may only state what its tools returned.
- **3c. Scoring + Pricing.** `grade_per_profile` and `estimate_premium` read curated facts **and** the live profile, compute on every request (never stored), and hand back to `orchestrate_turn`.
- **3d. Profile (in-memory).** `update_session_profile` reflects each `fact_find` write into the live `SessionState.profile`. State lives in process memory only; an idle session is evicted after 1 h. There is no disk persistence and no cross-session recall (see ADR-043, 2026-05-27).
- **4. Data layer.** Two reads โ `vector_search` for free-form Q&A, and `fact_lookup` for decision-critical numbers with verbatim quotes. The data layer does no writes during a request โ those happen offline only (vector ingest, curated-facts edits).
### 2.4 LLM brain + fail-loud fallback chain
```mermaid
flowchart LR
Q["chat turn"] --> G{"Gemini gemini-2.5-flash-lite"}
G -->|"OK"| ANS["grounded reply (only from tool results)"]
G -->|"real failure / cold-start 503"| H["backend/llm_health.py background probe + sticky-primary election"]
H --> NIM["NVIDIA NIM open-model chain backend/nim_fallback.py"]
NIM -->|"healthy model"| ANS
NIM -->|"whole chain down"| LOUD["explicit 'service degraded' (never a silently wrong answer)"]
ANS --> GUARD["prose-grounding guard: every policy/UIN named is verified against retrieve_policies + get_policy_facts"]
GUARD --> OUT["sent to user"]
```
**Summary.** How a chat turn is served by the primary
LLM, what happens when it fails, and the structural guard that prevents a
silently wrong answer.
**How it flows:**
- **Primary path.** Gemini (`gemini-2.5-flash-lite`). On a healthy
response โ the reply is built *only* from what the tools returned.
- **Fallback path (fail-loud).** A real Gemini failure or a cold-start
503 routes through `backend/llm_health.py` (a background probe with
sticky-primary election) to the NVIDIA NIM open-model chain
(`nim_fallback.py`). One healthy model in that chain serves the turn.
- **Last resort.** If the whole chain is down, the user gets an explicit
*"service degraded"* message โ never a silently wrong answer.
- **Prose-grounding guard.** Before a reply is sent, every policy / UIN
named in the prose is verified against the same `retrieve_policies`
and `get_policy_facts` results the brain saw (with an exemption for
genuine catalogue UINs). Faithfulness is structural, not bolt-on.
**Why a single brain (not a multi-model pipeline).** Earlier designs split
the work across several LLM passes (a separate fact-find brain, a QA
brain, a faithfulness-judge). That scaffolding was removed: a single
frontier model with well-designed tools is more accurate, far simpler,
and eliminates a whole class of cross-model contract bugs. Today there is
exactly **one** brain call per turn plus its tool calls. Faithfulness is
enforced *structurally* โ the brain can only state what `retrieve_policies`
and `get_policy_facts` returned โ rather than by a second grader model.
**More on the fallback chain.** The brain's primary is Gemini
(`gemini-2.5-flash-lite`). On a real Gemini failure or a cold-start 503,
the turn falls back to an NVIDIA NIM chain of open models. Candidate
selection uses a background health probe with sticky-primary election
(`backend/llm_health.py`) so one healthy model is chosen per call. The
fallback is **fail-loud**: if the whole chain is down the user gets an
explicit *"service degraded"* message, never a silently wrong answer.
(A separate LLM "judge" existed historically and has been retired โ the
single-brain design made it redundant.)
**Sticky-session retry policy (hardened 2026-05-27).** Once a session
has completed at least one successful single-brain turn, it stays on
single_brain for the rest of its lifetime โ cross-fading to
`nim_fallback` mid-stream would discard `last_recommendation_ids /
last_retrieved_chunks / slug_to_insurer`. To absorb Gemini's
intermittent "high demand" 503 bursts on sticky sessions,
`_gemini_call` now uses an adaptive retry schedule: **non-sticky**
session keeps 1 retry with a 1.5 s backoff (fast-fail to NIM on
cold-start); **sticky** session gets 2 retries with jittered
exponential backoffs (1.5 s โ 3 s, ยฑ25 % jitter). If the chain still
fails after retries, the user sees a plain, honest reply *"My model
service had a brief blip on that turn โ please send the same message
again."* (no more misleading *"could you say that again?"*).
### 2.5 Voice pipeline (in / out, with barge-in)
```mermaid
flowchart LR
MIC["mic โ tap-to-talk (touch) / push-to-talk (desktop)"] --> MR["MediaRecorder (authoritative audio)"]
MIC -.->|"live interim text"| WS["Web Speech API (display only)"]
MR --> STT["/api/transcribe โ Sarvam Saarika STT"]
STT --> BR["single_brain.handle_turn"]
BR --> RPL["reply text + citations"]
RPL --> VF["voice_format.py money/Indic normalise ยท chunk at sentence bounds"]
VF --> BUL["Sarvam Bulbul TTS"]
BUL --> PLAY["in-DOM <audio>"]
SPK["user speaks over bot"] -.->|"barge-in"| PLAY
SPK -.->|"abort in-flight"| BR
```
**Summary.** How spoken input becomes a chat turn, how
the reply becomes speech back, and how the user can interrupt mid-answer.
**How it flows:**
- **Capture.** Tap-to-talk (touch) or push-to-talk (desktop) starts
`MediaRecorder` (the authoritative audio) and Web Speech (a live
interim transcript shown on screen but never trusted for the turn).
- **STT.** The authoritative audio is sent to `/api/transcribe`
(Sarvam Saarika โ Indian-accent + Hinglish aware).
- **Brain โ reply.** The transcript runs through `single_brain.handle_turn`
exactly like a typed turn.
- **TTS.** `voice_format.py` normalises money / Indic shorthand and chunks
at sentence bounds (so long replies are spoken in full); Sarvam Bulbul
speaks; an in-DOM `