meet4150's picture
|
download
raw
4.94 kB
# Insurance RAG System (Production-Ready)
A complete Retrieval-Augmented Generation (RAG) pipeline for insurance brokers:
- Ingests 3 policy documents (health, auto, home)
- Chunks and embeds locally with Hugging Face BGE
- Stores vectors in FAISS
- Retrieves + optional cross-encoder reranking
- Generates grounded answers via local Hugging Face LLM
- Adds source attribution (document + section)
- Supports continuous multi-turn chat sessions
- Ships with a frontend to test quality and performance
## 1) Setup
```bash
cd insurance-rag
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Create `.env` from template:
```bash
cp .env.example .env
```
Set at least:
```env
HF_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
HF_LLM_MODEL=Qwen/Qwen2.5-1.5B-Instruct
HF_LLM_DEVICE=auto
HF_LLM_DTYPE=auto
HF_LLM_LOAD_IN_4BIT=false
CHUNK_SIZE=512
CHUNK_OVERLAP=64
TOP_K=4
```
GPU profile recommendations:
- `4GB VRAM`: `HF_LLM_MODEL=Qwen/Qwen2.5-1.5B-Instruct`, `HF_LLM_LOAD_IN_4BIT=false`
- `16GB VRAM` (better quality): `HF_LLM_MODEL=Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B`, `HF_LLM_LOAD_IN_4BIT=true`
## 2) Build Index (One-time)
```bash
python -c "from rag.vector_store import build_index; build_index()"
```
Or through API:
```bash
curl -X POST http://localhost:8000/index/build
```
## 3) Run API + Frontend
```bash
python -m uvicorn main:app --reload --port 8000
```
Open browser:
- `http://localhost:8000/` (frontend test console)
- API docs: `http://localhost:8000/docs`
## 4) Quick Tests
Health:
```bash
curl http://localhost:8000/health
```
RAG query:
```bash
curl -X POST http://localhost:8000/rag/query \
-H "Content-Type: application/json" \
-d '{"question": "Is flood damage covered in home insurance?", "session_id": "test-1", "use_reranker": true}'
```
Continuous chat:
```bash
curl -X POST http://localhost:8000/chat/continuous \
-H "Content-Type: application/json" \
-d '{"message": "What is the flood deductible?", "session_id": "broker-42", "use_rag_context": true, "use_reranker": true}'
```
Agent chat (tool-calling loop):
```bash
curl -X POST http://localhost:8000/agent/chat \
-H "Content-Type: application/json" \
-d '{"message":"What is my premium for coverage 120000 and risk score 1.2?","session_id":"agent-1","use_reranker":true,"max_turns":6}'
```
## 5) Architecture
1. Broker asks question
2. `BAAI/bge-small-en-v1.5` creates query embedding (local)
3. FAISS retrieves top-k policy chunks
4. Cross-encoder reranker reorders chunks (bonus layer)
5. Prompt injects context + question
6. Local Hugging Face LLM produces grounded answer with citations
## 6) Chunking Strategy (Justification)
- Split preference: paragraph (`\n\n`) -> line (`\n`) -> sentence (`.`) -> word (` `)
- `CHUNK_SIZE=512`: insurance clauses are usually 300-500 words, so one clause fits in one chunk
- `CHUNK_OVERLAP=64`: avoids orphaned tail phrases like "without rider" when they straddle boundaries
Example:
- `Section 8: Exclusions ... Flood (without rider)`
- Overlap preserves the clause boundary so retrieval still captures meaning and context
## 7) Continuous Chat
- Session memory keyed by `session_id`
- Stores recent turns up to `MAX_HISTORY`
- Optional `use_rag_context=true` injects retrieved policy context every turn
- Streaming variant: `/chat/continuous/stream`
## 8) Tool-Calling Agent (Part 3)
- Endpoint: `/agent/chat`
- Stateful per-session history (in-memory)
- Hand-rolled loop with max 6 turns
- Tool set:
- `search_policy(query)` -> RAG retrieval
- `calculate_premium(coverage, risk_score)` -> `coverage * 0.002 * risk_score`
- `check_claim_status(claim_id)` -> hardcoded mock claim map
- Malformed tool-call JSON is handled gracefully and retried inside the loop
- Session management:
- `GET /agent/sessions`
- `DELETE /agent/session/{session_id}`
## 9) Performance Testing
Use frontend + API:
- `/rag/benchmark` for repeated run stats (`avg`, `p50`, `p95`, `min`, `max`)
- `/rag/evaluate/json` to validate model answers against a JSON QA set (example: `../rag.json`)
- `/metrics` for rolling endpoint latency
- Frontend includes one-click benchmark and metrics refresh
## 10) Project Layout
```text
insurance-rag/
├── .env
├── .env.example
├── requirements.txt
├── README.md
├── main.py
├── rag/
│ ├── __init__.py
│ ├── documents/
│ │ ├── health_policy.txt
│ │ ├── auto_policy.txt
│ │ └── home_policy.txt
│ ├── embedder.py
│ ├── chunker.py
│ ├── vector_store.py
│ ├── reranker.py
│ └── pipeline.py
├── model_server/
│ ├── __init__.py
│ └── llm_client.py
├── agent/
│ ├── __init__.py
│ ├── engine.py
│ └── tools.py
└── frontend/
├── index.html
├── styles.css
└── app.js
```

Xet Storage Details

Size:
4.94 kB
·
Xet hash:
f66408a139332175ee3206a7168686e61754e334c9ae49e805d332f19ffc3c07

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.