Buckets:
Insurance RAG System (Production-Ready)
A complete Retrieval-Augmented Generation (RAG) pipeline for insurance brokers:
- Ingests 3 policy documents (health, auto, home)
- Chunks and embeds locally with Hugging Face BGE
- Stores vectors in FAISS
- Retrieves + optional cross-encoder reranking
- Generates grounded answers via local Hugging Face LLM
- Adds source attribution (document + section)
- Supports continuous multi-turn chat sessions
- Ships with a frontend to test quality and performance
1) Setup
cd insurance-rag
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Create .env from template:
cp .env.example .env
Set at least:
HF_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
HF_LLM_MODEL=Qwen/Qwen2.5-1.5B-Instruct
HF_LLM_DEVICE=auto
HF_LLM_DTYPE=auto
HF_LLM_LOAD_IN_4BIT=false
CHUNK_SIZE=512
CHUNK_OVERLAP=64
TOP_K=4
GPU profile recommendations:
4GB VRAM:HF_LLM_MODEL=Qwen/Qwen2.5-1.5B-Instruct,HF_LLM_LOAD_IN_4BIT=false16GB VRAM(better quality):HF_LLM_MODEL=Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B,HF_LLM_LOAD_IN_4BIT=true
2) Build Index (One-time)
python -c "from rag.vector_store import build_index; build_index()"
Or through API:
curl -X POST http://localhost:8000/index/build
3) Run API + Frontend
python -m uvicorn main:app --reload --port 8000
Open browser:
http://localhost:8000/(frontend test console)- API docs:
http://localhost:8000/docs
4) Quick Tests
Health:
curl http://localhost:8000/health
RAG query:
curl -X POST http://localhost:8000/rag/query \
-H "Content-Type: application/json" \
-d '{"question": "Is flood damage covered in home insurance?", "session_id": "test-1", "use_reranker": true}'
Continuous chat:
curl -X POST http://localhost:8000/chat/continuous \
-H "Content-Type: application/json" \
-d '{"message": "What is the flood deductible?", "session_id": "broker-42", "use_rag_context": true, "use_reranker": true}'
Agent chat (tool-calling loop):
curl -X POST http://localhost:8000/agent/chat \
-H "Content-Type: application/json" \
-d '{"message":"What is my premium for coverage 120000 and risk score 1.2?","session_id":"agent-1","use_reranker":true,"max_turns":6}'
5) Architecture
- Broker asks question
BAAI/bge-small-en-v1.5creates query embedding (local)- FAISS retrieves top-k policy chunks
- Cross-encoder reranker reorders chunks (bonus layer)
- Prompt injects context + question
- Local Hugging Face LLM produces grounded answer with citations
6) Chunking Strategy (Justification)
- Split preference: paragraph (
\n\n) -> line (\n) -> sentence (.) -> word () CHUNK_SIZE=512: insurance clauses are usually 300-500 words, so one clause fits in one chunkCHUNK_OVERLAP=64: avoids orphaned tail phrases like "without rider" when they straddle boundaries
Example:
Section 8: Exclusions ... Flood (without rider)- Overlap preserves the clause boundary so retrieval still captures meaning and context
7) Continuous Chat
- Session memory keyed by
session_id - Stores recent turns up to
MAX_HISTORY - Optional
use_rag_context=trueinjects retrieved policy context every turn - Streaming variant:
/chat/continuous/stream
8) Tool-Calling Agent (Part 3)
- Endpoint:
/agent/chat - Stateful per-session history (in-memory)
- Hand-rolled loop with max 6 turns
- Tool set:
search_policy(query)-> RAG retrievalcalculate_premium(coverage, risk_score)->coverage * 0.002 * risk_scorecheck_claim_status(claim_id)-> hardcoded mock claim map
- Malformed tool-call JSON is handled gracefully and retried inside the loop
- Session management:
GET /agent/sessionsDELETE /agent/session/{session_id}
9) Performance Testing
Use frontend + API:
/rag/benchmarkfor repeated run stats (avg,p50,p95,min,max)/rag/evaluate/jsonto validate model answers against a JSON QA set (example:../rag.json)/metricsfor rolling endpoint latency- Frontend includes one-click benchmark and metrics refresh
10) Project Layout
insurance-rag/
├── .env
├── .env.example
├── requirements.txt
├── README.md
├── main.py
├── rag/
│ ├── __init__.py
│ ├── documents/
│ │ ├── health_policy.txt
│ │ ├── auto_policy.txt
│ │ └── home_policy.txt
│ ├── embedder.py
│ ├── chunker.py
│ ├── vector_store.py
│ ├── reranker.py
│ └── pipeline.py
├── model_server/
│ ├── __init__.py
│ └── llm_client.py
├── agent/
│ ├── __init__.py
│ ├── engine.py
│ └── tools.py
└── frontend/
├── index.html
├── styles.css
└── app.js
Xet Storage Details
- Size:
- 4.94 kB
- Xet hash:
- f66408a139332175ee3206a7168686e61754e334c9ae49e805d332f19ffc3c07
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.