piyushptiwari commited on
Commit
2cc32a5
Β·
verified Β·
1 Parent(s): 439311b

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes. Β  See raw diff
README.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - insurance
7
+ - uk-insurance
8
+ - training-pipeline
9
+ - search-engine
10
+ - bytical
11
+ ---
12
+
13
+ # INSUREOS Models β€” Complete Insurance AI Training Pipeline
14
+
15
+ **Created by [Bytical AI](https://bytical.ai)** β€” AI agents that run insurance operations.
16
+
17
+ ## Overview
18
+
19
+ INSUREOS is a complete AI/ML training and inference pipeline for UK insurance operations. This repository contains all source code for data generation, model training, evaluation, data collection, and a hybrid search engine.
20
+
21
+ ### Model Suite
22
+
23
+ | Model | HuggingFace | Task | Key Metric |
24
+ |-------|-------------|------|------------|
25
+ | InsureLLM-4B | [piyushptiwari/InsureLLM-4B](https://huggingface.co/piyushptiwari/InsureLLM-4B) | Insurance domain LLM | ROUGE-1: 0.384 |
26
+ | InsureDocClassifier | [piyushptiwari/InsureDocClassifier](https://huggingface.co/piyushptiwari/InsureDocClassifier) | 12-class document classification | F1: 1.0 |
27
+ | InsureNER | [piyushptiwari/InsureNER](https://huggingface.co/piyushptiwari/InsureNER) | 13-entity NER | F1: 1.0 |
28
+ | InsureFraudNet | [piyushptiwari/InsureFraudNet](https://huggingface.co/piyushptiwari/InsureFraudNet) | Fraud detection (3 LoB) | AUC-ROC: 1.0 |
29
+ | InsurePricing | [piyushptiwari/InsurePricing](https://huggingface.co/piyushptiwari/InsurePricing) | Premium pricing (GLM + EBM) | MAE: Β£11,132 |
30
+ | InsureSearch | (included in this repo) | Hybrid search engine | 33K docs indexed |
31
+
32
+ ### Training Dataset
33
+
34
+ [piyushptiwari/insureos-training-data](https://huggingface.co/datasets/piyushptiwari/insureos-training-data) β€” 10K SFT, 5K DPO, 50K tabular, 10K docs, 8K NER
35
+
36
+ ## Repository Structure
37
+
38
+ ```
39
+ insureos-models/
40
+ β”œβ”€β”€ data/ # Synthetic data generation
41
+ β”‚ β”œβ”€β”€ constants.py # UK insurance constants (regions, perils, regulators)
42
+ β”‚ β”œβ”€β”€ gen_sft.py # Generate SFT instruction-response pairs
43
+ β”‚ β”œβ”€β”€ gen_dpo.py # Generate DPO preference pairs
44
+ β”‚ β”œβ”€β”€ gen_documents.py # Generate insurance documents (12 classes)
45
+ β”‚ β”œβ”€β”€ gen_ner.py # Generate NER-annotated text
46
+ β”‚ β”œβ”€β”€ gen_tabular.py # Generate claims tabular data
47
+ β”‚ └── generate_all.py # Run all generators
48
+ β”‚
49
+ β”œβ”€β”€ collect/ # Real-world data collection
50
+ β”‚ β”œβ”€β”€ config.py # Scraping targets and configuration
51
+ β”‚ β”œβ”€β”€ scraper_base.py # Base HTTP scraper with caching
52
+ β”‚ β”œβ”€β”€ convert_sft.py # Convert raw docs β†’ SFT/DPO format
53
+ β”‚ β”œβ”€β”€ run_fast.py # Fast collection orchestrator
54
+ β”‚ └── sources/ # Per-source scrapers
55
+ β”‚ β”œβ”€β”€ wikipedia.py # Wikipedia insurance articles
56
+ β”‚ β”œβ”€β”€ legislation.py # UK legislation (legislation.gov.uk)
57
+ β”‚ β”œβ”€β”€ fca.py # FCA Handbook
58
+ β”‚ β”œβ”€β”€ hf_datasets.py # HuggingFace insurance datasets
59
+ β”‚ β”œβ”€β”€ rss_news.py # Insurance news RSS feeds
60
+ β”‚ └── education.py # Insurance education resources
61
+ β”‚
62
+ β”œβ”€β”€ training/ # Model training scripts
63
+ β”‚ β”œβ”€β”€ qlora_finetune.py # QLoRA fine-tuning (Qwen3-4B)
64
+ β”‚ β”œβ”€β”€ dpo_train.py # DPO alignment training
65
+ β”‚ β”œβ”€β”€ retrain_realworld.py # Real-world data retraining
66
+ β”‚ β”œβ”€β”€ doc_classifier.py # ModernBERT document classifier
67
+ β”‚ β”œβ”€β”€ ner_model.py # ModernBERT NER model
68
+ β”‚ β”œβ”€β”€ fraud_model.py # XGBoost + Isolation Forest fraud
69
+ β”‚ β”œβ”€β”€ pricing_glm.py # Tweedie GLM + EBM pricing
70
+ β”‚ └── distill.py # Model distillation (experimental)
71
+ β”‚
72
+ β”œβ”€β”€ evaluation/ # Evaluation suite
73
+ β”‚ β”œβ”€β”€ run_eval.py # Full multi-model evaluation
74
+ β”‚ └── results/ # Evaluation results (JSON)
75
+ β”‚
76
+ β”œβ”€β”€ search/ # Hybrid search engine
77
+ β”‚ β”œβ”€β”€ config.py # Search configuration
78
+ β”‚ β”œβ”€β”€ embedder.py # BGE-small-en-v1.5 embedding service
79
+ β”‚ β”œβ”€β”€ bm25.py # Custom Okapi BM25 implementation
80
+ β”‚ β”œβ”€β”€ vector_store.py # Qdrant vector store
81
+ β”‚ β”œβ”€β”€ reranker.py # Cross-encoder reranker
82
+ β”‚ β”œβ”€β”€ hybrid_engine.py # RRF fusion (vector + BM25 + reranker)
83
+ β”‚ β”œβ”€β”€ indexer.py # Document ingestion pipeline
84
+ β”‚ β”œβ”€β”€ models.py # Pydantic data models
85
+ β”‚ └── api.py # FastAPI REST API
86
+ β”‚
87
+ β”œβ”€β”€ serve/ # Model serving
88
+ β”‚ └── api.py # FastAPI inference endpoints
89
+ β”‚
90
+ └── scripts/ # Automation
91
+ β”œβ”€β”€ setup.sh # Environment setup (NVIDIA, Python, deps)
92
+ └── train_all.sh # Full training pipeline script
93
+ ```
94
+
95
+ ## Quick Start
96
+
97
+ ### 1. Environment Setup
98
+
99
+ ```bash
100
+ # Create virtual environment
101
+ python3 -m venv .venv && source .venv/bin/activate
102
+
103
+ # Install dependencies
104
+ pip install torch transformers trl peft bitsandbytes
105
+ pip install xgboost scikit-learn interpret
106
+ pip install sentence-transformers qdrant-client fastapi uvicorn
107
+ ```
108
+
109
+ ### 2. Generate Training Data
110
+
111
+ ```bash
112
+ python -m data.generate_all
113
+ # Outputs: data/output/ (SFT, DPO, docs, NER, tabular)
114
+ ```
115
+
116
+ ### 3. Train Models
117
+
118
+ ```bash
119
+ # Train all models sequentially
120
+ bash scripts/train_all.sh
121
+
122
+ # Or individually:
123
+ python training/qlora_finetune.py # InsureLLM QLoRA
124
+ python training/dpo_train.py # InsureLLM DPO
125
+ python training/doc_classifier.py # Document classifier
126
+ python training/ner_model.py # NER model
127
+ python training/fraud_model.py # Fraud detection
128
+ python training/pricing_glm.py # Pricing models
129
+ ```
130
+
131
+ ### 4. Evaluate
132
+
133
+ ```bash
134
+ python evaluation/run_eval.py
135
+ # Results saved to evaluation/results/
136
+ ```
137
+
138
+ ### 5. Run Search Engine
139
+
140
+ ```bash
141
+ # Index documents
142
+ python search/indexer.py
143
+
144
+ # Start API
145
+ python search/api.py
146
+ # API at http://localhost:8900
147
+ # Endpoints: /search, /search/vector, /search/keyword, /suggest, /facets, /stats
148
+ ```
149
+
150
+ ## Search Engine β€” InsureSearch
151
+
152
+ A hybrid search engine rivaling Azure AI Search, built entirely on open-source components:
153
+
154
+ | Component | Technology | Details |
155
+ |-----------|-----------|---------|
156
+ | **Vector Search** | BGE-small-en-v1.5 (384-dim) + Qdrant | Semantic similarity |
157
+ | **Keyword Search** | Custom Okapi BM25 | Insurance-aware tokenization |
158
+ | **Reranking** | cross-encoder/ms-marco-MiniLM-L-6-v2 | Cross-encoder reranking |
159
+ | **Fusion** | Reciprocal Rank Fusion (RRF) | Vector 60% + BM25 40% |
160
+ | **API** | FastAPI | REST API with facets, suggestions |
161
+
162
+ **Index stats:** 33,034 chunks from 31,679 documents, 51,640 BM25 terms.
163
+
164
+ ## Training Pipeline
165
+
166
+ ```
167
+ Stage 1: Synthetic Data Generation
168
+ β”œβ”€β”€ 10K SFT instruction-response pairs
169
+ β”œβ”€β”€ 5K DPO preference pairs
170
+ β”œβ”€β”€ 50K tabular claims (Motor/Property/Liability)
171
+ β”œβ”€β”€ 10K insurance documents (12 classes)
172
+ └── 8K NER-annotated texts (13 entity types)
173
+
174
+ Stage 2: QLoRA Fine-Tuning β†’ Qwen3-4B
175
+ β”œβ”€β”€ rank=64, alpha=128, all-linear targets
176
+ β”œβ”€β”€ 2 epochs, batch=2, grad_accum=4
177
+ β”œβ”€β”€ Final: train_loss=0.012, eval_loss=0.118
178
+ └── Token accuracy: 95.88%
179
+
180
+ Stage 3: DPO Alignment
181
+ β”œβ”€β”€ 5K preference pairs
182
+ β”œβ”€β”€ 149 steps, reward_accuracy=1.0
183
+ └── Reward margin: 26.76
184
+
185
+ Stage 4: Real-World Data Collection
186
+ β”œβ”€β”€ Wikipedia (150 docs), UK Legislation (692)
187
+ β”œβ”€β”€ HuggingFace datasets (31,060), RSS (50), Education (88)
188
+ β”œβ”€β”€ Converted to 3,685 SFT + 776 DPO pairs
189
+ └── Quality filtered (English-only, no echo responses)
190
+
191
+ Stage 5: Real-World Retraining
192
+ β”œβ”€β”€ 876 steps on real-world SFT data
193
+ └── Claims process score improved 0.40 β†’ 0.60
194
+
195
+ Stage 6: Specialized Models (parallel)
196
+ β”œβ”€β”€ FraudNet: XGBoost + Isolation Forest β†’ AUC-ROC 1.0
197
+ β”œβ”€β”€ PricingGLM: Tweedie GLM + EBM β†’ MAE Β£11,132
198
+ β”œβ”€β”€ DocClassifier: ModernBERT β†’ F1 1.0
199
+ └── InsureNER: ModernBERT β†’ F1 1.0
200
+ ```
201
+
202
+ ## Tech Stack
203
+
204
+ - **LLM:** Qwen3-4B + QLoRA + DPO (PyTorch, Transformers, TRL, PEFT, bitsandbytes)
205
+ - **Classification & NER:** ModernBERT-base (Transformers)
206
+ - **Fraud Detection:** XGBoost + Isolation Forest (scikit-learn)
207
+ - **Pricing:** Tweedie GLM (scikit-learn) + EBM (InterpretML)
208
+ - **Search:** BGE-small-en-v1.5 + Qdrant + BM25 + cross-encoder
209
+ - **Training GPU:** NVIDIA Tesla T4 16GB
210
+
211
+ ## Citation
212
+
213
+ ```bibtex
214
+ @misc{bytical2026insureos,
215
+ title={INSUREOS: A Complete AI/ML Suite for UK Insurance Operations},
216
+ author={Bytical AI},
217
+ year={2026},
218
+ url={https://huggingface.co/piyushptiwari/insureos-models}
219
+ }
220
+ ```
221
+
222
+ ## About Bytical AI
223
+
224
+ [Bytical](https://bytical.ai) builds AI agents that run insurance operations β€” claims automation, underwriting intelligence, digital sales, and core system modernization for insurers across the UK and Europe. Microsoft AI Partner | NVIDIA | Salesforce.
__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Training package
api.py ADDED
@@ -0,0 +1,367 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” FastAPI Model Serving Endpoint
3
+ Serves all InsureOS models via a unified REST API.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import time
9
+ import pickle
10
+ import logging
11
+ from pathlib import Path
12
+ from contextlib import asynccontextmanager
13
+
14
+ import torch
15
+ import numpy as np
16
+ from fastapi import FastAPI, HTTPException
17
+ from pydantic import BaseModel, Field
18
+
19
+ # ── Logging ──
20
+ logging.basicConfig(level=logging.INFO)
21
+ logger = logging.getLogger("insureos-api")
22
+
23
+
24
+ # ── Request/Response models ──
25
+
26
+ class ChatRequest(BaseModel):
27
+ messages: list[dict] = Field(..., description="Chat messages in OpenAI format")
28
+ max_tokens: int = Field(512, ge=1, le=2048)
29
+ temperature: float = Field(0.7, ge=0.0, le=2.0)
30
+
31
+ class ChatResponse(BaseModel):
32
+ response: str
33
+ model: str
34
+ latency_ms: float
35
+
36
+ class FraudRequest(BaseModel):
37
+ lob: str = Field(..., description="Line of business: motor, property, or liability")
38
+ features: dict = Field(..., description="Claim features as key-value pairs")
39
+
40
+ class FraudResponse(BaseModel):
41
+ fraud_probability: float
42
+ fraud_label: bool
43
+ anomaly_score: float | None = None
44
+ model: str
45
+ latency_ms: float
46
+
47
+ class PricingRequest(BaseModel):
48
+ features: dict = Field(..., description="Rating factor values")
49
+ model_type: str = Field("ebm", description="Model type: glm or ebm")
50
+
51
+ class PricingResponse(BaseModel):
52
+ predicted_premium: float
53
+ model: str
54
+ latency_ms: float
55
+
56
+ class DocClassifyRequest(BaseModel):
57
+ text: str = Field(..., description="Document text to classify")
58
+
59
+ class DocClassifyResponse(BaseModel):
60
+ label: str
61
+ confidence: float
62
+ all_scores: dict[str, float]
63
+ model: str
64
+ latency_ms: float
65
+
66
+ class NERRequest(BaseModel):
67
+ text: str = Field(..., description="Text for entity extraction")
68
+
69
+ class NERResponse(BaseModel):
70
+ entities: list[dict]
71
+ model: str
72
+ latency_ms: float
73
+
74
+ class HealthResponse(BaseModel):
75
+ status: str
76
+ models_loaded: dict[str, bool]
77
+
78
+
79
+ # ── Model store ──
80
+ models = {}
81
+
82
+
83
+ @asynccontextmanager
84
+ async def lifespan(app: FastAPI):
85
+ """Load models on startup."""
86
+ logger.info("Loading models...")
87
+
88
+ # InsureLLM
89
+ insurellm_path = os.getenv("INSURELLM_MODEL", "models/insurellm-8b-dpo-merged")
90
+ if Path(insurellm_path).exists():
91
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
92
+ logger.info(f"Loading InsureLLM from {insurellm_path}...")
93
+ tokenizer = AutoTokenizer.from_pretrained(insurellm_path, trust_remote_code=True)
94
+ if tokenizer.pad_token is None:
95
+ tokenizer.pad_token = tokenizer.eos_token
96
+
97
+ bnb_config = BitsAndBytesConfig(
98
+ load_in_4bit=True,
99
+ bnb_4bit_quant_type="nf4",
100
+ bnb_4bit_compute_dtype=torch.bfloat16,
101
+ bnb_4bit_use_double_quant=True,
102
+ )
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ insurellm_path,
105
+ quantization_config=bnb_config,
106
+ device_map="auto",
107
+ trust_remote_code=True,
108
+ attn_implementation="sdpa",
109
+ torch_dtype=torch.bfloat16,
110
+ )
111
+ model.eval()
112
+ models["insurellm"] = {"model": model, "tokenizer": tokenizer}
113
+ logger.info("InsureLLM loaded βœ“")
114
+
115
+ # FraudNet
116
+ fraud_dir = Path(os.getenv("FRAUD_MODEL", "models/fraudnet"))
117
+ for lob in ["motor", "property", "liability"]:
118
+ xgb_path = fraud_dir / f"xgb_{lob}.json"
119
+ if xgb_path.exists():
120
+ import xgboost as xgb_lib
121
+ xgb_model = xgb_lib.XGBClassifier()
122
+ xgb_model.load_model(str(xgb_path))
123
+ models[f"fraud_{lob}"] = xgb_model
124
+ logger.info(f"FraudNet {lob} loaded βœ“")
125
+
126
+ iforest_path = fraud_dir / f"iforest_{lob}.pkl"
127
+ if iforest_path.exists():
128
+ with open(iforest_path, "rb") as f:
129
+ models[f"iforest_{lob}"] = pickle.load(f)
130
+ logger.info(f"IsolationForest {lob} loaded βœ“")
131
+
132
+ # Pricing
133
+ pricing_dir = Path(os.getenv("PRICING_MODEL", "models/pricing-glm"))
134
+ for name in ["tweedie_glm", "pricing_ebm"]:
135
+ pkl_path = pricing_dir / f"{name}.pkl"
136
+ if pkl_path.exists():
137
+ with open(pkl_path, "rb") as f:
138
+ models[name] = pickle.load(f)
139
+ logger.info(f"Pricing {name} loaded βœ“")
140
+
141
+ # Doc Classifier
142
+ doc_path = os.getenv("DOC_MODEL", "models/doc-classifier")
143
+ if Path(doc_path).exists():
144
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer as ATok
145
+ models["doc_classifier"] = {
146
+ "model": AutoModelForSequenceClassification.from_pretrained(doc_path),
147
+ "tokenizer": ATok.from_pretrained(doc_path),
148
+ }
149
+ models["doc_classifier"]["model"].eval()
150
+ meta_path = Path(doc_path) / "training_meta.json"
151
+ if meta_path.exists():
152
+ with open(meta_path) as f:
153
+ models["doc_classifier"]["meta"] = json.load(f)
154
+ logger.info("DocClassifier loaded βœ“")
155
+
156
+ # NER
157
+ ner_path = os.getenv("NER_MODEL", "models/ner-model")
158
+ if Path(ner_path).exists():
159
+ from transformers import AutoModelForTokenClassification, AutoTokenizer as ATok2
160
+ models["ner"] = {
161
+ "model": AutoModelForTokenClassification.from_pretrained(ner_path),
162
+ "tokenizer": ATok2.from_pretrained(ner_path),
163
+ }
164
+ models["ner"]["model"].eval()
165
+ meta_path = Path(ner_path) / "training_meta.json"
166
+ if meta_path.exists():
167
+ with open(meta_path) as f:
168
+ models["ner"]["meta"] = json.load(f)
169
+ logger.info("NER model loaded βœ“")
170
+
171
+ logger.info(f"Models loaded: {list(models.keys())}")
172
+ yield
173
+ models.clear()
174
+
175
+
176
+ app = FastAPI(
177
+ title="InsureOS Model API",
178
+ description="UK Insurance AI Model Serving β€” InsureLLM, FraudNet, PricingGLM, DocClassifier, NER",
179
+ version="0.1.0",
180
+ lifespan=lifespan,
181
+ )
182
+
183
+
184
+ # ── Endpoints ──
185
+
186
+ @app.get("/health", response_model=HealthResponse)
187
+ async def health():
188
+ return HealthResponse(
189
+ status="healthy",
190
+ models_loaded={
191
+ "insurellm": "insurellm" in models,
192
+ "fraud_motor": "fraud_motor" in models,
193
+ "fraud_property": "fraud_property" in models,
194
+ "fraud_liability": "fraud_liability" in models,
195
+ "pricing_glm": "tweedie_glm" in models,
196
+ "pricing_ebm": "pricing_ebm" in models,
197
+ "doc_classifier": "doc_classifier" in models,
198
+ "ner": "ner" in models,
199
+ },
200
+ )
201
+
202
+
203
+ @app.post("/v1/chat", response_model=ChatResponse)
204
+ async def chat(request: ChatRequest):
205
+ if "insurellm" not in models:
206
+ raise HTTPException(status_code=503, detail="InsureLLM not loaded")
207
+
208
+ tokenizer = models["insurellm"]["tokenizer"]
209
+ model = models["insurellm"]["model"]
210
+
211
+ text = tokenizer.apply_chat_template(request.messages, tokenize=False, add_generation_prompt=True)
212
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
213
+
214
+ start = time.time()
215
+ with torch.no_grad():
216
+ outputs = model.generate(
217
+ **inputs,
218
+ max_new_tokens=request.max_tokens,
219
+ temperature=max(request.temperature, 0.01),
220
+ top_p=0.9,
221
+ do_sample=request.temperature > 0,
222
+ )
223
+ latency = (time.time() - start) * 1000
224
+
225
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
226
+ return ChatResponse(response=response, model="insurellm-8b-dpo", latency_ms=latency)
227
+
228
+
229
+ @app.post("/v1/fraud", response_model=FraudResponse)
230
+ async def fraud_detect(request: FraudRequest):
231
+ lob = request.lob.lower()
232
+ model_key = f"fraud_{lob}"
233
+ if model_key not in models:
234
+ raise HTTPException(status_code=503, detail=f"FraudNet {lob} not loaded")
235
+
236
+ xgb_model = models[model_key]
237
+ import pandas as pd
238
+ features_df = pd.DataFrame([request.features]).fillna(0)
239
+
240
+ start = time.time()
241
+ prob = float(xgb_model.predict_proba(features_df)[:, 1][0])
242
+ label = prob >= 0.5
243
+ latency = (time.time() - start) * 1000
244
+
245
+ # Anomaly score from isolation forest
246
+ anomaly = None
247
+ iforest_key = f"iforest_{lob}"
248
+ if iforest_key in models:
249
+ anomaly = float(models[iforest_key].score_samples(features_df)[0])
250
+
251
+ return FraudResponse(
252
+ fraud_probability=prob,
253
+ fraud_label=label,
254
+ anomaly_score=anomaly,
255
+ model=f"fraudnet-{lob}",
256
+ latency_ms=latency,
257
+ )
258
+
259
+
260
+ @app.post("/v1/pricing", response_model=PricingResponse)
261
+ async def predict_price(request: PricingRequest):
262
+ model_key = "tweedie_glm" if request.model_type == "glm" else "pricing_ebm"
263
+ if model_key not in models:
264
+ raise HTTPException(status_code=503, detail=f"Pricing model {request.model_type} not loaded")
265
+
266
+ pricing_model = models[model_key]
267
+ import pandas as pd
268
+ features_df = pd.DataFrame([request.features]).fillna(0)
269
+
270
+ start = time.time()
271
+ prediction = float(max(0, pricing_model.predict(features_df)[0]))
272
+ latency = (time.time() - start) * 1000
273
+
274
+ return PricingResponse(
275
+ predicted_premium=prediction,
276
+ model=request.model_type,
277
+ latency_ms=latency,
278
+ )
279
+
280
+
281
+ @app.post("/v1/classify", response_model=DocClassifyResponse)
282
+ async def classify_document(request: DocClassifyRequest):
283
+ if "doc_classifier" not in models:
284
+ raise HTTPException(status_code=503, detail="DocClassifier not loaded")
285
+
286
+ tokenizer = models["doc_classifier"]["tokenizer"]
287
+ model = models["doc_classifier"]["model"]
288
+ meta = models["doc_classifier"].get("meta", {})
289
+ id2label = meta.get("id2label", {})
290
+
291
+ inputs = tokenizer(request.text, return_tensors="pt", truncation=True, max_length=512)
292
+
293
+ start = time.time()
294
+ with torch.no_grad():
295
+ outputs = model(**inputs)
296
+ latency = (time.time() - start) * 1000
297
+
298
+ probs = torch.softmax(outputs.logits, dim=-1)[0]
299
+ pred_id = probs.argmax().item()
300
+ confidence = probs[pred_id].item()
301
+
302
+ scores = {id2label.get(str(i), f"class_{i}"): float(p) for i, p in enumerate(probs)}
303
+
304
+ return DocClassifyResponse(
305
+ label=id2label.get(str(pred_id), f"class_{pred_id}"),
306
+ confidence=confidence,
307
+ all_scores=scores,
308
+ model="doc-classifier",
309
+ latency_ms=latency,
310
+ )
311
+
312
+
313
+ @app.post("/v1/ner", response_model=NERResponse)
314
+ async def extract_entities(request: NERRequest):
315
+ if "ner" not in models:
316
+ raise HTTPException(status_code=503, detail="NER model not loaded")
317
+
318
+ tokenizer = models["ner"]["tokenizer"]
319
+ model = models["ner"]["model"]
320
+ meta = models["ner"].get("meta", {})
321
+ id2label = meta.get("id2label", {})
322
+
323
+ tokens = request.text.split()
324
+ inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=256)
325
+
326
+ start = time.time()
327
+ with torch.no_grad():
328
+ outputs = model(**inputs)
329
+ latency = (time.time() - start) * 1000
330
+
331
+ preds = outputs.logits.argmax(dim=-1)[0].tolist()
332
+ word_ids = inputs.word_ids(batch_index=0)
333
+
334
+ entities = []
335
+ current_entity = None
336
+ for idx, (word_id, pred_id) in enumerate(zip(word_ids, preds)):
337
+ if word_id is None:
338
+ continue
339
+ label = id2label.get(str(pred_id), "O")
340
+
341
+ if label.startswith("B-"):
342
+ if current_entity:
343
+ entities.append(current_entity)
344
+ current_entity = {
345
+ "entity_type": label[2:],
346
+ "text": tokens[word_id],
347
+ "start_token": word_id,
348
+ "end_token": word_id,
349
+ }
350
+ elif label.startswith("I-") and current_entity and label[2:] == current_entity["entity_type"]:
351
+ current_entity["text"] += " " + tokens[word_id]
352
+ current_entity["end_token"] = word_id
353
+ else:
354
+ if current_entity:
355
+ entities.append(current_entity)
356
+ current_entity = None
357
+
358
+ if current_entity:
359
+ entities.append(current_entity)
360
+
361
+ return NERResponse(entities=entities, model="ner-model", latency_ms=latency)
362
+
363
+
364
+ if __name__ == "__main__":
365
+ import uvicorn
366
+ port = int(os.getenv("PORT", "8000"))
367
+ uvicorn.run("serve.api:app", host="0.0.0.0", port=port, reload=False)
collect/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Insurance data collection pipeline
collect/config.py ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration for data collection sources."""
2
+
3
+ import os
4
+ from pathlib import Path
5
+
6
+ # ── Paths ──────────────────────────────────────────────────────────
7
+ BASE_DIR = Path(__file__).resolve().parent.parent
8
+ RAW_DIR = BASE_DIR / "collect" / "raw"
9
+ PROCESSED_DIR = BASE_DIR / "collect" / "processed"
10
+ SFT_OUTPUT = BASE_DIR / "collect" / "sft_real_world.jsonl"
11
+ DPO_OUTPUT = BASE_DIR / "collect" / "dpo_real_world.jsonl"
12
+
13
+ RAW_DIR.mkdir(parents=True, exist_ok=True)
14
+ PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
15
+
16
+ # ── Rate limiting ──────────────────────────────────────────────────
17
+ REQUEST_DELAY = 1.5 # seconds between requests (be polite)
18
+ MAX_RETRIES = 3
19
+ TIMEOUT = 30
20
+
21
+ # ── User agent ─────────────────────────────────────────────────────
22
+ USER_AGENT = (
23
+ "InsureOS-DataCollector/1.0 "
24
+ "(Research; insurance-domain-model-training; "
25
+ "contact: piyush@bytical.com)"
26
+ )
27
+
28
+ HEADERS = {
29
+ "User-Agent": USER_AGENT,
30
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
31
+ "Accept-Language": "en-GB,en;q=0.9",
32
+ }
33
+
34
+ # ── Wikipedia insurance articles ───────────────────────────────────
35
+ WIKIPEDIA_SEED_ARTICLES = [
36
+ "Insurance", "Reinsurance", "Underwriting", "Actuarial_science",
37
+ "Insurance_policy", "Lloyd%27s_of_London", "Property_insurance",
38
+ "Casualty_insurance", "Life_insurance", "Health_insurance",
39
+ "Motor_insurance", "Marine_insurance", "Liability_insurance",
40
+ "Professional_indemnity_insurance", "Directors_and_officers_liability_insurance",
41
+ "Cyber_insurance", "Product_liability", "Public_liability",
42
+ "Employers%27_liability_insurance", "Business_interruption_insurance",
43
+ "Catastrophe_bond", "Insurance-linked_securities",
44
+ "Solvency_II", "IFRS_17", "Risk_management",
45
+ "Claims_adjusting", "Loss_adjustment", "Salvage_(insurance)",
46
+ "Subrogation", "Indemnity", "Utmost_good_faith",
47
+ "Proximate_cause_(insurance)", "Insurance_fraud",
48
+ "Parametric_insurance", "Microinsurance", "Takaful",
49
+ "Financial_Conduct_Authority", "Prudential_Regulation_Authority_(United_Kingdom)",
50
+ "General_insurance", "Insurance_broker", "Managing_general_agent",
51
+ "Coverholder", "Bordereaux", "Treaty_reinsurance",
52
+ "Facultative_reinsurance", "Excess_of_loss", "Quota_share",
53
+ "Stop-loss_insurance", "Aggregate_stop-loss_insurance",
54
+ "Deductible", "Co-insurance", "Self-insurance",
55
+ "Captive_insurance", "Risk_retention_group",
56
+ "Insurance_in_the_United_Kingdom", "Association_of_British_Insurers",
57
+ "Chartered_Insurance_Institute", "Insurance_premium_tax",
58
+ "Motor_Insurers%27_Bureau", "Pool_Reinsurance_Company",
59
+ "Flood_Re", "Terrorism_reinsurance",
60
+ "Insurance_contract", "Warranty_(insurance)",
61
+ "Condition_(insurance)", "Exclusion_(insurance)",
62
+ "Endorsement_(insurance)", "Schedule_(insurance)",
63
+ "Inception_(insurance)", "Renewal_(insurance)",
64
+ "Cancellation_(insurance)", "Claims-made_policy",
65
+ "Occurrence_policy", "Claims_reserve",
66
+ "Incurred_but_not_reported", "Loss_ratio",
67
+ "Combined_ratio", "Expense_ratio",
68
+ "Generalized_linear_model", "Tweedie_distribution",
69
+ "Poisson_regression", "Gamma_distribution",
70
+ "Chain_ladder_method", "Bornhuetter–Ferguson_method",
71
+ "Credibility_theory", "Experience_rating",
72
+ "Risk_classification", "Adverse_selection",
73
+ "Moral_hazard", "Insurance_scoring",
74
+ "Telematics", "Usage-based_insurance",
75
+ "Insurtech", "Peer-to-peer_insurance",
76
+ "Embedded_insurance", "Open_insurance",
77
+ "ACORD", "ISO_ClaimSearch",
78
+ "National_Flood_Insurance_Program",
79
+ "Earthquake_insurance", "Windstorm_insurance",
80
+ "Hail_insurance", "Crop_insurance",
81
+ "Title_insurance", "Surety_bond",
82
+ "Fidelity_bond", "Warranty",
83
+ "Extended_warranty", "Home_warranty",
84
+ "Pet_insurance", "Travel_insurance",
85
+ "Wedding_insurance", "Event_insurance",
86
+ "Key_person_insurance", "Trade_credit_insurance",
87
+ "Political_risk_insurance", "Environmental_liability",
88
+ "Pollution_insurance",
89
+ ]
90
+
91
+ # ── FCA Handbook sections ──────────────────────────────────────────
92
+ FCA_HANDBOOK_SECTIONS = [
93
+ "ICOBS", # Insurance: Conduct of Business Sourcebook
94
+ "SYSC", # Senior Management Arrangements
95
+ "PRIN", # Principles for Businesses
96
+ "COBS", # Conduct of Business Sourcebook
97
+ "DISP", # Dispute Resolution: Complaints
98
+ "SUP", # Supervision
99
+ "CONC", # Consumer Credit
100
+ "MCOB", # Mortgages and Home Finance
101
+ ]
102
+
103
+ FCA_BASE_URL = "https://www.handbook.fca.org.uk"
104
+
105
+ # ── UK Legislation ─────────────────────────────────────────────────
106
+ UK_LEGISLATION_URLS = [
107
+ # Insurance Act 2015
108
+ "https://www.legislation.gov.uk/ukpga/2015/4/contents",
109
+ # Enterprise Act 2016 (insurance damages for late payment)
110
+ "https://www.legislation.gov.uk/ukpga/2016/12/contents",
111
+ # Financial Services and Markets Act 2000
112
+ "https://www.legislation.gov.uk/ukpga/2000/8/contents",
113
+ # Third Parties (Rights against Insurers) Act 2010
114
+ "https://www.legislation.gov.uk/ukpga/2010/10/contents",
115
+ # Road Traffic Act 1988 (compulsory motor insurance)
116
+ "https://www.legislation.gov.uk/ukpga/1988/52/contents",
117
+ # Employers' Liability (Compulsory Insurance) Act 1969
118
+ "https://www.legislation.gov.uk/ukpga/1969/57/contents",
119
+ # Marine Insurance Act 1906
120
+ "https://www.legislation.gov.uk/ukpga/Edw7/6/41/contents",
121
+ # Consumer Insurance (Disclosure and Representations) Act 2012
122
+ "https://www.legislation.gov.uk/ukpga/2012/6/contents",
123
+ # Data Protection Act 2018
124
+ "https://www.legislation.gov.uk/ukpga/2018/12/contents",
125
+ ]
126
+
127
+ # ── Investopedia insurance glossary terms ──────────────────────────
128
+ INVESTOPEDIA_TERMS = [
129
+ "insurance", "reinsurance", "underwriting", "premium",
130
+ "deductible", "copayment", "coinsurance", "policy-limit",
131
+ "exclusion", "endorsement", "rider", "binder",
132
+ "actuary", "actuarial-science", "loss-ratio",
133
+ "combined-ratio", "expense-ratio", "claims-reserve",
134
+ "ibnr", "incurred-but-not-reported",
135
+ "lloyd-s-of-london", "surplus-lines",
136
+ "managing-general-agent", "captive-insurance-company",
137
+ "risk-retention-group", "self-insurance",
138
+ "occurrence-policy", "claims-made-policy",
139
+ "general-liability-insurance", "professional-liability-insurance",
140
+ "errors-and-omissions-insurance", "directors-and-officers-liability-insurance",
141
+ "cyber-insurance", "key-person-insurance",
142
+ "business-interruption-insurance", "commercial-property-insurance",
143
+ "workers-compensation", "employers-liability-insurance",
144
+ "public-liability-insurance", "product-liability-insurance",
145
+ "environmental-liability-insurance", "marine-insurance",
146
+ "hull-insurance", "cargo-insurance",
147
+ "protection-and-indemnity-insurance", "aviation-insurance",
148
+ "crop-insurance", "title-insurance",
149
+ "surety-bond", "fidelity-bond",
150
+ "catastrophe-bond", "insurance-linked-securities",
151
+ "parametric-insurance", "microinsurance",
152
+ "property-insurance", "casualty-insurance",
153
+ "fire-insurance", "flood-insurance",
154
+ "earthquake-insurance", "windstorm-insurance",
155
+ "homeowners-insurance", "renters-insurance",
156
+ "auto-insurance", "uninsured-motorist-coverage",
157
+ "comprehensive-auto-insurance", "collision-insurance",
158
+ "gap-insurance", "umbrella-insurance",
159
+ "life-insurance", "term-life-insurance",
160
+ "whole-life-insurance", "universal-life-insurance",
161
+ "variable-life-insurance", "endowment-policy",
162
+ "annuity", "health-insurance",
163
+ "disability-insurance", "long-term-care-insurance",
164
+ "pet-insurance", "travel-insurance",
165
+ "wedding-insurance", "event-insurance",
166
+ "trade-credit-insurance", "political-risk-insurance",
167
+ "warranty", "extended-warranty",
168
+ "solvency", "moral-hazard",
169
+ "adverse-selection", "risk-management",
170
+ "risk-assessment", "risk-transfer",
171
+ "risk-pooling", "law-of-large-numbers",
172
+ "subrogation", "indemnity", "utmost-good-faith",
173
+ "proximate-cause", "insurable-interest",
174
+ "insurance-fraud", "total-loss",
175
+ "actual-cash-value", "replacement-cost",
176
+ "agreed-value", "reinstatement-value",
177
+ ]
178
+
179
+ # ── HuggingFace datasets ──────────────────────────────────────────
180
+ HF_DATASETS = [
181
+ ("rvpierre/insurance-qa-en", None),
182
+ ("ebrigham/NL_insurance_reviews_sentiment", None),
183
+ ("snorkelai/Multi-Turn-Insurance-Underwriting-Code-Gen", None),
184
+ ("Ddream-ai/InsuranceCorpus", None),
185
+ ]
186
+
187
+ # ── Insurance subreddits ──────────────────────────────────────────
188
+ REDDIT_SUBREDDITS = [
189
+ "insurance",
190
+ "InsuranceProfessional",
191
+ "HealthInsurance",
192
+ "ActuaryUK",
193
+ "actuary",
194
+ ]
195
+
196
+ # ── RSS feeds for insurance news ───────────────────────────────────
197
+ RSS_FEEDS = [
198
+ "https://www.insurancetimes.co.uk/rss",
199
+ "https://www.insurancejournal.com/rss/news/",
200
+ "https://www.reinsurancene.ws/feed/",
201
+ "https://www.artemis.bm/feed/",
202
+ ]
collect/convert_sft.py ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Convert collected real-world insurance data into SFT and DPO training format.
2
+
3
+ Strategies:
4
+ 1. Knowledge Q&A β€” generate question-answer pairs from article text
5
+ 2. Summarisation β€” "Summarise this insurance concept"
6
+ 3. Regulation interpretation β€” "What does FCA say about X?"
7
+ 4. Definition β€” "Define {term} in insurance context"
8
+ 5. Scenario analysis β€” "Given {scenario}, what insurance considerations apply?"
9
+ 6. Comparison β€” "Compare {A} and {B} in insurance"
10
+ """
11
+
12
+ import json
13
+ import logging
14
+ import random
15
+ import re
16
+ import textwrap
17
+ from pathlib import Path
18
+
19
+ from collect.config import PROCESSED_DIR, SFT_OUTPUT, DPO_OUTPUT, RAW_DIR
20
+
21
+ logger = logging.getLogger(__name__)
22
+
23
+ # ── Templates ──────────────────────────────────────────────────────
24
+
25
+ QA_TEMPLATES = [
26
+ "What is {concept}?",
27
+ "Explain {concept} in the context of UK insurance.",
28
+ "How does {concept} work in insurance?",
29
+ "Define {concept} for an insurance professional.",
30
+ "What role does {concept} play in the insurance industry?",
31
+ "Describe {concept} and its importance in insurance.",
32
+ "As an insurance underwriter, explain {concept}.",
33
+ "What should a claims handler know about {concept}?",
34
+ "How is {concept} relevant to insurance regulation in the UK?",
35
+ "Explain {concept} as it applies to general insurance.",
36
+ ]
37
+
38
+ REGULATION_TEMPLATES = [
39
+ "What does the FCA require regarding {topic}?",
40
+ "Explain the regulatory requirements for {topic} in UK insurance.",
41
+ "How does {topic} affect insurance companies under UK regulation?",
42
+ "What compliance obligations exist for {topic}?",
43
+ "Summarise the key regulatory points about {topic}.",
44
+ ]
45
+
46
+ SCENARIO_TEMPLATES = [
47
+ "A policyholder has filed a claim for {scenario}. What are the key considerations?",
48
+ "An underwriter is assessing a risk involving {scenario}. What factors should they evaluate?",
49
+ "A broker needs to advise their client about {scenario}. What guidance should they give?",
50
+ "An insurance company is developing a product for {scenario}. What are the main considerations?",
51
+ ]
52
+
53
+ COMPARISON_TEMPLATES = [
54
+ "Compare and contrast {a} and {b} in insurance.",
55
+ "What are the differences between {a} and {b}?",
56
+ "When would an insurer choose {a} over {b}?",
57
+ ]
58
+
59
+
60
+ def _extract_first_paragraph(text: str, max_len: int = 800) -> str:
61
+ """Extract a clean first paragraph as a concise answer."""
62
+ paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
63
+ if not paragraphs:
64
+ return text[:max_len]
65
+ # Take first substantive paragraph
66
+ for p in paragraphs:
67
+ if len(p) > 50:
68
+ return p[:max_len]
69
+ return paragraphs[0][:max_len]
70
+
71
+
72
+ def _extract_key_concepts(text: str) -> list[str]:
73
+ """Extract key insurance concepts/terms from text."""
74
+ # Look for bold/capitalized terms, section headers, etc.
75
+ concepts = set()
76
+
77
+ # Find section headers (lines that look like headers)
78
+ for line in text.split("\n"):
79
+ line = line.strip()
80
+ if 3 < len(line) < 80 and not line.endswith("."):
81
+ if line[0].isupper() and not line.startswith("The "):
82
+ # Could be a concept
83
+ concepts.add(line.strip("=").strip("#").strip())
84
+
85
+ # Find insurance-specific noun phrases (simple heuristic)
86
+ insurance_terms = re.findall(
87
+ r'\b([A-Z][a-z]+(?: [A-Z][a-z]+){0,3})\b', text[:5000]
88
+ )
89
+ for term in insurance_terms:
90
+ if len(term) > 3 and any(kw in term.lower() for kw in [
91
+ "insurance", "reinsur", "claim", "underw", "polic",
92
+ "premium", "loss", "risk", "cover", "liabil",
93
+ "indemnit", "act", "regulation", "fca", "lloyd",
94
+ ]):
95
+ concepts.add(term)
96
+
97
+ return list(concepts)[:10]
98
+
99
+
100
+ def _make_sft_from_knowledge(doc: dict) -> list[dict]:
101
+ """Create SFT pairs from a knowledge article."""
102
+ pairs = []
103
+ title = doc.get("title", "")
104
+ text = doc.get("text", "")
105
+
106
+ if not text or len(text) < 100:
107
+ return pairs
108
+
109
+ # 1. Knowledge Q&A from title concept
110
+ if title and len(title) > 3:
111
+ concept = title.replace("_", " ")
112
+ question = random.choice(QA_TEMPLATES).format(concept=concept)
113
+
114
+ # Use first ~800 chars as answer
115
+ answer = _extract_first_paragraph(text, max_len=1200)
116
+ if len(answer) > 50:
117
+ pairs.append({
118
+ "instruction": question,
119
+ "response": answer,
120
+ "source": doc.get("source", "unknown"),
121
+ "category": "knowledge_qa",
122
+ })
123
+
124
+ # 2. Summarisation task
125
+ if len(text) > 500:
126
+ chunk = text[:3000]
127
+ pairs.append({
128
+ "instruction": f"Summarise the following insurance content:\n\n{chunk}",
129
+ "response": _extract_first_paragraph(text, max_len=600),
130
+ "source": doc.get("source", "unknown"),
131
+ "category": "summarisation",
132
+ })
133
+
134
+ # 3. Extract concepts and create Q&A for each
135
+ concepts = _extract_key_concepts(text)
136
+ for concept in concepts[:3]:
137
+ question = random.choice(QA_TEMPLATES).format(concept=concept)
138
+ # Find the paragraph that mentions this concept
139
+ for para in text.split("\n\n"):
140
+ if concept.lower() in para.lower() and len(para) > 50:
141
+ pairs.append({
142
+ "instruction": question,
143
+ "response": para[:1200],
144
+ "source": doc.get("source", "unknown"),
145
+ "category": "concept_qa",
146
+ })
147
+ break
148
+
149
+ return pairs
150
+
151
+
152
+ def _make_sft_from_regulation(doc: dict) -> list[dict]:
153
+ """Create SFT pairs from regulatory documents."""
154
+ pairs = []
155
+ text = doc.get("text", "")
156
+ title = doc.get("title", "")
157
+ section = doc.get("section", "")
158
+
159
+ if not text or len(text) < 100:
160
+ return pairs
161
+
162
+ # Regulatory Q&A
163
+ topic = title or section
164
+ if topic:
165
+ question = random.choice(REGULATION_TEMPLATES).format(topic=topic)
166
+ answer = _extract_first_paragraph(text, max_len=1500)
167
+ if len(answer) > 50:
168
+ pairs.append({
169
+ "instruction": question,
170
+ "response": answer,
171
+ "source": "regulation",
172
+ "category": "regulation_qa",
173
+ })
174
+
175
+ # Section-by-section Q&A
176
+ sections = text.split("\n\n")
177
+ for i, section_text in enumerate(sections[:5]):
178
+ if len(section_text) > 100:
179
+ pairs.append({
180
+ "instruction": f"Explain this insurance regulation provision:\n\n{section_text[:500]}",
181
+ "response": section_text[:1500],
182
+ "source": "regulation",
183
+ "category": "regulation_explain",
184
+ })
185
+
186
+ return pairs
187
+
188
+
189
+ def _make_sft_from_legislation(doc: dict) -> list[dict]:
190
+ """Create SFT pairs from UK insurance legislation."""
191
+ pairs = []
192
+ text = doc.get("text", "")
193
+ act = doc.get("act", "")
194
+
195
+ if not text or len(text) < 100:
196
+ return pairs
197
+
198
+ pairs.append({
199
+ "instruction": (
200
+ f"Explain the following provision from UK insurance legislation "
201
+ f"({act}):\n\n{text[:1000]}"
202
+ ),
203
+ "response": text[:2000],
204
+ "source": "uk_legislation",
205
+ "category": "legislation_qa",
206
+ })
207
+
208
+ return pairs
209
+
210
+
211
+ def _make_sft_from_news(doc: dict) -> list[dict]:
212
+ """Create SFT pairs from insurance news articles."""
213
+ pairs = []
214
+ text = doc.get("text", "")
215
+ title = doc.get("title", "")
216
+
217
+ if not text or len(text) < 200:
218
+ return pairs
219
+
220
+ # Summarisation task
221
+ pairs.append({
222
+ "instruction": f"Summarise this insurance industry news article:\n\n{text[:2000]}",
223
+ "response": _extract_first_paragraph(text, max_len=800),
224
+ "source": "insurance_news",
225
+ "category": "news_summary",
226
+ })
227
+
228
+ # Analysis task
229
+ if title:
230
+ pairs.append({
231
+ "instruction": (
232
+ f"As an insurance industry analyst, what are the key takeaways "
233
+ f"from this article titled '{title}'?\n\n{text[:1500]}"
234
+ ),
235
+ "response": _extract_first_paragraph(text, max_len=1000),
236
+ "source": "insurance_news",
237
+ "category": "news_analysis",
238
+ })
239
+
240
+ return pairs
241
+
242
+
243
+ def _make_sft_from_hf(doc: dict) -> list[dict]:
244
+ """Create SFT pairs from HuggingFace dataset rows."""
245
+ pairs = []
246
+ row = doc.get("row", {})
247
+ text = doc.get("text", "")
248
+
249
+ if not text or len(text) < 50:
250
+ return pairs
251
+
252
+ # If it has question/answer fields, use directly
253
+ q = row.get("question", "")
254
+ a = row.get("answer", row.get("response", ""))
255
+ if q and a:
256
+ pairs.append({
257
+ "instruction": q,
258
+ "response": a[:2000],
259
+ "source": doc.get("dataset", "huggingface"),
260
+ "category": "hf_qa",
261
+ })
262
+ return pairs
263
+
264
+ # If it has instruction/output fields
265
+ inst = row.get("instruction", row.get("input", ""))
266
+ out = row.get("output", row.get("response", ""))
267
+ if inst and out:
268
+ pairs.append({
269
+ "instruction": inst,
270
+ "response": out[:2000],
271
+ "source": doc.get("dataset", "huggingface"),
272
+ "category": "hf_instruction",
273
+ })
274
+ return pairs
275
+
276
+ # If it has review/sentiment fields
277
+ review = row.get("review", row.get("text", ""))
278
+ sentiment = row.get("sentiment", row.get("label", ""))
279
+ if review and sentiment:
280
+ pairs.append({
281
+ "instruction": (
282
+ f"Classify the sentiment of this insurance review "
283
+ f"as positive, negative, or neutral:\n\n{review[:1000]}"
284
+ ),
285
+ "response": f"Sentiment: {sentiment}",
286
+ "source": doc.get("dataset", "huggingface"),
287
+ "category": "hf_sentiment",
288
+ })
289
+ return pairs
290
+
291
+ # Generic: use as knowledge
292
+ pairs.append({
293
+ "instruction": f"Explain the following insurance information:\n\n{text[:1000]}",
294
+ "response": text[:2000],
295
+ "source": doc.get("dataset", "huggingface"),
296
+ "category": "hf_knowledge",
297
+ })
298
+
299
+ return pairs
300
+
301
+
302
+ def _make_dpo_pair(sft_pair: dict) -> dict | None:
303
+ """Create a DPO preference pair from an SFT pair.
304
+
305
+ chosen = the good response (real data)
306
+ rejected = a degraded version (shorter, less specific, generic)
307
+ """
308
+ instruction = sft_pair["instruction"]
309
+ good_response = sft_pair["response"]
310
+
311
+ if len(good_response) < 100:
312
+ return None
313
+
314
+ # Create a degraded response (shorter, more generic)
315
+ bad_strategies = [
316
+ # Strategy 1: Truncate to first sentence
317
+ lambda r: r.split(".")[0] + "." if "." in r else r[:50],
318
+ # Strategy 2: Generic non-answer
319
+ lambda r: "This is a complex insurance topic that requires careful consideration of many factors.",
320
+ # Strategy 3: Partial answer (first 20%)
321
+ lambda r: r[:max(50, len(r) // 5)],
322
+ # Strategy 4: Wrong focus
323
+ lambda r: f"While this is an important topic, the key thing to remember is that insurance is about managing risk. {r[:100]}",
324
+ ]
325
+
326
+ bad_response = random.choice(bad_strategies)(good_response)
327
+
328
+ return {
329
+ "instruction": instruction,
330
+ "chosen": good_response,
331
+ "rejected": bad_response,
332
+ "source": sft_pair.get("source", "unknown"),
333
+ }
334
+
335
+
336
+ def convert_all_to_sft(raw_dir: Path = RAW_DIR) -> tuple[int, int]:
337
+ """Convert all collected raw documents to SFT and DPO format."""
338
+ all_sft = []
339
+ all_dpo = []
340
+
341
+ # Processing map: source_type -> converter function
342
+ converters = {
343
+ "wikipedia": _make_sft_from_knowledge,
344
+ "fca_handbook": _make_sft_from_regulation,
345
+ "uk_legislation": _make_sft_from_legislation,
346
+ "investopedia": _make_sft_from_knowledge,
347
+ "insurance_news": _make_sft_from_news,
348
+ "insurance_news_summary": _make_sft_from_news,
349
+ "huggingface": _make_sft_from_hf,
350
+ "exam_syllabus": _make_sft_from_knowledge,
351
+ "insurance_education": _make_sft_from_knowledge,
352
+ "insurance_data": _make_sft_from_hf,
353
+ }
354
+
355
+ # Scan all JSONL files in raw subdirectories
356
+ for source_dir in raw_dir.iterdir():
357
+ if not source_dir.is_dir():
358
+ continue
359
+ for jsonl_file in source_dir.glob("*.jsonl"):
360
+ logger.info(f"Converting {jsonl_file}...")
361
+ with open(jsonl_file) as f:
362
+ for line in f:
363
+ try:
364
+ doc = json.loads(line)
365
+ except json.JSONDecodeError:
366
+ continue
367
+
368
+ source = doc.get("source", "")
369
+ converter = converters.get(source, _make_sft_from_knowledge)
370
+ sft_pairs = converter(doc)
371
+
372
+ for pair in sft_pairs:
373
+ all_sft.append(pair)
374
+ # 30% chance of creating DPO pair
375
+ if random.random() < 0.3:
376
+ dpo = _make_dpo_pair(pair)
377
+ if dpo:
378
+ all_dpo.append(dpo)
379
+
380
+ # Shuffle
381
+ random.shuffle(all_sft)
382
+ random.shuffle(all_dpo)
383
+
384
+ # Write SFT
385
+ with open(SFT_OUTPUT, "w") as f:
386
+ for pair in all_sft:
387
+ # Format as chat for Qwen3
388
+ chat = {
389
+ "messages": [
390
+ {"role": "system", "content": "You are InsureLLM, an expert UK insurance AI assistant. You provide accurate, detailed, and regulation-aware answers about insurance, underwriting, claims, actuarial science, and UK/EU insurance regulation."},
391
+ {"role": "user", "content": pair["instruction"]},
392
+ {"role": "assistant", "content": pair["response"]},
393
+ ]
394
+ }
395
+ f.write(json.dumps(chat, ensure_ascii=False) + "\n")
396
+
397
+ # Write DPO
398
+ with open(DPO_OUTPUT, "w") as f:
399
+ for pair in all_dpo:
400
+ dpo_row = {
401
+ "prompt": pair["instruction"],
402
+ "chosen": pair["chosen"],
403
+ "rejected": pair["rejected"],
404
+ }
405
+ f.write(json.dumps(dpo_row, ensure_ascii=False) + "\n")
406
+
407
+ logger.info(f"SFT: {len(all_sft)} pairs β†’ {SFT_OUTPUT}")
408
+ logger.info(f"DPO: {len(all_dpo)} pairs β†’ {DPO_OUTPUT}")
409
+
410
+ return len(all_sft), len(all_dpo)
411
+
412
+
413
+ if __name__ == "__main__":
414
+ logging.basicConfig(level=logging.INFO)
415
+ sft_count, dpo_count = convert_all_to_sft()
416
+ print(f"Created {sft_count} SFT pairs and {dpo_count} DPO pairs")
collect/run_collection.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Master orchestrator for all data collection sources."""
2
+
3
+ import json
4
+ import logging
5
+ import sys
6
+ import time
7
+ from pathlib import Path
8
+
9
+ # Setup logging
10
+ logging.basicConfig(
11
+ level=logging.INFO,
12
+ format="%(asctime)s [%(levelname)s] %(message)s",
13
+ datefmt="%H:%M:%S",
14
+ )
15
+ logger = logging.getLogger(__name__)
16
+
17
+
18
+ def run_collection():
19
+ """Run all data collection sources."""
20
+ start = time.time()
21
+ total_docs = 0
22
+
23
+ # ── 1. Wikipedia ───────────────────────────────────────────────
24
+ logger.info("=" * 60)
25
+ logger.info("1/6 WIKIPEDIA β€” Insurance articles")
26
+ logger.info("=" * 60)
27
+ try:
28
+ from collect.sources.wikipedia import collect_wikipedia
29
+ docs = collect_wikipedia(max_articles=400)
30
+ total_docs += len(docs)
31
+ logger.info(f" βœ“ Wikipedia: {len(docs)} documents")
32
+ except Exception as e:
33
+ logger.error(f" βœ— Wikipedia failed: {e}")
34
+
35
+ # ── 2. FCA Handbook ────────────────────────────────────────────
36
+ logger.info("=" * 60)
37
+ logger.info("2/6 FCA HANDBOOK β€” UK insurance regulation")
38
+ logger.info("=" * 60)
39
+ try:
40
+ from collect.sources.fca import collect_fca
41
+ docs = collect_fca()
42
+ total_docs += len(docs)
43
+ logger.info(f" βœ“ FCA: {len(docs)} documents")
44
+ except Exception as e:
45
+ logger.error(f" βœ— FCA failed: {e}")
46
+
47
+ # ── 3. UK Legislation ──────────────────────────────────────────
48
+ logger.info("=" * 60)
49
+ logger.info("3/6 UK LEGISLATION β€” Insurance Act 2015 etc.")
50
+ logger.info("=" * 60)
51
+ try:
52
+ from collect.sources.legislation import collect_legislation
53
+ docs = collect_legislation()
54
+ total_docs += len(docs)
55
+ logger.info(f" βœ“ Legislation: {len(docs)} documents")
56
+ except Exception as e:
57
+ logger.error(f" βœ— Legislation failed: {e}")
58
+
59
+ # ── 4. Investopedia ────────────────────────────────────────────
60
+ logger.info("=" * 60)
61
+ logger.info("4/6 INVESTOPEDIA β€” Insurance glossary")
62
+ logger.info("=" * 60)
63
+ try:
64
+ from collect.sources.investopedia import collect_investopedia
65
+ docs = collect_investopedia()
66
+ total_docs += len(docs)
67
+ logger.info(f" βœ“ Investopedia: {len(docs)} documents")
68
+ except Exception as e:
69
+ logger.error(f" βœ— Investopedia failed: {e}")
70
+
71
+ # ── 5. HuggingFace ─────────────────────────────────────────────
72
+ logger.info("=" * 60)
73
+ logger.info("5/6 HUGGINGFACE β€” Insurance datasets")
74
+ logger.info("=" * 60)
75
+ try:
76
+ from collect.sources.hf_datasets import collect_huggingface
77
+ docs = collect_huggingface()
78
+ total_docs += len(docs)
79
+ logger.info(f" βœ“ HuggingFace: {len(docs)} documents")
80
+ except Exception as e:
81
+ logger.error(f" βœ— HuggingFace failed: {e}")
82
+
83
+ # ── 6. RSS / News ──────────────────────────────────────────────
84
+ logger.info("=" * 60)
85
+ logger.info("6/6 RSS NEWS β€” Insurance industry news")
86
+ logger.info("=" * 60)
87
+ try:
88
+ from collect.sources.rss_news import collect_rss
89
+ docs = collect_rss()
90
+ total_docs += len(docs)
91
+ logger.info(f" βœ“ RSS: {len(docs)} documents")
92
+ except Exception as e:
93
+ logger.error(f" βœ— RSS failed: {e}")
94
+
95
+ # ── 7. Education ───────────────────────────────────────────────
96
+ logger.info("=" * 60)
97
+ logger.info("7/7 EDUCATION β€” Open textbooks & exam content")
98
+ logger.info("=" * 60)
99
+ try:
100
+ from collect.sources.education import collect_education
101
+ docs = collect_education()
102
+ total_docs += len(docs)
103
+ logger.info(f" βœ“ Education: {len(docs)} documents")
104
+ except Exception as e:
105
+ logger.error(f" βœ— Education failed: {e}")
106
+
107
+ # ── Convert to SFT ─────────────────────────────────────────────
108
+ logger.info("=" * 60)
109
+ logger.info("CONVERTING collected data β†’ SFT + DPO training format")
110
+ logger.info("=" * 60)
111
+ try:
112
+ from collect.convert_sft import convert_all_to_sft
113
+ sft_count, dpo_count = convert_all_to_sft()
114
+ logger.info(f" βœ“ SFT pairs: {sft_count}")
115
+ logger.info(f" βœ“ DPO pairs: {dpo_count}")
116
+ except Exception as e:
117
+ logger.error(f" βœ— SFT conversion failed: {e}")
118
+
119
+ elapsed = time.time() - start
120
+ logger.info("=" * 60)
121
+ logger.info(f"COLLECTION COMPLETE")
122
+ logger.info(f" Total documents: {total_docs:,}")
123
+ logger.info(f" Time elapsed: {elapsed / 60:.1f} minutes")
124
+ logger.info("=" * 60)
125
+
126
+
127
+ if __name__ == "__main__":
128
+ run_collection()
collect/run_fast.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fast data collection β€” reduced Wikipedia cap, lower API delay."""
2
+
3
+ import json
4
+ import logging
5
+ import sys
6
+ import time
7
+ from pathlib import Path
8
+
9
+ logging.basicConfig(
10
+ level=logging.INFO,
11
+ format="%(asctime)s [%(levelname)s] %(message)s",
12
+ datefmt="%H:%M:%S",
13
+ )
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # Override delay for API sources (Wikipedia API is generous)
17
+ import collect.config as cfg
18
+ cfg.REQUEST_DELAY = 0.5
19
+
20
+
21
+ def run_fast():
22
+ start = time.time()
23
+ total_docs = 0
24
+
25
+ # 1. Wikipedia (cap at 150 β€” still 2M+ chars of insurance knowledge)
26
+ logger.info("=" * 60)
27
+ logger.info("1/7 WIKIPEDIA β€” Insurance articles (max 150)")
28
+ logger.info("=" * 60)
29
+ try:
30
+ from collect.sources.wikipedia import collect_wikipedia
31
+ docs = collect_wikipedia(max_articles=150)
32
+ total_docs += len(docs)
33
+ logger.info(f" => Wikipedia: {len(docs)} documents")
34
+ except Exception as e:
35
+ logger.error(f" Wikipedia failed: {e}", exc_info=True)
36
+
37
+ # 2. FCA Handbook
38
+ cfg.REQUEST_DELAY = 1.5 # Web scraping β€” be polite
39
+ logger.info("=" * 60)
40
+ logger.info("2/7 FCA HANDBOOK")
41
+ logger.info("=" * 60)
42
+ try:
43
+ from collect.sources.fca import collect_fca
44
+ docs = collect_fca()
45
+ total_docs += len(docs)
46
+ logger.info(f" => FCA: {len(docs)} documents")
47
+ except Exception as e:
48
+ logger.error(f" FCA failed: {e}", exc_info=True)
49
+
50
+ # 3. UK Legislation
51
+ logger.info("=" * 60)
52
+ logger.info("3/7 UK LEGISLATION")
53
+ logger.info("=" * 60)
54
+ try:
55
+ from collect.sources.legislation import collect_legislation
56
+ docs = collect_legislation()
57
+ total_docs += len(docs)
58
+ logger.info(f" => Legislation: {len(docs)} documents")
59
+ except Exception as e:
60
+ logger.error(f" Legislation failed: {e}", exc_info=True)
61
+
62
+ # 4. Investopedia
63
+ logger.info("=" * 60)
64
+ logger.info("4/7 INVESTOPEDIA")
65
+ logger.info("=" * 60)
66
+ try:
67
+ from collect.sources.investopedia import collect_investopedia
68
+ docs = collect_investopedia()
69
+ total_docs += len(docs)
70
+ logger.info(f" => Investopedia: {len(docs)} documents")
71
+ except Exception as e:
72
+ logger.error(f" Investopedia failed: {e}", exc_info=True)
73
+
74
+ # 5. HuggingFace
75
+ cfg.REQUEST_DELAY = 0.3
76
+ logger.info("=" * 60)
77
+ logger.info("5/7 HUGGINGFACE DATASETS")
78
+ logger.info("=" * 60)
79
+ try:
80
+ from collect.sources.hf_datasets import collect_huggingface
81
+ docs = collect_huggingface()
82
+ total_docs += len(docs)
83
+ logger.info(f" => HuggingFace: {len(docs)} documents")
84
+ except Exception as e:
85
+ logger.error(f" HuggingFace failed: {e}", exc_info=True)
86
+
87
+ # 6. RSS News
88
+ cfg.REQUEST_DELAY = 1.0
89
+ logger.info("=" * 60)
90
+ logger.info("6/7 RSS NEWS")
91
+ logger.info("=" * 60)
92
+ try:
93
+ from collect.sources.rss_news import collect_rss
94
+ docs = collect_rss()
95
+ total_docs += len(docs)
96
+ logger.info(f" => RSS: {len(docs)} documents")
97
+ except Exception as e:
98
+ logger.error(f" RSS failed: {e}", exc_info=True)
99
+
100
+ # 7. Education
101
+ logger.info("=" * 60)
102
+ logger.info("7/7 EDUCATION")
103
+ logger.info("=" * 60)
104
+ try:
105
+ from collect.sources.education import collect_education
106
+ docs = collect_education()
107
+ total_docs += len(docs)
108
+ logger.info(f" => Education: {len(docs)} documents")
109
+ except Exception as e:
110
+ logger.error(f" Education failed: {e}", exc_info=True)
111
+
112
+ # Convert to SFT
113
+ logger.info("=" * 60)
114
+ logger.info("CONVERTING β†’ SFT + DPO format")
115
+ logger.info("=" * 60)
116
+ try:
117
+ from collect.convert_sft import convert_all_to_sft
118
+ sft_count, dpo_count = convert_all_to_sft()
119
+ logger.info(f" => SFT pairs: {sft_count}")
120
+ logger.info(f" => DPO pairs: {dpo_count}")
121
+ except Exception as e:
122
+ logger.error(f" SFT conversion failed: {e}", exc_info=True)
123
+
124
+ elapsed = time.time() - start
125
+ logger.info("=" * 60)
126
+ logger.info(f"DONE β€” {total_docs:,} documents in {elapsed / 60:.1f} min")
127
+ logger.info("=" * 60)
128
+
129
+
130
+ if __name__ == "__main__":
131
+ run_fast()
collect/scraper_base.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Base scraper with rate limiting, retries, and polite crawling."""
2
+
3
+ import time
4
+ import json
5
+ import hashlib
6
+ import logging
7
+ from pathlib import Path
8
+ from typing import Optional
9
+
10
+ import requests
11
+ from requests.adapters import HTTPAdapter
12
+ from urllib3.util.retry import Retry
13
+
14
+ from collect.config import (
15
+ HEADERS, REQUEST_DELAY, MAX_RETRIES, TIMEOUT, RAW_DIR,
16
+ )
17
+
18
+ logger = logging.getLogger(__name__)
19
+
20
+
21
+ class BaseScraper:
22
+ """Polite web scraper with rate limiting and caching."""
23
+
24
+ def __init__(self, source_name: str):
25
+ self.source_name = source_name
26
+ self.output_dir = RAW_DIR / source_name
27
+ self.output_dir.mkdir(parents=True, exist_ok=True)
28
+ self.session = self._build_session()
29
+ self._last_request_time = 0.0
30
+ self.stats = {"fetched": 0, "cached": 0, "failed": 0, "total_chars": 0}
31
+
32
+ def _build_session(self) -> requests.Session:
33
+ session = requests.Session()
34
+ session.headers.update(HEADERS)
35
+ retry = Retry(
36
+ total=MAX_RETRIES,
37
+ backoff_factor=1.0,
38
+ status_forcelist=[429, 500, 502, 503, 504],
39
+ allowed_methods=["GET"],
40
+ )
41
+ adapter = HTTPAdapter(max_retries=retry)
42
+ session.mount("https://", adapter)
43
+ session.mount("http://", adapter)
44
+ return session
45
+
46
+ def _rate_limit(self):
47
+ elapsed = time.time() - self._last_request_time
48
+ if elapsed < REQUEST_DELAY:
49
+ time.sleep(REQUEST_DELAY - elapsed)
50
+ self._last_request_time = time.time()
51
+
52
+ def _cache_key(self, url: str) -> str:
53
+ return hashlib.sha256(url.encode()).hexdigest()[:16]
54
+
55
+ def _cache_path(self, url: str) -> Path:
56
+ return self.output_dir / f"{self._cache_key(url)}.json"
57
+
58
+ def fetch(self, url: str, force: bool = False) -> Optional[str]:
59
+ """Fetch URL content with caching and rate limiting."""
60
+ cache = self._cache_path(url)
61
+ if not force and cache.exists():
62
+ data = json.loads(cache.read_text())
63
+ self.stats["cached"] += 1
64
+ return data.get("content")
65
+
66
+ self._rate_limit()
67
+ try:
68
+ resp = self.session.get(url, timeout=TIMEOUT)
69
+ resp.raise_for_status()
70
+ content = resp.text
71
+ # Cache the result
72
+ cache.write_text(json.dumps({
73
+ "url": url,
74
+ "status": resp.status_code,
75
+ "content": content,
76
+ "fetched_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
77
+ }))
78
+ self.stats["fetched"] += 1
79
+ self.stats["total_chars"] += len(content)
80
+ return content
81
+ except Exception as e:
82
+ logger.warning(f"[{self.source_name}] Failed to fetch {url}: {e}")
83
+ self.stats["failed"] += 1
84
+ return None
85
+
86
+ def save_documents(self, documents: list[dict], filename: str = "documents.jsonl"):
87
+ """Save collected documents as JSONL."""
88
+ out = self.output_dir / filename
89
+ with open(out, "w") as f:
90
+ for doc in documents:
91
+ f.write(json.dumps(doc, ensure_ascii=False) + "\n")
92
+ logger.info(f"[{self.source_name}] Saved {len(documents)} docs β†’ {out}")
93
+ return out
94
+
95
+ def print_stats(self):
96
+ logger.info(
97
+ f"[{self.source_name}] Stats: "
98
+ f"fetched={self.stats['fetched']}, "
99
+ f"cached={self.stats['cached']}, "
100
+ f"failed={self.stats['failed']}, "
101
+ f"chars={self.stats['total_chars']:,}"
102
+ )
collect/sources/__init__.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scrape insurance articles from Wikipedia via the public MediaWiki API."""
2
+
3
+ import logging
4
+ import re
5
+ from typing import Optional
6
+
7
+ from collect.scraper_base import BaseScraper
8
+ from collect.config import WIKIPEDIA_SEED_ARTICLES
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+ API = "https://en.wikipedia.org/w/api.php"
13
+
14
+
15
+ class WikipediaScraper(BaseScraper):
16
+ def __init__(self):
17
+ super().__init__("wikipedia")
18
+
19
+ def _get_article_text(self, title: str) -> Optional[dict]:
20
+ """Get plain-text extract of a Wikipedia article via API."""
21
+ url = (
22
+ f"{API}?action=query&titles={title}"
23
+ f"&prop=extracts&explaintext=1&exsectionformat=plain"
24
+ f"&format=json&redirects=1"
25
+ )
26
+ html = self.fetch(url)
27
+ if not html:
28
+ return None
29
+ import json
30
+ data = json.loads(html)
31
+ pages = data.get("query", {}).get("pages", {})
32
+ for pid, page in pages.items():
33
+ if pid == "-1":
34
+ return None
35
+ text = page.get("extract", "")
36
+ if len(text) < 200:
37
+ return None
38
+ return {
39
+ "title": page.get("title", title),
40
+ "text": text,
41
+ "source": "wikipedia",
42
+ "url": f"https://en.wikipedia.org/wiki/{title}",
43
+ "category": "insurance_knowledge",
44
+ }
45
+ return None
46
+
47
+ def _get_linked_articles(self, title: str, limit: int = 20) -> list[str]:
48
+ """Get insurance-related links from an article."""
49
+ url = (
50
+ f"{API}?action=query&titles={title}"
51
+ f"&prop=links&pllimit={limit}&plnamespace=0"
52
+ f"&format=json&redirects=1"
53
+ )
54
+ html = self.fetch(url)
55
+ if not html:
56
+ return []
57
+ import json
58
+ data = json.loads(html)
59
+ pages = data.get("query", {}).get("pages", {})
60
+ links = []
61
+ insurance_keywords = {
62
+ "insurance", "insur", "underw", "claim", "polic",
63
+ "premium", "actuar", "reinsur", "liabil", "indemnit",
64
+ "risk", "loss", "peril", "cover", "broker",
65
+ "lloyd", "solvency", "fca", "pra", "regul",
66
+ }
67
+ for page in pages.values():
68
+ for link in page.get("links", []):
69
+ link_title = link.get("title", "")
70
+ lower = link_title.lower()
71
+ if any(kw in lower for kw in insurance_keywords):
72
+ links.append(link_title.replace(" ", "_"))
73
+ return links
74
+
75
+ def collect(self, max_articles: int = 500) -> list[dict]:
76
+ """Collect Wikipedia insurance articles with link expansion."""
77
+ documents = []
78
+ visited = set()
79
+ queue = list(WIKIPEDIA_SEED_ARTICLES)
80
+
81
+ while queue and len(documents) < max_articles:
82
+ title = queue.pop(0)
83
+ if title in visited:
84
+ continue
85
+ visited.add(title)
86
+
87
+ doc = self._get_article_text(title)
88
+ if doc:
89
+ documents.append(doc)
90
+ logger.info(
91
+ f" [{len(documents)}/{max_articles}] {doc['title']} "
92
+ f"({len(doc['text']):,} chars)"
93
+ )
94
+ # Expand links from this article
95
+ if len(documents) < max_articles:
96
+ new_links = self._get_linked_articles(title)
97
+ for link in new_links:
98
+ if link not in visited:
99
+ queue.append(link)
100
+
101
+ self.save_documents(documents)
102
+ self.print_stats()
103
+ return documents
104
+
105
+
106
+ def collect_wikipedia(max_articles: int = 500) -> list[dict]:
107
+ scraper = WikipediaScraper()
108
+ return scraper.collect(max_articles)
collect/sources/education.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Collect open insurance educational content and textbook excerpts."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from typing import Optional
7
+
8
+ from bs4 import BeautifulSoup
9
+
10
+ from collect.scraper_base import BaseScraper
11
+
12
+ logger = logging.getLogger(__name__)
13
+
14
+ # Public insurance educational resources
15
+ EDUCATIONAL_URLS = [
16
+ # IRMI (International Risk Management Institute) - public glossary
17
+ ("https://www.irmi.com/term/insurance-definitions", "irmi_glossary"),
18
+ # III (Insurance Information Institute) - public fact sheets
19
+ ("https://www.iii.org/insurance-topics", "iii_topics"),
20
+ # CAS (Casualty Actuarial Society) - public resources
21
+ ("https://www.casact.org/publications-research", "cas_research"),
22
+ # SOA (Society of Actuaries) - public resources
23
+ ("https://www.soa.org/resources/research-reports/", "soa_research"),
24
+ # UK ABI (Association of British Insurers) - public resources
25
+ ("https://www.abi.org.uk/data-and-resources/", "abi_data"),
26
+ # Lloyd's - public market info
27
+ ("https://www.lloyds.com/about-lloyds", "lloyds_about"),
28
+ # Swiss Re - sigma reports (public abstracts)
29
+ ("https://www.swissre.com/institute/research/sigma-research.html", "swissre_sigma"),
30
+ ]
31
+
32
+ # Open actuarial textbook content (Loss Data Analytics - open source)
33
+ OPEN_TEXTBOOK_CHAPTERS = [
34
+ "https://openacttexts.github.io/Loss-Data-Analytics/",
35
+ ]
36
+
37
+ # Insurance exam prep - public syllabus material
38
+ EXAM_TOPICS = {
39
+ "CII_IF1": [
40
+ "Principles of insurance", "Insurance contract law",
41
+ "Types of insurance", "The insurance market",
42
+ "Insurance regulation in the UK", "Claims handling",
43
+ "Underwriting principles", "Reinsurance basics",
44
+ "Risk management fundamentals", "Insurance intermediaries",
45
+ ],
46
+ "CII_IF2": [
47
+ "General insurance business", "Property insurance",
48
+ "Liability insurance", "Motor insurance",
49
+ "Marine aviation and transport insurance",
50
+ "Financial lines insurance", "Specialty insurance",
51
+ ],
52
+ "IFoA_CP1": [
53
+ "Actuarial risk management", "Insurance pricing models",
54
+ "Generalized linear models in insurance",
55
+ "Loss reserving methods", "Chain ladder technique",
56
+ "Bornhuetter-Ferguson method", "Solvency II capital modelling",
57
+ "Risk measures and capital requirements",
58
+ ],
59
+ }
60
+
61
+
62
+ class EducationCollector(BaseScraper):
63
+ def __init__(self):
64
+ super().__init__("education")
65
+
66
+ def _scrape_page(self, url: str, source_name: str) -> list[dict]:
67
+ """Scrape educational content from a page and its links."""
68
+ html = self.fetch(url)
69
+ if not html:
70
+ return []
71
+
72
+ soup = BeautifulSoup(html, "html.parser")
73
+ for tag in soup.find_all(["nav", "footer", "script", "style", "aside"]):
74
+ tag.decompose()
75
+
76
+ documents = []
77
+
78
+ # Get main page content
79
+ main = soup.find("main") or soup.find("article") or soup.find("body")
80
+ if main:
81
+ text = main.get_text(separator="\n", strip=True)
82
+ text = re.sub(r"\n{3,}", "\n\n", text)
83
+ if len(text) > 200:
84
+ title = ""
85
+ h1 = soup.find("h1")
86
+ if h1:
87
+ title = h1.get_text(strip=True)
88
+ documents.append({
89
+ "title": title or source_name,
90
+ "text": text[:30000],
91
+ "source": source_name,
92
+ "url": url,
93
+ "category": "insurance_education",
94
+ })
95
+
96
+ # Follow internal links
97
+ base_domain = "/".join(url.split("/")[:3])
98
+ for a in soup.find_all("a", href=True)[:20]:
99
+ href = a["href"]
100
+ if href.startswith("/"):
101
+ href = base_domain + href
102
+ if href.startswith(base_domain) and href != url:
103
+ sub_html = self.fetch(href)
104
+ if sub_html:
105
+ sub_soup = BeautifulSoup(sub_html, "html.parser")
106
+ for tag in sub_soup.find_all(["nav", "footer", "script", "style"]):
107
+ tag.decompose()
108
+ sub_main = sub_soup.find("main") or sub_soup.find("article")
109
+ if sub_main:
110
+ sub_text = sub_main.get_text(separator="\n", strip=True)
111
+ if len(sub_text) > 200:
112
+ sub_title = ""
113
+ h1 = sub_soup.find("h1")
114
+ if h1:
115
+ sub_title = h1.get_text(strip=True)
116
+ documents.append({
117
+ "title": sub_title or href.split("/")[-1],
118
+ "text": sub_text[:20000],
119
+ "source": source_name,
120
+ "url": href,
121
+ "category": "insurance_education",
122
+ })
123
+
124
+ return documents
125
+
126
+ def _generate_exam_knowledge(self) -> list[dict]:
127
+ """Generate knowledge documents from insurance exam topics."""
128
+ documents = []
129
+ for exam, topics in EXAM_TOPICS.items():
130
+ for topic in topics:
131
+ # Create a structured knowledge entry
132
+ documents.append({
133
+ "title": f"{exam}: {topic}",
134
+ "text": f"Insurance Exam Topic: {topic}\n"
135
+ f"Exam: {exam}\n"
136
+ f"This topic covers the key concepts, principles, "
137
+ f"and practical applications of {topic.lower()} "
138
+ f"in the context of UK insurance practice.",
139
+ "source": "exam_syllabus",
140
+ "category": "insurance_education",
141
+ "exam": exam,
142
+ })
143
+ return documents
144
+
145
+ def collect(self) -> list[dict]:
146
+ """Collect educational insurance content."""
147
+ documents = []
148
+
149
+ for url, source_name in EDUCATIONAL_URLS:
150
+ logger.info(f" Scraping education source: {source_name}")
151
+ docs = self._scrape_page(url, source_name)
152
+ documents.extend(docs)
153
+ logger.info(f" Got {len(docs)} documents")
154
+
155
+ # Exam knowledge
156
+ exam_docs = self._generate_exam_knowledge()
157
+ documents.extend(exam_docs)
158
+ logger.info(f" Generated {len(exam_docs)} exam topic entries")
159
+
160
+ self.save_documents(documents)
161
+ self.print_stats()
162
+ return documents
163
+
164
+
165
+ def collect_education() -> list[dict]:
166
+ collector = EducationCollector()
167
+ return collector.collect()
collect/sources/fca.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scrape FCA Handbook sections relevant to insurance."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from typing import Optional
7
+
8
+ from bs4 import BeautifulSoup
9
+
10
+ from collect.scraper_base import BaseScraper
11
+ from collect.config import FCA_HANDBOOK_SECTIONS, FCA_BASE_URL
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+
16
+ class FCAHandbookScraper(BaseScraper):
17
+ def __init__(self):
18
+ super().__init__("fca_handbook")
19
+
20
+ def _scrape_section_index(self, section: str) -> list[str]:
21
+ """Get chapter URLs from a handbook section index page."""
22
+ url = f"{FCA_BASE_URL}/{section}"
23
+ html = self.fetch(url)
24
+ if not html:
25
+ return []
26
+ soup = BeautifulSoup(html, "html.parser")
27
+ links = []
28
+ for a in soup.find_all("a", href=True):
29
+ href = a["href"]
30
+ # Match chapter links like /ICOBS/1 or /ICOBS/1/1
31
+ if re.match(rf"/{section}/\d+", href):
32
+ full_url = f"{FCA_BASE_URL}{href}"
33
+ if full_url not in links:
34
+ links.append(full_url)
35
+ return links[:50] # Cap per section
36
+
37
+ def _scrape_chapter(self, url: str, section: str) -> Optional[dict]:
38
+ """Scrape text content from a handbook chapter page."""
39
+ html = self.fetch(url)
40
+ if not html:
41
+ return None
42
+ soup = BeautifulSoup(html, "html.parser")
43
+
44
+ # Remove nav, footer, scripts
45
+ for tag in soup.find_all(["nav", "footer", "script", "style", "aside"]):
46
+ tag.decompose()
47
+
48
+ # Get main content area
49
+ content = soup.find("main") or soup.find("div", class_="handbook-content")
50
+ if not content:
51
+ content = soup.find("body")
52
+ if not content:
53
+ return None
54
+
55
+ text = content.get_text(separator="\n", strip=True)
56
+ # Clean up excessive whitespace
57
+ text = re.sub(r"\n{3,}", "\n\n", text)
58
+ text = re.sub(r"[ \t]+", " ", text)
59
+
60
+ if len(text) < 100:
61
+ return None
62
+
63
+ title_tag = soup.find("h1") or soup.find("title")
64
+ title = title_tag.get_text(strip=True) if title_tag else url.split("/")[-1]
65
+
66
+ return {
67
+ "title": f"FCA Handbook - {section} - {title}",
68
+ "text": text[:50000], # Cap at 50k chars per page
69
+ "source": "fca_handbook",
70
+ "url": url,
71
+ "section": section,
72
+ "category": "regulation",
73
+ }
74
+
75
+ def collect(self) -> list[dict]:
76
+ """Collect FCA handbook content for insurance-related sections."""
77
+ documents = []
78
+
79
+ for section in FCA_HANDBOOK_SECTIONS:
80
+ logger.info(f" Scraping FCA section: {section}")
81
+ chapter_urls = self._scrape_section_index(section)
82
+ logger.info(f" Found {len(chapter_urls)} chapters")
83
+
84
+ for url in chapter_urls:
85
+ doc = self._scrape_chapter(url, section)
86
+ if doc:
87
+ documents.append(doc)
88
+ logger.info(
89
+ f" [{len(documents)}] {doc['title'][:60]} "
90
+ f"({len(doc['text']):,} chars)"
91
+ )
92
+
93
+ self.save_documents(documents)
94
+ self.print_stats()
95
+ return documents
96
+
97
+
98
+ def collect_fca(max_per_section: int = 50) -> list[dict]:
99
+ scraper = FCAHandbookScraper()
100
+ return scraper.collect()
collect/sources/hf_datasets.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Download and process insurance datasets from HuggingFace."""
2
+
3
+ import json
4
+ import logging
5
+ from typing import Optional
6
+
7
+ from collect.scraper_base import BaseScraper
8
+ from collect.config import HF_DATASETS
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+
13
+ class HuggingFaceCollector(BaseScraper):
14
+ def __init__(self):
15
+ super().__init__("huggingface")
16
+
17
+ def _download_dataset(self, name: str, config: Optional[str]) -> list[dict]:
18
+ """Download a HuggingFace dataset and convert to our format."""
19
+ try:
20
+ from datasets import load_dataset
21
+ except ImportError:
22
+ logger.error("Install `datasets`: pip install datasets")
23
+ return []
24
+
25
+ documents = []
26
+ try:
27
+ logger.info(f" Downloading HF dataset: {name}")
28
+ kwargs = {"trust_remote_code": True}
29
+ if config:
30
+ kwargs["name"] = config
31
+ ds = load_dataset(name, **kwargs)
32
+
33
+ # Process each split
34
+ for split_name, split_data in ds.items():
35
+ logger.info(f" Split '{split_name}': {len(split_data)} rows")
36
+ for i, row in enumerate(split_data):
37
+ doc = self._row_to_document(row, name, split_name, i)
38
+ if doc:
39
+ documents.append(doc)
40
+
41
+ except Exception as e:
42
+ logger.warning(f" Failed to load {name}: {e}")
43
+
44
+ return documents
45
+
46
+ def _row_to_document(self, row: dict, dataset_name: str,
47
+ split: str, idx: int) -> Optional[dict]:
48
+ """Convert a dataset row to a document dict."""
49
+ # Try common text field names
50
+ text_fields = ["text", "content", "question", "answer", "document",
51
+ "input", "output", "instruction", "response",
52
+ "review", "comment", "body", "description"]
53
+
54
+ texts = []
55
+ for field in text_fields:
56
+ if field in row and row[field] and isinstance(row[field], str):
57
+ texts.append(f"{field}: {row[field]}")
58
+
59
+ # Also grab any other string fields
60
+ for k, v in row.items():
61
+ if isinstance(v, str) and k not in text_fields and len(v) > 20:
62
+ texts.append(f"{k}: {v}")
63
+
64
+ if not texts:
65
+ return None
66
+
67
+ combined = "\n".join(texts)
68
+ if len(combined) < 50:
69
+ return None
70
+
71
+ return {
72
+ "title": f"{dataset_name}/{split}/{idx}",
73
+ "text": combined[:30000],
74
+ "source": "huggingface",
75
+ "dataset": dataset_name,
76
+ "split": split,
77
+ "category": "insurance_data",
78
+ "row": {k: str(v)[:500] for k, v in row.items()
79
+ if isinstance(v, (str, int, float))},
80
+ }
81
+
82
+ def collect(self) -> list[dict]:
83
+ """Download all configured HuggingFace insurance datasets."""
84
+ all_documents = []
85
+
86
+ for name, config in HF_DATASETS:
87
+ docs = self._download_dataset(name, config)
88
+ all_documents.extend(docs)
89
+ logger.info(f" Collected {len(docs)} docs from {name}")
90
+
91
+ self.save_documents(all_documents)
92
+ self.print_stats()
93
+ return all_documents
94
+
95
+
96
+ # Also search HuggingFace for more insurance datasets
97
+ class HuggingFaceSearcher(BaseScraper):
98
+ """Search HuggingFace Hub API for insurance-tagged datasets."""
99
+
100
+ def __init__(self):
101
+ super().__init__("hf_search")
102
+
103
+ def search_datasets(self, query: str = "insurance", limit: int = 50) -> list[str]:
104
+ """Search HuggingFace Hub for insurance datasets."""
105
+ url = (
106
+ f"https://huggingface.co/api/datasets"
107
+ f"?search={query}&limit={limit}&sort=downloads&direction=-1"
108
+ )
109
+ raw = self.fetch(url)
110
+ if not raw:
111
+ return []
112
+ try:
113
+ results = json.loads(raw)
114
+ names = [r["id"] for r in results if isinstance(r, dict) and "id" in r]
115
+ logger.info(f" Found {len(names)} HF datasets for '{query}'")
116
+ return names
117
+ except Exception as e:
118
+ logger.warning(f" HF search failed: {e}")
119
+ return []
120
+
121
+
122
+ def collect_huggingface() -> list[dict]:
123
+ collector = HuggingFaceCollector()
124
+ return collector.collect()
125
+
126
+
127
+ def search_hf_datasets() -> list[str]:
128
+ searcher = HuggingFaceSearcher()
129
+ found = []
130
+ for q in ["insurance", "insurance claims", "actuarial",
131
+ "insurance underwriting", "insurance fraud"]:
132
+ found.extend(searcher.search_datasets(q))
133
+ return list(set(found))
collect/sources/investopedia.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scrape Investopedia insurance glossary terms."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from typing import Optional
7
+
8
+ from bs4 import BeautifulSoup
9
+
10
+ from collect.scraper_base import BaseScraper
11
+ from collect.config import INVESTOPEDIA_TERMS
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+ BASE = "https://www.investopedia.com/terms"
16
+
17
+
18
+ class InvestopediaScraper(BaseScraper):
19
+ def __init__(self):
20
+ super().__init__("investopedia")
21
+
22
+ def _scrape_term(self, term: str) -> Optional[dict]:
23
+ """Scrape a single Investopedia insurance term."""
24
+ # Investopedia URL pattern: /terms/{first_letter}/{term}.asp
25
+ # OR newer: /terms/{term}-{number}
26
+ # Try the common patterns
27
+ first_letter = term[0].lower()
28
+ urls_to_try = [
29
+ f"https://www.investopedia.com/{term}-5075091",
30
+ f"https://www.investopedia.com/{term}-definition-5075091",
31
+ f"https://www.investopedia.com/terms/{first_letter}/{term}.asp",
32
+ f"https://www.investopedia.com/terms/{first_letter}/{term}",
33
+ ]
34
+
35
+ for url in urls_to_try:
36
+ html = self.fetch(url)
37
+ if not html or "404" in html[:500]:
38
+ continue
39
+
40
+ soup = BeautifulSoup(html, "html.parser")
41
+
42
+ # Remove ads, nav, etc
43
+ for tag in soup.find_all(["nav", "footer", "script", "style",
44
+ "aside", "header", "figure"]):
45
+ tag.decompose()
46
+
47
+ # Get article content
48
+ article = soup.find("article") or soup.find("div", class_="article-body")
49
+ if not article:
50
+ article = soup.find("main")
51
+ if not article:
52
+ continue
53
+
54
+ text = article.get_text(separator="\n", strip=True)
55
+ text = re.sub(r"\n{3,}", "\n\n", text)
56
+
57
+ if len(text) < 200:
58
+ continue
59
+
60
+ title = ""
61
+ h1 = soup.find("h1")
62
+ if h1:
63
+ title = h1.get_text(strip=True)
64
+ if not title:
65
+ title = term.replace("-", " ").title()
66
+
67
+ return {
68
+ "title": title,
69
+ "text": text[:20000],
70
+ "source": "investopedia",
71
+ "url": url,
72
+ "term": term,
73
+ "category": "insurance_education",
74
+ }
75
+ return None
76
+
77
+ def collect(self) -> list[dict]:
78
+ """Collect Investopedia insurance term definitions."""
79
+ documents = []
80
+
81
+ for i, term in enumerate(INVESTOPEDIA_TERMS):
82
+ doc = self._scrape_term(term)
83
+ if doc:
84
+ documents.append(doc)
85
+ logger.info(
86
+ f" [{len(documents)}/{len(INVESTOPEDIA_TERMS)}] {doc['title'][:50]} "
87
+ f"({len(doc['text']):,} chars)"
88
+ )
89
+ else:
90
+ logger.debug(f" Skipped: {term}")
91
+
92
+ if (i + 1) % 20 == 0:
93
+ logger.info(f" Progress: {i + 1}/{len(INVESTOPEDIA_TERMS)} terms checked")
94
+
95
+ self.save_documents(documents)
96
+ self.print_stats()
97
+ return documents
98
+
99
+
100
+ def collect_investopedia() -> list[dict]:
101
+ scraper = InvestopediaScraper()
102
+ return scraper.collect()
collect/sources/legislation.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scrape UK insurance legislation from legislation.gov.uk."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from typing import Optional
7
+
8
+ from bs4 import BeautifulSoup
9
+
10
+ from collect.scraper_base import BaseScraper
11
+ from collect.config import UK_LEGISLATION_URLS
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+
16
+ class LegislationScraper(BaseScraper):
17
+ def __init__(self):
18
+ super().__init__("uk_legislation")
19
+
20
+ def _get_section_urls(self, contents_url: str) -> list[str]:
21
+ """Parse a legislation contents page to get individual section URLs."""
22
+ html = self.fetch(contents_url)
23
+ if not html:
24
+ return []
25
+ soup = BeautifulSoup(html, "html.parser")
26
+ urls = []
27
+ for a in soup.find_all("a", href=True):
28
+ href = a["href"]
29
+ # Match section links
30
+ if "/section/" in href or "/part/" in href or "/schedule/" in href:
31
+ if href.startswith("/"):
32
+ href = f"https://www.legislation.gov.uk{href}"
33
+ if href not in urls:
34
+ urls.append(href)
35
+ return urls[:100] # Cap per act
36
+
37
+ def _scrape_section(self, url: str, act_name: str) -> Optional[dict]:
38
+ """Scrape text from a legislation section."""
39
+ html = self.fetch(url)
40
+ if not html:
41
+ return None
42
+ soup = BeautifulSoup(html, "html.parser")
43
+
44
+ # Remove navigation and chrome
45
+ for tag in soup.find_all(["nav", "footer", "script", "style", "aside", "header"]):
46
+ tag.decompose()
47
+
48
+ # legislation.gov.uk uses class="LegP1" etc for legislation paragraphs
49
+ content_div = (
50
+ soup.find("div", id="viewLegSnippet")
51
+ or soup.find("div", class_="LegClearFix")
52
+ or soup.find("article")
53
+ or soup.find("main")
54
+ )
55
+ if not content_div:
56
+ content_div = soup.find("body")
57
+ if not content_div:
58
+ return None
59
+
60
+ text = content_div.get_text(separator="\n", strip=True)
61
+ text = re.sub(r"\n{3,}", "\n\n", text)
62
+
63
+ if len(text) < 50:
64
+ return None
65
+
66
+ title_tag = soup.find("h1") or soup.find("title")
67
+ title = title_tag.get_text(strip=True) if title_tag else ""
68
+
69
+ return {
70
+ "title": f"UK Law - {act_name} - {title}".strip(" - "),
71
+ "text": text[:30000],
72
+ "source": "uk_legislation",
73
+ "url": url,
74
+ "act": act_name,
75
+ "category": "legislation",
76
+ }
77
+
78
+ def _extract_act_name(self, url: str) -> str:
79
+ """Extract act name from URL."""
80
+ parts = url.rstrip("/").split("/")
81
+ # e.g. https://www.legislation.gov.uk/ukpga/2015/4/contents -> "2015/4"
82
+ if "contents" in parts:
83
+ idx = parts.index("contents")
84
+ return "/".join(parts[max(0, idx - 2):idx])
85
+ return parts[-1]
86
+
87
+ def collect(self) -> list[dict]:
88
+ """Collect UK insurance legislation."""
89
+ documents = []
90
+
91
+ for contents_url in UK_LEGISLATION_URLS:
92
+ act_name = self._extract_act_name(contents_url)
93
+ logger.info(f" Scraping legislation: {act_name}")
94
+
95
+ # First, scrape the contents page itself for overview
96
+ overview = self._scrape_section(contents_url, act_name)
97
+ if overview:
98
+ documents.append(overview)
99
+
100
+ # Then get individual sections
101
+ section_urls = self._get_section_urls(contents_url)
102
+ logger.info(f" Found {len(section_urls)} sections")
103
+
104
+ for url in section_urls:
105
+ doc = self._scrape_section(url, act_name)
106
+ if doc:
107
+ documents.append(doc)
108
+
109
+ logger.info(f" Collected {len(documents)} total so far")
110
+
111
+ self.save_documents(documents)
112
+ self.print_stats()
113
+ return documents
114
+
115
+
116
+ def collect_legislation() -> list[dict]:
117
+ scraper = LegislationScraper()
118
+ return scraper.collect()
collect/sources/rss_news.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Collect insurance content from RSS feeds and news sources."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from typing import Optional
7
+
8
+ from bs4 import BeautifulSoup
9
+
10
+ from collect.scraper_base import BaseScraper
11
+ from collect.config import RSS_FEEDS
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+
16
+ class RSSCollector(BaseScraper):
17
+ def __init__(self):
18
+ super().__init__("rss_news")
19
+
20
+ def _parse_feed(self, feed_url: str) -> list[dict]:
21
+ """Parse an RSS/Atom feed and extract article URLs."""
22
+ xml = self.fetch(feed_url)
23
+ if not xml:
24
+ return []
25
+
26
+ soup = BeautifulSoup(xml, "xml")
27
+ items = soup.find_all("item") or soup.find_all("entry")
28
+ articles = []
29
+
30
+ for item in items[:30]: # Cap per feed
31
+ title = item.find("title")
32
+ link = item.find("link")
33
+ desc = item.find("description") or item.find("summary") or item.find("content")
34
+
35
+ title_text = title.get_text(strip=True) if title else ""
36
+ link_text = ""
37
+ if link:
38
+ link_text = link.get("href", "") or link.get_text(strip=True)
39
+ desc_text = desc.get_text(strip=True) if desc else ""
40
+
41
+ # Clean HTML from description
42
+ if desc_text:
43
+ desc_soup = BeautifulSoup(desc_text, "html.parser")
44
+ desc_text = desc_soup.get_text(separator=" ", strip=True)
45
+
46
+ if title_text and (desc_text or link_text):
47
+ articles.append({
48
+ "title": title_text,
49
+ "url": link_text,
50
+ "summary": desc_text[:5000],
51
+ })
52
+
53
+ return articles
54
+
55
+ def _scrape_article(self, url: str, title: str) -> Optional[dict]:
56
+ """Try to scrape full article text from URL."""
57
+ if not url or not url.startswith("http"):
58
+ return None
59
+
60
+ html = self.fetch(url)
61
+ if not html:
62
+ return None
63
+
64
+ soup = BeautifulSoup(html, "html.parser")
65
+ for tag in soup.find_all(["nav", "footer", "script", "style",
66
+ "aside", "header", "figure", "iframe"]):
67
+ tag.decompose()
68
+
69
+ article = (
70
+ soup.find("article")
71
+ or soup.find("div", class_=re.compile(r"article|content|post|entry"))
72
+ or soup.find("main")
73
+ )
74
+ if not article:
75
+ return None
76
+
77
+ text = article.get_text(separator="\n", strip=True)
78
+ text = re.sub(r"\n{3,}", "\n\n", text)
79
+
80
+ if len(text) < 200:
81
+ return None
82
+
83
+ return {
84
+ "title": title,
85
+ "text": text[:20000],
86
+ "source": "insurance_news",
87
+ "url": url,
88
+ "category": "insurance_news",
89
+ }
90
+
91
+ def collect(self) -> list[dict]:
92
+ """Collect articles from insurance RSS feeds."""
93
+ documents = []
94
+
95
+ for feed_url in RSS_FEEDS:
96
+ logger.info(f" Parsing feed: {feed_url}")
97
+ articles = self._parse_feed(feed_url)
98
+ logger.info(f" Found {len(articles)} articles")
99
+
100
+ for article in articles:
101
+ # Try to get full article
102
+ doc = self._scrape_article(article["url"], article["title"])
103
+ if doc:
104
+ documents.append(doc)
105
+ elif article["summary"] and len(article["summary"]) > 100:
106
+ # Fall back to RSS summary
107
+ documents.append({
108
+ "title": article["title"],
109
+ "text": article["summary"],
110
+ "source": "insurance_news_summary",
111
+ "url": article["url"],
112
+ "category": "insurance_news",
113
+ })
114
+
115
+ self.save_documents(documents)
116
+ self.print_stats()
117
+ return documents
118
+
119
+
120
+ def collect_rss() -> list[dict]:
121
+ collector = RSSCollector()
122
+ return collector.collect()
collect/sources/wikipedia.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scrape insurance articles from Wikipedia via the public MediaWiki API."""
2
+
3
+ import json
4
+ import logging
5
+ from typing import Optional
6
+
7
+ from collect.scraper_base import BaseScraper
8
+ from collect.config import WIKIPEDIA_SEED_ARTICLES
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+ API = "https://en.wikipedia.org/w/api.php"
13
+
14
+
15
+ class WikipediaScraper(BaseScraper):
16
+ def __init__(self):
17
+ super().__init__("wikipedia")
18
+
19
+ def _get_article_text(self, title: str) -> Optional[dict]:
20
+ """Get plain-text extract of a Wikipedia article via API."""
21
+ url = (
22
+ f"{API}?action=query&titles={title}"
23
+ f"&prop=extracts&explaintext=1&exsectionformat=plain"
24
+ f"&format=json&redirects=1"
25
+ )
26
+ raw = self.fetch(url)
27
+ if not raw:
28
+ return None
29
+ data = json.loads(raw)
30
+ pages = data.get("query", {}).get("pages", {})
31
+ for pid, page in pages.items():
32
+ if pid == "-1":
33
+ return None
34
+ text = page.get("extract", "")
35
+ if len(text) < 200:
36
+ return None
37
+ return {
38
+ "title": page.get("title", title),
39
+ "text": text,
40
+ "source": "wikipedia",
41
+ "url": f"https://en.wikipedia.org/wiki/{title}",
42
+ "category": "insurance_knowledge",
43
+ }
44
+ return None
45
+
46
+ def _get_linked_articles(self, title: str, limit: int = 20) -> list[str]:
47
+ """Get insurance-related links from an article."""
48
+ url = (
49
+ f"{API}?action=query&titles={title}"
50
+ f"&prop=links&pllimit={limit}&plnamespace=0"
51
+ f"&format=json&redirects=1"
52
+ )
53
+ raw = self.fetch(url)
54
+ if not raw:
55
+ return []
56
+ data = json.loads(raw)
57
+ pages = data.get("query", {}).get("pages", {})
58
+ links = []
59
+ insurance_keywords = {
60
+ "insurance", "insur", "underw", "claim", "polic",
61
+ "premium", "actuar", "reinsur", "liabil", "indemnit",
62
+ "risk", "loss", "peril", "cover", "broker",
63
+ "lloyd", "solvency", "fca", "pra", "regul",
64
+ }
65
+ for page in pages.values():
66
+ for link in page.get("links", []):
67
+ link_title = link.get("title", "")
68
+ lower = link_title.lower()
69
+ if any(kw in lower for kw in insurance_keywords):
70
+ links.append(link_title.replace(" ", "_"))
71
+ return links
72
+
73
+ def collect(self, max_articles: int = 500) -> list[dict]:
74
+ """Collect Wikipedia insurance articles with link expansion."""
75
+ documents = []
76
+ visited: set[str] = set()
77
+ queue = list(WIKIPEDIA_SEED_ARTICLES)
78
+
79
+ while queue and len(documents) < max_articles:
80
+ title = queue.pop(0)
81
+ if title in visited:
82
+ continue
83
+ visited.add(title)
84
+
85
+ doc = self._get_article_text(title)
86
+ if doc:
87
+ documents.append(doc)
88
+ logger.info(
89
+ f" [{len(documents)}/{max_articles}] {doc['title']} "
90
+ f"({len(doc['text']):,} chars)"
91
+ )
92
+ # Expand: get linked insurance articles
93
+ if len(documents) < max_articles:
94
+ new_links = self._get_linked_articles(title)
95
+ for link in new_links:
96
+ if link not in visited:
97
+ queue.append(link)
98
+
99
+ self.save_documents(documents)
100
+ self.print_stats()
101
+ return documents
102
+
103
+
104
+ def collect_wikipedia(max_articles: int = 500) -> list[dict]:
105
+ scraper = WikipediaScraper()
106
+ return scraper.collect(max_articles)
config.py ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration for data collection sources."""
2
+
3
+ import os
4
+ from pathlib import Path
5
+
6
+ # ── Paths ──────────────────────────────────────────────────────────
7
+ BASE_DIR = Path(__file__).resolve().parent.parent
8
+ RAW_DIR = BASE_DIR / "collect" / "raw"
9
+ PROCESSED_DIR = BASE_DIR / "collect" / "processed"
10
+ SFT_OUTPUT = BASE_DIR / "collect" / "sft_real_world.jsonl"
11
+ DPO_OUTPUT = BASE_DIR / "collect" / "dpo_real_world.jsonl"
12
+
13
+ RAW_DIR.mkdir(parents=True, exist_ok=True)
14
+ PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
15
+
16
+ # ── Rate limiting ──────────────────────────────────────────────────
17
+ REQUEST_DELAY = 1.5 # seconds between requests (be polite)
18
+ MAX_RETRIES = 3
19
+ TIMEOUT = 30
20
+
21
+ # ── User agent ─────────────────────────────────────────────────────
22
+ USER_AGENT = (
23
+ "InsureOS-DataCollector/1.0 "
24
+ "(Research; insurance-domain-model-training; "
25
+ "contact: piyush@bytical.com)"
26
+ )
27
+
28
+ HEADERS = {
29
+ "User-Agent": USER_AGENT,
30
+ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
31
+ "Accept-Language": "en-GB,en;q=0.9",
32
+ }
33
+
34
+ # ── Wikipedia insurance articles ───────────────────────────────────
35
+ WIKIPEDIA_SEED_ARTICLES = [
36
+ "Insurance", "Reinsurance", "Underwriting", "Actuarial_science",
37
+ "Insurance_policy", "Lloyd%27s_of_London", "Property_insurance",
38
+ "Casualty_insurance", "Life_insurance", "Health_insurance",
39
+ "Motor_insurance", "Marine_insurance", "Liability_insurance",
40
+ "Professional_indemnity_insurance", "Directors_and_officers_liability_insurance",
41
+ "Cyber_insurance", "Product_liability", "Public_liability",
42
+ "Employers%27_liability_insurance", "Business_interruption_insurance",
43
+ "Catastrophe_bond", "Insurance-linked_securities",
44
+ "Solvency_II", "IFRS_17", "Risk_management",
45
+ "Claims_adjusting", "Loss_adjustment", "Salvage_(insurance)",
46
+ "Subrogation", "Indemnity", "Utmost_good_faith",
47
+ "Proximate_cause_(insurance)", "Insurance_fraud",
48
+ "Parametric_insurance", "Microinsurance", "Takaful",
49
+ "Financial_Conduct_Authority", "Prudential_Regulation_Authority_(United_Kingdom)",
50
+ "General_insurance", "Insurance_broker", "Managing_general_agent",
51
+ "Coverholder", "Bordereaux", "Treaty_reinsurance",
52
+ "Facultative_reinsurance", "Excess_of_loss", "Quota_share",
53
+ "Stop-loss_insurance", "Aggregate_stop-loss_insurance",
54
+ "Deductible", "Co-insurance", "Self-insurance",
55
+ "Captive_insurance", "Risk_retention_group",
56
+ "Insurance_in_the_United_Kingdom", "Association_of_British_Insurers",
57
+ "Chartered_Insurance_Institute", "Insurance_premium_tax",
58
+ "Motor_Insurers%27_Bureau", "Pool_Reinsurance_Company",
59
+ "Flood_Re", "Terrorism_reinsurance",
60
+ "Insurance_contract", "Warranty_(insurance)",
61
+ "Condition_(insurance)", "Exclusion_(insurance)",
62
+ "Endorsement_(insurance)", "Schedule_(insurance)",
63
+ "Inception_(insurance)", "Renewal_(insurance)",
64
+ "Cancellation_(insurance)", "Claims-made_policy",
65
+ "Occurrence_policy", "Claims_reserve",
66
+ "Incurred_but_not_reported", "Loss_ratio",
67
+ "Combined_ratio", "Expense_ratio",
68
+ "Generalized_linear_model", "Tweedie_distribution",
69
+ "Poisson_regression", "Gamma_distribution",
70
+ "Chain_ladder_method", "Bornhuetter–Ferguson_method",
71
+ "Credibility_theory", "Experience_rating",
72
+ "Risk_classification", "Adverse_selection",
73
+ "Moral_hazard", "Insurance_scoring",
74
+ "Telematics", "Usage-based_insurance",
75
+ "Insurtech", "Peer-to-peer_insurance",
76
+ "Embedded_insurance", "Open_insurance",
77
+ "ACORD", "ISO_ClaimSearch",
78
+ "National_Flood_Insurance_Program",
79
+ "Earthquake_insurance", "Windstorm_insurance",
80
+ "Hail_insurance", "Crop_insurance",
81
+ "Title_insurance", "Surety_bond",
82
+ "Fidelity_bond", "Warranty",
83
+ "Extended_warranty", "Home_warranty",
84
+ "Pet_insurance", "Travel_insurance",
85
+ "Wedding_insurance", "Event_insurance",
86
+ "Key_person_insurance", "Trade_credit_insurance",
87
+ "Political_risk_insurance", "Environmental_liability",
88
+ "Pollution_insurance",
89
+ ]
90
+
91
+ # ── FCA Handbook sections ──────────────────────────────────────────
92
+ FCA_HANDBOOK_SECTIONS = [
93
+ "ICOBS", # Insurance: Conduct of Business Sourcebook
94
+ "SYSC", # Senior Management Arrangements
95
+ "PRIN", # Principles for Businesses
96
+ "COBS", # Conduct of Business Sourcebook
97
+ "DISP", # Dispute Resolution: Complaints
98
+ "SUP", # Supervision
99
+ "CONC", # Consumer Credit
100
+ "MCOB", # Mortgages and Home Finance
101
+ ]
102
+
103
+ FCA_BASE_URL = "https://www.handbook.fca.org.uk"
104
+
105
+ # ── UK Legislation ─────────────────────────────────────────────────
106
+ UK_LEGISLATION_URLS = [
107
+ # Insurance Act 2015
108
+ "https://www.legislation.gov.uk/ukpga/2015/4/contents",
109
+ # Enterprise Act 2016 (insurance damages for late payment)
110
+ "https://www.legislation.gov.uk/ukpga/2016/12/contents",
111
+ # Financial Services and Markets Act 2000
112
+ "https://www.legislation.gov.uk/ukpga/2000/8/contents",
113
+ # Third Parties (Rights against Insurers) Act 2010
114
+ "https://www.legislation.gov.uk/ukpga/2010/10/contents",
115
+ # Road Traffic Act 1988 (compulsory motor insurance)
116
+ "https://www.legislation.gov.uk/ukpga/1988/52/contents",
117
+ # Employers' Liability (Compulsory Insurance) Act 1969
118
+ "https://www.legislation.gov.uk/ukpga/1969/57/contents",
119
+ # Marine Insurance Act 1906
120
+ "https://www.legislation.gov.uk/ukpga/Edw7/6/41/contents",
121
+ # Consumer Insurance (Disclosure and Representations) Act 2012
122
+ "https://www.legislation.gov.uk/ukpga/2012/6/contents",
123
+ # Data Protection Act 2018
124
+ "https://www.legislation.gov.uk/ukpga/2018/12/contents",
125
+ ]
126
+
127
+ # ── Investopedia insurance glossary terms ──────────────────────────
128
+ INVESTOPEDIA_TERMS = [
129
+ "insurance", "reinsurance", "underwriting", "premium",
130
+ "deductible", "copayment", "coinsurance", "policy-limit",
131
+ "exclusion", "endorsement", "rider", "binder",
132
+ "actuary", "actuarial-science", "loss-ratio",
133
+ "combined-ratio", "expense-ratio", "claims-reserve",
134
+ "ibnr", "incurred-but-not-reported",
135
+ "lloyd-s-of-london", "surplus-lines",
136
+ "managing-general-agent", "captive-insurance-company",
137
+ "risk-retention-group", "self-insurance",
138
+ "occurrence-policy", "claims-made-policy",
139
+ "general-liability-insurance", "professional-liability-insurance",
140
+ "errors-and-omissions-insurance", "directors-and-officers-liability-insurance",
141
+ "cyber-insurance", "key-person-insurance",
142
+ "business-interruption-insurance", "commercial-property-insurance",
143
+ "workers-compensation", "employers-liability-insurance",
144
+ "public-liability-insurance", "product-liability-insurance",
145
+ "environmental-liability-insurance", "marine-insurance",
146
+ "hull-insurance", "cargo-insurance",
147
+ "protection-and-indemnity-insurance", "aviation-insurance",
148
+ "crop-insurance", "title-insurance",
149
+ "surety-bond", "fidelity-bond",
150
+ "catastrophe-bond", "insurance-linked-securities",
151
+ "parametric-insurance", "microinsurance",
152
+ "property-insurance", "casualty-insurance",
153
+ "fire-insurance", "flood-insurance",
154
+ "earthquake-insurance", "windstorm-insurance",
155
+ "homeowners-insurance", "renters-insurance",
156
+ "auto-insurance", "uninsured-motorist-coverage",
157
+ "comprehensive-auto-insurance", "collision-insurance",
158
+ "gap-insurance", "umbrella-insurance",
159
+ "life-insurance", "term-life-insurance",
160
+ "whole-life-insurance", "universal-life-insurance",
161
+ "variable-life-insurance", "endowment-policy",
162
+ "annuity", "health-insurance",
163
+ "disability-insurance", "long-term-care-insurance",
164
+ "pet-insurance", "travel-insurance",
165
+ "wedding-insurance", "event-insurance",
166
+ "trade-credit-insurance", "political-risk-insurance",
167
+ "warranty", "extended-warranty",
168
+ "solvency", "moral-hazard",
169
+ "adverse-selection", "risk-management",
170
+ "risk-assessment", "risk-transfer",
171
+ "risk-pooling", "law-of-large-numbers",
172
+ "subrogation", "indemnity", "utmost-good-faith",
173
+ "proximate-cause", "insurable-interest",
174
+ "insurance-fraud", "total-loss",
175
+ "actual-cash-value", "replacement-cost",
176
+ "agreed-value", "reinstatement-value",
177
+ ]
178
+
179
+ # ── HuggingFace datasets ──────────────────────────────────────────
180
+ HF_DATASETS = [
181
+ ("rvpierre/insurance-qa-en", None),
182
+ ("ebrigham/NL_insurance_reviews_sentiment", None),
183
+ ("snorkelai/Multi-Turn-Insurance-Underwriting-Code-Gen", None),
184
+ ("Ddream-ai/InsuranceCorpus", None),
185
+ ]
186
+
187
+ # ── Insurance subreddits ──────────────────────────────────────────
188
+ REDDIT_SUBREDDITS = [
189
+ "insurance",
190
+ "InsuranceProfessional",
191
+ "HealthInsurance",
192
+ "ActuaryUK",
193
+ "actuary",
194
+ ]
195
+
196
+ # ── RSS feeds for insurance news ───────────────────────────────────
197
+ RSS_FEEDS = [
198
+ "https://www.insurancetimes.co.uk/rss",
199
+ "https://www.insurancejournal.com/rss/news/",
200
+ "https://www.reinsurancene.ws/feed/",
201
+ "https://www.artemis.bm/feed/",
202
+ ]
convert_sft.py ADDED
@@ -0,0 +1,494 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Convert collected real-world insurance data into SFT and DPO training format.
2
+
3
+ Strategies:
4
+ 1. Knowledge Q&A β€” generate question-answer pairs from article text
5
+ 2. Summarisation β€” "Summarise this insurance concept"
6
+ 3. Regulation interpretation β€” "What does FCA say about X?"
7
+ 4. Legislation interpretation β€” UK insurance law sections
8
+ 5. Underwriting tasks β€” from snorkelai dataset
9
+ 6. News analysis
10
+ """
11
+
12
+ import json
13
+ import logging
14
+ import random
15
+ import re
16
+ from pathlib import Path
17
+
18
+ from collect.config import PROCESSED_DIR, SFT_OUTPUT, DPO_OUTPUT, RAW_DIR
19
+
20
+ logger = logging.getLogger(__name__)
21
+
22
+ # ── Language filter ────────────────────────────────────────────────
23
+
24
+ # Datasets to skip entirely (no answers, wrong language)
25
+ SKIP_DATASETS = {
26
+ "rvpierre/insurance-qa-en", # Questions only, no answers
27
+ "ebrigham/NL_insurance_reviews_sentiment", # Dutch
28
+ "Ddream-ai/InsuranceCorpus", # Chinese
29
+ }
30
+
31
+
32
+ def _is_english(text: str) -> bool:
33
+ """Quick heuristic: reject CJK or predominantly non-ASCII text."""
34
+ if not text:
35
+ return False
36
+ # Count CJK characters
37
+ cjk = sum(1 for c in text[:500] if '\u4e00' <= c <= '\u9fff' or '\u3040' <= c <= '\u30ff')
38
+ if cjk > 5:
39
+ return False
40
+ # Count non-ASCII vs ASCII
41
+ ascii_count = sum(1 for c in text[:500] if c.isascii())
42
+ if len(text[:500]) > 0 and ascii_count / len(text[:500]) < 0.7:
43
+ return False
44
+ return True
45
+
46
+
47
+ # ── Templates ──────────────────────────────────────────────────────
48
+
49
+ QA_TEMPLATES = [
50
+ "What is {concept}?",
51
+ "Explain {concept} in the context of UK insurance.",
52
+ "How does {concept} work in insurance?",
53
+ "Define {concept} for an insurance professional.",
54
+ "What role does {concept} play in the insurance industry?",
55
+ "Describe {concept} and its importance in insurance.",
56
+ "As an insurance underwriter, explain {concept}.",
57
+ "What should a claims handler know about {concept}?",
58
+ "How is {concept} relevant to insurance regulation in the UK?",
59
+ "Explain {concept} as it applies to general insurance.",
60
+ ]
61
+
62
+ REGULATION_TEMPLATES = [
63
+ "What does the FCA require regarding {topic}?",
64
+ "Explain the regulatory requirements for {topic} in UK insurance.",
65
+ "How does {topic} affect insurance companies under UK regulation?",
66
+ "What compliance obligations exist for {topic}?",
67
+ "Summarise the key regulatory points about {topic}.",
68
+ ]
69
+
70
+
71
+ def _extract_first_paragraph(text: str, max_len: int = 800) -> str:
72
+ """Extract a clean first paragraph as a concise answer."""
73
+ paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
74
+ if not paragraphs:
75
+ return text[:max_len]
76
+ for p in paragraphs:
77
+ if len(p) > 50:
78
+ return p[:max_len]
79
+ return paragraphs[0][:max_len]
80
+
81
+
82
+ def _extract_key_concepts(text: str) -> list[str]:
83
+ """Extract key insurance concepts/terms from text."""
84
+ concepts = set()
85
+ for line in text.split("\n"):
86
+ line = line.strip()
87
+ if 3 < len(line) < 80 and not line.endswith("."):
88
+ if line[0].isupper() and not line.startswith("The "):
89
+ concepts.add(line.strip("=").strip("#").strip())
90
+
91
+ insurance_terms = re.findall(
92
+ r'\b([A-Z][a-z]+(?: [A-Z][a-z]+){0,3})\b', text[:5000]
93
+ )
94
+ for term in insurance_terms:
95
+ if len(term) > 3 and any(kw in term.lower() for kw in [
96
+ "insurance", "reinsur", "claim", "underw", "polic",
97
+ "premium", "loss", "risk", "cover", "liabil",
98
+ "indemnit", "act", "regulation", "fca", "lloyd",
99
+ ]):
100
+ concepts.add(term)
101
+
102
+ return list(concepts)[:10]
103
+
104
+
105
+ def _make_sft_from_knowledge(doc: dict) -> list[dict]:
106
+ """Create SFT pairs from a knowledge article."""
107
+ pairs = []
108
+ title = doc.get("title", "")
109
+ text = doc.get("text", "")
110
+
111
+ if not text or len(text) < 100 or not _is_english(text):
112
+ return pairs
113
+
114
+ if title and len(title) > 3:
115
+ concept = title.replace("_", " ")
116
+ question = random.choice(QA_TEMPLATES).format(concept=concept)
117
+ answer = _extract_first_paragraph(text, max_len=1200)
118
+ if len(answer) > 50:
119
+ pairs.append({
120
+ "instruction": question,
121
+ "response": answer,
122
+ "source": doc.get("source", "unknown"),
123
+ "category": "knowledge_qa",
124
+ })
125
+
126
+ if len(text) > 500:
127
+ chunk = text[:3000]
128
+ pairs.append({
129
+ "instruction": f"Summarise the following insurance content:\n\n{chunk}",
130
+ "response": _extract_first_paragraph(text, max_len=600),
131
+ "source": doc.get("source", "unknown"),
132
+ "category": "summarisation",
133
+ })
134
+
135
+ concepts = _extract_key_concepts(text)
136
+ for concept in concepts[:3]:
137
+ question = random.choice(QA_TEMPLATES).format(concept=concept)
138
+ for para in text.split("\n\n"):
139
+ if concept.lower() in para.lower() and len(para) > 50:
140
+ pairs.append({
141
+ "instruction": question,
142
+ "response": para[:1200],
143
+ "source": doc.get("source", "unknown"),
144
+ "category": "concept_qa",
145
+ })
146
+ break
147
+
148
+ return pairs
149
+
150
+
151
+ def _make_sft_from_regulation(doc: dict) -> list[dict]:
152
+ """Create SFT pairs from regulatory documents."""
153
+ pairs = []
154
+ text = doc.get("text", "")
155
+ title = doc.get("title", "")
156
+ section = doc.get("section", "")
157
+
158
+ if not text or len(text) < 100 or not _is_english(text):
159
+ return pairs
160
+
161
+ topic = title or section
162
+ if topic:
163
+ question = random.choice(REGULATION_TEMPLATES).format(topic=topic)
164
+ answer = _extract_first_paragraph(text, max_len=1500)
165
+ if len(answer) > 50:
166
+ pairs.append({
167
+ "instruction": question,
168
+ "response": answer,
169
+ "source": "regulation",
170
+ "category": "regulation_qa",
171
+ })
172
+
173
+ sections = text.split("\n\n")
174
+ for section_text in sections[:5]:
175
+ if len(section_text) > 100:
176
+ pairs.append({
177
+ "instruction": f"Explain this insurance regulation provision:\n\n{section_text[:500]}",
178
+ "response": section_text[:1500],
179
+ "source": "regulation",
180
+ "category": "regulation_explain",
181
+ })
182
+
183
+ return pairs
184
+
185
+
186
+ def _make_sft_from_legislation(doc: dict) -> list[dict]:
187
+ """Create SFT pairs from UK insurance legislation."""
188
+ pairs = []
189
+ text = doc.get("text", "")
190
+ act = doc.get("act", "")
191
+
192
+ if not text or len(text) < 100 or not _is_english(text):
193
+ return pairs
194
+
195
+ pairs.append({
196
+ "instruction": (
197
+ f"Explain the following provision from UK insurance legislation "
198
+ f"({act}):\n\n{text[:1000]}"
199
+ ),
200
+ "response": text[:2000],
201
+ "source": "uk_legislation",
202
+ "category": "legislation_qa",
203
+ })
204
+
205
+ return pairs
206
+
207
+
208
+ def _make_sft_from_news(doc: dict) -> list[dict]:
209
+ """Create SFT pairs from insurance news articles."""
210
+ pairs = []
211
+ text = doc.get("text", "")
212
+ title = doc.get("title", "")
213
+
214
+ if not text or len(text) < 200 or not _is_english(text):
215
+ return pairs
216
+
217
+ pairs.append({
218
+ "instruction": f"Summarise this insurance industry news article:\n\n{text[:2000]}",
219
+ "response": _extract_first_paragraph(text, max_len=800),
220
+ "source": "insurance_news",
221
+ "category": "news_summary",
222
+ })
223
+
224
+ if title:
225
+ pairs.append({
226
+ "instruction": (
227
+ f"As an insurance industry analyst, what are the key takeaways "
228
+ f"from this article titled '{title}'?\n\n{text[:1500]}"
229
+ ),
230
+ "response": _extract_first_paragraph(text, max_len=1000),
231
+ "source": "insurance_news",
232
+ "category": "news_analysis",
233
+ })
234
+
235
+ return pairs
236
+
237
+
238
+ def _make_sft_from_underwriting(doc: dict) -> list[dict]:
239
+ """Create SFT pairs from snorkelai underwriting dataset."""
240
+ pairs = []
241
+ row = doc.get("row", {})
242
+ task = row.get("task", "")
243
+ ref_answer = row.get("reference answer", "")
244
+ company = row.get("company name", "Unknown Company")
245
+ desc = row.get("company description", "")
246
+ revenue = row.get("annual revenue", "")
247
+ employees = row.get("number of employees", "")
248
+ payroll = row.get("total payroll", "")
249
+ vehicles = row.get("number of vehicles", "")
250
+ construction = row.get("building construction", "")
251
+ state = row.get("state", "")
252
+ lob = row.get("lob", "")
253
+
254
+ if not task or not ref_answer:
255
+ return pairs
256
+
257
+ # Build company profile for context
258
+ profile_parts = [f"Company: {company}"]
259
+ if desc:
260
+ profile_parts.append(f"Description: {desc[:300]}")
261
+ if revenue:
262
+ profile_parts.append(f"Annual Revenue: ${int(revenue):,}" if revenue.isdigit() else f"Annual Revenue: {revenue}")
263
+ if employees:
264
+ profile_parts.append(f"Employees: {employees}")
265
+ if payroll:
266
+ profile_parts.append(f"Total Payroll: ${int(payroll):,}" if payroll.isdigit() else f"Total Payroll: {payroll}")
267
+ if vehicles:
268
+ profile_parts.append(f"Vehicles: {vehicles}")
269
+ if construction:
270
+ profile_parts.append(f"Building Construction: {construction}")
271
+ if state:
272
+ profile_parts.append(f"State: {state}")
273
+ if lob:
274
+ profile_parts.append(f"Line of Business: {lob}")
275
+ profile = "\n".join(profile_parts)
276
+
277
+ # Task-specific prompts
278
+ if task == "Small Business Elibility Check":
279
+ instruction = (
280
+ f"As an insurance underwriter, determine if the following company qualifies "
281
+ f"as a small business for insurance purposes:\n\n{profile}"
282
+ )
283
+ elif task == "Business Classification":
284
+ instruction = (
285
+ f"As an insurance underwriter, classify the following business and determine "
286
+ f"its NAICS code:\n\n{profile}"
287
+ )
288
+ elif task == "Appetite Check":
289
+ instruction = (
290
+ f"As an insurance underwriter, determine whether the following company is "
291
+ f"within appetite for the specified line of business:\n\n{profile}"
292
+ )
293
+ elif task == "Product Recommendations":
294
+ instruction = (
295
+ f"As an insurance underwriter, recommend appropriate insurance products "
296
+ f"for the following company:\n\n{profile}"
297
+ )
298
+ elif task == "Policy Limits":
299
+ instruction = (
300
+ f"As an insurance underwriter, recommend appropriate policy limits "
301
+ f"for the following company:\n\n{profile}"
302
+ )
303
+ elif task == "Deductibles":
304
+ instruction = (
305
+ f"As an insurance underwriter, recommend appropriate deductible levels "
306
+ f"for the following company:\n\n{profile}"
307
+ )
308
+ else:
309
+ instruction = (
310
+ f"As an insurance underwriter, perform the following task: {task}\n\n{profile}"
311
+ )
312
+
313
+ pairs.append({
314
+ "instruction": instruction,
315
+ "response": ref_answer,
316
+ "source": "snorkelai/underwriting",
317
+ "category": "underwriting",
318
+ })
319
+
320
+ return pairs
321
+
322
+
323
+ def _make_sft_from_hf(doc: dict) -> list[dict]:
324
+ """Create SFT pairs from HuggingFace dataset rows.
325
+
326
+ Skips datasets in SKIP_DATASETS and non-English text.
327
+ Routes snorkelai data to specialised underwriting converter.
328
+ """
329
+ dataset = doc.get("dataset", "")
330
+
331
+ # Skip blacklisted datasets
332
+ if dataset in SKIP_DATASETS:
333
+ return []
334
+
335
+ # Route snorkelai to dedicated underwriting handler
336
+ if "snorkelai" in dataset.lower() or "underwriting" in dataset.lower():
337
+ return _make_sft_from_underwriting(doc)
338
+
339
+ # Language filter
340
+ text = doc.get("text", "")
341
+ if not text or len(text) < 50 or not _is_english(text):
342
+ return []
343
+
344
+ row = doc.get("row", {})
345
+ pairs = []
346
+
347
+ # Check for question/answer fields
348
+ q = row.get("question", row.get("question_en", ""))
349
+ a = row.get("answer", row.get("answer_en", row.get("response", "")))
350
+ if q and a and len(a) > 20:
351
+ pairs.append({
352
+ "instruction": q,
353
+ "response": a[:2000],
354
+ "source": dataset,
355
+ "category": "hf_qa",
356
+ })
357
+ return pairs
358
+
359
+ # Check for instruction/output fields
360
+ inst = row.get("instruction", row.get("input", ""))
361
+ out = row.get("output", row.get("response", ""))
362
+ if inst and out and len(out) > 20:
363
+ pairs.append({
364
+ "instruction": inst,
365
+ "response": out[:2000],
366
+ "source": dataset,
367
+ "category": "hf_instruction",
368
+ })
369
+ return pairs
370
+
371
+ # Generic: only if text is substantial and looks like insurance content
372
+ if len(text) > 200:
373
+ pairs.append({
374
+ "instruction": f"Explain the following insurance information:\n\n{text[:1000]}",
375
+ "response": text[:2000],
376
+ "source": dataset,
377
+ "category": "hf_knowledge",
378
+ })
379
+
380
+ return pairs
381
+
382
+
383
+ def _make_dpo_pair(sft_pair: dict) -> dict | None:
384
+ """Create a DPO preference pair from an SFT pair."""
385
+ instruction = sft_pair["instruction"]
386
+ good_response = sft_pair["response"]
387
+
388
+ if len(good_response) < 100:
389
+ return None
390
+
391
+ bad_strategies = [
392
+ lambda r: r.split(".")[0] + "." if "." in r else r[:50],
393
+ lambda r: "This is a complex insurance topic that requires careful consideration of many factors.",
394
+ lambda r: r[:max(50, len(r) // 5)],
395
+ lambda r: f"While insurance is about managing risk, {r[:100]}",
396
+ ]
397
+
398
+ bad_response = random.choice(bad_strategies)(good_response)
399
+
400
+ return {
401
+ "instruction": instruction,
402
+ "chosen": good_response,
403
+ "rejected": bad_response,
404
+ "source": sft_pair.get("source", "unknown"),
405
+ }
406
+
407
+
408
+ def convert_all_to_sft(raw_dir: Path = RAW_DIR) -> tuple[int, int]:
409
+ """Convert all collected raw documents to SFT and DPO format."""
410
+ all_sft = []
411
+ all_dpo = []
412
+
413
+ converters = {
414
+ "wikipedia": _make_sft_from_knowledge,
415
+ "fca_handbook": _make_sft_from_regulation,
416
+ "uk_legislation": _make_sft_from_legislation,
417
+ "investopedia": _make_sft_from_knowledge,
418
+ "insurance_news": _make_sft_from_news,
419
+ "insurance_news_summary": _make_sft_from_news,
420
+ "huggingface": _make_sft_from_hf,
421
+ "exam_syllabus": _make_sft_from_knowledge,
422
+ "insurance_education": _make_sft_from_knowledge,
423
+ "insurance_data": _make_sft_from_hf,
424
+ }
425
+
426
+ skipped = 0
427
+ for source_dir in raw_dir.iterdir():
428
+ if not source_dir.is_dir():
429
+ continue
430
+ for jsonl_file in source_dir.glob("*.jsonl"):
431
+ logger.info(f"Converting {jsonl_file}...")
432
+ with open(jsonl_file) as f:
433
+ for line in f:
434
+ try:
435
+ doc = json.loads(line)
436
+ except json.JSONDecodeError:
437
+ continue
438
+
439
+ source = doc.get("source", "")
440
+ converter = converters.get(source, _make_sft_from_knowledge)
441
+ sft_pairs = converter(doc)
442
+
443
+ if not sft_pairs:
444
+ skipped += 1
445
+ continue
446
+
447
+ for pair in sft_pairs:
448
+ # Quality gate: response must not be a near-echo of instruction
449
+ resp = pair["response"].strip().lower()
450
+ inst_text = pair["instruction"].strip().lower()
451
+ if resp and resp != inst_text and len(resp) > 20:
452
+ all_sft.append(pair)
453
+ if random.random() < 0.3:
454
+ dpo = _make_dpo_pair(pair)
455
+ if dpo:
456
+ all_dpo.append(dpo)
457
+
458
+ logger.info(f"Skipped {skipped} documents (non-English, no answers, blacklisted)")
459
+
460
+ random.shuffle(all_sft)
461
+ random.shuffle(all_dpo)
462
+
463
+ # Write SFT
464
+ with open(SFT_OUTPUT, "w") as f:
465
+ for pair in all_sft:
466
+ chat = {
467
+ "messages": [
468
+ {"role": "system", "content": "You are InsureLLM, an expert UK insurance AI assistant. You provide accurate, detailed, and regulation-aware answers about insurance, underwriting, claims, actuarial science, and UK/EU insurance regulation."},
469
+ {"role": "user", "content": pair["instruction"]},
470
+ {"role": "assistant", "content": pair["response"]},
471
+ ]
472
+ }
473
+ f.write(json.dumps(chat, ensure_ascii=False) + "\n")
474
+
475
+ # Write DPO
476
+ with open(DPO_OUTPUT, "w") as f:
477
+ for pair in all_dpo:
478
+ dpo_row = {
479
+ "prompt": pair["instruction"],
480
+ "chosen": pair["chosen"],
481
+ "rejected": pair["rejected"],
482
+ }
483
+ f.write(json.dumps(dpo_row, ensure_ascii=False) + "\n")
484
+
485
+ logger.info(f"SFT: {len(all_sft)} pairs β†’ {SFT_OUTPUT}")
486
+ logger.info(f"DPO: {len(all_dpo)} pairs β†’ {DPO_OUTPUT}")
487
+
488
+ return len(all_sft), len(all_dpo)
489
+
490
+
491
+ if __name__ == "__main__":
492
+ logging.basicConfig(level=logging.INFO)
493
+ sft_count, dpo_count = convert_all_to_sft()
494
+ print(f"Created {sft_count} SFT pairs and {dpo_count} DPO pairs")
data/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # InsureOS Models β€” Python package markers
data/constants.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” UK Insurance Synthetic Data: Constants & Templates
3
+ All UK-specific: GBP, postcodes, FCA references, Lloyd's market terms
4
+ """
5
+
6
+ # ── UK Insurance Lines of Business ──
7
+ LINES_OF_BUSINESS = [
8
+ "Motor Private Car", "Motor Commercial Vehicle", "Motor Fleet",
9
+ "Home Buildings", "Home Contents", "Home Combined",
10
+ "Commercial Property", "Commercial Combined",
11
+ "Employers' Liability", "Public Liability", "Professional Indemnity",
12
+ "Directors & Officers", "Cyber Liability",
13
+ "Travel Single Trip", "Travel Annual Multi-Trip",
14
+ "Pet Insurance", "Life Term Assurance", "Income Protection",
15
+ "Marine Cargo", "Marine Hull",
16
+ "Aviation", "Engineering Inspection", "Legal Expenses",
17
+ "Fidelity Guarantee", "Business Interruption",
18
+ ]
19
+
20
+ # ── UK Regions & Postcodes ──
21
+ UK_REGIONS = {
22
+ "London": ["EC1A", "EC2A", "WC1A", "SW1A", "SE1", "E1", "N1", "W1"],
23
+ "South East": ["GU", "RH", "TN", "CT", "ME", "BN", "PO", "SO"],
24
+ "South West": ["BS", "BA", "EX", "PL", "TR", "TA", "DT", "GL"],
25
+ "East Anglia": ["CB", "IP", "NR", "CO", "CM", "PE"],
26
+ "Midlands": ["B", "CV", "WV", "WS", "DY", "NG", "DE", "LE"],
27
+ "North West": ["M", "L", "WA", "CH", "PR", "BL", "OL", "SK"],
28
+ "North East": ["NE", "SR", "DH", "TS", "DL", "HU"],
29
+ "Yorkshire": ["LS", "BD", "HX", "HD", "WF", "S", "DN", "YO"],
30
+ "Scotland": ["EH", "G", "AB", "DD", "KY", "FK", "PA", "IV"],
31
+ "Wales": ["CF", "SA", "NP", "LL", "SY", "LD"],
32
+ "Northern Ireland": ["BT"],
33
+ }
34
+
35
+ # ── UK Insurers (for realistic data) ──
36
+ UK_INSURERS = [
37
+ "Aviva", "AXA UK", "RSA Insurance", "Zurich UK",
38
+ "Allianz UK", "QBE European", "Hiscox",
39
+ "Beazley", "Brit Insurance", "MS Amlin",
40
+ "Ecclesiastical", "LV= General Insurance",
41
+ "NFU Mutual", "Direct Line Group", "Admiral",
42
+ "Ageas UK", "CovΓ©a Insurance", "Tokio Marine Kiln",
43
+ "Canopius", "Chaucer", "Argenta Syndicate",
44
+ ]
45
+
46
+ # ── Lloyd's Syndicates ──
47
+ LLOYDS_SYNDICATES = [
48
+ "Syndicate 2623 (Beazley)", "Syndicate 2987 (Brit)",
49
+ "Syndicate 2001 (MS Amlin)", "Syndicate 1084 (Chaucer)",
50
+ "Syndicate 4444 (Canopius)", "Syndicate 1861 (Argo)",
51
+ "Syndicate 1200 (Argo)", "Syndicate 5623 (Aon)",
52
+ "Syndicate 1729 (Dale)", "Syndicate 1969 (Apollo)",
53
+ "Syndicate 2525 (Asta)", "Syndicate 2121 (Argenta)",
54
+ ]
55
+
56
+ # ── MGA Names ──
57
+ MGA_NAMES = [
58
+ "Bravo Networks MGA", "Accelerant Holdings",
59
+ "Volante Global", "Three Sixty Underwriting",
60
+ "Pen Underwriting", "Plexus MGA",
61
+ "Manchester Underwriting", "Alchemy Underwriting",
62
+ "Rokstone Underwriting", "Arista Insurance",
63
+ "Ennismore MGA", "Flow Underwriting",
64
+ ]
65
+
66
+ # ── Claim Types by Line ──
67
+ CLAIM_TYPES = {
68
+ "Motor Private Car": [
69
+ "Accidental damage β€” collision with another vehicle",
70
+ "Accidental damage β€” single vehicle (e.g., hit kerb, post)",
71
+ "Theft of vehicle", "Theft from vehicle",
72
+ "Windscreen damage", "Fire damage",
73
+ "Third party bodily injury", "Third party property damage",
74
+ "Personal injury β€” whiplash", "Flood damage to vehicle",
75
+ ],
76
+ "Home Buildings": [
77
+ "Escape of water β€” burst pipe", "Escape of water β€” leaking roof",
78
+ "Storm damage β€” roof tiles", "Storm damage β€” fallen tree",
79
+ "Flood damage", "Subsidence", "Heave",
80
+ "Fire damage", "Malicious damage",
81
+ "Impact damage β€” vehicle into property",
82
+ ],
83
+ "Home Contents": [
84
+ "Theft β€” burglary", "Accidental damage β€” spillage on carpet",
85
+ "Accidental damage β€” broken TV/laptop",
86
+ "Fire damage to contents", "Flood damage to contents",
87
+ "Loss of jewellery", "Freezer contents (power failure)",
88
+ ],
89
+ "Employers' Liability": [
90
+ "Slip/trip/fall at workplace", "Manual handling injury",
91
+ "Repetitive strain injury", "Exposure to hazardous substances",
92
+ "Workplace violence", "Work-related stress claim",
93
+ "Fall from height", "Machinery accident",
94
+ ],
95
+ "Public Liability": [
96
+ "Slip/trip on premises", "Product liability β€” defective goods",
97
+ "Property damage during work", "Food poisoning claim",
98
+ "Professional negligence", "Advertising injury",
99
+ ],
100
+ "Professional Indemnity": [
101
+ "Negligent advice or design", "Breach of duty of care",
102
+ "Failure to meet professional standards", "Data breach liability",
103
+ "Omission in professional service", "Loss of client documents",
104
+ ],
105
+ "Cyber Liability": [
106
+ "Ransomware attack", "Data breach β€” customer PII",
107
+ "Business email compromise", "DDoS attack β€” business interruption",
108
+ "Social engineering fraud", "Third party data breach claim",
109
+ ],
110
+ "Travel Single Trip": [
111
+ "Medical emergency abroad", "Trip cancellation",
112
+ "Lost baggage", "Flight delay", "Passport loss",
113
+ "Personal belongings theft",
114
+ ],
115
+ }
116
+
117
+ # ── FCA/Regulatory References ──
118
+ FCA_REFERENCES = {
119
+ "consumer_duty": "FCA Consumer Duty (PS22/9, effective July 2023)",
120
+ "icobs": "ICOBS (Insurance: Conduct of Business sourcebook)",
121
+ "fair_treatment": "TCF β€” Treating Customers Fairly principles",
122
+ "complaints": "DISP (Dispute Resolution: Complaints sourcebook)",
123
+ "claims_handling": "ICOBS 8 β€” Claims handling requirements",
124
+ "value_assessment": "FCA PS21/5 General Insurance Value Measures",
125
+ "pricing_practices": "FCA PS21/14 General Insurance Pricing Practices",
126
+ "solvency_ii": "Solvency II Directive (2009/138/EC) as retained UK law",
127
+ "gdpr": "UK GDPR (Data Protection Act 2018)",
128
+ "equality_act": "Equality Act 2010 β€” protected characteristics",
129
+ "fos": "Financial Ombudsman Service (FOS) referral rights",
130
+ }
131
+
132
+ # ── Policy Wording Sections ──
133
+ POLICY_SECTIONS = [
134
+ "Definitions", "Operative Clause", "Insuring Clause",
135
+ "General Exclusions", "General Conditions", "Claims Conditions",
136
+ "Endorsements", "Schedule of Insurance",
137
+ "Section 1 β€” Buildings", "Section 2 β€” Contents",
138
+ "Section 3 β€” Personal Possessions", "Section 4 β€” Liability",
139
+ "Cancellation Clause", "Subrogation Rights",
140
+ "Arbitration Clause", "Fraud Clause",
141
+ "Sanctions Limitation & Exclusion Clause",
142
+ "Several Liability Clause (Lloyd's)",
143
+ "Third Party Rights (Contracts) Act 1999 Exclusion",
144
+ ]
145
+
146
+ # ── Document Types ──
147
+ DOCUMENT_TYPES = [
148
+ "Policy Schedule", "Certificate of Insurance",
149
+ "Claim Form", "Loss Adjuster Report",
150
+ "Bordereaux β€” Premium", "Bordereaux β€” Claims",
151
+ "Endorsement", "Renewal Notice",
152
+ "Statement of Fact", "FNOL Report",
153
+ "Subrogation Notice", "Policy Wording",
154
+ ]
155
+
156
+ # ── NER Entity Types ──
157
+ NER_ENTITY_TYPES = [
158
+ "POLICY_NUMBER", "CLAIM_NUMBER", "INSURED_NAME",
159
+ "INSURER_NAME", "BROKER_NAME", "SYNDICATE",
160
+ "COVERAGE_TYPE", "CURRENCY_AMOUNT", "DATE",
161
+ "POSTCODE", "VEHICLE_REG", "PERIL",
162
+ "EXCLUSION", "EXCESS_AMOUNT", "LIMIT_AMOUNT",
163
+ ]
164
+
165
+ # ── Insurance Jargon for Training ──
166
+ INSURANCE_JARGON = {
167
+ "utmost good faith": "Both parties must disclose all material facts honestly.",
168
+ "subrogation": "The insurer's right to recover costs from a third party at fault.",
169
+ "indemnity": "Restoring the insured to the same financial position as before the loss.",
170
+ "proximate cause": "The dominant or effective cause of the loss.",
171
+ "excess": "The first amount of any claim that the policyholder must pay themselves.",
172
+ "deductible": "Another term for excess β€” the uninsured portion of a claim.",
173
+ "aggregate limit": "The maximum total amount an insurer will pay in a policy period.",
174
+ "bordereaux": "A detailed listing of premiums or claims, typically sent monthly by an MGA to their capacity provider.",
175
+ "coverholder": "A firm authorized by Lloyd's to enter into contracts of insurance on behalf of a syndicate.",
176
+ "binding authority": "An agreement allowing a coverholder to underwrite risks on behalf of a syndicate.",
177
+ "slip": "The document used in the London Market to place a risk, showing lead and following underwriters.",
178
+ "following market": "Underwriters who accept a share of the risk after the lead underwriter has set terms.",
179
+ "burning cost": "The ratio of actual claims incurred to premiums received, used to set reinsurance rates.",
180
+ "loss ratio": "Claims paid (or incurred) divided by premiums earned, expressed as a percentage.",
181
+ "combined ratio": "Loss ratio plus expense ratio β€” below 100% means underwriting profit.",
182
+ "IBNR": "Incurred But Not Reported β€” reserves for claims that have occurred but not yet been filed.",
183
+ "case reserve": "The estimated cost set aside for a specific known claim.",
184
+ "frequency": "The number of claims per unit of exposure.",
185
+ "severity": "The average cost per claim.",
186
+ "attritional loss": "Expected losses from many small, frequent claims.",
187
+ "catastrophe loss": "Large losses from a single event (e.g., storm, flood).",
188
+ "FCA Consumer Duty": "FCA regulation requiring firms to act to deliver good outcomes for retail customers.",
189
+ "TCF": "Treating Customers Fairly β€” FCA principle predating Consumer Duty.",
190
+ "FOS": "Financial Ombudsman Service β€” free dispute resolution for consumers.",
191
+ "FSCS": "Financial Services Compensation Scheme β€” protects consumers if an insurer fails.",
192
+ }
193
+
194
+ # ── SFT Task Categories ──
195
+ SFT_TASK_CATEGORIES = [
196
+ "claims_handling", # Process a claim, assess coverage, set reserves
197
+ "policy_analysis", # Explain policy wordings, coverage, exclusions
198
+ "fnol", # First Notification of Loss processing
199
+ "compliance_check", # FCA Consumer Duty, GDPR, fair pricing checks
200
+ "bordereaux_processing", # Parse and validate bordereaux data
201
+ "fraud_assessment", # Evaluate fraud indicators
202
+ "underwriting_triage", # Assess a submission, recommend terms
203
+ "customer_communication", # Draft policyholder letters/emails
204
+ "reserve_setting", # Set/review claim reserves with rationale
205
+ "renewal_review", # Assess a renewal, flag changes
206
+ "jargon_explanation", # Explain insurance terms in plain English
207
+ "regulatory_query", # Answer regulatory questions (FCA, PRA, Lloyd's)
208
+ ]
209
+
210
+ # ── DPO Preference Dimensions ──
211
+ DPO_PREFERENCE_DIMENSIONS = [
212
+ "fca_consumer_duty", # Chosen: consumer-fair; Rejected: unfair/opaque
213
+ "accuracy", # Chosen: factually correct; Rejected: hallucinated
214
+ "regulatory_compliance", # Chosen: compliant; Rejected: non-compliant
215
+ "plain_english", # Chosen: clear; Rejected: jargon-heavy
216
+ "data_protection", # Chosen: GDPR-safe; Rejected: leaks PII
217
+ "fair_pricing", # Chosen: non-discriminatory; Rejected: uses protected characteristics
218
+ ]
219
+ MGAS = MGA_NAMES # alias for backward compat
data/gen_documents.py ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Synthetic Document Classification Data Generator
3
+ Generates 10K labelled insurance document texts for fine-tuning a ModernBERT classifier.
4
+ """
5
+
6
+ import json
7
+ import os
8
+ import random
9
+
10
+ from faker import Faker
11
+ from tqdm import tqdm
12
+
13
+ from data.constants import (
14
+ DOCUMENT_TYPES, UK_INSURERS, LLOYDS_SYNDICATES, MGAS, UK_REGIONS,
15
+ )
16
+
17
+ fake = Faker("en_GB")
18
+ Faker.seed(45)
19
+ random.seed(45)
20
+
21
+ # ── Document templates per type ──
22
+
23
+ def _gen_policy_schedule() -> str:
24
+ insurer = random.choice(UK_INSURERS)
25
+ customer = fake.name()
26
+ addr = fake.address().replace("\n", ", ")
27
+ ref = f"POL-{random.randint(100000,999999)}"
28
+ inception = fake.date_between(start_date="-2y", end_date="today")
29
+ expiry = inception.replace(year=inception.year + 1)
30
+ premium = random.randint(200, 5000)
31
+ return (
32
+ f"POLICY SCHEDULE\n"
33
+ f"Insurer: {insurer}\n"
34
+ f"Policy Number: {ref}\n"
35
+ f"Policyholder: {customer}\n"
36
+ f"Address: {addr}\n"
37
+ f"Period of Insurance: {inception.isoformat()} to {expiry.isoformat()}\n"
38
+ f"Total Premium: Β£{premium:,}\n"
39
+ f"Insurance Premium Tax at 12%: Β£{int(premium*0.12):,}\n"
40
+ f"Lines of Business: {random.choice(['Motor','Home','Commercial Combined','Landlord'])}\n"
41
+ f"Voluntary Excess: Β£{random.choice([0,100,250,500])}\n"
42
+ f"Compulsory Excess: Β£{random.choice([100,250,350])}\n"
43
+ f"No Claims Discount: {random.randint(0,9)} years\n"
44
+ f"Special Conditions: {random.choice(['None','Alarm condition','Unoccupancy clause','Named driver only'])}\n"
45
+ )
46
+
47
+
48
+ def _gen_claim_form() -> str:
49
+ customer = fake.name()
50
+ ref = f"CLM-{random.randint(200000,999999)}"
51
+ loss_date = fake.date_between(start_date="-1y", end_date="today")
52
+ claim_type = random.choice(["Escape of water", "Accidental damage", "Theft", "Storm", "Collision", "Fire"])
53
+ amount = random.randint(500, 50000)
54
+ return (
55
+ f"FIRST NOTIFICATION OF LOSS / CLAIM FORM\n"
56
+ f"Claim Reference: {ref}\n"
57
+ f"Policyholder: {customer}\n"
58
+ f"Date of Loss: {loss_date.isoformat()}\n"
59
+ f"Type of Loss: {claim_type}\n"
60
+ f"Description: {fake.paragraph(nb_sentences=4)}\n"
61
+ f"Estimated Value: Β£{amount:,}\n"
62
+ f"Police Notified: {random.choice(['Yes - crime ref provided','No','Not applicable'])}\n"
63
+ f"Witnesses: {random.choice(['Yes','No'])}\n"
64
+ f"Supporting Evidence: {random.choice(['Photos attached','Receipts attached','Awaiting','None'])}\n"
65
+ )
66
+
67
+
68
+ def _gen_endorsement() -> str:
69
+ ref = f"POL-{random.randint(100000,999999)}"
70
+ endo_num = random.randint(1, 12)
71
+ effective = fake.date_between(start_date="-1y", end_date="today")
72
+ return (
73
+ f"ENDORSEMENT NO. {endo_num}\n"
74
+ f"Policy Reference: {ref}\n"
75
+ f"Effective Date: {effective.isoformat()}\n"
76
+ f"Amendment: {random.choice(['Addition of named driver','Change of address','Increase in sum insured','Vehicle change','Occupation update','Additional cover added'])}\n"
77
+ f"Previous: {fake.sentence()}\n"
78
+ f"New: {fake.sentence()}\n"
79
+ f"Additional Premium: Β£{random.randint(0,350)}\n"
80
+ f"All other terms and conditions remain unchanged.\n"
81
+ f"Authorised by: {fake.name()}, Underwriter\n"
82
+ )
83
+
84
+
85
+ def _gen_loss_adjuster_report() -> str:
86
+ ref = f"CLM-{random.randint(200000,999999)}"
87
+ adjuster = fake.name()
88
+ company = random.choice(["Crawford & Company", "McLarens", "Sedgwick", "Davies Group", "Cunningham Lindsey"])
89
+ return (
90
+ f"LOSS ADJUSTER'S REPORT\n"
91
+ f"Claim Reference: {ref}\n"
92
+ f"Loss Adjuster: {adjuster} β€” {company}\n"
93
+ f"Visit Date: {fake.date_between(start_date='-6m', end_date='today').isoformat()}\n"
94
+ f"Property Inspected: {fake.address().replace(chr(10), ', ')}\n"
95
+ f"Findings: {fake.paragraph(nb_sentences=6)}\n"
96
+ f"Cause of Loss: {random.choice(['Burst pipe β€” wear and tear','Storm damage β€” wind speed confirmed >55mph','Malicious damage β€” forced entry confirmed','Subsidence β€” monitoring recommended','Accidental damage β€” consistent with account given'])}\n"
97
+ f"Recommended Settlement: Β£{random.randint(1000,40000):,}\n"
98
+ f"Recommendation: {random.choice(['Pay in full','Pay subject to betterment deduction','Decline β€” maintenance exclusion','Further investigation required','Refer to fraud team'])}\n"
99
+ )
100
+
101
+
102
+ def _gen_bordereaux() -> str:
103
+ mga = random.choice(MGAS)
104
+ syndicate = random.choice(LLOYDS_SYNDICATES)
105
+ period = f"{random.choice(['Q1','Q2','Q3','Q4'])} {random.randint(2023,2025)}"
106
+ rows = random.randint(5, 15)
107
+ header = "PolicyRef | Inception | GWP | Claims Paid | Outstanding | Status"
108
+ data_rows = ""
109
+ for _ in range(rows):
110
+ data_rows += (
111
+ f"\n{random.randint(100000,999999)} | "
112
+ f"{fake.date_between(start_date='-2y',end_date='today').isoformat()} | "
113
+ f"Β£{random.randint(500,25000):,} | "
114
+ f"Β£{random.randint(0,15000):,} | "
115
+ f"Β£{random.randint(0,8000):,} | "
116
+ f"{random.choice(['Active','Lapsed','Cancelled','Renewed'])}"
117
+ )
118
+ return (
119
+ f"BORDEREAUX REPORT\n"
120
+ f"MGA: {mga}\n"
121
+ f"Capacity Provider: {syndicate}\n"
122
+ f"Reporting Period: {period}\n"
123
+ f"Currency: GBP\n\n"
124
+ f"{header}{data_rows}\n\n"
125
+ f"Total GWP: Β£{random.randint(50000,500000):,}\n"
126
+ f"Total Claims Paid: Β£{random.randint(10000,200000):,}\n"
127
+ f"Loss Ratio: {random.randint(35,95)}%\n"
128
+ )
129
+
130
+
131
+ def _gen_renewal_invite() -> str:
132
+ insurer = random.choice(UK_INSURERS)
133
+ customer = fake.name()
134
+ ref = f"POL-{random.randint(100000,999999)}"
135
+ current_premium = random.randint(300, 4000)
136
+ renewal_premium = int(current_premium * random.uniform(0.85, 1.35))
137
+ renewal_date = fake.date_between(start_date="today", end_date="+60d")
138
+ return (
139
+ f"INSURANCE RENEWAL INVITATION\n"
140
+ f"Dear {customer},\n\n"
141
+ f"Your {random.choice(['motor','home','landlord','commercial'])} insurance policy "
142
+ f"({ref}) with {insurer} is due for renewal on {renewal_date.isoformat()}.\n\n"
143
+ f"Current premium: Β£{current_premium:,}\n"
144
+ f"Renewal premium: Β£{renewal_premium:,}\n"
145
+ f"{'Your premium has increased' if renewal_premium > current_premium else 'Your premium has decreased'} "
146
+ f"by Β£{abs(renewal_premium - current_premium):,}.\n\n"
147
+ f"Key changes this year: {random.choice(['No changes to cover','Excess increased by Β£50','New market rate applied','Discount for claims-free year applied'])}.\n\n"
148
+ f"Under FCA Consumer Duty, we're required to ensure you're getting fair value. "
149
+ f"If you'd like to discuss your renewal, please call us.\n"
150
+ )
151
+
152
+
153
+ def _gen_subrogation_letter() -> str:
154
+ ref = f"CLM-{random.randint(200000,999999)}"
155
+ insurer = random.choice(UK_INSURERS)
156
+ third_party_insurer = random.choice(UK_INSURERS)
157
+ amount = random.randint(1000, 25000)
158
+ return (
159
+ f"WITHOUT PREJUDICE\n"
160
+ f"SUBROGATION RECOVERY DEMAND\n\n"
161
+ f"From: {insurer} β€” Claims Recovery Unit\n"
162
+ f"To: {third_party_insurer} β€” Third Party Claims\n"
163
+ f"Our Reference: {ref}\n"
164
+ f"Date of Loss: {fake.date_between(start_date='-1y', end_date='-30d').isoformat()}\n\n"
165
+ f"We write in connection with the above claim in which our policyholder's "
166
+ f"vehicle/property was damaged by your insured.\n\n"
167
+ f"We have indemnified our policyholder in the sum of Β£{amount:,} and hereby seek "
168
+ f"recovery pursuant to our rights of subrogation.\n\n"
169
+ f"We enclose: loss adjuster report, repair invoices, photographic evidence.\n\n"
170
+ f"Please respond within 21 days with your admission or otherwise.\n"
171
+ )
172
+
173
+
174
+ def _gen_complaint_letter() -> str:
175
+ customer = fake.name()
176
+ insurer = random.choice(UK_INSURERS)
177
+ ref = f"CLM-{random.randint(200000,999999)}"
178
+ return (
179
+ f"FORMAL COMPLAINT\n\n"
180
+ f"From: {customer}\n"
181
+ f"To: Complaints Department, {insurer}\n"
182
+ f"Date: {fake.date_between(start_date='-3m', end_date='today').isoformat()}\n"
183
+ f"Claim Reference: {ref}\n\n"
184
+ f"Dear Complaints Team,\n\n"
185
+ f"I wish to make a formal complaint about the handling of my claim.\n\n"
186
+ f"Issue: {random.choice(['Unreasonable delay β€” no update in 8 weeks','Settlement offer is too low and does not reflect actual costs','You declined my claim without proper investigation','Your staff were unhelpful and dismissive','You failed to appoint a loss adjuster as promised','My personal data was shared without consent'])}\n\n"
187
+ f"{fake.paragraph(nb_sentences=3)}\n\n"
188
+ f"I expect a response within 8 weeks in line with FCA requirements. "
189
+ f"If I am not satisfied, I understand I can refer this to the Financial Ombudsman Service.\n\n"
190
+ f"Yours faithfully,\n{customer}\n"
191
+ )
192
+
193
+
194
+ def _gen_medical_report() -> str:
195
+ claimant = fake.name()
196
+ ref = f"CLM-{random.randint(200000,999999)}"
197
+ doctor = f"Dr {fake.last_name()}"
198
+ return (
199
+ f"MEDICO-LEGAL REPORT\n"
200
+ f"Claim Reference: {ref}\n"
201
+ f"Claimant: {claimant}\n"
202
+ f"Examining Doctor: {doctor}, {random.choice(['GP','Orthopaedic Consultant','Neurologist','Psychiatrist'])}\n"
203
+ f"Date of Examination: {fake.date_between(start_date='-3m', end_date='today').isoformat()}\n"
204
+ f"Date of Accident: {fake.date_between(start_date='-1y', end_date='-3m').isoformat()}\n\n"
205
+ f"HISTORY: {fake.paragraph(nb_sentences=3)}\n\n"
206
+ f"EXAMINATION FINDINGS: {fake.paragraph(nb_sentences=3)}\n\n"
207
+ f"DIAGNOSIS: {random.choice(['Whiplash Associated Disorder Grade II','Lumbar disc protrusion','Fractured clavicle β€” healed','Adjustment disorder with anxiety','Soft tissue injury β€” resolving','Post-traumatic stress disorder β€” moderate'])}\n\n"
208
+ f"PROGNOSIS: Recovery expected within {random.choice(['3-6 months','6-12 months','12-18 months','Ongoing β€” chronic'])}.\n"
209
+ f"Employment Impact: {random.choice(['None','2 weeks off work','4 weeks reduced duties','Ongoing inability to work'])}\n"
210
+ )
211
+
212
+
213
+ def _gen_fca_letter() -> str:
214
+ insurer = random.choice(UK_INSURERS)
215
+ return (
216
+ f"FINANCIAL CONDUCT AUTHORITY\n"
217
+ f"25 The North Colonnade, London E14 5HS\n\n"
218
+ f"To: Chief Executive, {insurer}\n"
219
+ f"Date: {fake.date_between(start_date='-1y', end_date='today').isoformat()}\n\n"
220
+ f"Dear Sir/Madam,\n\n"
221
+ f"RE: {random.choice(['Section 166 Skilled Person Review','Dear CEO letter β€” General Insurance Pricing','Thematic Review β€” Claims Handling Practices','Consumer Duty Implementation Assessment','Complaints Handling Review'])}\n\n"
222
+ f"{fake.paragraph(nb_sentences=5)}\n\n"
223
+ f"We require your response by {fake.date_between(start_date='today', end_date='+60d').isoformat()}.\n\n"
224
+ f"Yours faithfully,\n"
225
+ f"Director of Insurance Supervision\n"
226
+ f"Financial Conduct Authority\n"
227
+ )
228
+
229
+
230
+ def _gen_risk_survey() -> str:
231
+ surveyor = fake.name()
232
+ region_name, region = random.choice(list(UK_REGIONS.items()))
233
+ return (
234
+ f"COMMERCIAL RISK SURVEY REPORT\n"
235
+ f"Surveyor: {surveyor} β€” {random.choice(['Zurich Risk Engineering','AXA Risk Consulting','RSA Risk Control','Aviva Risk Management'])}\n"
236
+ f"Property: {fake.company()} β€” {fake.address().replace(chr(10), ', ')}\n"
237
+ f"Date: {fake.date_between(start_date='-6m', end_date='today').isoformat()}\n\n"
238
+ f"OCCUPANCY: {random.choice(['Office','Warehouse','Retail','Manufacturing','Restaurant','Hotel'])}\n"
239
+ f"CONSTRUCTION: {random.choice(['Brick/tile','Steel frame/composite','Timber frame','Concrete'])}\n"
240
+ f"FIRE PROTECTION: {random.choice(['Sprinklers β€” full','Sprinklers β€” partial','Extinguishers only','None'])}\n"
241
+ f"SECURITY: {random.choice(['Intruder alarm β€” monitored','CCTV + alarm','Basic locks only','Security guard 24/7'])}\n"
242
+ f"FLOOD RISK: {random.choice(['Zone 1 β€” minimal','Zone 2 β€” low','Zone 3a β€” moderate','Zone 3b β€” high'])}\n\n"
243
+ f"RECOMMENDATIONS:\n"
244
+ f"1. {fake.sentence()}\n"
245
+ f"2. {fake.sentence()}\n"
246
+ f"3. {fake.sentence()}\n\n"
247
+ f"OVERALL RISK GRADE: {random.choice(['A β€” Excellent','B β€” Good','C β€” Average','D β€” Below Average','E β€” Poor'])}\n"
248
+ )
249
+
250
+
251
+ def _gen_slip() -> str:
252
+ syndicate = random.choice(LLOYDS_SYNDICATES)
253
+ broker = random.choice(["Aon", "Marsh", "WTW", "Howden", "Lockton", "Gallagher"])
254
+ return (
255
+ f"LLOYD'S MARKET PLACING SLIP\n"
256
+ f"UMR: B{random.randint(1000,9999)}{random.choice('ABCDEFGH')}{random.randint(10000,99999)}\n"
257
+ f"Broker: {broker}\n"
258
+ f"Lead Underwriter: {syndicate}\n\n"
259
+ f"ASSURED: {fake.company()}\n"
260
+ f"PERIOD: {fake.date_between(start_date='today', end_date='+30d').isoformat()} to "
261
+ f"{fake.date_between(start_date='+365d', end_date='+395d').isoformat()}\n"
262
+ f"TYPE: {random.choice(['Property All Risks','General Liability','Professional Indemnity','Cyber','Marine Cargo','D&O'])}\n"
263
+ f"LIMIT: {random.choice(['Β£1,000,000','Β£2,500,000','Β£5,000,000','Β£10,000,000'])} any one occurrence\n"
264
+ f"DEDUCTIBLE: {random.choice(['Β£10,000','Β£25,000','Β£50,000','Β£100,000'])}\n"
265
+ f"RATE: {random.uniform(0.1, 2.5):.3f}%\n"
266
+ f"PREMIUM: Β£{random.randint(10000, 500000):,}\n"
267
+ f"LEAD LINE: {random.randint(10, 40)}%\n"
268
+ f"FOLLOW CAPACITY: {random.choice(['Fully placed','85% placed β€” seeking balance','Open β€” marketing'])}\n"
269
+ )
270
+
271
+
272
+ # ── Generator map ──
273
+
274
+ DOC_GENERATORS = {
275
+ "Policy Schedule": _gen_policy_schedule,
276
+ "Claim Form / FNOL": _gen_claim_form,
277
+ "Endorsement": _gen_endorsement,
278
+ "Loss Adjuster Report": _gen_loss_adjuster_report,
279
+ "Bordereaux": _gen_bordereaux,
280
+ "Renewal Notice": _gen_renewal_invite,
281
+ "Subrogation Letter": _gen_subrogation_letter,
282
+ "Complaint": _gen_complaint_letter,
283
+ "Medical Report": _gen_medical_report,
284
+ "Regulatory Correspondence": _gen_fca_letter,
285
+ "Risk Survey": _gen_risk_survey,
286
+ "Lloyd's Slip": _gen_slip,
287
+ }
288
+
289
+
290
+ def generate_document_dataset(n: int = 10000, output_path: str = "data/output/insurance_docs_10k.jsonl"):
291
+ """Generate n labelled document classification examples."""
292
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
293
+
294
+ labels = list(DOC_GENERATORS.keys())
295
+ per_label = n // len(labels)
296
+ remainder = n % len(labels)
297
+
298
+ records = []
299
+ for i, (label, gen_fn) in enumerate(DOC_GENERATORS.items()):
300
+ count = per_label + (1 if i < remainder else 0)
301
+ for _ in tqdm(range(count), desc=f"Docs β€” {label}"):
302
+ records.append({
303
+ "text": gen_fn(),
304
+ "label": label,
305
+ "label_id": labels.index(label),
306
+ })
307
+
308
+ random.shuffle(records)
309
+
310
+ with open(output_path, "w") as f:
311
+ for rec in records:
312
+ f.write(json.dumps(rec, ensure_ascii=False) + "\n")
313
+
314
+ print(f"\nβœ“ Generated {len(records)} document classification examples β†’ {output_path}")
315
+ from collections import Counter
316
+ dist = Counter(r["label"] for r in records)
317
+ for lab, count in sorted(dist.items()):
318
+ print(f" {lab}: {count}")
319
+
320
+ return output_path
321
+
322
+
323
+ if __name__ == "__main__":
324
+ generate_document_dataset()
data/gen_dpo.py ADDED
@@ -0,0 +1,375 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Synthetic DPO Preference Data Generator
3
+ Generates chosen/rejected pairs for Direct Preference Optimization.
4
+ Chosen = FCA-compliant, accurate, plain English
5
+ Rejected = Non-compliant, hallucinated, jargon-heavy, or unfair
6
+ """
7
+
8
+ import json
9
+ import random
10
+ import os
11
+ from pathlib import Path
12
+
13
+ from faker import Faker
14
+ from tqdm import tqdm
15
+
16
+ from data.constants import (
17
+ UK_INSURERS, FCA_REFERENCES, CLAIM_TYPES, INSURANCE_JARGON,
18
+ DPO_PREFERENCE_DIMENSIONS,
19
+ )
20
+
21
+ fake = Faker("en_GB")
22
+ Faker.seed(43)
23
+ random.seed(43)
24
+
25
+
26
+ def _rand_gbp(low: int, high: int) -> str:
27
+ return f"Β£{random.randint(low, high):,}"
28
+
29
+
30
+ def _rand_claim_number() -> str:
31
+ return f"CLM-{random.randint(200000, 999999)}"
32
+
33
+
34
+ def gen_fca_consumer_duty_pair() -> dict:
35
+ """Chosen: consumer-fair response. Rejected: opaque, unhelpful response."""
36
+ customer = fake.name()
37
+ insurer = random.choice(UK_INSURERS)
38
+ claim_num = _rand_claim_number()
39
+ claim_value = random.randint(2000, 30000)
40
+
41
+ prompt = (
42
+ f"{customer} has submitted a home insurance claim ({claim_num}) for escape of water "
43
+ f"damage valued at Β£{claim_value:,}. The claim has been open for 6 weeks with no update. "
44
+ f"The customer calls asking for a progress update. Draft the response."
45
+ )
46
+
47
+ chosen = (
48
+ f"Thank you for calling, {customer}. I sincerely apologise for the delay in updating you "
49
+ f"on claim {claim_num} β€” you should have received regular updates, and I'm sorry that hasn't happened.\n\n"
50
+ f"Here's where we are:\n"
51
+ f"- Our loss adjuster visited on [date] and their report is now with our assessment team\n"
52
+ f"- We expect to have a decision within the next 5 working days\n"
53
+ f"- I'm assigning myself as your dedicated handler so you have one person to contact\n\n"
54
+ f"My direct number is [number] and email is [email]. I'll call you by [specific date] with an update, "
55
+ f"even if we're still working on it.\n\n"
56
+ f"I understand this is stressful, especially when you're living with the damage. Is there anything "
57
+ f"else I can help with right now? If the damage is causing any issues with living in the property, "
58
+ f"please let me know and we can look at interim measures.\n\n"
59
+ f"If you're unhappy with how we've handled this, you have the right to make a formal complaint, "
60
+ f"and if we can't resolve it, you can refer to the Financial Ombudsman Service free of charge."
61
+ )
62
+
63
+ rejected = (
64
+ f"As per your policy reference, claim {claim_num} is currently under assessment. "
65
+ f"The claims process can take up to 90 days as stipulated in Section 4, Clause 7(b) of your "
66
+ f"policy schedule. We will contact you in due course when a determination has been made "
67
+ f"regarding the indemnification of your peril-related loss under the escape of water cover "
68
+ f"provision. No further action is required from you at this time. Please do not contact us "
69
+ f"again as this may delay the process. Ref: FCA/ICOBS/8.1.1R. Claims Department."
70
+ )
71
+
72
+ return {
73
+ "prompt": prompt,
74
+ "chosen": chosen,
75
+ "rejected": rejected,
76
+ "dimension": "fca_consumer_duty",
77
+ }
78
+
79
+
80
+ def gen_accuracy_pair() -> dict:
81
+ """Chosen: factually accurate. Rejected: hallucinated facts."""
82
+ customer = fake.name()
83
+ line = "Motor Private Car"
84
+ excess = random.choice([250, 350, 500])
85
+ claim_amount = random.randint(3000, 15000)
86
+
87
+ prompt = (
88
+ f"{customer} asks: 'My car was hit in a car park while I was shopping. "
89
+ f"The other driver left a note with their details. My excess is Β£{excess}. "
90
+ f"How does the claims process work and will I get my excess back?'"
91
+ )
92
+
93
+ chosen = (
94
+ f"Here's how this works:\n\n"
95
+ f"1. **Report the claim** to your insurer with the other driver's details from the note\n"
96
+ f"2. **Your insurer assesses** the damage and arranges repairs\n"
97
+ f"3. **You pay your excess** (Β£{excess}) upfront when repairs are done\n"
98
+ f"4. **Subrogation** β€” because the other driver was at fault, your insurer will pursue them "
99
+ f"(or their insurer) to recover the claim cost\n"
100
+ f"5. **If recovery is successful**, you get your Β£{excess} excess back\n\n"
101
+ f"**Important points:**\n"
102
+ f"- This should be a non-fault claim, so it shouldn't affect your no-claims discount (NCD) "
103
+ f"once the other party's insurer admits liability\n"
104
+ f"- However, while the claim is open, your NCD may be temporarily affected at renewal\n"
105
+ f"- Keep the note safe β€” it's your key evidence\n"
106
+ f"- If the other driver disputes fault, your insurer handles the negotiation\n"
107
+ f"- Typical timeline: 6-12 weeks for a straightforward non-fault claim"
108
+ )
109
+
110
+ rejected = (
111
+ f"Since the other driver left a note, this is automatically a non-fault claim and you won't "
112
+ f"need to pay any excess at all. Your insurer is legally required to waive the excess under "
113
+ f"Section 152 of the Road Traffic Act 1998 when fault is admitted in writing. Your no-claims "
114
+ f"bonus is protected by law in this scenario and cannot be affected. The other driver's insurer "
115
+ f"must pay within 14 days under the Motor Insurers' Bureau agreement. You should receive "
116
+ f"Β£{claim_amount:,} plus Β£{excess} compensation directly to your bank account."
117
+ )
118
+
119
+ return {
120
+ "prompt": prompt,
121
+ "chosen": chosen,
122
+ "rejected": rejected,
123
+ "dimension": "accuracy",
124
+ }
125
+
126
+
127
+ def gen_regulatory_compliance_pair() -> dict:
128
+ """Chosen: compliant advice. Rejected: non-compliant recommendation."""
129
+ mga = random.choice(["Plexus MGA", "Pen Underwriting", "Three Sixty Underwriting"])
130
+ syndicate = random.choice(["Syndicate 2623", "Syndicate 2001", "Syndicate 4444"])
131
+
132
+ prompt = (
133
+ f"As an MGA ({mga}), we want to speed up our claims settlement process. "
134
+ f"Our current average is 45 days. Our capacity provider ({syndicate}) has asked us "
135
+ f"to settle faster. What can we do?"
136
+ )
137
+
138
+ chosen = (
139
+ f"Here are compliant approaches to reduce settlement time:\n\n"
140
+ f"**Process improvements:**\n"
141
+ f"1. Implement automated FNOL with structured data capture β€” reduces manual re-keying by 60%\n"
142
+ f"2. Use AI-assisted document extraction for supporting evidence (photos, invoices, reports)\n"
143
+ f"3. Create fast-track thresholds: auto-settle claims under Β£1,000 with documented approval rules\n"
144
+ f"4. Set up same-day payment processing for approved settlements\n\n"
145
+ f"**Important compliance guardrails:**\n"
146
+ f"- All fast-track settlements must still comply with {FCA_REFERENCES['claims_handling']}\n"
147
+ f"- Even automated decisions need a human review mechanism for customers to challenge\n"
148
+ f"- Under {FCA_REFERENCES['consumer_duty']}, faster isn't always better β€” ensure the settlement "
149
+ f"amount is fair, not just fast\n"
150
+ f"- Data sharing between you and {syndicate} must comply with {FCA_REFERENCES['gdpr']} β€” "
151
+ f"ensure your data processing agreement covers automated decision-making\n"
152
+ f"- Maintain full audit trail for {syndicate}'s regulatory reporting\n"
153
+ f"- Check your binding authority agreement for any settlement authority limits\n\n"
154
+ f"**Target:** 15-20 days average for straightforward claims is achievable with these changes."
155
+ )
156
+
157
+ rejected = (
158
+ f"The fastest way to settle claims is to just approve everything under Β£5,000 automatically "
159
+ f"without investigation. Most small claims are genuine anyway. You can also skip the loss adjuster "
160
+ f"for any claim under Β£10,000 β€” they just slow things down. For larger claims, just offer 70% of "
161
+ f"the claimed amount as a quick settlement β€” most people will accept rather than wait. "
162
+ f"Don't worry too much about documentation, {syndicate} mainly cares about the numbers. "
163
+ f"If customers complain, tell them the settlement is final and they signed the policy terms."
164
+ )
165
+
166
+ return {
167
+ "prompt": prompt,
168
+ "chosen": chosen,
169
+ "rejected": rejected,
170
+ "dimension": "regulatory_compliance",
171
+ }
172
+
173
+
174
+ def gen_plain_english_pair() -> dict:
175
+ """Chosen: plain English. Rejected: impenetrable jargon."""
176
+ customer = fake.name()
177
+ insurer = random.choice(UK_INSURERS)
178
+
179
+ prompt = (
180
+ f"{customer} has received a letter from {insurer} declining part of their "
181
+ f"home insurance claim. They don't understand why. Explain the partial decline."
182
+ )
183
+
184
+ chosen = (
185
+ f"Dear {customer},\n\n"
186
+ f"I understand the letter from {insurer} was confusing. Let me explain what happened "
187
+ f"in straightforward terms.\n\n"
188
+ f"**What was claimed:** Your claim was for water damage to your kitchen ceiling and "
189
+ f"the cost of fixing the leaking pipe that caused it.\n\n"
190
+ f"**What's being paid:** The damage to your ceiling β€” Β£2,800 for replastering and "
191
+ f"redecorating, minus your Β£250 excess = Β£2,550 being paid to you.\n\n"
192
+ f"**What's NOT being paid (and why):** The cost of repairing the pipe itself (Β£450). "
193
+ f"This is because your policy covers **damage caused by** the leak, but not the repair "
194
+ f"of the pipe that leaked. Think of it like this β€” if a tree falls on your roof, the "
195
+ f"insurance covers fixing the roof, but not removing the tree from the garden.\n\n"
196
+ f"This is a standard term in almost all home insurance policies. It's not unique to {insurer}.\n\n"
197
+ f"**What you can do:**\n"
198
+ f"- If you think this is wrong, call us and we'll review\n"
199
+ f"- If you're still unhappy, you can make a formal complaint\n"
200
+ f"- If we can't resolve it, the Financial Ombudsman Service is free to use\n\n"
201
+ f"Is there anything else I can clarify?"
202
+ )
203
+
204
+ rejected = (
205
+ f"Re: Policy {random.randint(100000, 999999)} β€” Partial Declinature Notification\n\n"
206
+ f"Dear Policyholder,\n\n"
207
+ f"Further to our loss adjuster's report dated [date] and pursuant to the terms, conditions, "
208
+ f"and exclusions set forth in your policy schedule (Section 2, Subsection 4(a)(iii)), we have "
209
+ f"determined that the proximate cause of the peril giving rise to the consequential damage falls "
210
+ f"within the indemnifiable perils as enumerated in the Operative Clause; however, the antecedent "
211
+ f"mechanical failure of the plumbing infrastructure constitutes a maintenance obligation under "
212
+ f"General Condition 7(b) β€” Maintenance and Reasonable Precautions, and is therefore excluded "
213
+ f"from indemnification under the aforegoing policy provisions. The quantum of the remaining "
214
+ f"indemnifiable loss, net of the contractual voluntary excess, has been calculated on an "
215
+ f"indemnity basis per the Basis of Settlement clause. No betterment has been applied. "
216
+ f"Please refer to your policy booklet, pages 47-62, for the complete exclusionary provisions. "
217
+ f"Yours faithfully, Claims Department."
218
+ )
219
+
220
+ return {
221
+ "prompt": prompt,
222
+ "chosen": chosen,
223
+ "rejected": rejected,
224
+ "dimension": "plain_english",
225
+ }
226
+
227
+
228
+ def gen_data_protection_pair() -> dict:
229
+ """Chosen: GDPR-safe. Rejected: casually leaks PII."""
230
+ customer_a = fake.name()
231
+ customer_b = fake.name()
232
+ insurer = random.choice(UK_INSURERS)
233
+
234
+ prompt = (
235
+ f"A third party ({customer_b}) calls {insurer} asking about the status of a "
236
+ f"motor insurance claim they've made against {customer_a}'s policy. "
237
+ f"{customer_b} says they were the other driver in the accident and wants to know "
238
+ f"if the claim has been accepted and when they'll be paid. How should you respond?"
239
+ )
240
+
241
+ chosen = (
242
+ f"I understand you're waiting for an update on your claim, and I appreciate your patience.\n\n"
243
+ f"However, I'm unable to share any details about the claim status or the other "
244
+ f"party's policy with you, as this is protected by data protection law "
245
+ f"({FCA_REFERENCES['gdpr']}).\n\n"
246
+ f"**What you should do:**\n"
247
+ f"1. Contact **your own insurer** β€” they can liaise with us on your behalf through "
248
+ f"the standard inter-insurer claims process\n"
249
+ f"2. If you don't have insurance (uninsured loss), you can pursue the claim through "
250
+ f"the other party's insurer, but we can only correspond with you about YOUR claim "
251
+ f"against our policyholder β€” we cannot discuss our policyholder's policy details\n"
252
+ f"3. We can confirm we've received notification of the incident β€” but nothing more\n\n"
253
+ f"If you provide your contact details and your own insurer's reference, we can ensure "
254
+ f"any correspondence is directed appropriately.\n\n"
255
+ f"I know this might feel frustrating, but these protections exist for everyone, "
256
+ f"including you if someone were to call about your policy."
257
+ )
258
+
259
+ rejected = (
260
+ f"Yes, {customer_a}'s claim has been accepted. Their policy number is [number] and "
261
+ f"they have fully comprehensive cover with us. The claim was valued at Β£4,500 and we've "
262
+ f"already approved it. {customer_a} lives at {fake.address()} and has been with us for "
263
+ f"3 years. We'll be settling the third party element with you directly. Can I take your "
264
+ f"bank details and I'll process the payment? The settlement should be in your account "
265
+ f"within 5 working days."
266
+ )
267
+
268
+ return {
269
+ "prompt": prompt,
270
+ "chosen": chosen,
271
+ "rejected": rejected,
272
+ "dimension": "data_protection",
273
+ }
274
+
275
+
276
+ def gen_fair_pricing_pair() -> dict:
277
+ """Chosen: non-discriminatory pricing. Rejected: uses protected characteristics."""
278
+ insurer = random.choice(UK_INSURERS)
279
+
280
+ prompt = (
281
+ f"As a pricing actuary at {insurer}, you're asked to review a new rating factor "
282
+ f"proposal for motor insurance. The data science team wants to add 'first name' as "
283
+ f"a predictive feature because it improves the model's loss ratio prediction by 3%. "
284
+ f"Should we include it?"
285
+ )
286
+
287
+ chosen = (
288
+ f"**Recommendation: DO NOT include 'first name' as a rating factor.**\n\n"
289
+ f"**Why it's statistically predictive:** First names correlate with age, gender, ethnicity, "
290
+ f"and socioeconomic background. The 3% improvement is almost certainly because the model is "
291
+ f"using first name as a proxy for one or more protected characteristics.\n\n"
292
+ f"**Why it's not allowed:**\n\n"
293
+ f"1. **Gender proxy** β€” under EU/UK law (Test-Achats ruling + {FCA_REFERENCES['equality_act']}), "
294
+ f"gender cannot be used as a rating factor. First names like 'Sarah' vs 'James' directly "
295
+ f"proxy gender.\n\n"
296
+ f"2. **Ethnicity proxy** β€” names like 'Mohammed' or 'Priya' correlate with ethnicity. "
297
+ f"Using these would constitute indirect race discrimination under the {FCA_REFERENCES['equality_act']}.\n\n"
298
+ f"3. **Age proxy** β€” naming trends change over decades (e.g., 'Ethel' suggests older, "
299
+ f"'Jayden' suggests younger). This could amount to indirect age discrimination.\n\n"
300
+ f"4. **FCA enforcement** β€” the FCA explicitly monitors for proxy discrimination in pricing "
301
+ f"under {FCA_REFERENCES['pricing_practices']}. Using first name would trigger supervisory action.\n\n"
302
+ f"**What to do instead:**\n"
303
+ f"- Review what legitimate risk factors the 3% improvement might be capturing\n"
304
+ f"- Can you add those factors directly (e.g., vehicle type, postcode, occupation)?\n"
305
+ f"- Document this decision for your fair pricing audit trail"
306
+ )
307
+
308
+ rejected = (
309
+ f"The 3% improvement is significant and would give us a competitive edge. First name is "
310
+ f"publicly available information, so there's no GDPR issue. We should include it. "
311
+ f"Technically, we're not using gender directly β€” we're using a name, which is different. "
312
+ f"If the regulator asks, we can explain it's a legitimate predictor based on claims data. "
313
+ f"Lots of insurers use similar features, they just don't talk about it. The model is only "
314
+ f"reflecting real-world risk patterns, and it's up to the market to price accurately."
315
+ )
316
+
317
+ return {
318
+ "prompt": prompt,
319
+ "chosen": chosen,
320
+ "rejected": rejected,
321
+ "dimension": "fair_pricing",
322
+ }
323
+
324
+
325
+ # ── Master Generator ──
326
+
327
+ DPO_GENERATORS = [
328
+ gen_fca_consumer_duty_pair,
329
+ gen_accuracy_pair,
330
+ gen_regulatory_compliance_pair,
331
+ gen_plain_english_pair,
332
+ gen_data_protection_pair,
333
+ gen_fair_pricing_pair,
334
+ ]
335
+
336
+
337
+ def generate_dpo_dataset(n: int = 5000, output_path: str = "data/output/insurance_dpo_5k.jsonl"):
338
+ """Generate n DPO preference pairs."""
339
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
340
+
341
+ per_gen = n // len(DPO_GENERATORS)
342
+ remainder = n % len(DPO_GENERATORS)
343
+
344
+ records = []
345
+ for i, gen_fn in enumerate(DPO_GENERATORS):
346
+ count = per_gen + (1 if i < remainder else 0)
347
+ for _ in tqdm(range(count), desc=f"DPO β€” {gen_fn.__name__}"):
348
+ pair = gen_fn()
349
+ records.append({
350
+ "prompt": [
351
+ {"role": "system", "content": "You are InsureLLM, a specialist UK insurance AI assistant."},
352
+ {"role": "user", "content": pair["prompt"]},
353
+ ],
354
+ "chosen": [{"role": "assistant", "content": pair["chosen"]}],
355
+ "rejected": [{"role": "assistant", "content": pair["rejected"]}],
356
+ "dimension": pair["dimension"],
357
+ })
358
+
359
+ random.shuffle(records)
360
+
361
+ with open(output_path, "w") as f:
362
+ for record in records:
363
+ f.write(json.dumps(record, ensure_ascii=False) + "\n")
364
+
365
+ print(f"\nβœ“ Generated {len(records)} DPO preference pairs β†’ {output_path}")
366
+ from collections import Counter
367
+ dist = Counter(r["dimension"] for r in records)
368
+ for dim, count in sorted(dist.items()):
369
+ print(f" {dim}: {count}")
370
+
371
+ return output_path
372
+
373
+
374
+ if __name__ == "__main__":
375
+ generate_dpo_dataset()
data/gen_ner.py ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Synthetic NER (Named Entity Recognition) Data Generator
3
+ Generates 8K token-labelled insurance text examples in IOB2 format for ModernBERT NER.
4
+ """
5
+
6
+ import json
7
+ import os
8
+ import random
9
+ from datetime import timedelta
10
+
11
+ from faker import Faker
12
+ from tqdm import tqdm
13
+
14
+ from data.constants import (
15
+ UK_INSURERS, LLOYDS_SYNDICATES, MGAS, UK_REGIONS,
16
+ NER_ENTITY_TYPES, FCA_REFERENCES,
17
+ )
18
+
19
+ fake = Faker("en_GB")
20
+ Faker.seed(46)
21
+ random.seed(46)
22
+
23
+ # Entity types with IOB2 labels:
24
+ # PERSON, ORG, INSURER, MGA, SYNDICATE, POLICY_NUMBER, CLAIM_NUMBER,
25
+ # MONEY, DATE, POSTCODE, LOB, REGULATION, PERIL, VEHICLE, ADDRESS
26
+
27
+
28
+ def _postcode() -> str:
29
+ region_name, region = random.choice(list(UK_REGIONS.items()))
30
+ prefix = random.choice(region)
31
+ return f"{prefix}{random.randint(1,29)} {random.randint(1,9)}{random.choice('ABCDEFGHJKLMNPRSTUVWXY')}{random.choice('ABCDEFGHJKLMNPRSTUVWXY')}"
32
+
33
+
34
+ def _policy_ref() -> str:
35
+ return f"POL-{random.randint(100000, 999999)}"
36
+
37
+
38
+ def _claim_ref() -> str:
39
+ return f"CLM-{random.randint(200000, 999999)}"
40
+
41
+
42
+ def _amount() -> str:
43
+ val = random.choice([
44
+ random.randint(100, 999),
45
+ random.randint(1000, 9999),
46
+ random.randint(10000, 99999),
47
+ random.randint(100000, 999999),
48
+ ])
49
+ return f"Β£{val:,}"
50
+
51
+
52
+ def _date_str() -> str:
53
+ d = fake.date_between(start_date="-3y", end_date="+1y")
54
+ return d.strftime(random.choice(["%d/%m/%Y", "%d %B %Y", "%Y-%m-%d"]))
55
+
56
+
57
+ def _vehicle() -> str:
58
+ makes = ["Ford Fiesta", "VW Golf", "BMW 3 Series", "Toyota Yaris", "Kia Sportage",
59
+ "Vauxhall Corsa", "Mercedes A-Class", "Tesla Model 3", "Nissan Qashqai", "Audi A3"]
60
+ return random.choice(makes)
61
+
62
+
63
+ def _peril() -> str:
64
+ return random.choice([
65
+ "escape of water", "storm damage", "theft", "fire", "flood",
66
+ "accidental damage", "subsidence", "malicious damage", "collision",
67
+ "burst pipe", "lightning strike", "impact damage", "vandalism",
68
+ ])
69
+
70
+
71
+ def _regulation() -> str:
72
+ return random.choice(list(FCA_REFERENCES.values()) + [
73
+ "ICOBS 8.1.1R", "DISP 1.3", "PRIN 2A", "Consumer Duty",
74
+ "FCA PS21/5", "Equality Act 2010", "GDPR Article 6",
75
+ ])
76
+
77
+
78
+ def _lob() -> str:
79
+ return random.choice([
80
+ "motor insurance", "home insurance", "commercial combined",
81
+ "employers' liability", "public liability", "professional indemnity",
82
+ "property insurance", "cyber insurance", "D&O insurance",
83
+ ])
84
+
85
+
86
+ # ── Sentence templates with entity slots ──
87
+
88
+ TEMPLATES = [
89
+ # 0 β€” claim notification
90
+ lambda: _build(
91
+ "{PERSON} reported a {PERIL} claim ({CLAIM_NUMBER}) on {DATE}. "
92
+ "The loss occurred at {POSTCODE} and is covered under {LOB} policy {POLICY_NUMBER} "
93
+ "with {INSURER}. Estimated value: {MONEY}."
94
+ ),
95
+ # 1 β€” subrogation
96
+ lambda: _build(
97
+ "{INSURER} is pursuing subrogation recovery of {MONEY} against {ORG} "
98
+ "in respect of claim {CLAIM_NUMBER} dated {DATE}. "
99
+ "The policyholder {PERSON} resides at {POSTCODE}."
100
+ ),
101
+ # 2 β€” Lloyd's placement
102
+ lambda: _build(
103
+ "{SYNDICATE} has written a {MONEY} line on the {LOB} facility "
104
+ "brokered for {ORG} by {MGA}. Inception date {DATE}."
105
+ ),
106
+ # 3 β€” regulatory
107
+ lambda: _build(
108
+ "Under {REGULATION}, {INSURER} must provide {PERSON} with a final response "
109
+ "to their {PERIL} claim ({CLAIM_NUMBER}) by {DATE}. "
110
+ "The claim value is {MONEY}."
111
+ ),
112
+ # 4 β€” renewal
113
+ lambda: _build(
114
+ "{PERSON}'s {LOB} policy {POLICY_NUMBER} with {INSURER} is due for renewal on {DATE}. "
115
+ "Current premium: {MONEY}. Property at {POSTCODE}."
116
+ ),
117
+ # 5 β€” vehicle claim
118
+ lambda: _build(
119
+ "{PERSON} was driving a {VEHICLE} when the {PERIL} incident occurred on {DATE} "
120
+ "near {POSTCODE}. Claim {CLAIM_NUMBER} has been opened with {INSURER} for {MONEY}."
121
+ ),
122
+ # 6 β€” MGA bordereaux
123
+ lambda: _build(
124
+ "{MGA} submitted the {DATE} bordereaux to {SYNDICATE} showing {MONEY} GWP "
125
+ "across {LOB} business. Contact: {PERSON}."
126
+ ),
127
+ # 7 β€” complaint
128
+ lambda: _build(
129
+ "{PERSON} has filed a complaint against {INSURER} regarding claim {CLAIM_NUMBER}. "
130
+ "Per {REGULATION}, we must respond by {DATE}. Claim relates to {PERIL} at {POSTCODE}. "
131
+ "Amount disputed: {MONEY}."
132
+ ),
133
+ # 8 β€” loss adjuster
134
+ lambda: _build(
135
+ "Loss adjuster {PERSON} from {ORG} inspected the {PERIL} damage at {POSTCODE} on {DATE}. "
136
+ "They recommend a settlement of {MONEY} on claim {CLAIM_NUMBER} under {LOB} cover."
137
+ ),
138
+ # 9 β€” medical
139
+ lambda: _build(
140
+ "Dr {PERSON} examined the claimant in connection with claim {CLAIM_NUMBER} "
141
+ "dated {DATE}. The {PERIL} incident at {POSTCODE} resulted in injuries. "
142
+ "{INSURER} has reserved {MONEY} under the {LOB} policy."
143
+ ),
144
+ # 10 β€” endorsement
145
+ lambda: _build(
146
+ "Endorsement applied to {POLICY_NUMBER}: {PERSON} has changed vehicle to {VEHICLE}. "
147
+ "Effective {DATE}. Additional premium: {MONEY}. Insurer: {INSURER}."
148
+ ),
149
+ # 11 β€” fraud referral
150
+ lambda: _build(
151
+ "Claim {CLAIM_NUMBER} by {PERSON} for {PERIL} ({MONEY}) has been referred to the fraud team. "
152
+ "Policy {POLICY_NUMBER} with {INSURER} started on {DATE}. "
153
+ "Property postcode: {POSTCODE}. Cf. {REGULATION}."
154
+ ),
155
+ ]
156
+
157
+
158
+ def _build(template: str) -> tuple[list[str], list[str]]:
159
+ """Fill template slots and return (tokens, iob_tags)."""
160
+ # Generate entity values
161
+ entities = {
162
+ "PERSON": fake.name(),
163
+ "ORG": fake.company(),
164
+ "INSURER": random.choice(UK_INSURERS),
165
+ "MGA": random.choice(MGAS),
166
+ "SYNDICATE": random.choice(LLOYDS_SYNDICATES),
167
+ "POLICY_NUMBER": _policy_ref(),
168
+ "CLAIM_NUMBER": _claim_ref(),
169
+ "MONEY": _amount(),
170
+ "DATE": _date_str(),
171
+ "POSTCODE": _postcode(),
172
+ "LOB": _lob(),
173
+ "REGULATION": _regulation(),
174
+ "PERIL": _peril(),
175
+ "VEHICLE": _vehicle(),
176
+ }
177
+
178
+ # Parse template to get ordered (text_fragment, entity_type) pairs
179
+ tokens = []
180
+ tags = []
181
+
182
+ remaining = template
183
+ while remaining:
184
+ # Find next entity slot
185
+ best_pos = len(remaining)
186
+ best_key = None
187
+ for key in entities:
188
+ marker = "{" + key + "}"
189
+ pos = remaining.find(marker)
190
+ if pos != -1 and pos < best_pos:
191
+ best_pos = pos
192
+ best_key = key
193
+
194
+ if best_key is None:
195
+ # No more entities β€” tokenize remaining text
196
+ for tok in remaining.split():
197
+ tokens.append(tok)
198
+ tags.append("O")
199
+ break
200
+
201
+ marker = "{" + best_key + "}"
202
+
203
+ # Text before entity
204
+ before = remaining[:best_pos]
205
+ for tok in before.split():
206
+ if tok:
207
+ tokens.append(tok)
208
+ tags.append("O")
209
+
210
+ # Entity tokens
211
+ entity_value = entities[best_key]
212
+ entity_tokens = entity_value.split()
213
+ for j, etok in enumerate(entity_tokens):
214
+ tokens.append(etok)
215
+ tags.append(f"B-{best_key}" if j == 0 else f"I-{best_key}")
216
+
217
+ remaining = remaining[best_pos + len(marker):]
218
+
219
+ return tokens, tags
220
+
221
+
222
+ def generate_ner_dataset(n: int = 8000, output_path: str = "data/output/insurance_ner_8k.jsonl"):
223
+ """Generate n NER examples in token-level IOB2 format."""
224
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
225
+
226
+ records = []
227
+ for _ in tqdm(range(n), desc="NER examples"):
228
+ gen_fn = random.choice(TEMPLATES)
229
+ tokens, tags = gen_fn()
230
+ records.append({
231
+ "tokens": tokens,
232
+ "ner_tags": tags,
233
+ "text": " ".join(tokens),
234
+ })
235
+
236
+ random.shuffle(records)
237
+
238
+ with open(output_path, "w") as f:
239
+ for rec in records:
240
+ f.write(json.dumps(rec, ensure_ascii=False) + "\n")
241
+
242
+ # Stats
243
+ all_tags = set()
244
+ for rec in records:
245
+ all_tags.update(rec["ner_tags"])
246
+ entity_tags = sorted(t for t in all_tags if t != "O")
247
+
248
+ print(f"\nβœ“ Generated {len(records)} NER examples β†’ {output_path}")
249
+ print(f" Entity types found: {len(entity_tags)}")
250
+ for t in entity_tags:
251
+ count = sum(1 for rec in records for tag in rec["ner_tags"] if tag == t)
252
+ print(f" {t}: {count}")
253
+
254
+ return output_path
255
+
256
+
257
+ if __name__ == "__main__":
258
+ generate_ner_dataset()
data/gen_sft.py ADDED
@@ -0,0 +1,1192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Synthetic SFT Data Generator
3
+ Generates 10K instruction-response pairs for UK insurance fine-tuning.
4
+ 100% synthetic β€” no real PII, no real policy data.
5
+ """
6
+
7
+ import json
8
+ import random
9
+ import os
10
+ from datetime import datetime, timedelta
11
+ from pathlib import Path
12
+
13
+ from faker import Faker
14
+ from tqdm import tqdm
15
+
16
+ from data.constants import (
17
+ LINES_OF_BUSINESS, UK_REGIONS, UK_INSURERS, LLOYDS_SYNDICATES,
18
+ MGA_NAMES, CLAIM_TYPES, FCA_REFERENCES, POLICY_SECTIONS,
19
+ INSURANCE_JARGON, SFT_TASK_CATEGORIES,
20
+ )
21
+
22
+ fake = Faker("en_GB")
23
+ Faker.seed(42)
24
+ random.seed(42)
25
+
26
+
27
+ def _rand_gbp(low: int, high: int) -> str:
28
+ return f"Β£{random.randint(low, high):,}"
29
+
30
+
31
+ def _rand_policy_number() -> str:
32
+ prefix = random.choice(["POL", "UW", "BA", "PI", "CL", "MGA"])
33
+ return f"{prefix}-{random.randint(100000, 999999)}"
34
+
35
+
36
+ def _rand_claim_number() -> str:
37
+ return f"CLM-{random.randint(200000, 999999)}"
38
+
39
+
40
+ def _rand_date(start_year: int = 2022, end_year: int = 2026) -> str:
41
+ start = datetime(start_year, 1, 1)
42
+ end = datetime(end_year, 3, 31)
43
+ delta = (end - start).days
44
+ d = start + timedelta(days=random.randint(0, delta))
45
+ return d.strftime("%d/%m/%Y")
46
+
47
+
48
+ def _rand_postcode() -> str:
49
+ region = random.choice(list(UK_REGIONS.values()))
50
+ prefix = random.choice(region)
51
+ num = random.randint(1, 28)
52
+ suffix = f"{random.randint(1,9)}{random.choice('ABCDEFGHJKLMNPQRSTUVWXY')}{random.choice('ABCDEFGHJKLMNPQRSTUVWXY')}"
53
+ return f"{prefix}{num} {suffix}"
54
+
55
+
56
+ def _rand_vehicle_reg() -> str:
57
+ area = random.choice(["AB", "BA", "CA", "DA", "EA", "FA", "GA", "HA", "KA", "LA", "MA"])
58
+ age = random.choice(["21", "22", "23", "24", "25", "71", "72", "73", "74", "75"])
59
+ letters = "".join(random.choices("ABCDEFGHJKLMNPRSTUVWXY", k=3))
60
+ return f"{area}{age} {letters}"
61
+
62
+
63
+ # ────────────────────────────────────────────
64
+ # SFT Template Generators (one per category)
65
+ # ────────────────────────────────────────────
66
+
67
+ def gen_claims_handling() -> dict:
68
+ line = random.choice(list(CLAIM_TYPES.keys()))
69
+ claim_type = random.choice(CLAIM_TYPES[line])
70
+ insured = fake.name()
71
+ policy_num = _rand_policy_number()
72
+ claim_num = _rand_claim_number()
73
+ date_of_loss = _rand_date(2024, 2026)
74
+ claim_value = random.randint(500, 150000)
75
+ excess = random.choice([100, 150, 200, 250, 300, 500, 1000])
76
+ postcode = _rand_postcode()
77
+ insurer = random.choice(UK_INSURERS)
78
+
79
+ instruction = (
80
+ f"You are a claims handler for {insurer}. Process the following claim.\n\n"
81
+ f"Policy Number: {policy_num}\n"
82
+ f"Claim Number: {claim_num}\n"
83
+ f"Insured: {insured}\n"
84
+ f"Line of Business: {line}\n"
85
+ f"Date of Loss: {date_of_loss}\n"
86
+ f"Location: {postcode}\n"
87
+ f"Claim Type: {claim_type}\n"
88
+ f"Estimated Value: Β£{claim_value:,}\n"
89
+ f"Policy Excess: Β£{excess:,}\n\n"
90
+ f"Assess coverage, set an initial reserve, identify any red flags, "
91
+ f"and draft next steps. Ensure compliance with {FCA_REFERENCES['claims_handling']}."
92
+ )
93
+
94
+ net_claim = max(0, claim_value - excess)
95
+ red_flags = []
96
+ if claim_value > 50000:
97
+ red_flags.append("High-value claim β€” requires senior adjuster review")
98
+ if "theft" in claim_type.lower():
99
+ red_flags.append("Theft claim β€” verify police report reference")
100
+ if "fire" in claim_type.lower():
101
+ red_flags.append("Fire claim β€” fire brigade report required")
102
+ if "flood" in claim_type.lower():
103
+ red_flags.append("Flood claim β€” check Environment Agency flood map for postcode")
104
+ if "subsidence" in claim_type.lower():
105
+ red_flags.append("Subsidence claim β€” require arboricultural report and site investigation")
106
+ if not red_flags:
107
+ red_flags.append("No immediate red flags identified")
108
+
109
+ response = (
110
+ f"## Claims Assessment β€” {claim_num}\n\n"
111
+ f"**Coverage Check:** {line} policy {policy_num} for {insured} "
112
+ f"is active. The reported loss β€” {claim_type} β€” falls within the "
113
+ f"scope of standard {line} cover, subject to policy terms and exclusions.\n\n"
114
+ f"**Initial Reserve:** Β£{net_claim:,} (estimated value Β£{claim_value:,} "
115
+ f"less policy excess of Β£{excess:,})\n\n"
116
+ f"**Red Flags:**\n"
117
+ )
118
+ for flag in red_flags:
119
+ response += f"- {flag}\n"
120
+
121
+ response += (
122
+ f"\n**Next Steps:**\n"
123
+ f"1. Confirm policy was in force on {date_of_loss} and premium paid\n"
124
+ f"2. Obtain insured's full statement of events\n"
125
+ f"3. Appoint loss adjuster if claim exceeds Β£{min(10000, claim_value):,} threshold\n"
126
+ f"4. Request supporting documentation (photos, receipts, police/fire report if applicable)\n"
127
+ f"5. Check for any relevant exclusions in policy wording\n"
128
+ f"6. Update claims bordereaux for MGA reporting\n\n"
129
+ f"**Compliance Note:** Under {FCA_REFERENCES['claims_handling']}, we must handle this "
130
+ f"claim promptly and fairly. The insured must be kept informed at each stage. "
131
+ f"Under {FCA_REFERENCES['consumer_duty']}, we must act to deliver a good outcome "
132
+ f"for the customer throughout the claims journey."
133
+ )
134
+
135
+ return {"instruction": instruction, "response": response, "category": "claims_handling"}
136
+
137
+
138
+ def gen_policy_analysis() -> dict:
139
+ line = random.choice(LINES_OF_BUSINESS[:12])
140
+ section = random.choice(POLICY_SECTIONS)
141
+ insurer = random.choice(UK_INSURERS)
142
+ policy_num = _rand_policy_number()
143
+
144
+ exclusion_examples = {
145
+ "General Exclusions": [
146
+ "War, terrorism, and nuclear risks",
147
+ "Wear and tear, gradual deterioration",
148
+ "Loss arising from illegal acts by the insured",
149
+ "Pre-existing defects known to the insured",
150
+ "Sanctions β€” loss connected to sanctioned territories (Russia, Iran, North Korea, etc.)",
151
+ ],
152
+ "Claims Conditions": [
153
+ "Insured must notify claims within 30 days of discovery",
154
+ "Insured must not admit liability without insurer's written consent",
155
+ "Insurer has the right to take over defence and settlement of any claim",
156
+ "Failure to comply with claims conditions may invalidate the claim",
157
+ ],
158
+ }
159
+
160
+ relevant_exclusions = exclusion_examples.get(section, [
161
+ f"Standard {section} provisions apply per market wording",
162
+ "Refer to policy schedule for specific sub-limits and deductibles",
163
+ ])
164
+
165
+ instruction = (
166
+ f"Analyse the '{section}' section of a {line} policy issued by {insurer} "
167
+ f"(Policy: {policy_num}). Explain what this section covers, key exclusions, "
168
+ f"and any implications for the policyholder. Use plain English suitable for "
169
+ f"a consumer, in line with FCA Consumer Duty requirements."
170
+ )
171
+
172
+ response = (
173
+ f"## Policy Analysis β€” {section}\n\n"
174
+ f"**Policy:** {policy_num} ({line}, issued by {insurer})\n\n"
175
+ f"**What this section does:**\n"
176
+ f"The '{section}' section sets out "
177
+ )
178
+
179
+ if section == "General Exclusions":
180
+ response += (
181
+ "the circumstances and events that are NOT covered by this policy. "
182
+ "These apply to all sections of your cover.\n\n"
183
+ "**Key exclusions in plain English:**\n"
184
+ )
185
+ for exc in relevant_exclusions:
186
+ response += f"- {exc}\n"
187
+ response += (
188
+ f"\n**What this means for you:** If your loss falls under any of these "
189
+ f"exclusions, the insurer will not pay your claim. It's important to read "
190
+ f"these carefully and ask your broker if anything is unclear.\n\n"
191
+ )
192
+ elif section == "Claims Conditions":
193
+ response += (
194
+ "the rules you must follow when making a claim. If you don't follow "
195
+ "these procedures, the insurer may reduce or refuse your claim.\n\n"
196
+ "**Key conditions in plain English:**\n"
197
+ )
198
+ for cond in relevant_exclusions:
199
+ response += f"- {cond}\n"
200
+ response += (
201
+ f"\n**What this means for you:** Report any potential claim promptly. "
202
+ f"Don't accept blame or make payments without contacting your insurer first. "
203
+ f"Keep all evidence and documentation.\n\n"
204
+ )
205
+ else:
206
+ response += (
207
+ f"the terms relating to {section.lower()} under this {line} policy. "
208
+ f"This defines the scope, limits, and conditions that apply.\n\n"
209
+ f"**Key points:**\n"
210
+ )
211
+ for item in relevant_exclusions:
212
+ response += f"- {item}\n"
213
+ response += "\n"
214
+
215
+ response += (
216
+ f"**Consumer Duty Reminder:** Under {FCA_REFERENCES['consumer_duty']}, "
217
+ f"{insurer} is required to ensure this policy provides fair value and that "
218
+ f"all terms are communicated clearly. If you don't understand any part of "
219
+ f"your policy, you have the right to ask for a plain English explanation. "
220
+ f"If you're unhappy, you can complain to {insurer} and, if unresolved, "
221
+ f"refer to the Financial Ombudsman Service (FOS) free of charge."
222
+ )
223
+
224
+ return {"instruction": instruction, "response": response, "category": "policy_analysis"}
225
+
226
+
227
+ def gen_fnol() -> dict:
228
+ line = random.choice(["Motor Private Car", "Home Buildings", "Home Contents"])
229
+ claim_type = random.choice(CLAIM_TYPES[line])
230
+ caller = fake.name()
231
+ policy_num = _rand_policy_number()
232
+ date_of_loss = _rand_date(2025, 2026)
233
+ postcode = _rand_postcode()
234
+
235
+ if "Motor" in line:
236
+ vehicle_reg = _rand_vehicle_reg()
237
+ extra = f"Vehicle Registration: {vehicle_reg}\n"
238
+ else:
239
+ extra = f"Property Address Postcode: {postcode}\n"
240
+
241
+ instruction = (
242
+ f"You are an FNOL (First Notification of Loss) handler. A policyholder is "
243
+ f"reporting a new claim by phone. Capture all required information and create "
244
+ f"a structured FNOL record.\n\n"
245
+ f"Caller: {caller}\n"
246
+ f"Policy: {policy_num}\n"
247
+ f"Line: {line}\n"
248
+ f"{extra}"
249
+ f"What happened: \"{claim_type} β€” happened on {date_of_loss}. "
250
+ f"{'I noticed it when I got home from work.' if 'Home' in line else 'It happened on the A-road near ' + fake.city() + '.'}\"\n\n"
251
+ f"Process this FNOL, ask any clarifying questions, and create the record."
252
+ )
253
+
254
+ claim_num = _rand_claim_number()
255
+ response = (
256
+ f"## FNOL Record β€” {claim_num}\n\n"
257
+ f"**Date/Time of FNOL:** {datetime.now().strftime('%d/%m/%Y %H:%M')}\n"
258
+ f"**Reported By:** {caller} (policyholder)\n"
259
+ f"**Policy Number:** {policy_num}\n"
260
+ f"**Line of Business:** {line}\n"
261
+ f"**Date of Loss:** {date_of_loss}\n"
262
+ f"{extra}"
263
+ f"**Loss Description:** {claim_type}\n"
264
+ f"**Claim Number Assigned:** {claim_num}\n\n"
265
+ f"**Clarifying Questions Asked:**\n"
266
+ f"1. Were there any injuries? (Response required)\n"
267
+ f"2. Were emergency services called? (Response required)\n"
268
+ f"3. Are there any witnesses? (Names/contact details needed)\n"
269
+ f"4. Has the insured taken steps to prevent further damage? (Duty to mitigate)\n"
270
+ f"5. Does the insured have any photos or documentation? (Request submission)\n"
271
+ )
272
+
273
+ if "Motor" in line:
274
+ response += (
275
+ f"6. Was a third party involved? (Need their details, insurer, registration)\n"
276
+ f"7. Has the incident been reported to the police? (Theft/injury = mandatory)\n"
277
+ )
278
+ elif "Home" in line:
279
+ response += (
280
+ f"6. Is the property currently habitable? (May need emergency accommodation)\n"
281
+ f"7. Have temporary repairs been carried out? (Keep receipts)\n"
282
+ )
283
+
284
+ response += (
285
+ f"\n**Initial Triage:**\n"
286
+ f"- Severity: {'High' if 'injury' in claim_type.lower() or 'fire' in claim_type.lower() else 'Medium'}\n"
287
+ f"- Fast-track eligible: {'Yes' if random.random() > 0.5 else 'No β€” requires adjuster visit'}\n"
288
+ f"- Fraud indicators: None at this stage\n\n"
289
+ f"**Next Steps:**\n"
290
+ f"1. Send acknowledgement letter/email within 24 hours\n"
291
+ f"2. Assign to claims team for full assessment\n"
292
+ f"3. Arrange loss adjuster if required\n"
293
+ f"4. Set initial reserve\n\n"
294
+ f"**Compliance:** FNOL processed in accordance with {FCA_REFERENCES['claims_handling']} "
295
+ f"and {FCA_REFERENCES['consumer_duty']}. Customer informed of their rights under "
296
+ f"{FCA_REFERENCES['complaints']} and FOS referral rights."
297
+ )
298
+
299
+ return {"instruction": instruction, "response": response, "category": "fnol"}
300
+
301
+
302
+ def gen_compliance_check() -> dict:
303
+ scenario_type = random.choice([
304
+ "pricing_fairness", "claim_decline", "renewal_increase",
305
+ "data_handling", "vulnerable_customer", "complaints_handling"
306
+ ])
307
+
308
+ insurer = random.choice(UK_INSURERS)
309
+ customer = fake.name()
310
+
311
+ scenarios = {
312
+ "pricing_fairness": {
313
+ "instruction": (
314
+ f"{insurer} is renewing a home insurance policy for {customer}, age {random.randint(65, 85)}. "
315
+ f"The premium has increased from Β£{random.randint(200, 400):,} to "
316
+ f"Β£{random.randint(500, 900):,} β€” a {random.randint(40, 120)}% increase. "
317
+ f"The customer's claims history is clean. Assess whether this pricing complies "
318
+ f"with FCA pricing practices rules and Consumer Duty."
319
+ ),
320
+ "response_good": (
321
+ f"## Compliance Assessment β€” Pricing Fairness\n\n"
322
+ f"**CONCERN: Potential non-compliance with FCA pricing rules.**\n\n"
323
+ f"Under {FCA_REFERENCES['pricing_practices']}, insurers must not charge renewing "
324
+ f"customers more than they would charge an equivalent new customer. This is the "
325
+ f"'price walking' ban. A {random.randint(40, 120)}% increase for a claims-free "
326
+ f"customer raises a red flag.\n\n"
327
+ f"**Required checks:**\n"
328
+ f"1. Compare this renewal premium to the new business equivalent premium β€” they should be comparable\n"
329
+ f"2. Verify that the increase is justified by risk factors (claims inflation, reinsurance costs, location risk) not customer inertia\n"
330
+ f"3. Check whether {customer}'s age is driving the increase β€” pricing based on age alone may breach the {FCA_REFERENCES['equality_act']}\n"
331
+ f"4. Ensure the customer has received clear communication explaining why the premium has changed\n\n"
332
+ f"**Consumer Duty ({FCA_REFERENCES['consumer_duty']}):**\n"
333
+ f"- Outcome 1 (Products & Services): Is this policy still providing fair value?\n"
334
+ f"- Outcome 2 (Price & Value): The price must reflect the value of the product\n"
335
+ f"- Outcome 3 (Consumer Understanding): The renewal notice must clearly explain the price change\n"
336
+ f"- Outcome 4 (Consumer Support): The customer should be able to easily compare options or switch\n\n"
337
+ f"**Recommendation:** Flag for review by pricing team. Do not issue renewal at this price until "
338
+ f"new business equivalence is confirmed."
339
+ ),
340
+ },
341
+ "claim_decline": {
342
+ "instruction": (
343
+ f"{insurer} is declining a {random.choice(['home escape of water', 'motor theft', 'public liability'])} "
344
+ f"claim from {customer} (Claim: {_rand_claim_number()}) on the grounds of "
345
+ f"non-disclosure. The customer did not declare a previous claim from 3 years ago during renewal. "
346
+ f"Assess the compliance implications of this decline."
347
+ ),
348
+ "response_good": (
349
+ f"## Compliance Assessment β€” Claim Decline\n\n"
350
+ f"**CONCERN: Claim decline on non-disclosure grounds requires careful handling.**\n\n"
351
+ f"Under the Consumer Insurance (Disclosure and Representations) Act 2012 (CIDRA), "
352
+ f"the burden is on the insurer to ask clear questions. A consumer's duty is to take "
353
+ f"reasonable care not to make a misrepresentation.\n\n"
354
+ f"**Key questions:**\n"
355
+ f"1. Did the renewal documentation SPECIFICALLY ask about previous claims in the last 3-5 years?\n"
356
+ f"2. Was the question clear and unambiguous? (Under CIDRA, a vague question shifts risk to insurer)\n"
357
+ f"3. Was the non-disclosure deliberate or reckless, or merely careless?\n"
358
+ f" - Deliberate/reckless: Insurer can void the policy and refuse all claims\n"
359
+ f" - Careless: Insurer can only adjust (proportional remedy) β€” what would they have done?\n"
360
+ f"4. Would {insurer} have still provided cover if the previous claim had been declared?\n\n"
361
+ f"**Consumer Duty implications:**\n"
362
+ f"- Simply declining and citing 'non-disclosure' without proportional remedy assessment "
363
+ f"is likely to breach {FCA_REFERENCES['consumer_duty']}\n"
364
+ f"- The customer must be informed of their FOS rights\n"
365
+ f"- A full explanation in plain English must be provided\n\n"
366
+ f"**Recommendation:** Apply proportional remedy. Calculate what premium would have been charged "
367
+ f"with full disclosure, and settle claim on that proportional basis unless the non-disclosure was "
368
+ f"deliberate or reckless."
369
+ ),
370
+ },
371
+ "renewal_increase": {
372
+ "instruction": (
373
+ f"Review the following renewal strategy for {insurer}: Motor fleet policy for a commercial "
374
+ f"customer with {random.randint(20, 100)} vehicles. Current premium: {_rand_gbp(50000, 200000)}. "
375
+ f"Proposed renewal premium: {_rand_gbp(80000, 350000)}. "
376
+ f"Loss ratio last 3 years: {random.randint(45, 85)}%. "
377
+ f"Check if this renewal approach complies with market standards and regulations."
378
+ ),
379
+ "response_good": (
380
+ f"## Renewal Assessment\n\n"
381
+ f"**Analysis:**\n"
382
+ f"The premium increase needs to be validated against:\n"
383
+ f"1. **Burning cost analysis** β€” actual loss experience over 3+ years\n"
384
+ f"2. **Market benchmarking** β€” what similar fleets are paying\n"
385
+ f"3. **Claims inflation** β€” motor repair costs have risen ~15% in 2025/2026\n"
386
+ f"4. **Rate adequacy** β€” is the current rate technically profitable?\n\n"
387
+ f"**Compliance check:**\n"
388
+ f"- Commercial policies are outside the FCA's pricing practices rules (PS21/14), "
389
+ f"which only apply to retail home and motor\n"
390
+ f"- However, if the insured is an SME, FCA principles of fair treatment still apply\n"
391
+ f"- Solvency II requires the premium to be sufficient to cover liabilities\n"
392
+ f"- Documentation of rating methodology must be maintained for audit\n\n"
393
+ f"**Recommendation:** Proceed with renewal but ensure full actuarial justification "
394
+ f"is documented. Provide the customer with a clear breakdown showing how claims "
395
+ f"experience, market conditions, and inflation contribute to the increase."
396
+ ),
397
+ },
398
+ "data_handling": {
399
+ "instruction": (
400
+ f"A claims handler at {insurer} has emailed a spreadsheet containing "
401
+ f"{random.randint(50, 500)} claim records (including names, addresses, policy numbers, "
402
+ f"medical reports) to an external loss adjuster using personal email instead of the "
403
+ f"secure portal. Assess the data protection implications."
404
+ ),
405
+ "response_good": (
406
+ f"## Data Protection Incident Assessment\n\n"
407
+ f"**SEVERITY: HIGH β€” Potential reportable data breach under UK GDPR.**\n\n"
408
+ f"**Assessment:**\n"
409
+ f"1. This is a personal data breach under {FCA_REFERENCES['gdpr']} (Article 4(12))\n"
410
+ f"2. The data includes special category data (medical reports) under Article 9\n"
411
+ f"3. Personal email is not a secure or approved communication channel\n"
412
+ f"4. The data was not encrypted in transit\n\n"
413
+ f"**Regulatory obligations:**\n"
414
+ f"- **ICO notification** β€” required within 72 hours if there's a risk to individuals' rights "
415
+ f"(Article 33). Given the volume and sensitivity, this likely triggers notification.\n"
416
+ f"- **Individual notification** β€” required without undue delay if high risk to individuals (Article 34)\n"
417
+ f"- **FCA notification** β€” material breaches must be reported to the FCA\n\n"
418
+ f"**Immediate actions:**\n"
419
+ f"1. Contact the claims handler β€” do NOT delete the email (preserve evidence)\n"
420
+ f"2. Contact the loss adjuster β€” request immediate secure deletion and written confirmation\n"
421
+ f"3. Check if the personal email provider stores/scans data (e.g., Gmail ToS)\n"
422
+ f"4. Notify the DPO within 24 hours for breach assessment\n"
423
+ f"5. Log in the breach register\n"
424
+ f"6. Investigate root cause β€” why was the secure portal not used?\n\n"
425
+ f"**Potential consequences:**\n"
426
+ f"- ICO fine of up to Β£17.5M or 4% of annual turnover\n"
427
+ f"- FCA enforcement action\n"
428
+ f"- Reputational damage\n"
429
+ f"- Loss of customers' trust"
430
+ ),
431
+ },
432
+ "vulnerable_customer": {
433
+ "instruction": (
434
+ f"A customer ({customer}) has called {insurer} to make a home insurance claim after a burst pipe. "
435
+ f"The customer sounds distressed, mentions they are elderly and live alone, and says they "
436
+ f"don't understand the process. The initial handler suggested the customer 'just email the photos'. "
437
+ f"Review this interaction for Consumer Duty compliance."
438
+ ),
439
+ "response_good": (
440
+ f"## Vulnerable Customer Assessment\n\n"
441
+ f"**CONCERN: Inadequate support for a potentially vulnerable customer.**\n\n"
442
+ f"**Vulnerability indicators identified:**\n"
443
+ f"- Elderly and living alone (health/resilience vulnerability)\n"
444
+ f"- Distressed (life event + emotional state)\n"
445
+ f"- Expressed lack of understanding (capability vulnerability)\n"
446
+ f"- Home damage (residential circumstances)\n\n"
447
+ f"Under {FCA_REFERENCES['consumer_duty']}, Outcome 4 (Consumer Support), "
448
+ f"firms must ensure customers can access appropriate support. For vulnerable "
449
+ f"customers, this means additional reasonable adjustments.\n\n"
450
+ f"**What went wrong:**\n"
451
+ f"- 'Just email the photos' is inadequate for a customer who is distressed and may not be tech-literate\n"
452
+ f"- No vulnerability assessment was conducted\n"
453
+ f"- No offer of alternative channels (post, in-person visit, appointee)\n"
454
+ f"- No empathy or reassurance provided\n\n"
455
+ f"**Required actions:**\n"
456
+ f"1. Flag {customer}'s file with vulnerability indicators (with consent)\n"
457
+ f"2. Assign a dedicated claims handler (single point of contact)\n"
458
+ f"3. Offer multiple channels: phone, post, or arrange a loss adjuster visit who can also help capture photos\n"
459
+ f"4. Check if the property is habitable β€” if not, arrange emergency accommodation\n"
460
+ f"5. Send all correspondence in large print or alternative format if needed\n"
461
+ f"6. Follow up proactively rather than waiting for the customer to chase\n"
462
+ f"7. Record the vulnerability and adjustments made for ongoing review\n\n"
463
+ f"**Training recommendation:** Retrain the initial handler on FCA FG21/1 "
464
+ f"(Guidance for firms on the fair treatment of vulnerable customers)."
465
+ ),
466
+ },
467
+ "complaints_handling": {
468
+ "instruction": (
469
+ f"{customer} has complained to {insurer} that their motor insurance claim "
470
+ f"(for a theft-recovered vehicle) has been open for {random.randint(4, 8)} months "
471
+ f"without resolution. The customer says they've had no updates for {random.randint(3, 6)} weeks "
472
+ f"and can't reach their claims handler. They've threatened to go to the Financial Ombudsman. "
473
+ f"Assess this complaint and recommend a resolution."
474
+ ),
475
+ "response_good": (
476
+ f"## Complaints Assessment\n\n"
477
+ f"**SEVERITY: HIGH β€” FOS referral imminent, Consumer Duty breach likely.**\n\n"
478
+ f"**Timeline analysis:**\n"
479
+ f"Under {FCA_REFERENCES['complaints']}, {insurer} must:\n"
480
+ f"- Send a written acknowledgement within 5 business days (check: was this done?)\n"
481
+ f"- Issue a Final Response within 8 weeks of the complaint\n"
482
+ f"- If not resolved in 8 weeks, issue a 'deadlock letter' allowing FOS referral\n\n"
483
+ f"The customer has a right to refer to FOS if:\n"
484
+ f"- 8 weeks have passed since the complaint, OR\n"
485
+ f"- A Final Response has been issued and they're dissatisfied\n"
486
+ f"- FOS referral must be made within 6 months of the Final Response\n\n"
487
+ f"**Consumer Duty assessment:**\n"
488
+ f"- No updates for weeks = breach of Outcome 4 (Consumer Support)\n"
489
+ f"- Extended delay without resolution = potential breach of Outcome 1\n"
490
+ f"- Unable to reach handler = systemic support failure\n\n"
491
+ f"**Recommended resolution:**\n"
492
+ f"1. Immediate callback from a senior claims manager (today)\n"
493
+ f"2. Full timeline review β€” why has this claim been open so long?\n"
494
+ f"3. If the claim can be settled, make a settlement offer within 48 hours\n"
495
+ f"4. Offer compensation for distress and inconvenience (FOS typical award: Β£150-Β£500)\n"
496
+ f"5. Issue Final Response with clear explanation and FOS rights\n"
497
+ f"6. Root cause analysis β€” workload management, handler turnover, process gaps\n\n"
498
+ f"**FOS risk:** If this reaches FOS, the likely outcome is an upheld complaint "
499
+ f"plus a compensation award. Resolve internally to avoid the Β£750 FOS case fee."
500
+ ),
501
+ },
502
+ }
503
+
504
+ scenario = scenarios[scenario_type]
505
+ return {
506
+ "instruction": scenario["instruction"],
507
+ "response": scenario["response_good"],
508
+ "category": "compliance_check",
509
+ }
510
+
511
+
512
+ def gen_bordereaux_processing() -> dict:
513
+ mga = random.choice(MGA_NAMES)
514
+ syndicate = random.choice(LLOYDS_SYNDICATES)
515
+ month = random.choice(["January", "February", "March", "April", "May", "June",
516
+ "July", "August", "September", "October", "November", "December"])
517
+ year = random.choice([2025, 2026])
518
+ line = random.choice(["Commercial Property", "Professional Indemnity", "Employers' Liability"])
519
+ num_risks = random.randint(50, 500)
520
+ total_gwp = random.randint(100000, 2000000)
521
+ num_claims = random.randint(3, 30)
522
+ paid_claims = random.randint(50000, 500000)
523
+ outstanding = random.randint(100000, 1000000)
524
+
525
+ instruction = (
526
+ f"You are processing the {month} {year} bordereaux submission from {mga} "
527
+ f"under their binding authority with {syndicate}.\n\n"
528
+ f"**Premium Bordereaux Summary:**\n"
529
+ f"- Line: {line}\n"
530
+ f"- New risks bound: {num_risks}\n"
531
+ f"- Total GWP: Β£{total_gwp:,}\n\n"
532
+ f"**Claims Bordereaux Summary:**\n"
533
+ f"- Open claims: {num_claims}\n"
534
+ f"- Paid this month: Β£{paid_claims:,}\n"
535
+ f"- Outstanding reserves: Β£{outstanding:,}\n\n"
536
+ f"Validate this bordereaux, identify any issues, and produce a summary for the syndicate."
537
+ )
538
+
539
+ loss_ratio = round((paid_claims + outstanding) / total_gwp * 100 / 12 * random.randint(6, 12), 1)
540
+
541
+ response = (
542
+ f"## Bordereaux Processing Report β€” {mga}\n\n"
543
+ f"**Binding Authority:** {syndicate}\n"
544
+ f"**Period:** {month} {year}\n"
545
+ f"**Line:** {line}\n\n"
546
+ f"### Premium Bordereaux Validation\n"
547
+ f"- Risks bound: {num_risks} βœ“\n"
548
+ f"- GWP: Β£{total_gwp:,}\n"
549
+ f"- Average premium per risk: Β£{total_gwp // num_risks:,}\n"
550
+ f"- Validation checks:\n"
551
+ f" - [ ] All risks within binding authority limit? (Check individual risk cap)\n"
552
+ f" - [ ] Policy inception dates within BA period?\n"
553
+ f" - [ ] All territories within permitted scope?\n"
554
+ f" - [ ] OFAC/sanctions screening completed?\n"
555
+ f" - [ ] Unique Policy References (UPR) correctly formatted?\n\n"
556
+ f"### Claims Bordereaux Validation\n"
557
+ f"- Open claims: {num_claims}\n"
558
+ f"- Paid: Β£{paid_claims:,}\n"
559
+ f"- Outstanding: Β£{outstanding:,}\n"
560
+ f"- Incurred: Β£{paid_claims + outstanding:,}\n"
561
+ f"- Running loss ratio: {loss_ratio}%\n"
562
+ f"- Validation checks:\n"
563
+ f" - [ ] All claim numbers match policy records?\n"
564
+ f" - [ ] Reserve movements properly documented?\n"
565
+ f" - [ ] Any claims approaching BA authority limit?\n"
566
+ f" - [ ] Large loss notifications submitted for claims > 100% of BA threshold?\n\n"
567
+ f"### Flags\n"
568
+ )
569
+ if loss_ratio > 70:
570
+ response += f"- ⚠️ Loss ratio at {loss_ratio}% β€” exceeds 70% threshold for syndicate review\n"
571
+ if num_claims / num_risks > 0.1:
572
+ response += f"- ⚠️ Claims frequency ({num_claims}/{num_risks} = {num_claims/num_risks:.1%}) appears elevated\n"
573
+ if total_gwp / num_risks < 500:
574
+ response += f"- ⚠️ Average premium (Β£{total_gwp // num_risks:,}) appears low for {line} β€” check rating adequacy\n"
575
+ response += (
576
+ f"\n### Action Items\n"
577
+ f"1. Submit validated bordereaux to {syndicate} via XIS/LIMOSS\n"
578
+ f"2. Reconcile GWP against signed premium advice\n"
579
+ f"3. Update the burning cost analysis with latest claims data\n"
580
+ f"4. Schedule quarterly review meeting with {mga} management"
581
+ )
582
+
583
+ return {"instruction": instruction, "response": response, "category": "bordereaux_processing"}
584
+
585
+
586
+ def gen_fraud_assessment() -> dict:
587
+ line = random.choice(["Motor Private Car", "Home Buildings", "Home Contents",
588
+ "Employers' Liability", "Public Liability"])
589
+ claim_num = _rand_claim_number()
590
+ insured = fake.name()
591
+ claim_value = random.randint(2000, 80000)
592
+
593
+ indicators = random.sample([
594
+ "Claim filed within 30 days of policy inception",
595
+ "Insured increased sum insured 2 weeks before the loss",
596
+ "Inconsistent dates in insured's statement vs police report",
597
+ "Property was listed for sale at time of claimed burglary",
598
+ "Multiple previous claims across different insurers in last 2 years",
599
+ "Loss occurred during a period when property should have been unoccupied",
600
+ "Claimed items include high-value electronics but no purchase receipts",
601
+ "Third party witness is a family member of the insured",
602
+ "Phone geo-location data places insured 200 miles from claimed loss location",
603
+ "Vehicle was in arrears on finance payments at time of 'theft'",
604
+ "Medical report for injury claim references a different incident date",
605
+ "Social media posts inconsistent with claimed injuries",
606
+ ], k=random.randint(2, 5))
607
+
608
+ instruction = (
609
+ f"Assess the following {line} claim for fraud indicators.\n\n"
610
+ f"Claim: {claim_num}\n"
611
+ f"Insured: {insured}\n"
612
+ f"Value: Β£{claim_value:,}\n\n"
613
+ f"**Indicators flagged by automated screening:**\n"
614
+ )
615
+ for indicator in indicators:
616
+ instruction += f"- {indicator}\n"
617
+ instruction += (
618
+ f"\nProvide a fraud risk assessment, recommend investigation steps, "
619
+ f"and note any regulatory considerations."
620
+ )
621
+
622
+ score = min(10, len(indicators) * 2 + random.randint(0, 2))
623
+ risk_level = "HIGH" if score >= 7 else "MEDIUM" if score >= 4 else "LOW"
624
+
625
+ response = (
626
+ f"## Fraud Risk Assessment β€” {claim_num}\n\n"
627
+ f"**Risk Score:** {score}/10 ({risk_level})\n"
628
+ f"**Insured:** {insured}\n"
629
+ f"**Claim Value:** Β£{claim_value:,}\n\n"
630
+ f"**Analysis of Indicators:**\n"
631
+ )
632
+ for i, indicator in enumerate(indicators, 1):
633
+ response += f"{i}. **{indicator}** β€” "
634
+ if "inception" in indicator.lower():
635
+ response += "Classic red flag. Check insured's quote history and whether they approached multiple insurers.\n"
636
+ elif "increased" in indicator.lower():
637
+ response += "Suggests foreknowledge. Obtain the mid-term adjustment documentation.\n"
638
+ elif "inconsistent" in indicator.lower():
639
+ response += "Material discrepancy. Requires detailed statement comparison.\n"
640
+ elif "sale" in indicator.lower():
641
+ response += "Financial pressure indicator. Check Land Registry and estate agent listings.\n"
642
+ elif "multiple" in indicator.lower():
643
+ response += "Check CUE (Claims & Underwriting Exchange) database for full claims history.\n"
644
+ elif "unoccupied" in indicator.lower():
645
+ response += "Policy conditions for unoccupied properties differ. Check occupancy requirements.\n"
646
+ elif "receipts" in indicator.lower():
647
+ response += "Request alternative proof: bank/credit card statements, warranty registrations.\n"
648
+ elif "family member" in indicator.lower():
649
+ response += "Independent witness corroboration needed. Check for collusion.\n"
650
+ elif "geo-location" in indicator.lower():
651
+ response += "Strong objective evidence. Cross-reference with call records and transaction data.\n"
652
+ elif "finance" in indicator.lower():
653
+ response += "Financial motive established. Check with finance house for payment status.\n"
654
+ elif "medical" in indicator.lower():
655
+ response += "Request GP records and A&E attendance records for verification.\n"
656
+ elif "social media" in indicator.lower():
657
+ response += "Document with screenshots (dated). Admissible as evidence.\n"
658
+ else:
659
+ response += "Requires further investigation.\n"
660
+
661
+ response += (
662
+ f"\n**Recommended Actions:**\n"
663
+ f"1. Refer to Special Investigations Unit (SIU)\n"
664
+ f"2. Appoint forensic investigator if warranted\n"
665
+ f"3. Obtain full CUE/CIFAS/IFB checks\n"
666
+ f"4. Request detailed signed statement from insured under reserve of rights\n"
667
+ f"5. {'Consider surveillance' if score >= 7 else 'Monitor for further indicators'}\n\n"
668
+ f"**Regulatory Notes:**\n"
669
+ f"- Do NOT deny the claim based on suspicion alone β€” investigation must be completed first\n"
670
+ f"- Maintain fair treatment under {FCA_REFERENCES['consumer_duty']} even during investigation\n"
671
+ f"- If fraud is confirmed, refer to the Insurance Fraud Bureau (IFB)\n"
672
+ f"- CIFAS marker can be applied if the insured is found to have made a fraudulent claim\n"
673
+ f"- Under the Insurance Act 2015 (commercial) / CIDRA 2012 (consumer), "
674
+ f"a fraudulent claim entitles the insurer to refuse the entire claim and "
675
+ f"recover any sums already paid"
676
+ )
677
+
678
+ return {"instruction": instruction, "response": response, "category": "fraud_assessment"}
679
+
680
+
681
+ def gen_underwriting_triage() -> dict:
682
+ line = random.choice(["Commercial Property", "Professional Indemnity",
683
+ "Employers' Liability", "Cyber Liability", "D&O"])
684
+ business = fake.company()
685
+ turnover = random.randint(500000, 50000000)
686
+ employees = random.randint(5, 500)
687
+ inception = _rand_date(2026, 2026)
688
+ broker = random.choice(["Marsh UK", "Aon UK", "Willis Towers Watson",
689
+ "Gallagher", "Howden", "Lockton UK"])
690
+
691
+ instruction = (
692
+ f"You are an underwriter at a Lloyd's syndicate. Triage the following new business submission.\n\n"
693
+ f"**Submission Summary:**\n"
694
+ f"- Broker: {broker}\n"
695
+ f"- Line: {line}\n"
696
+ f"- Insured: {business}\n"
697
+ f"- Turnover: Β£{turnover:,}\n"
698
+ f"- Employees: {employees}\n"
699
+ f"- Proposed Inception: {inception}\n"
700
+ f"- Trade: {random.choice(['IT consultancy', 'construction contractor', 'solicitors practice', 'healthcare provider', 'manufacturing', 'hospitality', 'financial services', 'retail chain'])}\n"
701
+ f"- Claims history: {random.choice(['Clean β€” no claims in 5 years', '1 claim Β£25K (2024) β€” resolved', '2 claims totalling Β£150K (2023-2024)', 'Active claim β€” Β£500K outstanding'])}\n\n"
702
+ f"Triage this submission: accept for quoting, decline, or request further information."
703
+ )
704
+
705
+ response = (
706
+ f"## Underwriting Triage β€” New Business\n\n"
707
+ f"**Broker:** {broker}\n"
708
+ f"**Proposed Insured:** {business}\n"
709
+ f"**Line:** {line}\n\n"
710
+ f"**Initial Assessment:**\n"
711
+ f"- Turnover: Β£{turnover:,} β€” {'Within appetite' if turnover < 20000000 else 'At upper end of appetite β€” referral required'}\n"
712
+ f"- Employees: {employees} β€” {'Standard' if employees < 200 else 'Large risk β€” may need specific terms'}\n\n"
713
+ f"**Information Required Before Quoting:**\n"
714
+ f"1. Full proposal form (ACORD application or equivalent)\n"
715
+ f"2. 5-year loss history with triangulation\n"
716
+ f"3. Current policy schedule (if renewal/transfer)\n"
717
+ f"4. Risk management procedures documentation\n"
718
+ )
719
+
720
+ if "Cyber" in line:
721
+ response += (
722
+ f"5. IT security assessment (pen test results, ISO 27001 status)\n"
723
+ f"6. Incident response plan\n"
724
+ f"7. Data inventory (volume of PII/PHI held)\n"
725
+ f"8. Third-party vendor risk assessment\n"
726
+ )
727
+ elif "Professional" in line:
728
+ response += (
729
+ f"5. Professional qualifications and regulatory status\n"
730
+ f"6. Engagement terms and client contract templates\n"
731
+ f"7. PI risk management procedures\n"
732
+ )
733
+ elif "Property" in line:
734
+ response += (
735
+ f"5. Building construction details and fire protection\n"
736
+ f"6. Business continuity plan\n"
737
+ f"7. Risk survey report (within last 2 years)\n"
738
+ )
739
+
740
+ response += (
741
+ f"\n**Triage Decision:** {'ACCEPT for quoting β€” request further information above' if turnover < 30000000 else 'REFER to senior underwriter given risk size'}\n\n"
742
+ f"**Indicative Terms (subject to full underwriting):**\n"
743
+ f"- Rate: {random.uniform(0.1, 2.5):.2f}% of turnover\n"
744
+ f"- Minimum premium: {_rand_gbp(2500, 15000)}\n"
745
+ f"- Deductible: {_rand_gbp(1000, 25000)}\n"
746
+ f"- Limit: {_rand_gbp(1000000, 10000000)}\n"
747
+ )
748
+
749
+ return {"instruction": instruction, "response": response, "category": "underwriting_triage"}
750
+
751
+
752
+ def gen_customer_communication() -> dict:
753
+ scenario = random.choice([
754
+ "claim_acknowledgement", "claim_settlement_offer", "renewal_notice",
755
+ "complaint_final_response", "coverage_clarification"
756
+ ])
757
+ customer = fake.name()
758
+ insurer = random.choice(UK_INSURERS)
759
+ claim_num = _rand_claim_number()
760
+ policy_num = _rand_policy_number()
761
+
762
+ if scenario == "claim_acknowledgement":
763
+ instruction = (
764
+ f"Draft a claim acknowledgement letter from {insurer} to {customer} "
765
+ f"for claim {claim_num} (home escape of water). The letter must comply "
766
+ f"with FCA Consumer Duty plain English requirements."
767
+ )
768
+ response = (
769
+ f"Dear {customer},\n\n"
770
+ f"**Your Claim β€” Reference: {claim_num}**\n\n"
771
+ f"Thank you for letting us know about the water damage to your property. "
772
+ f"We're sorry you're dealing with this and we want to help resolve it as quickly as possible.\n\n"
773
+ f"**What happens next:**\n\n"
774
+ f"1. **Your claim handler** β€” {fake.name()} will be looking after your claim. "
775
+ f"You can reach them on {fake.phone_number()} or by email.\n\n"
776
+ f"2. **We may need to visit** β€” depending on the extent of the damage, we may arrange "
777
+ f"for a loss adjuster to visit your property. If so, we'll contact you to arrange "
778
+ f"a convenient time.\n\n"
779
+ f"3. **What we need from you:**\n"
780
+ f" - Photos of the damage (if safe to take)\n"
781
+ f" - Any emergency repair receipts (keep these safe β€” we'll reimburse reasonable costs)\n"
782
+ f" - Details of how the leak happened, if known\n\n"
783
+ f"4. **Emergency repairs** β€” if you need to stop further damage (e.g., turning off water, "
784
+ f"temporary repairs), please go ahead. Your policy covers reasonable emergency costs.\n\n"
785
+ f"5. **Timeline** β€” we aim to settle straightforward claims within 30 days. "
786
+ f"If your claim is more complex, we'll keep you updated every step of the way.\n\n"
787
+ f"**Your rights:**\n"
788
+ f"If at any point you're unhappy with how we're handling your claim, please tell us. "
789
+ f"You can make a complaint and, if we can't resolve it, you can refer to the "
790
+ f"Financial Ombudsman Service (FOS) free of charge. Details: www.financial-ombudsman.org.uk\n\n"
791
+ f"Kind regards,\n"
792
+ f"{fake.name()}\n"
793
+ f"Claims Team, {insurer}"
794
+ )
795
+ elif scenario == "claim_settlement_offer":
796
+ amount = random.randint(1000, 25000)
797
+ excess = random.choice([100, 250, 500])
798
+ instruction = (
799
+ f"Draft a settlement offer letter from {insurer} to {customer} for claim {claim_num}. "
800
+ f"The assessed amount is Β£{amount:,} with a Β£{excess} excess."
801
+ )
802
+ response = (
803
+ f"Dear {customer},\n\n"
804
+ f"**Settlement Offer β€” Claim: {claim_num}**\n\n"
805
+ f"We've now completed our assessment of your claim. Here's our offer:\n\n"
806
+ f"| Item | Amount |\n"
807
+ f"|---|---|\n"
808
+ f"| Assessed claim value | Β£{amount:,} |\n"
809
+ f"| Less: your policy excess | -Β£{excess:,} |\n"
810
+ f"| **Settlement amount** | **Β£{amount - excess:,}** |\n\n"
811
+ f"**How we calculated this:**\n"
812
+ f"We assessed the cost of repair/replacement based on our adjuster's report "
813
+ f"and current market prices. If you'd like a detailed breakdown, please ask.\n\n"
814
+ f"**If you agree:** Simply reply to this letter or call us to accept. "
815
+ f"We'll transfer Β£{amount - excess:,} to your bank account within 5 working days.\n\n"
816
+ f"**If you disagree:** You don't have to accept this offer. Please let us know "
817
+ f"why you think the amount should be different, and we'll review. You can also:\n"
818
+ f"- Obtain your own independent estimate and send it to us\n"
819
+ f"- Make a complaint through our complaints process\n"
820
+ f"- Refer to the Financial Ombudsman Service if we can't agree\n\n"
821
+ f"This offer remains open for 30 days.\n\n"
822
+ f"Kind regards,\n"
823
+ f"{fake.name()}\n"
824
+ f"Claims Team, {insurer}"
825
+ )
826
+ else:
827
+ instruction = (
828
+ f"Draft a plain English explanation for {customer} about what 'proximate cause' "
829
+ f"means in the context of their {random.choice(['home', 'motor'])} insurance claim. "
830
+ f"They don't understand why the insurer is investigating the cause before paying."
831
+ )
832
+ response = (
833
+ f"Dear {customer},\n\n"
834
+ f"Thank you for your question β€” it's completely understandable to want to know why we need "
835
+ f"to investigate before we can pay your claim.\n\n"
836
+ f"**What 'proximate cause' means in plain English:**\n\n"
837
+ f"When something goes wrong with your home or car, there's often a chain of events. "
838
+ f"'Proximate cause' simply means: **what was the main reason the damage happened?**\n\n"
839
+ f"Your insurance policy covers specific causes (called 'perils') β€” things like storms, "
840
+ f"fire, theft, or accidental damage. It doesn't cover others β€” like wear and tear or "
841
+ f"gradual deterioration.\n\n"
842
+ f"**Why it matters for your claim:**\n\n"
843
+ f"We need to establish *why* the damage happened to check it's covered by your policy. "
844
+ f"For example:\n"
845
+ f"- If your roof leaked because of storm damage β†’ that's likely covered βœ“\n"
846
+ f"- If your roof leaked because the tiles were old and worn out β†’ that's wear and tear, "
847
+ f"which isn't usually covered βœ—\n"
848
+ f"- If the storm made existing wear worse β†’ we look at which cause was 'dominant' β€” "
849
+ f"if the storm was the main cause, the claim should be covered even though the roof was old\n\n"
850
+ f"**We're not trying to avoid paying** β€” we're making sure we apply your policy correctly "
851
+ f"and fairly, as required by our regulator (the FCA).\n\n"
852
+ f"We'll have the investigation results within [X] days and will contact you as soon as we do.\n\n"
853
+ f"If you have any questions, please call us on {fake.phone_number()}.\n\n"
854
+ f"Kind regards,\n"
855
+ f"{fake.name()}\n"
856
+ f"Claims Team, {insurer}"
857
+ )
858
+
859
+ return {"instruction": instruction, "response": response, "category": "customer_communication"}
860
+
861
+
862
+ def gen_reserve_setting() -> dict:
863
+ line = random.choice(list(CLAIM_TYPES.keys()))
864
+ claim_type = random.choice(CLAIM_TYPES[line])
865
+ claim_num = _rand_claim_number()
866
+ insured = fake.name()
867
+ current_reserve = random.randint(5000, 100000)
868
+ new_info = random.choice([
869
+ "Loss adjuster report received β€” damage more extensive than initially reported",
870
+ "Third party solicitor has made contact β€” injury claim likely to increase",
871
+ "Subrogation recovery of Β£15,000 confirmed from third party insurer",
872
+ "Repair costs quote received β€” significantly lower than initial estimate",
873
+ "Medical report indicates longer recovery period β€” reserve increase needed",
874
+ "Fraud investigation cleared β€” claim confirmed genuine, proceed to settlement",
875
+ ])
876
+
877
+ instruction = (
878
+ f"Review and update the reserve for claim {claim_num}.\n\n"
879
+ f"**Current Details:**\n"
880
+ f"- Insured: {insured}\n"
881
+ f"- Line: {line}\n"
882
+ f"- Type: {claim_type}\n"
883
+ f"- Current reserve: Β£{current_reserve:,}\n"
884
+ f"- New information: {new_info}\n\n"
885
+ f"Assess whether the reserve should be increased, decreased, or maintained. "
886
+ f"Provide your rationale."
887
+ )
888
+
889
+ if "more extensive" in new_info or "injury" in new_info or "longer recovery" in new_info:
890
+ adjustment = random.randint(10000, 50000)
891
+ new_reserve = current_reserve + adjustment
892
+ direction = "INCREASE"
893
+ elif "recovery" in new_info or "lower" in new_info:
894
+ adjustment = random.randint(1000, max(1000, min(30000, current_reserve - 1000)))
895
+ new_reserve = current_reserve - adjustment
896
+ direction = "DECREASE"
897
+ else:
898
+ new_reserve = current_reserve
899
+ direction = "MAINTAIN"
900
+ adjustment = 0
901
+
902
+ response = (
903
+ f"## Reserve Review β€” {claim_num}\n\n"
904
+ f"**Recommendation:** {direction}\n\n"
905
+ f"| | Amount |\n"
906
+ f"|---|---|\n"
907
+ f"| Previous reserve | Β£{current_reserve:,} |\n"
908
+ f"| Adjustment | {'+'if direction == 'INCREASE' else '-' if direction == 'DECREASE' else ''}Β£{adjustment:,} |\n"
909
+ f"| **New reserve** | **Β£{new_reserve:,}** |\n\n"
910
+ f"**Rationale:**\n"
911
+ f"Based on the new information β€” {new_info.lower()} β€” the reserve should be "
912
+ f"{'increased to reflect the higher expected claim cost' if direction == 'INCREASE' else 'decreased to reflect reduced expected cost' if direction == 'DECREASE' else 'maintained as the new information does not materially change the expected outcome'}.\n\n"
913
+ f"**Factors considered:**\n"
914
+ f"1. Original claim notification and circumstances\n"
915
+ f"2. Updated damage assessment / medical evidence / legal correspondence\n"
916
+ f"3. Benchmarking against similar claims (industry loss data)\n"
917
+ f"4. Potential for future development (claims typically develop over {random.randint(6, 24)} months for {line})\n"
918
+ f"5. Subrogation/recovery prospects\n\n"
919
+ f"**Approval required:** {'Yes β€” exceeds Β£50K reserve authority' if new_reserve > 50000 else 'No β€” within claims handler authority'}\n\n"
920
+ f"**Bordereaux impact:** Updated reserve to be reflected in next monthly claims bordereaux to capacity provider."
921
+ )
922
+
923
+ return {"instruction": instruction, "response": response, "category": "reserve_setting"}
924
+
925
+
926
+ def gen_jargon_explanation() -> dict:
927
+ term = random.choice(list(INSURANCE_JARGON.keys()))
928
+ definition = INSURANCE_JARGON[term]
929
+ context_line = random.choice(LINES_OF_BUSINESS[:10])
930
+
931
+ instruction = (
932
+ f"A policyholder with a {context_line} policy has asked: "
933
+ f"\"What does '{term}' mean? I saw it in my policy document and I don't understand it.\"\n\n"
934
+ f"Explain this in plain English, with a practical example relevant to their policy type."
935
+ )
936
+
937
+ examples = {
938
+ "utmost good faith": (
939
+ f"if you're applying for {context_line} insurance and you know about a previous claim "
940
+ f"or a pre-existing issue, you must tell the insurer. If you hide it and later make a claim, "
941
+ f"the insurer could refuse to pay."
942
+ ),
943
+ "subrogation": (
944
+ f"say someone else caused damage to your {context_line.lower().replace('home ', 'home')}. "
945
+ f"Your insurer pays your claim, then chases the person responsible (or their insurer) to "
946
+ f"get the money back. You don't need to do anything β€” the insurer handles it."
947
+ ),
948
+ "indemnity": (
949
+ f"your insurer will put you back in the financial position you were in just before the loss. "
950
+ f"They won't make you better off (no profit from a claim) and won't leave you worse off. "
951
+ f"So for a {context_line.lower()} claim, they'd cover the cost of repair or replacement "
952
+ f"to the same standard as before."
953
+ ),
954
+ "excess": (
955
+ f"it's the first part of any claim you pay yourself. For example, if your {context_line.lower()} "
956
+ f"policy has a Β£250 excess and your claim is for Β£2,000, you pay Β£250 and the insurer pays Β£1,750."
957
+ ),
958
+ }
959
+
960
+ example = examples.get(term, (
961
+ f"In the context of your {context_line} policy, this means: {definition} "
962
+ f"For instance, if you needed to make a claim, this term would affect how "
963
+ f"the insurer assesses and pays that claim."
964
+ ))
965
+
966
+ response = (
967
+ f"Great question β€” '{term}' is one of those insurance terms that sounds complicated "
968
+ f"but is actually straightforward.\n\n"
969
+ f"**In plain English:** {definition}\n\n"
970
+ f"**How this applies to your {context_line} policy:**\n"
971
+ f"For example, {example}\n\n"
972
+ f"**Why it matters:** Understanding this term helps you know your rights and "
973
+ f"what to expect if you ever need to claim. Under the FCA's Consumer Duty rules, "
974
+ f"your insurer should explain these terms clearly β€” so if anything else in your "
975
+ f"policy document doesn't make sense, don't hesitate to ask."
976
+ )
977
+
978
+ return {"instruction": instruction, "response": response, "category": "jargon_explanation"}
979
+
980
+
981
+ def gen_regulatory_query() -> dict:
982
+ query_type = random.choice([
983
+ "consumer_duty_overview",
984
+ "fos_process",
985
+ "solvency_ii",
986
+ "lloyd_s_market",
987
+ "gender_pricing",
988
+ "claims_management_company",
989
+ ])
990
+
991
+ queries = {
992
+ "consumer_duty_overview": {
993
+ "instruction": (
994
+ "A new claims handler asks: 'What is the FCA Consumer Duty and how does it affect "
995
+ "how I handle claims on a daily basis? What could go wrong if I don't comply?'"
996
+ ),
997
+ "response": (
998
+ f"## FCA Consumer Duty β€” Practical Guide for Claims Handlers\n\n"
999
+ f"**What it is:** {FCA_REFERENCES['consumer_duty']} is the FCA's highest-priority regulation. "
1000
+ f"It requires all firms to 'act to deliver good outcomes for retail customers.'\n\n"
1001
+ f"**The 4 outcomes you must deliver on every claim:**\n\n"
1002
+ f"1. **Products & Services** β€” The policy must be designed to meet the customer's needs. "
1003
+ f"If you notice the policy doesn't cover what the customer reasonably expected, flag it.\n\n"
1004
+ f"2. **Price & Value** β€” The customer has paid a fair premium for meaningful cover. "
1005
+ f"Don't use technical loopholes to avoid paying legitimate claims.\n\n"
1006
+ f"3. **Consumer Understanding** β€” Communicate in plain English. No jargon in claim letters. "
1007
+ f"Explain decisions clearly. If declining part of a claim, explain WHY in terms they can understand.\n\n"
1008
+ f"4. **Consumer Support** β€” Be accessible. Respond promptly. Identify vulnerable customers "
1009
+ f"and provide additional support. Don't make the claims process unnecessarily difficult.\n\n"
1010
+ f"**What could go wrong:**\n"
1011
+ f"- FCA enforcement action and fines (up to unlimited amount)\n"
1012
+ f"- FOS complaints upheld against the firm (Β£750 per case fee + compensation)\n"
1013
+ f"- Personal regulatory action against individuals who systematically breach the Duty\n"
1014
+ f"- Reputational damage in the market\n\n"
1015
+ f"**Daily practical tips:**\n"
1016
+ f"- Would I be comfortable if the FCA read this email/letter? If not, rewrite it.\n"
1017
+ f"- Am I explaining the decision or just quoting policy clause numbers? Explain.\n"
1018
+ f"- Has the customer been waiting more than expected for an update? Call them proactively.\n"
1019
+ f"- Does the customer seem confused or distressed? Offer additional support."
1020
+ ),
1021
+ },
1022
+ "fos_process": {
1023
+ "instruction": "Explain the Financial Ombudsman Service complaints process for insurance disputes in the UK.",
1024
+ "response": (
1025
+ f"## Financial Ombudsman Service (FOS) β€” Insurance Complaints Process\n\n"
1026
+ f"**What is FOS?** A free, independent service that resolves disputes between consumers "
1027
+ f"and financial firms in the UK.\n\n"
1028
+ f"**Who can use FOS?** Consumers, micro-enterprises (under 10 employees + under €2M turnover), "
1029
+ f"and charities with annual income under Β£6.5M.\n\n"
1030
+ f"**The process:**\n\n"
1031
+ f"1. **Complain to your insurer first** β€” the insurer has 8 weeks to issue a Final Response\n"
1032
+ f"2. **If unsatisfied**, refer to FOS within 6 months of the Final Response\n"
1033
+ f"3. **If 8 weeks pass** without a Final Response, you can refer immediately\n"
1034
+ f"4. **FOS investigates** β€” reviews both sides, may request documents, may mediate\n"
1035
+ f"5. **Adjudicator's opinion** β€” first assessment\n"
1036
+ f"6. **If either party disagrees** β€” can request an Ombudsman's Final Decision (binding on the firm, not the consumer)\n\n"
1037
+ f"**Costs:**\n"
1038
+ f"- Free for consumers\n"
1039
+ f"- Firms pay Β£750 case fee per complaint (after first 3 free per year)\n\n"
1040
+ f"**Compensation limits:**\n"
1041
+ f"- Up to Β£430,000 for complaints about acts after 1 April 2024\n"
1042
+ f"- Plus fair compensation for distress and inconvenience\n\n"
1043
+ f"**For insurers:** FOS decisions are published and create informal precedent. "
1044
+ f"High upheld rates damage reputation and invite FCA scrutiny."
1045
+ ),
1046
+ },
1047
+ "gender_pricing": {
1048
+ "instruction": (
1049
+ "Can we use gender as a rating factor in our motor insurance pricing model? "
1050
+ "What's the current UK legal position?"
1051
+ ),
1052
+ "response": (
1053
+ f"## Gender-Based Insurance Pricing β€” UK Legal Position\n\n"
1054
+ f"**Short answer: NO.** You cannot use gender as a rating factor for insurance pricing in the UK.\n\n"
1055
+ f"**Legal basis:**\n"
1056
+ f"- **EU Gender Directive (2004/113/EC)** as interpreted by the Court of Justice (Test-Achats, C-236/09, 2011)\n"
1057
+ f"- This was implemented in the UK and remains part of retained UK law post-Brexit\n"
1058
+ f"- The **{FCA_REFERENCES['equality_act']}** prohibits direct discrimination based on sex in the provision of services\n\n"
1059
+ f"**What this means in practice:**\n"
1060
+ f"1. You CANNOT charge different premiums based on the policyholder's gender\n"
1061
+ f"2. You CANNOT use gender as a proxy rating factor (e.g., using first name to infer gender)\n"
1062
+ f"3. You CAN use factors that correlate with risk even if they also correlate with gender "
1063
+ f"(e.g., mileage, occupation, vehicle type) β€” provided the factor independently predicts risk\n"
1064
+ f"4. Your pricing model must be able to demonstrate that no protected characteristic "
1065
+ f"is being used directly or as a proxy\n\n"
1066
+ f"**FCA scrutiny:**\n"
1067
+ f"The FCA actively monitors pricing models for fairness under {FCA_REFERENCES['pricing_practices']}. "
1068
+ f"Firms must be able to justify every rating factor.\n\n"
1069
+ f"**Action:** Review all rating factors with your actuarial team. Document the independent "
1070
+ f"risk justification for each factor. Remove any factor that can be shown to primarily "
1071
+ f"serve as a gender proxy."
1072
+ ),
1073
+ },
1074
+ }
1075
+
1076
+ q = queries.get(query_type, queries["consumer_duty_overview"])
1077
+ return {"instruction": q["instruction"], "response": q["response"], "category": "regulatory_query"}
1078
+
1079
+
1080
+ def gen_renewal_review() -> dict:
1081
+ customer = fake.name()
1082
+ insurer = random.choice(UK_INSURERS)
1083
+ policy_num = _rand_policy_number()
1084
+ line = random.choice(["Motor Private Car", "Home Combined", "Commercial Combined"])
1085
+ current_premium = random.randint(200, 5000)
1086
+ new_premium = int(current_premium * random.uniform(0.9, 1.5))
1087
+ change_pct = round((new_premium - current_premium) / current_premium * 100, 1)
1088
+
1089
+ instruction = (
1090
+ f"Review the following renewal for compliance and customer outcome.\n\n"
1091
+ f"**Renewal Details:**\n"
1092
+ f"- Policyholder: {customer}\n"
1093
+ f"- Policy: {policy_num}\n"
1094
+ f"- Line: {line}\n"
1095
+ f"- Current premium: Β£{current_premium:,}\n"
1096
+ f"- Renewal premium: Β£{new_premium:,} ({'+' if change_pct > 0 else ''}{change_pct}%)\n"
1097
+ f"- Claims in period: {random.choice(['None', '1 small claim (Β£800)', '1 claim (Β£3,500)', 'None'])}\n"
1098
+ f"- Customer tenure: {random.randint(1, 15)} years\n\n"
1099
+ f"Assess whether this renewal is fair and compliant."
1100
+ )
1101
+
1102
+ response = (
1103
+ f"## Renewal Review β€” {policy_num}\n\n"
1104
+ f"**Premium Change:** {'+'if change_pct > 0 else ''}{change_pct}% "
1105
+ f"(Β£{current_premium:,} β†’ Β£{new_premium:,})\n\n"
1106
+ f"**Compliance Checks:**\n\n"
1107
+ )
1108
+
1109
+ if "Motor" in line or "Home" in line:
1110
+ response += (
1111
+ f"1. **Price Walking Check ({FCA_REFERENCES['pricing_practices']}):** "
1112
+ f"{'PASS β€” renewal premium must not exceed equivalent new business price. Verify with pricing team.' if change_pct < 20 else 'FLAG β€” significant increase requires NB equivalence verification.'}\n\n"
1113
+ )
1114
+ else:
1115
+ response += f"1. **Commercial policy β€” PS21/14 does not apply**, but fair treatment principles still do.\n\n"
1116
+
1117
+ response += (
1118
+ f"2. **Consumer Duty Value Assessment:** "
1119
+ f"{'The product appears to offer fair value at this price point.' if change_pct < 25 else 'Significant price increase β€” requires value assessment documentation.'}\n\n"
1120
+ f"3. **Communication Check:** Renewal notice must clearly show:\n"
1121
+ f" - Last year's premium for comparison βœ“\n"
1122
+ f" - Any changes to cover βœ“\n"
1123
+ f" - Reminder to shop around ('you may be able to get this cover cheaper elsewhere') βœ“\n"
1124
+ f" - Cancellation rights βœ“\n\n"
1125
+ f"4. **Auto-Renewal Disclosure:** If this policy auto-renews, the notice must clearly state this "
1126
+ f"and explain how to opt out.\n\n"
1127
+ f"**Recommendation:** {'Proceed with renewal' if change_pct < 30 else 'Review pricing justification before issuing'}"
1128
+ )
1129
+
1130
+ return {"instruction": instruction, "response": response, "category": "renewal_review"}
1131
+
1132
+
1133
+ # ── Master Generator ──
1134
+
1135
+ GENERATORS = {
1136
+ "claims_handling": gen_claims_handling,
1137
+ "policy_analysis": gen_policy_analysis,
1138
+ "fnol": gen_fnol,
1139
+ "compliance_check": gen_compliance_check,
1140
+ "bordereaux_processing": gen_bordereaux_processing,
1141
+ "fraud_assessment": gen_fraud_assessment,
1142
+ "underwriting_triage": gen_underwriting_triage,
1143
+ "customer_communication": gen_customer_communication,
1144
+ "reserve_setting": gen_reserve_setting,
1145
+ "jargon_explanation": gen_jargon_explanation,
1146
+ "regulatory_query": gen_regulatory_query,
1147
+ "renewal_review": gen_renewal_review,
1148
+ }
1149
+
1150
+
1151
+ def generate_sft_dataset(n: int = 10000, output_path: str = "data/output/insurance_sft_10k.jsonl"):
1152
+ """Generate n SFT examples, balanced across categories."""
1153
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
1154
+
1155
+ categories = list(GENERATORS.keys())
1156
+ per_category = n // len(categories)
1157
+ remainder = n % len(categories)
1158
+
1159
+ records = []
1160
+ for i, cat in enumerate(categories):
1161
+ count = per_category + (1 if i < remainder else 0)
1162
+ gen_fn = GENERATORS[cat]
1163
+ for _ in tqdm(range(count), desc=f"Generating {cat}"):
1164
+ record = gen_fn()
1165
+ # Format as chat for Qwen3 training
1166
+ records.append({
1167
+ "messages": [
1168
+ {"role": "system", "content": "You are InsureLLM, a specialist UK insurance AI assistant trained on UK insurance law, FCA regulations, Lloyd's market practices, and ACORD standards. You provide accurate, compliant, and plain-English guidance for insurance professionals and consumers."},
1169
+ {"role": "user", "content": record["instruction"]},
1170
+ {"role": "assistant", "content": record["response"]},
1171
+ ],
1172
+ "category": record["category"],
1173
+ })
1174
+
1175
+ random.shuffle(records)
1176
+
1177
+ with open(output_path, "w") as f:
1178
+ for record in records:
1179
+ f.write(json.dumps(record, ensure_ascii=False) + "\n")
1180
+
1181
+ print(f"\nβœ“ Generated {len(records)} SFT examples β†’ {output_path}")
1182
+ # Print category distribution
1183
+ from collections import Counter
1184
+ dist = Counter(r["category"] for r in records)
1185
+ for cat, count in sorted(dist.items()):
1186
+ print(f" {cat}: {count}")
1187
+
1188
+ return output_path
1189
+
1190
+
1191
+ if __name__ == "__main__":
1192
+ generate_sft_dataset()
data/gen_tabular.py ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Synthetic Tabular Claims Data Generator
3
+ Generates 50K synthetic UK insurance claims for fraud detection and pricing models.
4
+ """
5
+
6
+ import csv
7
+ import json
8
+ import os
9
+ import random
10
+ from datetime import datetime, timedelta
11
+ from pathlib import Path
12
+
13
+ from faker import Faker
14
+ from tqdm import tqdm
15
+
16
+ from data.constants import (
17
+ UK_INSURERS, UK_REGIONS, LLOYDS_SYNDICATES, MGAS,
18
+ CLAIM_TYPES, LINES_OF_BUSINESS,
19
+ )
20
+
21
+ fake = Faker("en_GB")
22
+ Faker.seed(44)
23
+ random.seed(44)
24
+
25
+ # ── Vehicle data ──
26
+
27
+ MAKES_MODELS = {
28
+ "Ford": ["Fiesta", "Focus", "Puma", "Kuga", "Ranger"],
29
+ "Vauxhall": ["Corsa", "Astra", "Mokka", "Grandland"],
30
+ "Volkswagen": ["Golf", "Polo", "T-Roc", "Tiguan", "ID.3"],
31
+ "BMW": ["1 Series", "3 Series", "X1", "X3", "iX"],
32
+ "Mercedes": ["A-Class", "C-Class", "GLA", "GLC"],
33
+ "Audi": ["A3", "A4", "Q3", "Q5"],
34
+ "Toyota": ["Yaris", "Corolla", "C-HR", "RAV4"],
35
+ "Nissan": ["Juke", "Qashqai", "Leaf", "X-Trail"],
36
+ "Kia": ["Picanto", "Sportage", "Niro", "EV6"],
37
+ "Hyundai": ["i10", "i20", "Tucson", "IONIQ 5"],
38
+ "Tesla": ["Model 3", "Model Y"],
39
+ "Peugeot": ["208", "2008", "308", "3008"],
40
+ "Renault": ["Clio", "Captur", "Megane E-Tech"],
41
+ }
42
+
43
+ FUEL_TYPES = ["Petrol", "Diesel", "Hybrid", "Electric", "Plug-in Hybrid"]
44
+
45
+ OCCUPATIONS = [
46
+ "Accountant", "Teacher", "Software Engineer", "Nurse", "Manager",
47
+ "Retired", "Student", "Self-employed", "Civil Servant", "Sales Executive",
48
+ "Driver", "Electrician", "Solicitor", "Architect", "Doctor",
49
+ "Chef", "Farmer", "Mechanic", "Journalist", "Pharmacist",
50
+ ]
51
+
52
+ PROPERTY_TYPES = ["Detached", "Semi-detached", "Terraced", "Flat", "Bungalow", "Converted Flat"]
53
+ HEATING_TYPES = ["Gas Central", "Electric", "Oil", "Heat Pump", "LPG"]
54
+ FLOOD_RISK = ["Negligible", "Low", "Medium", "High"]
55
+ SUBSIDENCE_RISK = ["No", "Historic – resolved", "Active"]
56
+
57
+
58
+ # ── Motor claim record ──
59
+
60
+ def gen_motor_claim(claim_id: int, is_fraud: bool = False) -> dict:
61
+ region_name, region = random.choice(list(UK_REGIONS.items()))
62
+ postcode_prefix = random.choice(region)
63
+ make = random.choice(list(MAKES_MODELS.keys()))
64
+ model = random.choice(MAKES_MODELS[make])
65
+ vehicle_year = random.randint(2012, 2025)
66
+ vehicle_value = random.randint(3000, 65000)
67
+ driver_age = random.randint(18, 80)
68
+
69
+ inception = fake.date_between(start_date="-3y", end_date="-30d")
70
+ loss_date = fake.date_between(start_date=inception, end_date="today")
71
+ report_date = loss_date + timedelta(days=random.randint(0, 14))
72
+ settlement_date = report_date + timedelta(days=random.randint(5, 120)) if random.random() > 0.25 else None
73
+
74
+ claim_type = random.choice(CLAIM_TYPES["Motor Private Car"])
75
+ reserve = random.randint(500, 40000)
76
+
77
+ # Fraud signals
78
+ if is_fraud:
79
+ # Exaggerated patterns
80
+ claim_amount = reserve * random.uniform(1.5, 4.0)
81
+ days_to_report = random.randint(10, 30)
82
+ previous_claims_3y = random.randint(2, 6)
83
+ policy_age_days = random.randint(10, 90) # New policy
84
+ witnesses = 0
85
+ dashcam = False
86
+ police_report = random.random() < 0.15
87
+ time_of_loss = f"{random.randint(22, 23):02d}:{random.randint(0, 59):02d}"
88
+ else:
89
+ claim_amount = reserve * random.uniform(0.5, 1.3)
90
+ days_to_report = random.randint(0, 7)
91
+ previous_claims_3y = random.choices([0, 1, 2, 3], weights=[60, 25, 10, 5])[0]
92
+ policy_age_days = random.randint(30, 1095)
93
+ witnesses = random.choices([0, 1, 2], weights=[30, 50, 20])[0]
94
+ dashcam = random.random() < 0.4
95
+ police_report = random.random() < 0.5
96
+ time_of_loss = f"{random.randint(6, 21):02d}:{random.randint(0, 59):02d}"
97
+
98
+ return {
99
+ "claim_id": f"MTR-{claim_id:06d}",
100
+ "lob": "Motor",
101
+ "insurer": random.choice(UK_INSURERS),
102
+ "region": region_name,
103
+ "postcode_prefix": postcode_prefix,
104
+ "inception_date": inception.isoformat(),
105
+ "loss_date": loss_date.isoformat(),
106
+ "report_date": report_date.isoformat(),
107
+ "settlement_date": settlement_date.isoformat() if settlement_date else "",
108
+ "claim_type": claim_type,
109
+ "claim_status": random.choice(["Open", "Settled", "Declined", "Reserved"]) if not settlement_date else "Settled",
110
+ # Driver
111
+ "driver_age": driver_age,
112
+ "driver_gender": random.choice(["M", "F"]),
113
+ "occupation": random.choice(OCCUPATIONS),
114
+ "years_driving": max(0, driver_age - random.randint(17, 25)),
115
+ "years_ncd": random.randint(0, min(20, max(0, driver_age - 18))),
116
+ # Vehicle
117
+ "vehicle_make": make,
118
+ "vehicle_model": model,
119
+ "vehicle_year": vehicle_year,
120
+ "vehicle_value": round(vehicle_value, 0),
121
+ "fuel_type": random.choice(FUEL_TYPES),
122
+ "annual_mileage": random.choice([5000, 8000, 10000, 12000, 15000, 20000]),
123
+ # Financial
124
+ "premium": round(random.uniform(300, 3500), 2),
125
+ "voluntary_excess": random.choice([0, 100, 250, 500, 750]),
126
+ "compulsory_excess": random.choice([150, 250, 350]),
127
+ "reserve_amount": round(reserve, 2),
128
+ "claim_amount": round(claim_amount, 2),
129
+ "recovery_amount": round(claim_amount * random.uniform(0, 0.5), 2) if random.random() < 0.3 else 0.0,
130
+ # Risk indicators
131
+ "previous_claims_3y": previous_claims_3y,
132
+ "days_to_report": days_to_report,
133
+ "policy_age_days": policy_age_days,
134
+ "witnesses": witnesses,
135
+ "dashcam": dashcam,
136
+ "police_report": police_report,
137
+ "time_of_loss": time_of_loss,
138
+ # Target
139
+ "is_fraud": is_fraud,
140
+ }
141
+
142
+
143
+ # ── Home/Property claim record ──
144
+
145
+ def gen_property_claim(claim_id: int, is_fraud: bool = False) -> dict:
146
+ region_name, region = random.choice(list(UK_REGIONS.items()))
147
+ postcode_prefix = random.choice(region)
148
+ property_type = random.choice(PROPERTY_TYPES)
149
+ rebuild_value = random.randint(150000, 800000)
150
+ contents_value = random.randint(20000, 150000)
151
+ property_age = random.randint(1, 200)
152
+
153
+ inception = fake.date_between(start_date="-3y", end_date="-30d")
154
+ loss_date = fake.date_between(start_date=inception, end_date="today")
155
+ report_date = loss_date + timedelta(days=random.randint(0, 14))
156
+
157
+ claim_type = random.choice(CLAIM_TYPES["Home Buildings"] + CLAIM_TYPES["Home Contents"])
158
+ reserve = random.randint(1000, 80000)
159
+
160
+ if is_fraud:
161
+ claim_amount = reserve * random.uniform(2.0, 5.0)
162
+ days_to_report = random.randint(12, 45)
163
+ previous_claims_3y = random.randint(2, 5)
164
+ policy_age_days = random.randint(15, 60)
165
+ has_cctv = False
166
+ loss_adjuster_appointed = True
167
+ else:
168
+ claim_amount = reserve * random.uniform(0.6, 1.4)
169
+ days_to_report = random.randint(0, 7)
170
+ previous_claims_3y = random.choices([0, 1, 2], weights=[70, 22, 8])[0]
171
+ policy_age_days = random.randint(30, 1825)
172
+ has_cctv = random.random() < 0.25
173
+ loss_adjuster_appointed = random.random() < 0.4
174
+
175
+ return {
176
+ "claim_id": f"PRP-{claim_id:06d}",
177
+ "lob": "Property",
178
+ "insurer": random.choice(UK_INSURERS),
179
+ "region": region_name,
180
+ "postcode_prefix": postcode_prefix,
181
+ "inception_date": inception.isoformat(),
182
+ "loss_date": loss_date.isoformat(),
183
+ "report_date": report_date.isoformat(),
184
+ "settlement_date": "",
185
+ "claim_type": claim_type,
186
+ "claim_status": random.choice(["Open", "Settled", "Declined", "Reserved"]),
187
+ # Property
188
+ "property_type": property_type,
189
+ "property_age_years": property_age,
190
+ "rebuild_value": rebuild_value,
191
+ "contents_value": contents_value,
192
+ "heating_type": random.choice(HEATING_TYPES),
193
+ "flood_risk_zone": random.choice(FLOOD_RISK),
194
+ "subsidence_history": random.choice(SUBSIDENCE_RISK),
195
+ "alarm_installed": random.random() < 0.6,
196
+ "locks_bs3621": random.random() < 0.55,
197
+ # Financial
198
+ "premium": round(random.uniform(150, 2500), 2),
199
+ "voluntary_excess": random.choice([0, 100, 250, 500]),
200
+ "compulsory_excess": random.choice([100, 250]),
201
+ "reserve_amount": round(reserve, 2),
202
+ "claim_amount": round(claim_amount, 2),
203
+ "recovery_amount": 0.0,
204
+ # Risk
205
+ "previous_claims_3y": previous_claims_3y,
206
+ "days_to_report": days_to_report,
207
+ "policy_age_days": policy_age_days,
208
+ "has_cctv": has_cctv,
209
+ "loss_adjuster_appointed": loss_adjuster_appointed,
210
+ "unoccupied_30_days": random.random() < 0.05,
211
+ # Target
212
+ "is_fraud": is_fraud,
213
+ }
214
+
215
+
216
+ # ── Liability claim record ──
217
+
218
+ def gen_liability_claim(claim_id: int, is_fraud: bool = False) -> dict:
219
+ region_name, region = random.choice(list(UK_REGIONS.items()))
220
+ postcode_prefix = random.choice(region)
221
+
222
+ inception = fake.date_between(start_date="-3y", end_date="-30d")
223
+ loss_date = fake.date_between(start_date=inception, end_date="today")
224
+ report_date = loss_date + timedelta(days=random.randint(0, 30))
225
+
226
+ claim_type = random.choice(
227
+ CLAIM_TYPES["Employers' Liability"] + CLAIM_TYPES["Public Liability"]
228
+ + CLAIM_TYPES["Professional Indemnity"]
229
+ )
230
+ reserve = random.randint(5000, 250000)
231
+
232
+ if is_fraud:
233
+ claim_amount = reserve * random.uniform(2.0, 6.0)
234
+ days_to_report = random.randint(20, 60)
235
+ previous_claims_3y = random.randint(3, 8)
236
+ solicitor_involved = True
237
+ independent_witness = False
238
+ medical_evidence_delay_days = random.randint(30, 90)
239
+ else:
240
+ claim_amount = reserve * random.uniform(0.5, 1.5)
241
+ days_to_report = random.randint(0, 14)
242
+ previous_claims_3y = random.choices([0, 1, 2], weights=[65, 25, 10])[0]
243
+ solicitor_involved = random.random() < 0.45
244
+ independent_witness = random.random() < 0.5
245
+ medical_evidence_delay_days = random.randint(2, 21)
246
+
247
+ return {
248
+ "claim_id": f"LBL-{claim_id:06d}",
249
+ "lob": "Liability",
250
+ "insurer": random.choice(UK_INSURERS),
251
+ "region": region_name,
252
+ "postcode_prefix": postcode_prefix,
253
+ "inception_date": inception.isoformat(),
254
+ "loss_date": loss_date.isoformat(),
255
+ "report_date": report_date.isoformat(),
256
+ "settlement_date": "",
257
+ "claim_type": claim_type,
258
+ "claim_status": random.choice(["Open", "Settled", "Declined", "Reserved"]),
259
+ # Claimant
260
+ "claimant_age": random.randint(18, 80),
261
+ "claimant_gender": random.choice(["M", "F"]),
262
+ "injury_type": random.choice([
263
+ "Whiplash", "Back strain", "Fracture", "Soft tissue",
264
+ "Psychological", "Head injury", "Burns", "Multiple",
265
+ ]),
266
+ "injury_severity": random.choice(["Minor", "Moderate", "Serious", "Catastrophic"]),
267
+ # Financial
268
+ "reserve_amount": round(reserve, 2),
269
+ "claim_amount": round(claim_amount, 2),
270
+ "solicitor_costs": round(random.uniform(500, 15000), 2) if solicitor_involved else 0.0,
271
+ "medical_costs": round(random.uniform(200, 25000), 2),
272
+ # Risk
273
+ "previous_claims_3y": previous_claims_3y,
274
+ "days_to_report": days_to_report,
275
+ "solicitor_involved": solicitor_involved,
276
+ "independent_witness": independent_witness,
277
+ "medical_evidence_delay_days": medical_evidence_delay_days,
278
+ "cctv_available": random.random() < 0.2,
279
+ # Target
280
+ "is_fraud": is_fraud,
281
+ }
282
+
283
+
284
+ # ── Orchestrator ──
285
+
286
+ def generate_tabular_dataset(
287
+ n: int = 50000,
288
+ fraud_rate: float = 0.08,
289
+ output_dir: str = "data/output",
290
+ ):
291
+ """Generate mixed LoB tabular claims dataset with ~8% fraud rate."""
292
+ os.makedirs(output_dir, exist_ok=True)
293
+
294
+ n_fraud = int(n * fraud_rate)
295
+ n_genuine = n - n_fraud
296
+
297
+ # Split across LoB: 50% motor, 30% property, 20% liability
298
+ splits = {"motor": 0.50, "property": 0.30, "liability": 0.20}
299
+
300
+ motor_records = []
301
+ prop_records = []
302
+ liab_records = []
303
+
304
+ for lob, frac in splits.items():
305
+ total_lob = int(n * frac)
306
+ fraud_lob = int(n_fraud * frac)
307
+ genuine_lob = total_lob - fraud_lob
308
+
309
+ gen_fn = {"motor": gen_motor_claim, "property": gen_property_claim, "liability": gen_liability_claim}[lob]
310
+ target = {"motor": motor_records, "property": prop_records, "liability": liab_records}[lob]
311
+
312
+ for i in tqdm(range(genuine_lob), desc=f"{lob.capitalize()} β€” genuine"):
313
+ target.append(gen_fn(len(target) + 1, is_fraud=False))
314
+ for i in tqdm(range(fraud_lob), desc=f"{lob.capitalize()} β€” fraud"):
315
+ target.append(gen_fn(len(target) + 1, is_fraud=True))
316
+
317
+ # Write separate CSVs per LoB
318
+ for name, records in [("motor", motor_records), ("property", prop_records), ("liability", liab_records)]:
319
+ random.shuffle(records)
320
+ outpath = os.path.join(output_dir, f"claims_{name}_{len(records)}k.csv")
321
+ if records:
322
+ with open(outpath, "w", newline="") as f:
323
+ writer = csv.DictWriter(f, fieldnames=records[0].keys())
324
+ writer.writeheader()
325
+ writer.writerows(records)
326
+ fraud_count = sum(1 for r in records if r["is_fraud"])
327
+ print(f"βœ“ {name}: {len(records)} records ({fraud_count} fraud, {fraud_count/len(records)*100:.1f}%) β†’ {outpath}")
328
+
329
+ # Also write combined JSONL for convenience
330
+ all_records = motor_records + prop_records + liab_records
331
+ random.shuffle(all_records)
332
+ combined_path = os.path.join(output_dir, f"claims_all_{len(all_records)}.jsonl")
333
+ with open(combined_path, "w") as f:
334
+ for rec in all_records:
335
+ f.write(json.dumps(rec, ensure_ascii=False, default=str) + "\n")
336
+ total_fraud = sum(1 for r in all_records if r["is_fraud"])
337
+ print(f"\nβœ“ Combined: {len(all_records)} records ({total_fraud} fraud, {total_fraud/len(all_records)*100:.1f}%) β†’ {combined_path}")
338
+
339
+ return combined_path
340
+
341
+
342
+ if __name__ == "__main__":
343
+ generate_tabular_dataset()
data/generate_all.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Master Data Generation Orchestrator
3
+ Runs all synthetic data generators and produces the complete training dataset.
4
+ """
5
+
6
+ import time
7
+ import sys
8
+ from pathlib import Path
9
+
10
+ # Ensure project root on path
11
+ sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
12
+
13
+ from data.gen_sft import generate_sft_dataset
14
+ from data.gen_dpo import generate_dpo_dataset
15
+ from data.gen_tabular import generate_tabular_dataset
16
+ from data.gen_documents import generate_document_dataset
17
+ from data.gen_ner import generate_ner_dataset
18
+
19
+
20
+ OUTPUT_DIR = "data/output"
21
+
22
+
23
+ def main():
24
+ start = time.time()
25
+ print("=" * 60)
26
+ print(" InsureOS β€” Synthetic Data Generation Pipeline")
27
+ print("=" * 60)
28
+
29
+ # 1. SFT instruction-response pairs
30
+ print("\n[1/5] SFT Data (10K instruction-response pairs)")
31
+ generate_sft_dataset(n=10000, output_path=f"{OUTPUT_DIR}/insurance_sft_10k.jsonl")
32
+
33
+ # 2. DPO preference pairs
34
+ print("\n[2/5] DPO Data (5K preference pairs)")
35
+ generate_dpo_dataset(n=5000, output_path=f"{OUTPUT_DIR}/insurance_dpo_5k.jsonl")
36
+
37
+ # 3. Tabular claims data
38
+ print("\n[3/5] Tabular Claims Data (50K records)")
39
+ generate_tabular_dataset(n=50000, fraud_rate=0.08, output_dir=OUTPUT_DIR)
40
+
41
+ # 4. Document classification
42
+ print("\n[4/5] Document Classification Data (10K documents)")
43
+ generate_document_dataset(n=10000, output_path=f"{OUTPUT_DIR}/insurance_docs_10k.jsonl")
44
+
45
+ # 5. NER data
46
+ print("\n[5/5] NER Data (8K token-labelled examples)")
47
+ generate_ner_dataset(n=8000, output_path=f"{OUTPUT_DIR}/insurance_ner_8k.jsonl")
48
+
49
+ elapsed = time.time() - start
50
+ print("\n" + "=" * 60)
51
+ print(f" βœ“ All data generated in {elapsed:.1f}s")
52
+ print(f" Output directory: {OUTPUT_DIR}/")
53
+ print("=" * 60)
54
+
55
+ # List generated files
56
+ output_path = Path(OUTPUT_DIR)
57
+ if output_path.exists():
58
+ print("\nGenerated files:")
59
+ for f in sorted(output_path.iterdir()):
60
+ size_mb = f.stat().st_size / (1024 * 1024)
61
+ print(f" {f.name:50s} {size_mb:8.2f} MB")
62
+
63
+
64
+ if __name__ == "__main__":
65
+ main()
distill.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Knowledge Distillation Script
3
+ Distils InsureLLM-8B (DPO-aligned teacher) β†’ InsureLLM-4B (Qwen3-4B student).
4
+ Uses KL-divergence + hard-label distillation for 16 GB VRAM.
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import argparse
10
+
11
+ import torch
12
+ import torch.nn.functional as F
13
+ from torch.utils.data import DataLoader
14
+ from datasets import Dataset
15
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
16
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
17
+ from tqdm import tqdm
18
+
19
+
20
+ # ── Defaults ──
21
+
22
+ TEACHER_MODEL = "models/insurellm-8b-dpo-merged"
23
+ STUDENT_MODEL = "Qwen/Qwen3-4B"
24
+ DATA_PATH = "data/output/insurance_sft_10k.jsonl"
25
+ OUTPUT_DIR = "models/insurellm-4b-distilled"
26
+ MAX_SEQ_LEN = 1024
27
+ LORA_R = 32
28
+ LORA_ALPHA = 64
29
+ TEMPERATURE = 3.0 # softens teacher logits
30
+ ALPHA_KL = 0.7 # weight of KL loss vs hard label loss
31
+ EPOCHS = 3
32
+ BATCH_SIZE = 2
33
+ GRAD_ACCUM = 8
34
+ LR = 1e-4
35
+ WARMUP_STEPS = 50
36
+ SAVE_STEPS = 200
37
+
38
+
39
+ def load_data(path: str, tokenizer, max_len: int) -> Dataset:
40
+ """Load and tokenize SFT data for distillation."""
41
+ records = []
42
+ with open(path) as f:
43
+ for line in f:
44
+ obj = json.loads(line)
45
+ # Apply chat template to get text
46
+ text = tokenizer.apply_chat_template(
47
+ obj["messages"],
48
+ tokenize=False,
49
+ add_generation_prompt=False,
50
+ )
51
+ records.append({"text": text})
52
+
53
+ ds = Dataset.from_list(records)
54
+
55
+ def tokenize_fn(examples):
56
+ return tokenizer(
57
+ examples["text"],
58
+ truncation=True,
59
+ max_length=max_len,
60
+ padding="max_length",
61
+ return_tensors="pt",
62
+ )
63
+
64
+ ds = ds.map(tokenize_fn, batched=True, remove_columns=["text"])
65
+ ds.set_format("torch")
66
+ return ds
67
+
68
+
69
+ def main():
70
+ parser = argparse.ArgumentParser(description="Distil InsureLLM-8B β†’ InsureLLM-4B")
71
+ parser.add_argument("--teacher-model", default=TEACHER_MODEL)
72
+ parser.add_argument("--student-model", default=STUDENT_MODEL)
73
+ parser.add_argument("--data-path", default=DATA_PATH)
74
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
75
+ parser.add_argument("--epochs", type=int, default=EPOCHS)
76
+ parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
77
+ parser.add_argument("--lr", type=float, default=LR)
78
+ parser.add_argument("--temperature", type=float, default=TEMPERATURE)
79
+ parser.add_argument("--alpha-kl", type=float, default=ALPHA_KL)
80
+ args = parser.parse_args()
81
+
82
+ print(f"{'='*60}")
83
+ print(f" InsureOS β€” Knowledge Distillation")
84
+ print(f" Teacher: {args.teacher_model}")
85
+ print(f" Student: {args.student_model}")
86
+ print(f" Temperature: {args.temperature}, Alpha: {args.alpha_kl}")
87
+ print(f"{'='*60}\n")
88
+
89
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
90
+
91
+ # ── 1. Load tokenizer (student's) ──
92
+ print("[1/5] Loading tokenizer...")
93
+ tokenizer = AutoTokenizer.from_pretrained(
94
+ args.student_model,
95
+ trust_remote_code=True,
96
+ padding_side="right",
97
+ )
98
+ if tokenizer.pad_token is None:
99
+ tokenizer.pad_token = tokenizer.eos_token
100
+
101
+ # ── 2. Load teacher (4-bit, frozen) ──
102
+ print("[2/5] Loading teacher model (4-bit, frozen)...")
103
+ bnb_config = BitsAndBytesConfig(
104
+ load_in_4bit=True,
105
+ bnb_4bit_quant_type="nf4",
106
+ bnb_4bit_compute_dtype=torch.bfloat16,
107
+ bnb_4bit_use_double_quant=True,
108
+ )
109
+
110
+ teacher = AutoModelForCausalLM.from_pretrained(
111
+ args.teacher_model,
112
+ quantization_config=bnb_config,
113
+ device_map="auto",
114
+ trust_remote_code=True,
115
+ torch_dtype=torch.bfloat16,
116
+ )
117
+ teacher.eval()
118
+ for p in teacher.parameters():
119
+ p.requires_grad = False
120
+
121
+ # ── 3. Load student (4-bit + LoRA for training) ──
122
+ print("[3/5] Loading student model with LoRA...")
123
+ student = AutoModelForCausalLM.from_pretrained(
124
+ args.student_model,
125
+ quantization_config=bnb_config,
126
+ device_map="auto",
127
+ trust_remote_code=True,
128
+ torch_dtype=torch.bfloat16,
129
+ )
130
+ student = prepare_model_for_kbit_training(student, use_gradient_checkpointing=True)
131
+
132
+ lora_config = LoraConfig(
133
+ r=LORA_R,
134
+ lora_alpha=LORA_ALPHA,
135
+ lora_dropout=0.05,
136
+ target_modules="all-linear",
137
+ task_type=TaskType.CAUSAL_LM,
138
+ bias="none",
139
+ )
140
+ student = get_peft_model(student, lora_config)
141
+ student.print_trainable_parameters()
142
+
143
+ # ── 4. Load data ──
144
+ print("[4/5] Loading and tokenizing data...")
145
+ dataset = load_data(args.data_path, tokenizer, MAX_SEQ_LEN)
146
+ dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)
147
+ print(f" Examples: {len(dataset)}, Batches/epoch: {len(dataloader)}")
148
+
149
+ # ── 5. Distillation training loop ──
150
+ print("[5/5] Starting distillation...\n")
151
+
152
+ optimizer = torch.optim.AdamW(student.parameters(), lr=args.lr, weight_decay=0.01)
153
+
154
+ total_steps = len(dataloader) * args.epochs // GRAD_ACCUM
155
+ scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=total_steps)
156
+
157
+ global_step = 0
158
+ best_loss = float("inf")
159
+
160
+ for epoch in range(args.epochs):
161
+ student.train()
162
+ epoch_loss = 0.0
163
+ accum_loss = 0.0
164
+
165
+ pbar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{args.epochs}")
166
+ for step, batch in enumerate(pbar):
167
+ input_ids = batch["input_ids"].to(device)
168
+ attention_mask = batch["attention_mask"].to(device)
169
+
170
+ # Teacher forward (no grad)
171
+ with torch.no_grad():
172
+ teacher_outputs = teacher(
173
+ input_ids=input_ids,
174
+ attention_mask=attention_mask,
175
+ )
176
+ teacher_logits = teacher_outputs.logits
177
+
178
+ # Student forward
179
+ student_outputs = student(
180
+ input_ids=input_ids,
181
+ attention_mask=attention_mask,
182
+ labels=input_ids, # for hard label loss
183
+ )
184
+ student_logits = student_outputs.logits
185
+ hard_loss = student_outputs.loss
186
+
187
+ # KL divergence loss (soft labels)
188
+ T = args.temperature
189
+ teacher_probs = F.log_softmax(teacher_logits / T, dim=-1)
190
+ student_log_probs = F.log_softmax(student_logits / T, dim=-1)
191
+
192
+ # Only compute KL over non-padding tokens
193
+ mask = attention_mask.unsqueeze(-1).float()
194
+ kl_loss = F.kl_div(
195
+ student_log_probs * mask,
196
+ teacher_probs * mask,
197
+ log_target=True,
198
+ reduction="batchmean",
199
+ ) * (T ** 2)
200
+
201
+ # Combined loss
202
+ loss = args.alpha_kl * kl_loss + (1 - args.alpha_kl) * hard_loss
203
+ loss = loss / GRAD_ACCUM
204
+
205
+ loss.backward()
206
+ accum_loss += loss.item()
207
+
208
+ if (step + 1) % GRAD_ACCUM == 0:
209
+ torch.nn.utils.clip_grad_norm_(student.parameters(), 1.0)
210
+ optimizer.step()
211
+ scheduler.step()
212
+ optimizer.zero_grad()
213
+ global_step += 1
214
+
215
+ pbar.set_postfix({
216
+ "loss": f"{accum_loss:.4f}",
217
+ "kl": f"{kl_loss.item():.4f}",
218
+ "hard": f"{hard_loss.item():.4f}",
219
+ "lr": f"{scheduler.get_last_lr()[0]:.2e}",
220
+ })
221
+ epoch_loss += accum_loss
222
+ accum_loss = 0.0
223
+
224
+ # Save checkpoint
225
+ if global_step % SAVE_STEPS == 0:
226
+ ckpt_dir = os.path.join(args.output_dir, f"checkpoint-{global_step}")
227
+ student.save_pretrained(ckpt_dir)
228
+ tokenizer.save_pretrained(ckpt_dir)
229
+ print(f"\n Checkpoint saved: {ckpt_dir}")
230
+
231
+ avg_loss = epoch_loss / max(1, global_step)
232
+ print(f"\nEpoch {epoch+1} β€” avg loss: {avg_loss:.4f}")
233
+
234
+ if avg_loss < best_loss:
235
+ best_loss = avg_loss
236
+ best_dir = os.path.join(args.output_dir, "best")
237
+ student.save_pretrained(best_dir)
238
+ tokenizer.save_pretrained(best_dir)
239
+ print(f" Best model saved: {best_dir}")
240
+
241
+ # ── Final save ──
242
+ print("\nSaving final distilled model...")
243
+ student.save_pretrained(args.output_dir)
244
+ tokenizer.save_pretrained(args.output_dir)
245
+
246
+ # Merge LoRA
247
+ merged_dir = f"{args.output_dir}-merged"
248
+ print(f"Merging LoRA β†’ {merged_dir}")
249
+ merged = student.merge_and_unload()
250
+ merged.save_pretrained(merged_dir)
251
+ tokenizer.save_pretrained(merged_dir)
252
+
253
+ print(f"\nβœ“ Distillation complete!")
254
+ print(f" Student (LoRA): {args.output_dir}")
255
+ print(f" Student (merged): {merged_dir}")
256
+ print(f" Best loss: {best_loss:.4f}")
257
+
258
+
259
+ if __name__ == "__main__":
260
+ main()
doc_classifier.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Document Classifier Training
3
+ Fine-tunes ModernBERT (or a fallback BERT-base) for 12-class insurance document classification.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import argparse
9
+ from pathlib import Path
10
+
11
+ import torch
12
+ import numpy as np
13
+ from datasets import Dataset
14
+ from transformers import (
15
+ AutoModelForSequenceClassification,
16
+ AutoTokenizer,
17
+ TrainingArguments,
18
+ Trainer,
19
+ )
20
+ from sklearn.metrics import accuracy_score, f1_score, classification_report
21
+
22
+ from data.constants import DOCUMENT_TYPES
23
+
24
+
25
+ # ── Defaults ──
26
+
27
+ # ModernBERT is preferred; fall back to bert-base if unavailable
28
+ MODEL_NAME = "answerdotai/ModernBERT-base"
29
+ FALLBACK_MODEL = "google-bert/bert-base-uncased"
30
+ DATA_PATH = "data/output/insurance_docs_10k.jsonl"
31
+ OUTPUT_DIR = "models/doc-classifier"
32
+ MAX_LEN = 512
33
+ EPOCHS = 5
34
+ BATCH_SIZE = 16
35
+ LR = 2e-5
36
+ WARMUP_RATIO = 0.1
37
+ EVAL_SPLIT = 0.1
38
+ LABELS = DOCUMENT_TYPES # 12 classes
39
+
40
+
41
+ def load_data(path: str) -> Dataset:
42
+ records = []
43
+ with open(path) as f:
44
+ for line in f:
45
+ obj = json.loads(line)
46
+ records.append({
47
+ "text": obj["text"],
48
+ "label": obj["label_id"],
49
+ })
50
+ return Dataset.from_list(records)
51
+
52
+
53
+ def compute_metrics(pred):
54
+ labels = pred.label_ids
55
+ preds = np.argmax(pred.predictions, axis=-1)
56
+ acc = accuracy_score(labels, preds)
57
+ f1_macro = f1_score(labels, preds, average="macro")
58
+ f1_weighted = f1_score(labels, preds, average="weighted")
59
+ return {"accuracy": acc, "f1_macro": f1_macro, "f1_weighted": f1_weighted}
60
+
61
+
62
+ def main():
63
+ parser = argparse.ArgumentParser(description="Train document classifier")
64
+ parser.add_argument("--model-name", default=MODEL_NAME)
65
+ parser.add_argument("--data-path", default=DATA_PATH)
66
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
67
+ parser.add_argument("--epochs", type=int, default=EPOCHS)
68
+ parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
69
+ parser.add_argument("--lr", type=float, default=LR)
70
+ args = parser.parse_args()
71
+
72
+ print(f"{'='*60}")
73
+ print(f" InsureOS β€” Document Classifier Training")
74
+ print(f" Model: {args.model_name}")
75
+ print(f" Classes: {len(LABELS)}")
76
+ print(f"{'='*60}\n")
77
+
78
+ # ── 1. Load tokenizer & model ──
79
+ print("[1/4] Loading model and tokenizer...")
80
+ try:
81
+ tokenizer = AutoTokenizer.from_pretrained(args.model_name)
82
+ model = AutoModelForSequenceClassification.from_pretrained(
83
+ args.model_name,
84
+ num_labels=len(LABELS),
85
+ id2label={i: l for i, l in enumerate(LABELS)},
86
+ label2id={l: i for i, l in enumerate(LABELS)},
87
+ )
88
+ except Exception:
89
+ print(f" ⚠ {args.model_name} unavailable, falling back to {FALLBACK_MODEL}")
90
+ tokenizer = AutoTokenizer.from_pretrained(FALLBACK_MODEL)
91
+ model = AutoModelForSequenceClassification.from_pretrained(
92
+ FALLBACK_MODEL,
93
+ num_labels=len(LABELS),
94
+ id2label={i: l for i, l in enumerate(LABELS)},
95
+ label2id={l: i for i, l in enumerate(LABELS)},
96
+ )
97
+
98
+ # ── 2. Load & tokenize data ──
99
+ print("[2/4] Loading data...")
100
+ dataset = load_data(args.data_path)
101
+ print(f" Total: {len(dataset)}")
102
+
103
+ def tokenize_fn(examples):
104
+ return tokenizer(
105
+ examples["text"],
106
+ truncation=True,
107
+ max_length=MAX_LEN,
108
+ padding="max_length",
109
+ )
110
+
111
+ dataset = dataset.map(tokenize_fn, batched=True)
112
+ dataset = dataset.class_encode_column("label")
113
+ split = dataset.train_test_split(test_size=EVAL_SPLIT, seed=42, stratify_by_column="label")
114
+ train_ds = split["train"]
115
+ eval_ds = split["test"]
116
+ print(f" Train: {len(train_ds)}, Eval: {len(eval_ds)}")
117
+
118
+ # ── 3. Training ──
119
+ print("[3/4] Training...")
120
+ training_args = TrainingArguments(
121
+ output_dir=args.output_dir,
122
+ num_train_epochs=args.epochs,
123
+ per_device_train_batch_size=args.batch_size,
124
+ per_device_eval_batch_size=args.batch_size * 2,
125
+ learning_rate=args.lr,
126
+ lr_scheduler_type="cosine",
127
+ warmup_ratio=WARMUP_RATIO,
128
+ weight_decay=0.01,
129
+ eval_strategy="epoch",
130
+ save_strategy="epoch",
131
+ save_total_limit=2,
132
+ load_best_model_at_end=True,
133
+ metric_for_best_model="f1_macro",
134
+ greater_is_better=True,
135
+ fp16=torch.cuda.is_available(),
136
+ report_to="none",
137
+ logging_steps=50,
138
+ )
139
+
140
+ trainer = Trainer(
141
+ model=model,
142
+ args=training_args,
143
+ train_dataset=train_ds,
144
+ eval_dataset=eval_ds,
145
+ compute_metrics=compute_metrics,
146
+ )
147
+
148
+ trainer.train()
149
+
150
+ # ── 4. Evaluate & save ──
151
+ print("[4/4] Final evaluation...")
152
+ results = trainer.evaluate()
153
+ print(f" Accuracy: {results['eval_accuracy']:.4f}")
154
+ print(f" F1 (macro): {results['eval_f1_macro']:.4f}")
155
+ print(f" F1 (weighted): {results['eval_f1_weighted']:.4f}")
156
+
157
+ # Detailed classification report
158
+ preds = trainer.predict(eval_ds)
159
+ y_pred = np.argmax(preds.predictions, axis=-1)
160
+ y_true = preds.label_ids
161
+ report = classification_report(y_true, y_pred, target_names=LABELS)
162
+ print(f"\n{report}")
163
+
164
+ # Save
165
+ trainer.save_model(args.output_dir)
166
+ tokenizer.save_pretrained(args.output_dir)
167
+
168
+ # Save label map and results
169
+ meta = {
170
+ "labels": LABELS,
171
+ "id2label": {i: l for i, l in enumerate(LABELS)},
172
+ "results": {k: float(v) for k, v in results.items()},
173
+ }
174
+ with open(os.path.join(args.output_dir, "training_meta.json"), "w") as f:
175
+ json.dump(meta, f, indent=2)
176
+
177
+ print(f"\nβœ“ Document classifier saved β†’ {args.output_dir}")
178
+
179
+
180
+ if __name__ == "__main__":
181
+ main()
dpo_train.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” DPO (Direct Preference Optimization) Training Script
3
+ Applies RLHF-style alignment using preference pairs: FCA-compliant (chosen) vs non-compliant (rejected).
4
+ Runs on the QLoRA-finetuned InsureLLM-4B checkpoint.
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import argparse
10
+
11
+ import torch
12
+ from datasets import Dataset
13
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
14
+ from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, TaskType
15
+ from trl import DPOTrainer, DPOConfig
16
+
17
+
18
+ # ── Defaults ──
19
+
20
+ SFT_MODEL = "models/insurellm-4b-qlora-merged" # output of Phase 1
21
+ DATA_PATH = "data/output/insurance_dpo_5k.jsonl"
22
+ OUTPUT_DIR = "models/insurellm-4b-dpo"
23
+ MAX_SEQ_LEN = 512
24
+ MAX_PROMPT_LEN = 256
25
+ LORA_R = 32
26
+ LORA_ALPHA = 64
27
+ LORA_DROPOUT = 0.05
28
+ BETA = 0.1 # DPO temperature β€” lower = stronger preference signal
29
+ EPOCHS = 1
30
+ BATCH_SIZE = 1
31
+ GRAD_ACCUM = 16 # effective batch = 16
32
+ LR = 5e-5
33
+ WARMUP_RATIO = 0.1
34
+ LOGGING_STEPS = 10
35
+ SAVE_STEPS = 100
36
+ EVAL_SPLIT = 0.05
37
+
38
+
39
+ def load_dpo_data(path: str) -> Dataset:
40
+ """Load JSONL DPO data. Each record has prompt, chosen, rejected."""
41
+ records = []
42
+ with open(path) as f:
43
+ for line in f:
44
+ obj = json.loads(line)
45
+ records.append(obj)
46
+ return Dataset.from_list(records)
47
+
48
+
49
+ def main():
50
+ parser = argparse.ArgumentParser(description="DPO alignment for InsureLLM")
51
+ parser.add_argument("--sft-model", default=SFT_MODEL)
52
+ parser.add_argument("--data-path", default=DATA_PATH)
53
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
54
+ parser.add_argument("--epochs", type=int, default=EPOCHS)
55
+ parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
56
+ parser.add_argument("--lr", type=float, default=LR)
57
+ parser.add_argument("--beta", type=float, default=BETA)
58
+ parser.add_argument("--lora-r", type=int, default=LORA_R)
59
+ parser.add_argument("--resume-from-checkpoint", action="store_true")
60
+ args = parser.parse_args()
61
+
62
+ print(f"{'='*60}")
63
+ print(f" InsureOS β€” DPO Alignment Training")
64
+ print(f" SFT model: {args.sft_model}")
65
+ print(f" DPO beta: {args.beta}")
66
+ print(f" Data: {args.data_path}")
67
+ print(f"{'='*60}\n")
68
+
69
+ # ── 1. Tokenizer ──
70
+ print("[1/5] Loading tokenizer...")
71
+ tokenizer = AutoTokenizer.from_pretrained(
72
+ args.sft_model,
73
+ trust_remote_code=True,
74
+ padding_side="left", # DPO needs left padding
75
+ )
76
+ if tokenizer.pad_token is None:
77
+ tokenizer.pad_token = tokenizer.eos_token
78
+
79
+ # ── 2. Load model (4-bit for VRAM) ──
80
+ print("[2/5] Loading SFT model in 4-bit...")
81
+ bnb_config = BitsAndBytesConfig(
82
+ load_in_4bit=True,
83
+ bnb_4bit_quant_type="nf4",
84
+ bnb_4bit_compute_dtype=torch.bfloat16,
85
+ bnb_4bit_use_double_quant=True,
86
+ )
87
+
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ args.sft_model,
90
+ quantization_config=bnb_config,
91
+ device_map="auto",
92
+ trust_remote_code=True,
93
+ attn_implementation="sdpa",
94
+ dtype=torch.bfloat16,
95
+ )
96
+ model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
97
+
98
+ # Reference model (same base, frozen) β€” DPOTrainer can create this automatically,
99
+ # but with 4-bit we share the base and just use the adapter diff.
100
+ # DPOTrainer with peft_config will handle ref model internally.
101
+
102
+ # ── 3. Apply fresh LoRA for DPO ──
103
+ print("[3/5] Applying DPO LoRA adapters...")
104
+ lora_config = LoraConfig(
105
+ r=args.lora_r,
106
+ lora_alpha=LORA_ALPHA,
107
+ lora_dropout=LORA_DROPOUT,
108
+ target_modules="all-linear",
109
+ task_type=TaskType.CAUSAL_LM,
110
+ bias="none",
111
+ )
112
+
113
+ # ── 4. Load data ──
114
+ print("[4/5] Loading DPO preference data...")
115
+ dataset = load_dpo_data(args.data_path)
116
+ print(f" Total preference pairs: {len(dataset)}")
117
+
118
+ split = dataset.train_test_split(test_size=EVAL_SPLIT, seed=42)
119
+ train_ds = split["train"]
120
+ eval_ds = split["test"]
121
+ print(f" Train: {len(train_ds)}, Eval: {len(eval_ds)}")
122
+
123
+ # ── 5. Train with DPO ──
124
+ print("[5/5] Starting DPO training...")
125
+
126
+ dpo_config = DPOConfig(
127
+ output_dir=args.output_dir,
128
+ num_train_epochs=args.epochs,
129
+ per_device_train_batch_size=args.batch_size,
130
+ per_device_eval_batch_size=args.batch_size,
131
+ gradient_accumulation_steps=GRAD_ACCUM,
132
+ learning_rate=args.lr,
133
+ lr_scheduler_type="cosine",
134
+ warmup_ratio=WARMUP_RATIO,
135
+ weight_decay=0.01,
136
+ bf16=True,
137
+ beta=args.beta,
138
+ max_length=MAX_SEQ_LEN,
139
+ logging_steps=LOGGING_STEPS,
140
+ save_steps=SAVE_STEPS,
141
+ save_total_limit=2,
142
+ eval_strategy="steps",
143
+ eval_steps=SAVE_STEPS,
144
+ load_best_model_at_end=True,
145
+ gradient_checkpointing=True,
146
+ gradient_checkpointing_kwargs={"use_reentrant": False},
147
+ report_to="none",
148
+ )
149
+
150
+ trainer = DPOTrainer(
151
+ model=model,
152
+ args=dpo_config,
153
+ train_dataset=train_ds,
154
+ eval_dataset=eval_ds,
155
+ processing_class=tokenizer,
156
+ peft_config=lora_config,
157
+ )
158
+
159
+ if args.resume_from_checkpoint:
160
+ trainer.train(resume_from_checkpoint=True)
161
+ else:
162
+ trainer.train()
163
+
164
+ # ── Save ──
165
+ print("\nSaving DPO model...")
166
+ trainer.save_model(args.output_dir)
167
+ tokenizer.save_pretrained(args.output_dir)
168
+
169
+ # Merge
170
+ merged_dir = f"{args.output_dir}-merged"
171
+ print(f"Merging DPO LoRA β†’ {merged_dir}")
172
+ merged = trainer.model.merge_and_unload()
173
+ merged.save_pretrained(merged_dir)
174
+ tokenizer.save_pretrained(merged_dir)
175
+
176
+ print(f"\nβœ“ DPO training complete!")
177
+ print(f" DPO adapter: {args.output_dir}")
178
+ print(f" Merged model: {merged_dir}")
179
+ print(f"\n This model is now aligned to prefer:")
180
+ print(f" βœ“ FCA Consumer Duty compliant responses")
181
+ print(f" βœ“ Plain English over jargon")
182
+ print(f" βœ“ GDPR-safe data handling")
183
+ print(f" βœ“ Accurate claims/regulatory information")
184
+ print(f" βœ“ Fair pricing (no proxy discrimination)")
185
+
186
+
187
+ if __name__ == "__main__":
188
+ main()
evaluation/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Evaluation package
evaluation/results/full_eval_report.json ADDED
@@ -0,0 +1,412 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "insurellm": {
3
+ "model": "models/insurellm-4b-realworld-merged",
4
+ "domain_eval": [
5
+ {
6
+ "id": "fca_consumer_duty",
7
+ "score": 0,
8
+ "required_found": 0,
9
+ "required_total": 5,
10
+ "forbidden_found": 0,
11
+ "latency_s": 69.76176810264587,
12
+ "response_length": 111
13
+ },
14
+ {
15
+ "id": "gdpr_data_protection",
16
+ "score": 0,
17
+ "required_found": 0,
18
+ "required_total": 4,
19
+ "forbidden_found": 0,
20
+ "latency_s": 180.0589418411255,
21
+ "response_length": 201
22
+ },
23
+ {
24
+ "id": "claims_process",
25
+ "score": 0.6,
26
+ "required_found": 3,
27
+ "required_total": 5,
28
+ "forbidden_found": 0,
29
+ "latency_s": 197.52413249015808,
30
+ "response_length": 189
31
+ },
32
+ {
33
+ "id": "fraud_indicators",
34
+ "score": 0.25,
35
+ "required_found": 1,
36
+ "required_total": 4,
37
+ "forbidden_found": 0,
38
+ "latency_s": 196.2334017753601,
39
+ "response_length": 181
40
+ },
41
+ {
42
+ "id": "lloyds_market",
43
+ "score": 0.2,
44
+ "required_found": 1,
45
+ "required_total": 5,
46
+ "forbidden_found": 0,
47
+ "latency_s": 121.65669202804565,
48
+ "response_length": 146
49
+ },
50
+ {
51
+ "id": "pricing_fairness",
52
+ "score": 0.25,
53
+ "required_found": 1,
54
+ "required_total": 4,
55
+ "forbidden_found": 0,
56
+ "latency_s": 57.555458784103394,
57
+ "response_length": 83
58
+ },
59
+ {
60
+ "id": "subrogation",
61
+ "score": 0.5,
62
+ "required_found": 2,
63
+ "required_total": 4,
64
+ "forbidden_found": 0,
65
+ "latency_s": 117.41489219665527,
66
+ "response_length": 155
67
+ },
68
+ {
69
+ "id": "renewal_transparency",
70
+ "score": 0.2,
71
+ "required_found": 1,
72
+ "required_total": 5,
73
+ "forbidden_found": 0,
74
+ "latency_s": 161.38414025306702,
75
+ "response_length": 181
76
+ }
77
+ ],
78
+ "generation_metrics": {
79
+ "rouge1": 0.3839015292749472,
80
+ "rouge2": 0.10873793498858823,
81
+ "rougeL": 0.19891142911031565
82
+ },
83
+ "summary": {
84
+ "avg_domain_score": 0.25,
85
+ "avg_latency_s": 137.6986784338951,
86
+ "domain_pass_rate": 0.0
87
+ }
88
+ },
89
+ "fraudnet": [
90
+ {
91
+ "lob": "Motor",
92
+ "auc_roc": 1.0,
93
+ "avg_precision": 1.0,
94
+ "precision_fraud": 1.0,
95
+ "recall_fraud": 1.0,
96
+ "f1_fraud": 1.0,
97
+ "n_train": 20000,
98
+ "n_test": 5000,
99
+ "n_fraud_train": 1600,
100
+ "fraud_rate": 0.08,
101
+ "features_used": [
102
+ "driver_age",
103
+ "years_driving",
104
+ "years_ncd",
105
+ "vehicle_year",
106
+ "vehicle_value",
107
+ "annual_mileage",
108
+ "premium",
109
+ "voluntary_excess",
110
+ "compulsory_excess",
111
+ "reserve_amount",
112
+ "claim_amount",
113
+ "recovery_amount",
114
+ "previous_claims_3y",
115
+ "days_to_report",
116
+ "policy_age_days",
117
+ "witnesses",
118
+ "dashcam",
119
+ "police_report",
120
+ "claim_reserve_ratio",
121
+ "claim_premium_ratio",
122
+ "new_policy",
123
+ "late_report",
124
+ "vehicle_age"
125
+ ],
126
+ "top_features": {
127
+ "claim_reserve_ratio": "0.48872024",
128
+ "days_to_report": "0.43656266",
129
+ "policy_age_days": "0.057298034",
130
+ "previous_claims_3y": "0.014383504",
131
+ "witnesses": "0.0020713843",
132
+ "dashcam": "0.00096420496",
133
+ "driver_age": "0.0",
134
+ "years_driving": "0.0",
135
+ "years_ncd": "0.0",
136
+ "vehicle_year": "0.0"
137
+ }
138
+ },
139
+ {
140
+ "lob": "Property",
141
+ "auc_roc": 1.0,
142
+ "avg_precision": 1.0,
143
+ "precision_fraud": 1.0,
144
+ "recall_fraud": 1.0,
145
+ "f1_fraud": 1.0,
146
+ "n_train": 12000,
147
+ "n_test": 3000,
148
+ "n_fraud_train": 960,
149
+ "fraud_rate": 0.08,
150
+ "features_used": [
151
+ "property_age_years",
152
+ "rebuild_value",
153
+ "contents_value",
154
+ "premium",
155
+ "voluntary_excess",
156
+ "compulsory_excess",
157
+ "reserve_amount",
158
+ "claim_amount",
159
+ "previous_claims_3y",
160
+ "days_to_report",
161
+ "policy_age_days",
162
+ "has_cctv",
163
+ "loss_adjuster_appointed",
164
+ "unoccupied_30_days",
165
+ "alarm_installed",
166
+ "locks_bs3621",
167
+ "claim_reserve_ratio",
168
+ "claim_premium_ratio",
169
+ "new_policy",
170
+ "late_report"
171
+ ],
172
+ "top_features": {
173
+ "days_to_report": "0.40884864",
174
+ "policy_age_days": "0.37601757",
175
+ "claim_reserve_ratio": "0.1996711",
176
+ "previous_claims_3y": "0.014420756",
177
+ "late_report": "0.0010419991",
178
+ "property_age_years": "0.0",
179
+ "rebuild_value": "0.0",
180
+ "contents_value": "0.0",
181
+ "premium": "0.0",
182
+ "voluntary_excess": "0.0"
183
+ }
184
+ },
185
+ {
186
+ "lob": "Liability",
187
+ "auc_roc": 1.0,
188
+ "avg_precision": 1.0,
189
+ "precision_fraud": 1.0,
190
+ "recall_fraud": 1.0,
191
+ "f1_fraud": 1.0,
192
+ "n_train": 8000,
193
+ "n_test": 2000,
194
+ "n_fraud_train": 640,
195
+ "fraud_rate": 0.08,
196
+ "features_used": [
197
+ "claimant_age",
198
+ "reserve_amount",
199
+ "claim_amount",
200
+ "solicitor_costs",
201
+ "medical_costs",
202
+ "previous_claims_3y",
203
+ "days_to_report",
204
+ "solicitor_involved",
205
+ "independent_witness",
206
+ "medical_evidence_delay_days",
207
+ "cctv_available",
208
+ "claim_reserve_ratio",
209
+ "late_report"
210
+ ],
211
+ "top_features": {
212
+ "previous_claims_3y": "0.561369",
213
+ "days_to_report": "0.43863094",
214
+ "claimant_age": "0.0",
215
+ "reserve_amount": "0.0",
216
+ "claim_amount": "0.0",
217
+ "solicitor_costs": "0.0",
218
+ "medical_costs": "0.0",
219
+ "solicitor_involved": "0.0",
220
+ "independent_witness": "0.0",
221
+ "medical_evidence_delay_days": "0.0"
222
+ }
223
+ }
224
+ ],
225
+ "pricing": {
226
+ "glm": {
227
+ "model": "TweedieGLM",
228
+ "tweedie_power": 1.5,
229
+ "mae": 12244.958454220796,
230
+ "rmse": 17615.02268631013,
231
+ "mape_pct": 198.8474585306464,
232
+ "coefficients": {
233
+ "driver_age": 0.0,
234
+ "years_driving": 0.0,
235
+ "years_ncd": 0.0,
236
+ "vehicle_year": 0.0,
237
+ "vehicle_value": 0.0,
238
+ "annual_mileage": 0.0,
239
+ "voluntary_excess": 0.0,
240
+ "compulsory_excess": 0.0,
241
+ "previous_claims_3y": 0.0,
242
+ "policy_age_days": 0.0,
243
+ "vehicle_age": 0.0,
244
+ "driver_experience_ratio": 0.0,
245
+ "ncd_ratio": 0.0,
246
+ "vehicle_make_enc": 0.0,
247
+ "fuel_type_enc": 0.0,
248
+ "occupation_enc": 0.0,
249
+ "region_enc": 0.0
250
+ },
251
+ "intercept": 9.967596757593236,
252
+ "n_train": 20000,
253
+ "n_test": 5000
254
+ },
255
+ "ebm": {
256
+ "model": "EBM",
257
+ "mae": 11131.778297959956,
258
+ "rmse": 14787.148537325793,
259
+ "mape_pct": 177.58336694602855,
260
+ "n_train": 20000,
261
+ "n_test": 5000,
262
+ "top_features": {
263
+ "previous_claims_3y": 3259.1140028713794,
264
+ "policy_age_days": 2683.871584881652,
265
+ "previous_claims_3y & policy_age_days": 1608.1250699587606,
266
+ "region_enc": 221.31391899112393,
267
+ "vehicle_make_enc": 173.4298978553976,
268
+ "voluntary_excess & previous_claims_3y": 172.51716007254487,
269
+ "annual_mileage": 171.50784229318242,
270
+ "compulsory_excess": 165.02085743907992,
271
+ "voluntary_excess": 163.32366251884218,
272
+ "ncd_ratio": 152.51296273306403
273
+ }
274
+ }
275
+ },
276
+ "doc_classifier": {
277
+ "labels": [
278
+ "Policy Schedule",
279
+ "Certificate of Insurance",
280
+ "Claim Form",
281
+ "Loss Adjuster Report",
282
+ "Bordereaux \u2014 Premium",
283
+ "Bordereaux \u2014 Claims",
284
+ "Endorsement",
285
+ "Renewal Notice",
286
+ "Statement of Fact",
287
+ "FNOL Report",
288
+ "Subrogation Notice",
289
+ "Policy Wording"
290
+ ],
291
+ "id2label": {
292
+ "0": "Policy Schedule",
293
+ "1": "Certificate of Insurance",
294
+ "2": "Claim Form",
295
+ "3": "Loss Adjuster Report",
296
+ "4": "Bordereaux \u2014 Premium",
297
+ "5": "Bordereaux \u2014 Claims",
298
+ "6": "Endorsement",
299
+ "7": "Renewal Notice",
300
+ "8": "Statement of Fact",
301
+ "9": "FNOL Report",
302
+ "10": "Subrogation Notice",
303
+ "11": "Policy Wording"
304
+ },
305
+ "results": {
306
+ "eval_loss": 4.1706562114995904e-06,
307
+ "eval_accuracy": 1.0,
308
+ "eval_f1_macro": 1.0,
309
+ "eval_f1_weighted": 1.0,
310
+ "eval_runtime": 30.3435,
311
+ "eval_samples_per_second": 32.956,
312
+ "eval_steps_per_second": 2.076,
313
+ "epoch": 5.0
314
+ }
315
+ },
316
+ "ner": {
317
+ "label_list": [
318
+ "O",
319
+ "B-CLAIM_NUMBER",
320
+ "B-DATE",
321
+ "B-INSURER",
322
+ "B-LOB",
323
+ "B-MGA",
324
+ "B-MONEY",
325
+ "B-ORG",
326
+ "B-PERIL",
327
+ "B-PERSON",
328
+ "B-POLICY_NUMBER",
329
+ "B-POSTCODE",
330
+ "B-REGULATION",
331
+ "B-SYNDICATE",
332
+ "B-VEHICLE",
333
+ "I-DATE",
334
+ "I-INSURER",
335
+ "I-LOB",
336
+ "I-MGA",
337
+ "I-ORG",
338
+ "I-PERIL",
339
+ "I-PERSON",
340
+ "I-POSTCODE",
341
+ "I-REGULATION",
342
+ "I-SYNDICATE",
343
+ "I-VEHICLE"
344
+ ],
345
+ "label2id": {
346
+ "O": 0,
347
+ "B-CLAIM_NUMBER": 1,
348
+ "B-DATE": 2,
349
+ "B-INSURER": 3,
350
+ "B-LOB": 4,
351
+ "B-MGA": 5,
352
+ "B-MONEY": 6,
353
+ "B-ORG": 7,
354
+ "B-PERIL": 8,
355
+ "B-PERSON": 9,
356
+ "B-POLICY_NUMBER": 10,
357
+ "B-POSTCODE": 11,
358
+ "B-REGULATION": 12,
359
+ "B-SYNDICATE": 13,
360
+ "B-VEHICLE": 14,
361
+ "I-DATE": 15,
362
+ "I-INSURER": 16,
363
+ "I-LOB": 17,
364
+ "I-MGA": 18,
365
+ "I-ORG": 19,
366
+ "I-PERIL": 20,
367
+ "I-PERSON": 21,
368
+ "I-POSTCODE": 22,
369
+ "I-REGULATION": 23,
370
+ "I-SYNDICATE": 24,
371
+ "I-VEHICLE": 25
372
+ },
373
+ "id2label": {
374
+ "0": "O",
375
+ "1": "B-CLAIM_NUMBER",
376
+ "2": "B-DATE",
377
+ "3": "B-INSURER",
378
+ "4": "B-LOB",
379
+ "5": "B-MGA",
380
+ "6": "B-MONEY",
381
+ "7": "B-ORG",
382
+ "8": "B-PERIL",
383
+ "9": "B-PERSON",
384
+ "10": "B-POLICY_NUMBER",
385
+ "11": "B-POSTCODE",
386
+ "12": "B-REGULATION",
387
+ "13": "B-SYNDICATE",
388
+ "14": "B-VEHICLE",
389
+ "15": "I-DATE",
390
+ "16": "I-INSURER",
391
+ "17": "I-LOB",
392
+ "18": "I-MGA",
393
+ "19": "I-ORG",
394
+ "20": "I-PERIL",
395
+ "21": "I-PERSON",
396
+ "22": "I-POSTCODE",
397
+ "23": "I-REGULATION",
398
+ "24": "I-SYNDICATE",
399
+ "25": "I-VEHICLE"
400
+ },
401
+ "results": {
402
+ "eval_loss": 4.797985820914619e-05,
403
+ "eval_f1": 1.0,
404
+ "eval_precision": 1.0,
405
+ "eval_recall": 1.0,
406
+ "eval_runtime": 11.6416,
407
+ "eval_samples_per_second": 68.719,
408
+ "eval_steps_per_second": 2.147,
409
+ "epoch": 8.0
410
+ }
411
+ }
412
+ }
evaluation/results/insurellm_eval.json ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "models/insurellm-4b-realworld-merged",
3
+ "domain_eval": [
4
+ {
5
+ "id": "fca_consumer_duty",
6
+ "score": 0,
7
+ "required_found": 0,
8
+ "required_total": 5,
9
+ "forbidden_found": 0,
10
+ "latency_s": 69.76176810264587,
11
+ "response_length": 111
12
+ },
13
+ {
14
+ "id": "gdpr_data_protection",
15
+ "score": 0,
16
+ "required_found": 0,
17
+ "required_total": 4,
18
+ "forbidden_found": 0,
19
+ "latency_s": 180.0589418411255,
20
+ "response_length": 201
21
+ },
22
+ {
23
+ "id": "claims_process",
24
+ "score": 0.6,
25
+ "required_found": 3,
26
+ "required_total": 5,
27
+ "forbidden_found": 0,
28
+ "latency_s": 197.52413249015808,
29
+ "response_length": 189
30
+ },
31
+ {
32
+ "id": "fraud_indicators",
33
+ "score": 0.25,
34
+ "required_found": 1,
35
+ "required_total": 4,
36
+ "forbidden_found": 0,
37
+ "latency_s": 196.2334017753601,
38
+ "response_length": 181
39
+ },
40
+ {
41
+ "id": "lloyds_market",
42
+ "score": 0.2,
43
+ "required_found": 1,
44
+ "required_total": 5,
45
+ "forbidden_found": 0,
46
+ "latency_s": 121.65669202804565,
47
+ "response_length": 146
48
+ },
49
+ {
50
+ "id": "pricing_fairness",
51
+ "score": 0.25,
52
+ "required_found": 1,
53
+ "required_total": 4,
54
+ "forbidden_found": 0,
55
+ "latency_s": 57.555458784103394,
56
+ "response_length": 83
57
+ },
58
+ {
59
+ "id": "subrogation",
60
+ "score": 0.5,
61
+ "required_found": 2,
62
+ "required_total": 4,
63
+ "forbidden_found": 0,
64
+ "latency_s": 117.41489219665527,
65
+ "response_length": 155
66
+ },
67
+ {
68
+ "id": "renewal_transparency",
69
+ "score": 0.2,
70
+ "required_found": 1,
71
+ "required_total": 5,
72
+ "forbidden_found": 0,
73
+ "latency_s": 161.38414025306702,
74
+ "response_length": 181
75
+ }
76
+ ],
77
+ "generation_metrics": {
78
+ "rouge1": 0.3839015292749472,
79
+ "rouge2": 0.10873793498858823,
80
+ "rougeL": 0.19891142911031565
81
+ },
82
+ "summary": {
83
+ "avg_domain_score": 0.25,
84
+ "avg_latency_s": 137.6986784338951,
85
+ "domain_pass_rate": 0.0
86
+ }
87
+ }
evaluation/run_eval.py ADDED
@@ -0,0 +1,361 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Comprehensive Evaluation Suite
3
+ Evaluates all trained models: InsureLLM (generative), FraudNet, PricingGLM, DocClassifier, NER.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import argparse
9
+ import time
10
+ from pathlib import Path
11
+
12
+ import torch
13
+ import numpy as np
14
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
15
+
16
+ # Optional imports (gracefully degrade if not installed)
17
+ try:
18
+ from rouge_score import rouge_scorer
19
+ HAS_ROUGE = True
20
+ except ImportError:
21
+ HAS_ROUGE = False
22
+
23
+ try:
24
+ from bert_score import score as bert_score_fn
25
+ HAS_BERTSCORE = True
26
+ except ImportError:
27
+ HAS_BERTSCORE = False
28
+
29
+
30
+ # ── Defaults ──
31
+
32
+ INSURELLM_MODEL = "models/insurellm-4b-dpo-merged"
33
+ EVAL_DATA = "data/output/insurance_sft_10k.jsonl"
34
+ OUTPUT_DIR = "evaluation/results"
35
+ N_SAMPLES = 100 # number of samples for generative eval
36
+
37
+
38
+ # ── UK Insurance Domain-Specific Evaluation Prompts ──
39
+
40
+ DOMAIN_EVAL_PROMPTS = [
41
+ {
42
+ "id": "fca_consumer_duty",
43
+ "prompt": "A customer's home insurance claim for escape of water has been open for 8 weeks with no update. They call asking what's happening. Draft a response.",
44
+ "required_elements": ["apology", "timeline", "next steps", "complaint rights", "financial ombudsman"],
45
+ "forbidden_elements": ["jargon", "indemnification", "proximate cause", "peril"],
46
+ },
47
+ {
48
+ "id": "gdpr_data_protection",
49
+ "prompt": "A third party calls asking about the status of a motor claim made against our policyholder. How should we respond?",
50
+ "required_elements": ["data protection", "cannot share", "their own insurer", "GDPR"],
51
+ "forbidden_elements": ["policy number", "policyholder address", "bank details"],
52
+ },
53
+ {
54
+ "id": "claims_process",
55
+ "prompt": "Explain the insurance claims process for a domestic property fire from FNOL through to settlement.",
56
+ "required_elements": ["FNOL", "loss adjuster", "investigation", "settlement", "excess"],
57
+ "forbidden_elements": [],
58
+ },
59
+ {
60
+ "id": "fraud_indicators",
61
+ "prompt": "What are the key red flags for insurance fraud in motor third-party claims?",
62
+ "required_elements": ["exaggeration", "staged", "late reporting", "multiple claims"],
63
+ "forbidden_elements": [],
64
+ },
65
+ {
66
+ "id": "lloyds_market",
67
+ "prompt": "Explain the role of an MGA in the Lloyd's market and how binding authority agreements work.",
68
+ "required_elements": ["binding authority", "capacity provider", "syndicate", "delegated authority", "bordereaux"],
69
+ "forbidden_elements": [],
70
+ },
71
+ {
72
+ "id": "pricing_fairness",
73
+ "prompt": "An insurer wants to use first names as a rating factor because it improves their model by 3%. Should they?",
74
+ "required_elements": ["proxy discrimination", "protected characteristics", "Equality Act", "FCA"],
75
+ "forbidden_elements": [],
76
+ },
77
+ {
78
+ "id": "subrogation",
79
+ "prompt": "Explain subrogation rights in UK insurance. When does an insurer pursue recovery?",
80
+ "required_elements": ["recovery", "third party", "policyholder indemnified", "non-fault"],
81
+ "forbidden_elements": [],
82
+ },
83
+ {
84
+ "id": "renewal_transparency",
85
+ "prompt": "A customer's premium increased by 25% at renewal. They want to know why. Draft an explanation.",
86
+ "required_elements": ["transparency", "factors", "shop around", "Consumer Duty", "fair value"],
87
+ "forbidden_elements": ["take it or leave it", "market rate"],
88
+ },
89
+ ]
90
+
91
+
92
+ def evaluate_insurellm(model_path: str, n_samples: int, output_dir: str) -> dict:
93
+ """Evaluate the generative InsureLLM model."""
94
+ print(f"\n{'='*60}")
95
+ print(f" Evaluating InsureLLM: {model_path}")
96
+ print(f"{'='*60}")
97
+
98
+ # Load model
99
+ print("Loading model...")
100
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
101
+ if tokenizer.pad_token is None:
102
+ tokenizer.pad_token = tokenizer.eos_token
103
+
104
+ bnb_config = BitsAndBytesConfig(
105
+ load_in_4bit=True,
106
+ bnb_4bit_quant_type="nf4",
107
+ bnb_4bit_compute_dtype=torch.bfloat16,
108
+ bnb_4bit_use_double_quant=True,
109
+ )
110
+
111
+ model = AutoModelForCausalLM.from_pretrained(
112
+ model_path,
113
+ quantization_config=bnb_config,
114
+ device_map="auto",
115
+ trust_remote_code=True,
116
+ attn_implementation="sdpa",
117
+ dtype=torch.bfloat16,
118
+ )
119
+ model.eval()
120
+
121
+ results = {
122
+ "model": model_path,
123
+ "domain_eval": [],
124
+ "generation_metrics": {},
125
+ }
126
+
127
+ # ── 1. Domain-Specific Evaluation ──
128
+ print("\n[1/3] Domain-specific evaluation...")
129
+ for i, item in enumerate(DOMAIN_EVAL_PROMPTS):
130
+ print(f" Prompt {i+1}/{len(DOMAIN_EVAL_PROMPTS)}: {item['id']}...", flush=True)
131
+ messages = [
132
+ {"role": "system", "content": "You are InsureLLM, a specialist UK insurance AI assistant. Answer directly without internal reasoning."},
133
+ {"role": "user", "content": item["prompt"]},
134
+ ]
135
+
136
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
137
+ # Disable Qwen3 thinking mode by appending <think>\n</think>\n
138
+ if "<|im_start|>assistant" in text:
139
+ text = text + "<think>\n</think>\n"
140
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
141
+
142
+ start = time.time()
143
+ with torch.no_grad():
144
+ outputs = model.generate(
145
+ **inputs,
146
+ max_new_tokens=256,
147
+ do_sample=False,
148
+ )
149
+ latency = time.time() - start
150
+
151
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
152
+
153
+ # Check required elements
154
+ response_lower = response.lower()
155
+ found_required = [e for e in item["required_elements"] if e.lower() in response_lower]
156
+ found_forbidden = [e for e in item["forbidden_elements"] if e.lower() in response_lower]
157
+
158
+ score = len(found_required) / max(len(item["required_elements"]), 1)
159
+ penalty = len(found_forbidden) * 0.15
160
+ final_score = max(0, score - penalty)
161
+
162
+ eval_result = {
163
+ "id": item["id"],
164
+ "score": final_score,
165
+ "required_found": len(found_required),
166
+ "required_total": len(item["required_elements"]),
167
+ "forbidden_found": len(found_forbidden),
168
+ "latency_s": latency,
169
+ "response_length": len(response.split()),
170
+ }
171
+ results["domain_eval"].append(eval_result)
172
+
173
+ status = "βœ“" if final_score >= 0.7 else "β–³" if final_score >= 0.4 else "βœ—"
174
+ print(f" {status} {item['id']}: {final_score:.2f} "
175
+ f"({len(found_required)}/{len(item['required_elements'])} required, "
176
+ f"{len(found_forbidden)} forbidden, {latency:.1f}s)")
177
+
178
+ avg_domain = np.mean([r["score"] for r in results["domain_eval"]])
179
+ avg_latency = np.mean([r["latency_s"] for r in results["domain_eval"]])
180
+ print(f"\n Average domain score: {avg_domain:.3f}")
181
+ print(f" Average latency: {avg_latency:.1f}s")
182
+
183
+ # ── 2. ROUGE scores on held-out SFT data ──
184
+ if HAS_ROUGE and os.path.exists(EVAL_DATA):
185
+ print("\n[2/3] ROUGE evaluation on SFT test set...")
186
+ scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
187
+
188
+ eval_records = []
189
+ with open(EVAL_DATA) as f:
190
+ for line in f:
191
+ eval_records.append(json.loads(line))
192
+
193
+ # Use last N as eval
194
+ eval_subset = eval_records[-min(n_samples, len(eval_records)):]
195
+
196
+ rouge1_scores = []
197
+ rouge2_scores = []
198
+ rougeL_scores = []
199
+
200
+ for rec in eval_subset:
201
+ messages = rec["messages"]
202
+ # Get reference (last assistant message)
203
+ reference = messages[-1]["content"]
204
+ prompt_messages = messages[:-1]
205
+
206
+ text = tokenizer.apply_chat_template(prompt_messages, tokenize=False, add_generation_prompt=True)
207
+ # Disable Qwen3 thinking mode
208
+ if "<|im_start|>assistant" in text:
209
+ text = text + "<think>\n</think>\n"
210
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
211
+
212
+ with torch.no_grad():
213
+ outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
214
+
215
+ generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
216
+
217
+ scores = scorer.score(reference, generated)
218
+ rouge1_scores.append(scores["rouge1"].fmeasure)
219
+ rouge2_scores.append(scores["rouge2"].fmeasure)
220
+ rougeL_scores.append(scores["rougeL"].fmeasure)
221
+
222
+ results["generation_metrics"]["rouge1"] = float(np.mean(rouge1_scores))
223
+ results["generation_metrics"]["rouge2"] = float(np.mean(rouge2_scores))
224
+ results["generation_metrics"]["rougeL"] = float(np.mean(rougeL_scores))
225
+
226
+ print(f" ROUGE-1: {results['generation_metrics']['rouge1']:.4f}")
227
+ print(f" ROUGE-2: {results['generation_metrics']['rouge2']:.4f}")
228
+ print(f" ROUGE-L: {results['generation_metrics']['rougeL']:.4f}")
229
+ else:
230
+ print("\n[2/3] Skipping ROUGE (rouge_score not installed or data not found)")
231
+
232
+ # ── 3. Summary metrics ──
233
+ print("\n[3/3] Computing summary...")
234
+ results["summary"] = {
235
+ "avg_domain_score": float(avg_domain),
236
+ "avg_latency_s": float(avg_latency),
237
+ "domain_pass_rate": float(np.mean([1 if r["score"] >= 0.7 else 0 for r in results["domain_eval"]])),
238
+ }
239
+
240
+ # Save
241
+ os.makedirs(output_dir, exist_ok=True)
242
+ outpath = os.path.join(output_dir, "insurellm_eval.json")
243
+ with open(outpath, "w") as f:
244
+ json.dump(results, f, indent=2)
245
+ print(f"\nβœ“ InsureLLM eval results β†’ {outpath}")
246
+
247
+ return results
248
+
249
+
250
+ def evaluate_all(args):
251
+ """Run evaluation for all available models."""
252
+ print(f"{'='*60}")
253
+ print(f" InsureOS β€” Full Evaluation Suite")
254
+ print(f"{'='*60}")
255
+
256
+ os.makedirs(args.output_dir, exist_ok=True)
257
+ all_results = {}
258
+
259
+ # 1. InsureLLM
260
+ if os.path.exists(args.insurellm_model):
261
+ all_results["insurellm"] = evaluate_insurellm(
262
+ args.insurellm_model, args.n_samples, args.output_dir
263
+ )
264
+ else:
265
+ print(f"\n⚠ InsureLLM not found at {args.insurellm_model}, skipping")
266
+
267
+ # 2. FraudNet β€” just check if results exist from training
268
+ fraud_results = Path("models/fraudnet/training_results.json")
269
+ if fraud_results.exists():
270
+ with open(fraud_results) as f:
271
+ all_results["fraudnet"] = json.load(f)
272
+ print(f"\nβœ“ FraudNet results loaded from training")
273
+ else:
274
+ print(f"\n⚠ FraudNet results not found, skipping")
275
+
276
+ # 3. Pricing GLM
277
+ pricing_results = Path("models/pricing-glm/training_results.json")
278
+ if pricing_results.exists():
279
+ with open(pricing_results) as f:
280
+ all_results["pricing"] = json.load(f)
281
+ print(f"βœ“ Pricing model results loaded from training")
282
+ else:
283
+ print(f"⚠ Pricing results not found, skipping")
284
+
285
+ # 4. Doc Classifier
286
+ doc_meta = Path("models/doc-classifier/training_meta.json")
287
+ if doc_meta.exists():
288
+ with open(doc_meta) as f:
289
+ all_results["doc_classifier"] = json.load(f)
290
+ print(f"βœ“ Doc classifier results loaded")
291
+ else:
292
+ print(f"⚠ Doc classifier results not found, skipping")
293
+
294
+ # 5. NER
295
+ ner_meta = Path("models/ner-model/training_meta.json")
296
+ if ner_meta.exists():
297
+ with open(ner_meta) as f:
298
+ all_results["ner"] = json.load(f)
299
+ print(f"βœ“ NER results loaded")
300
+ else:
301
+ print(f"⚠ NER results not found, skipping")
302
+
303
+ # ── Summary report ──
304
+ report_path = os.path.join(args.output_dir, "full_eval_report.json")
305
+ with open(report_path, "w") as f:
306
+ json.dump(all_results, f, indent=2, default=str)
307
+
308
+ print(f"\n{'='*60}")
309
+ print(f" EVALUATION SUMMARY")
310
+ print(f"{'='*60}")
311
+
312
+ if "insurellm" in all_results:
313
+ s = all_results["insurellm"].get("summary", {})
314
+ print(f"\n InsureLLM (Generative):")
315
+ print(f" Domain score: {s.get('avg_domain_score', 'N/A')}")
316
+ print(f" Pass rate: {s.get('domain_pass_rate', 'N/A')}")
317
+ print(f" Latency: {s.get('avg_latency_s', 'N/A')}s")
318
+
319
+ if "fraudnet" in all_results:
320
+ for r in all_results["fraudnet"]:
321
+ if isinstance(r, dict):
322
+ print(f"\n FraudNet ({r.get('lob', '?')}):")
323
+ print(f" AUC-ROC: {r.get('auc_roc', 'N/A')}")
324
+ print(f" Avg Precision: {r.get('avg_precision', 'N/A')}")
325
+
326
+ if "pricing" in all_results:
327
+ for model_type in ["glm", "ebm"]:
328
+ if model_type in all_results["pricing"]:
329
+ m = all_results["pricing"][model_type]
330
+ print(f"\n Pricing {model_type.upper()}:")
331
+ print(f" MAE: Β£{m.get('mae', 'N/A')}")
332
+ print(f" RMSE: Β£{m.get('rmse', 'N/A')}")
333
+
334
+ if "doc_classifier" in all_results:
335
+ r = all_results["doc_classifier"].get("results", {})
336
+ print(f"\n Document Classifier:")
337
+ print(f" Accuracy: {r.get('eval_accuracy', 'N/A')}")
338
+ print(f" F1 (macro): {r.get('eval_f1_macro', 'N/A')}")
339
+
340
+ if "ner" in all_results:
341
+ r = all_results["ner"].get("results", {})
342
+ print(f"\n NER Model:")
343
+ print(f" F1: {r.get('eval_f1', 'N/A')}")
344
+ print(f" Precision: {r.get('eval_precision', 'N/A')}")
345
+ print(f" Recall: {r.get('eval_recall', 'N/A')}")
346
+
347
+ print(f"\n Full report β†’ {report_path}")
348
+
349
+
350
+ def main():
351
+ parser = argparse.ArgumentParser(description="InsureOS evaluation suite")
352
+ parser.add_argument("--insurellm-model", default=INSURELLM_MODEL)
353
+ parser.add_argument("--n-samples", type=int, default=N_SAMPLES)
354
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
355
+ args = parser.parse_args()
356
+
357
+ evaluate_all(args)
358
+
359
+
360
+ if __name__ == "__main__":
361
+ main()
fraud_model.py ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Fraud Detection Model
3
+ Hybrid approach: XGBoost + Isolation Forest ensemble on tabular claims data.
4
+ Trains separate models per LoB (Motor, Property, Liability) + a combined model.
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import argparse
10
+ import pickle
11
+ from pathlib import Path
12
+
13
+ import numpy as np
14
+ import pandas as pd
15
+ from sklearn.model_selection import train_test_split, StratifiedKFold
16
+ from sklearn.preprocessing import LabelEncoder, StandardScaler
17
+ from sklearn.metrics import (
18
+ classification_report, roc_auc_score, precision_recall_curve,
19
+ average_precision_score, confusion_matrix,
20
+ )
21
+ from sklearn.ensemble import IsolationForest
22
+ import xgboost as xgb
23
+
24
+
25
+ # ── Defaults ──
26
+
27
+ DATA_DIR = "data/output"
28
+ OUTPUT_DIR = "models/fraudnet"
29
+ TEST_SIZE = 0.2
30
+ N_FOLDS = 5
31
+ RANDOM_STATE = 42
32
+
33
+ # XGBoost hyperparams (tuned for imbalanced fraud data)
34
+ XGB_PARAMS = {
35
+ "objective": "binary:logistic",
36
+ "eval_metric": "aucpr",
37
+ "max_depth": 6,
38
+ "learning_rate": 0.05,
39
+ "subsample": 0.8,
40
+ "colsample_bytree": 0.8,
41
+ "min_child_weight": 5,
42
+ "gamma": 1,
43
+ "reg_alpha": 0.1,
44
+ "reg_lambda": 1.0,
45
+ "tree_method": "hist",
46
+ "device": "cuda",
47
+ "n_estimators": 500,
48
+ "early_stopping_rounds": 30,
49
+ }
50
+
51
+ # Feature groups per LoB
52
+ MOTOR_FEATURES = [
53
+ "driver_age", "years_driving", "years_ncd", "vehicle_year", "vehicle_value",
54
+ "annual_mileage", "premium", "voluntary_excess", "compulsory_excess",
55
+ "reserve_amount", "claim_amount", "recovery_amount",
56
+ "previous_claims_3y", "days_to_report", "policy_age_days",
57
+ "witnesses", "dashcam", "police_report",
58
+ ]
59
+
60
+ PROPERTY_FEATURES = [
61
+ "property_age_years", "rebuild_value", "contents_value",
62
+ "premium", "voluntary_excess", "compulsory_excess",
63
+ "reserve_amount", "claim_amount",
64
+ "previous_claims_3y", "days_to_report", "policy_age_days",
65
+ "has_cctv", "loss_adjuster_appointed", "unoccupied_30_days",
66
+ "alarm_installed", "locks_bs3621",
67
+ ]
68
+
69
+ LIABILITY_FEATURES = [
70
+ "claimant_age", "reserve_amount", "claim_amount",
71
+ "solicitor_costs", "medical_costs",
72
+ "previous_claims_3y", "days_to_report",
73
+ "solicitor_involved", "independent_witness",
74
+ "medical_evidence_delay_days", "cctv_available",
75
+ ]
76
+
77
+
78
+ def load_csv(path: str) -> pd.DataFrame:
79
+ df = pd.read_csv(path)
80
+ # Convert booleans
81
+ bool_cols = df.select_dtypes(include=["object"]).columns
82
+ for col in bool_cols:
83
+ if set(df[col].dropna().unique()).issubset({"True", "False", True, False}):
84
+ df[col] = df[col].map({"True": 1, "False": 0, True: 1, False: 0}).fillna(0).astype(int)
85
+ return df
86
+
87
+
88
+ def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
89
+ """Add derived fraud-signal features."""
90
+ df = df.copy()
91
+
92
+ # Claim-to-reserve ratio (key fraud signal)
93
+ if "reserve_amount" in df.columns and "claim_amount" in df.columns:
94
+ df["claim_reserve_ratio"] = df["claim_amount"] / df["reserve_amount"].clip(lower=1)
95
+
96
+ # Claim-to-premium ratio
97
+ if "premium" in df.columns and "claim_amount" in df.columns:
98
+ df["claim_premium_ratio"] = df["claim_amount"] / df["premium"].clip(lower=1)
99
+
100
+ # Policy age bucket (new policies = higher fraud risk)
101
+ if "policy_age_days" in df.columns:
102
+ df["new_policy"] = (df["policy_age_days"] < 90).astype(int)
103
+
104
+ # Late reporting flag
105
+ if "days_to_report" in df.columns:
106
+ df["late_report"] = (df["days_to_report"] > 14).astype(int)
107
+
108
+ # Vehicle age
109
+ if "vehicle_year" in df.columns:
110
+ df["vehicle_age"] = 2025 - df["vehicle_year"]
111
+
112
+ return df
113
+
114
+
115
+ def train_xgb_model(
116
+ X_train: pd.DataFrame,
117
+ y_train: pd.Series,
118
+ X_test: pd.DataFrame,
119
+ y_test: pd.Series,
120
+ feature_names: list,
121
+ lob: str,
122
+ output_dir: str,
123
+ ) -> dict:
124
+ """Train XGBoost model for a specific LoB."""
125
+ # Handle class imbalance via scale_pos_weight
126
+ n_neg = (y_train == 0).sum()
127
+ n_pos = (y_train == 1).sum()
128
+ scale_pos_weight = n_neg / max(n_pos, 1)
129
+
130
+ params = {**XGB_PARAMS, "scale_pos_weight": scale_pos_weight}
131
+ n_estimators = params.pop("n_estimators")
132
+ early_stopping = params.pop("early_stopping_rounds")
133
+
134
+ model = xgb.XGBClassifier(
135
+ n_estimators=n_estimators,
136
+ early_stopping_rounds=early_stopping,
137
+ **params,
138
+ )
139
+
140
+ # Use all available features from the feature list that exist in X
141
+ avail_features = [f for f in feature_names if f in X_train.columns]
142
+ X_tr = X_train[avail_features].fillna(0)
143
+ X_te = X_test[avail_features].fillna(0)
144
+
145
+ model.fit(
146
+ X_tr, y_train,
147
+ eval_set=[(X_te, y_test)],
148
+ verbose=50,
149
+ )
150
+
151
+ # Predictions
152
+ y_pred_proba = model.predict_proba(X_te)[:, 1]
153
+ y_pred = (y_pred_proba >= 0.5).astype(int)
154
+
155
+ # Metrics
156
+ auc_roc = roc_auc_score(y_test, y_pred_proba)
157
+ avg_prec = average_precision_score(y_test, y_pred_proba)
158
+ report = classification_report(y_test, y_pred, output_dict=True)
159
+
160
+ results = {
161
+ "lob": lob,
162
+ "auc_roc": auc_roc,
163
+ "avg_precision": avg_prec,
164
+ "precision_fraud": report.get("1", {}).get("precision", 0),
165
+ "recall_fraud": report.get("1", {}).get("recall", 0),
166
+ "f1_fraud": report.get("1", {}).get("f1-score", 0),
167
+ "n_train": len(y_train),
168
+ "n_test": len(y_test),
169
+ "n_fraud_train": int(y_train.sum()),
170
+ "fraud_rate": float(y_train.mean()),
171
+ "features_used": avail_features,
172
+ }
173
+
174
+ # Feature importance
175
+ importance = dict(zip(avail_features, model.feature_importances_))
176
+ results["top_features"] = dict(sorted(importance.items(), key=lambda x: x[1], reverse=True)[:10])
177
+
178
+ # Save
179
+ os.makedirs(output_dir, exist_ok=True)
180
+ model_path = os.path.join(output_dir, f"xgb_{lob.lower()}.json")
181
+ model.save_model(model_path)
182
+ print(f"\n βœ“ {lob} XGBoost saved β†’ {model_path}")
183
+ print(f" AUC-ROC: {auc_roc:.4f}, Avg Precision: {avg_prec:.4f}")
184
+ print(f" Fraud precision: {results['precision_fraud']:.3f}, recall: {results['recall_fraud']:.3f}")
185
+
186
+ return results
187
+
188
+
189
+ def train_isolation_forest(
190
+ X_train: pd.DataFrame,
191
+ feature_names: list,
192
+ lob: str,
193
+ output_dir: str,
194
+ ) -> str:
195
+ """Train Isolation Forest for anomaly scoring (unsupervised complement)."""
196
+ avail_features = [f for f in feature_names if f in X_train.columns]
197
+ X = X_train[avail_features].fillna(0)
198
+
199
+ iforest = IsolationForest(
200
+ n_estimators=200,
201
+ contamination=0.08,
202
+ max_samples="auto",
203
+ random_state=RANDOM_STATE,
204
+ n_jobs=-1,
205
+ )
206
+ iforest.fit(X)
207
+
208
+ model_path = os.path.join(output_dir, f"iforest_{lob.lower()}.pkl")
209
+ with open(model_path, "wb") as f:
210
+ pickle.dump(iforest, f)
211
+
212
+ print(f" βœ“ {lob} Isolation Forest saved β†’ {model_path}")
213
+ return model_path
214
+
215
+
216
+ def main():
217
+ parser = argparse.ArgumentParser(description="Train FraudNet models")
218
+ parser.add_argument("--data-dir", default=DATA_DIR)
219
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
220
+ args = parser.parse_args()
221
+
222
+ print(f"{'='*60}")
223
+ print(f" InsureOS β€” FraudNet Training")
224
+ print(f" Data: {args.data_dir}")
225
+ print(f"{'='*60}\n")
226
+
227
+ os.makedirs(args.output_dir, exist_ok=True)
228
+ all_results = []
229
+
230
+ # ── Motor ──
231
+ motor_files = list(Path(args.data_dir).glob("claims_motor_*.csv"))
232
+ if motor_files:
233
+ print("=" * 40 + " MOTOR " + "=" * 40)
234
+ df = load_csv(str(motor_files[0]))
235
+ df = engineer_features(df)
236
+ features = MOTOR_FEATURES + ["claim_reserve_ratio", "claim_premium_ratio", "new_policy", "late_report", "vehicle_age"]
237
+
238
+ X = df.drop(columns=["is_fraud", "claim_id", "lob", "insurer", "region",
239
+ "claim_type", "claim_status", "vehicle_make", "vehicle_model",
240
+ "fuel_type", "occupation", "driver_gender",
241
+ "postcode_prefix", "inception_date", "loss_date",
242
+ "report_date", "settlement_date", "time_of_loss"],
243
+ errors="ignore")
244
+ y = df["is_fraud"].astype(int)
245
+
246
+ X_train, X_test, y_train, y_test = train_test_split(
247
+ X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_STATE
248
+ )
249
+
250
+ result = train_xgb_model(X_train, y_train, X_test, y_test, features, "Motor", args.output_dir)
251
+ all_results.append(result)
252
+ train_isolation_forest(X_train, features, "Motor", args.output_dir)
253
+
254
+ # ── Property ──
255
+ prop_files = list(Path(args.data_dir).glob("claims_property_*.csv"))
256
+ if prop_files:
257
+ print("\n" + "=" * 38 + " PROPERTY " + "=" * 38)
258
+ df = load_csv(str(prop_files[0]))
259
+ df = engineer_features(df)
260
+ features = PROPERTY_FEATURES + ["claim_reserve_ratio", "claim_premium_ratio", "new_policy", "late_report"]
261
+
262
+ X = df.drop(columns=["is_fraud", "claim_id", "lob", "insurer", "region",
263
+ "claim_type", "claim_status", "property_type",
264
+ "heating_type", "flood_risk_zone", "subsidence_history",
265
+ "postcode_prefix", "inception_date", "loss_date",
266
+ "report_date", "settlement_date"],
267
+ errors="ignore")
268
+ y = df["is_fraud"].astype(int)
269
+
270
+ X_train, X_test, y_train, y_test = train_test_split(
271
+ X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_STATE
272
+ )
273
+
274
+ result = train_xgb_model(X_train, y_train, X_test, y_test, features, "Property", args.output_dir)
275
+ all_results.append(result)
276
+ train_isolation_forest(X_train, features, "Property", args.output_dir)
277
+
278
+ # ── Liability ──
279
+ liab_files = list(Path(args.data_dir).glob("claims_liability_*.csv"))
280
+ if liab_files:
281
+ print("\n" + "=" * 37 + " LIABILITY " + "=" * 37)
282
+ df = load_csv(str(liab_files[0]))
283
+ df = engineer_features(df)
284
+ features = LIABILITY_FEATURES + ["claim_reserve_ratio", "late_report"]
285
+
286
+ X = df.drop(columns=["is_fraud", "claim_id", "lob", "insurer", "region",
287
+ "claim_type", "claim_status", "claimant_gender",
288
+ "injury_type", "injury_severity",
289
+ "postcode_prefix", "inception_date", "loss_date",
290
+ "report_date", "settlement_date"],
291
+ errors="ignore")
292
+ y = df["is_fraud"].astype(int)
293
+
294
+ X_train, X_test, y_train, y_test = train_test_split(
295
+ X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_STATE
296
+ )
297
+
298
+ result = train_xgb_model(X_train, y_train, X_test, y_test, features, "Liability", args.output_dir)
299
+ all_results.append(result)
300
+ train_isolation_forest(X_train, features, "Liability", args.output_dir)
301
+
302
+ # ── Save results summary ──
303
+ summary_path = os.path.join(args.output_dir, "training_results.json")
304
+ with open(summary_path, "w") as f:
305
+ json.dump(all_results, f, indent=2, default=str)
306
+
307
+ print(f"\n{'='*60}")
308
+ print(f" βœ“ FraudNet training complete!")
309
+ print(f"{'='*60}")
310
+ for r in all_results:
311
+ print(f"\n {r['lob']}:")
312
+ print(f" AUC-ROC: {r['auc_roc']:.4f}")
313
+ print(f" Avg Precision: {r['avg_precision']:.4f}")
314
+ print(f" Fraud F1: {r['f1_fraud']:.3f}")
315
+ print(f" Top features: {list(r['top_features'].keys())[:5]}")
316
+ print(f"\n Results β†’ {summary_path}")
317
+
318
+
319
+ if __name__ == "__main__":
320
+ main()
ner_model.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Insurance NER Model Training
3
+ Fine-tunes ModernBERT (or fallback BERT-base) for token-level Named Entity Recognition
4
+ with 14 insurance-specific entity types in IOB2 format.
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import argparse
10
+ from pathlib import Path
11
+
12
+ import torch
13
+ import numpy as np
14
+ from datasets import Dataset
15
+ from transformers import (
16
+ AutoModelForTokenClassification,
17
+ AutoTokenizer,
18
+ TrainingArguments,
19
+ Trainer,
20
+ DataCollatorForTokenClassification,
21
+ )
22
+ from seqeval.metrics import (
23
+ classification_report as seq_classification_report,
24
+ f1_score as seq_f1_score,
25
+ precision_score as seq_precision_score,
26
+ recall_score as seq_recall_score,
27
+ )
28
+
29
+
30
+ # ── Defaults ──
31
+
32
+ MODEL_NAME = "answerdotai/ModernBERT-base"
33
+ FALLBACK_MODEL = "google-bert/bert-base-uncased"
34
+ DATA_PATH = "data/output/insurance_ner_8k.jsonl"
35
+ OUTPUT_DIR = "models/ner-model"
36
+ MAX_LEN = 256
37
+ EPOCHS = 8
38
+ BATCH_SIZE = 16
39
+ LR = 3e-5
40
+ WARMUP_RATIO = 0.1
41
+ EVAL_SPLIT = 0.1
42
+
43
+
44
+ def load_data(path: str):
45
+ """Load NER JSONL and build label set."""
46
+ records = []
47
+ all_tags = set()
48
+ with open(path) as f:
49
+ for line in f:
50
+ obj = json.loads(line)
51
+ records.append(obj)
52
+ all_tags.update(obj["ner_tags"])
53
+
54
+ # Build label list: O first, then B-/I- sorted
55
+ entity_tags = sorted(t for t in all_tags if t != "O")
56
+ label_list = ["O"] + entity_tags
57
+ label2id = {l: i for i, l in enumerate(label_list)}
58
+ id2label = {i: l for i, l in enumerate(label_list)}
59
+
60
+ return records, label_list, label2id, id2label
61
+
62
+
63
+ def tokenize_and_align(examples, tokenizer, label2id, max_len):
64
+ """Tokenize and align NER labels with subword tokens."""
65
+ tokenized = tokenizer(
66
+ examples["tokens"],
67
+ is_split_into_words=True,
68
+ truncation=True,
69
+ max_length=max_len,
70
+ padding="max_length",
71
+ )
72
+
73
+ aligned_labels = []
74
+ for i, labels in enumerate(examples["ner_tags"]):
75
+ word_ids = tokenized.word_ids(batch_index=i)
76
+ label_ids = []
77
+ previous_word_idx = None
78
+ for word_idx in word_ids:
79
+ if word_idx is None:
80
+ label_ids.append(-100) # special tokens
81
+ elif word_idx != previous_word_idx:
82
+ label_ids.append(label2id.get(labels[word_idx], 0))
83
+ else:
84
+ # For subword tokens, use I- tag if the original is B-
85
+ orig_label = labels[word_idx]
86
+ if orig_label.startswith("B-"):
87
+ label_ids.append(label2id.get(orig_label.replace("B-", "I-"), 0))
88
+ else:
89
+ label_ids.append(label2id.get(orig_label, 0))
90
+ previous_word_idx = word_idx
91
+ aligned_labels.append(label_ids)
92
+
93
+ tokenized["labels"] = aligned_labels
94
+ return tokenized
95
+
96
+
97
+ def main():
98
+ parser = argparse.ArgumentParser(description="Train insurance NER model")
99
+ parser.add_argument("--model-name", default=MODEL_NAME)
100
+ parser.add_argument("--data-path", default=DATA_PATH)
101
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
102
+ parser.add_argument("--epochs", type=int, default=EPOCHS)
103
+ parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
104
+ parser.add_argument("--lr", type=float, default=LR)
105
+ args = parser.parse_args()
106
+
107
+ # Load data
108
+ print(f"{'='*60}")
109
+ print(f" InsureOS β€” NER Model Training")
110
+ print(f"{'='*60}\n")
111
+
112
+ print("[1/5] Loading NER data...")
113
+ records, label_list, label2id, id2label = load_data(args.data_path)
114
+ print(f" Examples: {len(records)}")
115
+ print(f" Labels: {len(label_list)} ({len(label_list)-1} entity types)")
116
+
117
+ # Load tokenizer & model
118
+ print("[2/5] Loading model...")
119
+ try:
120
+ tokenizer = AutoTokenizer.from_pretrained(args.model_name)
121
+ model = AutoModelForTokenClassification.from_pretrained(
122
+ args.model_name,
123
+ num_labels=len(label_list),
124
+ id2label=id2label,
125
+ label2id=label2id,
126
+ )
127
+ except Exception:
128
+ print(f" ⚠ Falling back to {FALLBACK_MODEL}")
129
+ tokenizer = AutoTokenizer.from_pretrained(FALLBACK_MODEL)
130
+ model = AutoModelForTokenClassification.from_pretrained(
131
+ FALLBACK_MODEL,
132
+ num_labels=len(label_list),
133
+ id2label=id2label,
134
+ label2id=label2id,
135
+ )
136
+
137
+ # Prepare dataset
138
+ print("[3/5] Tokenizing and aligning labels...")
139
+ ds = Dataset.from_list(records)
140
+ ds = ds.map(
141
+ lambda ex: tokenize_and_align(ex, tokenizer, label2id, MAX_LEN),
142
+ batched=True,
143
+ remove_columns=["text"],
144
+ )
145
+
146
+ split = ds.train_test_split(test_size=EVAL_SPLIT, seed=42)
147
+ train_ds = split["train"]
148
+ eval_ds = split["test"]
149
+ print(f" Train: {len(train_ds)}, Eval: {len(eval_ds)}")
150
+
151
+ # Metrics
152
+ def compute_metrics(pred):
153
+ preds = np.argmax(pred.predictions, axis=-1)
154
+ labels = pred.label_ids
155
+
156
+ true_labels = []
157
+ true_preds = []
158
+
159
+ for pred_seq, label_seq in zip(preds, labels):
160
+ seq_labels = []
161
+ seq_preds = []
162
+ for p, l in zip(pred_seq, label_seq):
163
+ if l != -100:
164
+ seq_labels.append(id2label[l])
165
+ seq_preds.append(id2label[p])
166
+ true_labels.append(seq_labels)
167
+ true_preds.append(seq_preds)
168
+
169
+ return {
170
+ "f1": seq_f1_score(true_labels, true_preds),
171
+ "precision": seq_precision_score(true_labels, true_preds),
172
+ "recall": seq_recall_score(true_labels, true_preds),
173
+ }
174
+
175
+ # Train
176
+ print("[4/5] Training...")
177
+ data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
178
+
179
+ training_args = TrainingArguments(
180
+ output_dir=args.output_dir,
181
+ num_train_epochs=args.epochs,
182
+ per_device_train_batch_size=args.batch_size,
183
+ per_device_eval_batch_size=args.batch_size * 2,
184
+ learning_rate=args.lr,
185
+ lr_scheduler_type="cosine",
186
+ warmup_ratio=WARMUP_RATIO,
187
+ weight_decay=0.01,
188
+ eval_strategy="epoch",
189
+ save_strategy="epoch",
190
+ save_total_limit=2,
191
+ load_best_model_at_end=True,
192
+ metric_for_best_model="f1",
193
+ greater_is_better=True,
194
+ fp16=torch.cuda.is_available(),
195
+ report_to="none",
196
+ logging_steps=50,
197
+ )
198
+
199
+ trainer = Trainer(
200
+ model=model,
201
+ args=training_args,
202
+ train_dataset=train_ds,
203
+ eval_dataset=eval_ds,
204
+ data_collator=data_collator,
205
+ compute_metrics=compute_metrics,
206
+ )
207
+
208
+ trainer.train()
209
+
210
+ # Evaluate
211
+ print("[5/5] Final evaluation...")
212
+ results = trainer.evaluate()
213
+ print(f" F1: {results['eval_f1']:.4f}")
214
+ print(f" Precision: {results['eval_precision']:.4f}")
215
+ print(f" Recall: {results['eval_recall']:.4f}")
216
+
217
+ # Detailed per-entity report
218
+ preds_output = trainer.predict(eval_ds)
219
+ preds = np.argmax(preds_output.predictions, axis=-1)
220
+ labels = preds_output.label_ids
221
+
222
+ true_labels = []
223
+ true_preds = []
224
+ for pred_seq, label_seq in zip(preds, labels):
225
+ seq_labels = []
226
+ seq_preds = []
227
+ for p, l in zip(pred_seq, label_seq):
228
+ if l != -100:
229
+ seq_labels.append(id2label[l])
230
+ seq_preds.append(id2label[p])
231
+ true_labels.append(seq_labels)
232
+ true_preds.append(seq_preds)
233
+
234
+ report = seq_classification_report(true_labels, true_preds)
235
+ print(f"\n{report}")
236
+
237
+ # Save
238
+ trainer.save_model(args.output_dir)
239
+ tokenizer.save_pretrained(args.output_dir)
240
+
241
+ meta = {
242
+ "label_list": label_list,
243
+ "label2id": label2id,
244
+ "id2label": id2label,
245
+ "results": {k: float(v) for k, v in results.items()},
246
+ }
247
+ with open(os.path.join(args.output_dir, "training_meta.json"), "w") as f:
248
+ json.dump(meta, f, indent=2)
249
+
250
+ print(f"\nβœ“ NER model saved β†’ {args.output_dir}")
251
+
252
+
253
+ if __name__ == "__main__":
254
+ main()
pricing_glm.py ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Insurance Pricing GLM + EBM
3
+ Trains a Tweedie GLM for pure premium estimation and an Explainable Boosting Machine (EBM)
4
+ for interpretable rating factor analysis. Uses motor claims tabular data.
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import argparse
10
+ import pickle
11
+ from pathlib import Path
12
+
13
+ import numpy as np
14
+ import pandas as pd
15
+ from sklearn.model_selection import train_test_split
16
+ from sklearn.metrics import mean_absolute_error, mean_squared_error
17
+ from sklearn.preprocessing import LabelEncoder
18
+
19
+ # Tweedie GLM
20
+ from sklearn.linear_model import TweedieRegressor
21
+
22
+ # Explainable Boosting Machine (glass-box model)
23
+ from interpret.glassbox import ExplainableBoostingRegressor
24
+ from interpret import show
25
+
26
+
27
+ # ── Defaults ──
28
+
29
+ DATA_DIR = "data/output"
30
+ OUTPUT_DIR = "models/pricing-glm"
31
+ TEST_SIZE = 0.2
32
+ RANDOM_STATE = 42
33
+
34
+ # GLM hyperparams
35
+ TWEEDIE_POWER = 1.5 # 1 < p < 2 β†’ Compound Poisson-Gamma (standard for insurance)
36
+ TWEEDIE_ALPHA = 1.0 # regularization strength
37
+ TWEEDIE_MAX_ITER = 300
38
+
39
+ # Features for pricing
40
+ PRICING_FEATURES = [
41
+ "driver_age", "years_driving", "years_ncd", "vehicle_year", "vehicle_value",
42
+ "annual_mileage", "voluntary_excess", "compulsory_excess",
43
+ "previous_claims_3y", "policy_age_days",
44
+ ]
45
+
46
+ CAT_FEATURES = [
47
+ "vehicle_make", "fuel_type", "occupation", "region",
48
+ ]
49
+
50
+
51
+ def load_and_prepare(data_dir: str) -> pd.DataFrame:
52
+ """Load motor claims CSV and prepare for pricing model."""
53
+ motor_files = list(Path(data_dir).glob("claims_motor_*.csv"))
54
+ if not motor_files:
55
+ raise FileNotFoundError(f"No motor claims CSV found in {data_dir}")
56
+
57
+ df = pd.read_csv(str(motor_files[0]))
58
+
59
+ # Target: claim_amount (pure premium proxy)
60
+ # Only use settled claims with positive amounts
61
+ df = df[df["claim_amount"] > 0].copy()
62
+
63
+ # Encode categoricals
64
+ encoders = {}
65
+ for col in CAT_FEATURES:
66
+ if col in df.columns:
67
+ le = LabelEncoder()
68
+ df[col + "_enc"] = le.fit_transform(df[col].fillna("Unknown"))
69
+ encoders[col] = le
70
+
71
+ # Derived features
72
+ df["vehicle_age"] = 2025 - df["vehicle_year"]
73
+ df["driver_experience_ratio"] = df["years_driving"] / df["driver_age"].clip(lower=18)
74
+ df["ncd_ratio"] = df["years_ncd"] / df["years_driving"].clip(lower=1)
75
+
76
+ return df, encoders
77
+
78
+
79
+ def train_tweedie_glm(
80
+ X_train: pd.DataFrame,
81
+ y_train: pd.Series,
82
+ X_test: pd.DataFrame,
83
+ y_test: pd.Series,
84
+ feature_names: list,
85
+ output_dir: str,
86
+ ) -> dict:
87
+ """Train Tweedie GLM for pure premium."""
88
+ print("\n[GLM] Training Tweedie Regressor...")
89
+
90
+ glm = TweedieRegressor(
91
+ power=TWEEDIE_POWER,
92
+ alpha=TWEEDIE_ALPHA,
93
+ max_iter=TWEEDIE_MAX_ITER,
94
+ link="log",
95
+ )
96
+
97
+ X_tr = X_train[feature_names].fillna(0)
98
+ X_te = X_test[feature_names].fillna(0)
99
+
100
+ glm.fit(X_tr, y_train)
101
+
102
+ # Predictions (clipped to positive)
103
+ y_pred = np.clip(glm.predict(X_te), 0, None)
104
+
105
+ # Metrics
106
+ mae = mean_absolute_error(y_test, y_pred)
107
+ rmse = np.sqrt(mean_squared_error(y_test, y_pred))
108
+ mape = np.mean(np.abs((y_test - y_pred) / y_test.clip(lower=1))) * 100
109
+
110
+ # Coefficients
111
+ coefs = dict(zip(feature_names, glm.coef_))
112
+
113
+ results = {
114
+ "model": "TweedieGLM",
115
+ "tweedie_power": TWEEDIE_POWER,
116
+ "mae": mae,
117
+ "rmse": rmse,
118
+ "mape_pct": mape,
119
+ "coefficients": coefs,
120
+ "intercept": float(glm.intercept_),
121
+ "n_train": len(y_train),
122
+ "n_test": len(y_test),
123
+ }
124
+
125
+ # Save
126
+ model_path = os.path.join(output_dir, "tweedie_glm.pkl")
127
+ with open(model_path, "wb") as f:
128
+ pickle.dump(glm, f)
129
+
130
+ print(f" βœ“ Tweedie GLM saved β†’ {model_path}")
131
+ print(f" MAE: Β£{mae:,.2f}")
132
+ print(f" RMSE: Β£{rmse:,.2f}")
133
+ print(f" MAPE: {mape:.1f}%")
134
+ print(f" Top coefficients:")
135
+ for feat, coef in sorted(coefs.items(), key=lambda x: abs(x[1]), reverse=True)[:5]:
136
+ print(f" {feat}: {coef:+.4f}")
137
+
138
+ return results
139
+
140
+
141
+ def train_ebm(
142
+ X_train: pd.DataFrame,
143
+ y_train: pd.Series,
144
+ X_test: pd.DataFrame,
145
+ y_test: pd.Series,
146
+ feature_names: list,
147
+ output_dir: str,
148
+ ) -> dict:
149
+ """Train Explainable Boosting Machine for interpretable pricing."""
150
+ print("\n[EBM] Training Explainable Boosting Machine...")
151
+
152
+ ebm = ExplainableBoostingRegressor(
153
+ max_bins=256,
154
+ outer_bags=8,
155
+ inner_bags=4,
156
+ learning_rate=0.01,
157
+ max_leaves=3,
158
+ min_samples_leaf=10,
159
+ interactions=10, # allow up to 10 pairwise interactions
160
+ random_state=RANDOM_STATE,
161
+ )
162
+
163
+ X_tr = X_train[feature_names].fillna(0)
164
+ X_te = X_test[feature_names].fillna(0)
165
+
166
+ ebm.fit(X_tr, y_train)
167
+
168
+ y_pred = np.clip(ebm.predict(X_te), 0, None)
169
+
170
+ mae = mean_absolute_error(y_test, y_pred)
171
+ rmse = np.sqrt(mean_squared_error(y_test, y_pred))
172
+ mape = np.mean(np.abs((y_test - y_pred) / y_test.clip(lower=1))) * 100
173
+
174
+ # Feature importances from EBM
175
+ importance = dict(zip(
176
+ ebm.term_names_,
177
+ ebm.term_importances(),
178
+ ))
179
+
180
+ results = {
181
+ "model": "EBM",
182
+ "mae": mae,
183
+ "rmse": rmse,
184
+ "mape_pct": mape,
185
+ "n_train": len(y_train),
186
+ "n_test": len(y_test),
187
+ "top_features": dict(sorted(importance.items(), key=lambda x: x[1], reverse=True)[:10]),
188
+ }
189
+
190
+ # Save
191
+ model_path = os.path.join(output_dir, "pricing_ebm.pkl")
192
+ with open(model_path, "wb") as f:
193
+ pickle.dump(ebm, f)
194
+
195
+ print(f" βœ“ EBM saved β†’ {model_path}")
196
+ print(f" MAE: Β£{mae:,.2f}")
197
+ print(f" RMSE: Β£{rmse:,.2f}")
198
+ print(f" MAPE: {mape:.1f}%")
199
+ print(f" Top features:")
200
+ for feat, imp in sorted(importance.items(), key=lambda x: x[1], reverse=True)[:5]:
201
+ print(f" {feat}: {imp:.4f}")
202
+
203
+ return results
204
+
205
+
206
+ def main():
207
+ parser = argparse.ArgumentParser(description="Train pricing models")
208
+ parser.add_argument("--data-dir", default=DATA_DIR)
209
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
210
+ args = parser.parse_args()
211
+
212
+ print(f"{'='*60}")
213
+ print(f" InsureOS β€” Pricing Model Training")
214
+ print(f" Data: {args.data_dir}")
215
+ print(f"{'='*60}")
216
+
217
+ os.makedirs(args.output_dir, exist_ok=True)
218
+
219
+ # Load data
220
+ print("\nLoading motor claims data...")
221
+ df, encoders = load_and_prepare(args.data_dir)
222
+ print(f" Records: {len(df)}")
223
+ print(f" Mean claim amount: Β£{df['claim_amount'].mean():,.2f}")
224
+ print(f" Median claim amount: Β£{df['claim_amount'].median():,.2f}")
225
+
226
+ # Feature set
227
+ numeric_features = PRICING_FEATURES + ["vehicle_age", "driver_experience_ratio", "ncd_ratio"]
228
+ cat_enc_features = [c + "_enc" for c in CAT_FEATURES if c + "_enc" in df.columns]
229
+ all_features = numeric_features + cat_enc_features
230
+
231
+ y = df["claim_amount"]
232
+ X = df[all_features]
233
+
234
+ X_train, X_test, y_train, y_test = train_test_split(
235
+ X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
236
+ )
237
+
238
+ # Train both models
239
+ glm_results = train_tweedie_glm(X_train, y_train, X_test, y_test, all_features, args.output_dir)
240
+ ebm_results = train_ebm(X_train, y_train, X_test, y_test, all_features, args.output_dir)
241
+
242
+ # Save encoders
243
+ encoder_path = os.path.join(args.output_dir, "label_encoders.pkl")
244
+ with open(encoder_path, "wb") as f:
245
+ pickle.dump(encoders, f)
246
+
247
+ # Save results
248
+ summary = {"glm": glm_results, "ebm": ebm_results}
249
+ summary_path = os.path.join(args.output_dir, "training_results.json")
250
+ with open(summary_path, "w") as f:
251
+ json.dump(summary, f, indent=2, default=str)
252
+
253
+ print(f"\n{'='*60}")
254
+ print(f" βœ“ Pricing model training complete!")
255
+ print(f" Tweedie GLM MAE: Β£{glm_results['mae']:,.2f}")
256
+ print(f" EBM MAE: Β£{ebm_results['mae']:,.2f}")
257
+ print(f" Results β†’ {summary_path}")
258
+ print(f"{'='*60}")
259
+
260
+
261
+ if __name__ == "__main__":
262
+ main()
push_to_hf.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Push all INSUREOS models and code to HuggingFace Hub.
3
+ Created by Bytical AI.
4
+ """
5
+ import os
6
+ import sys
7
+ from huggingface_hub import HfApi, create_repo
8
+
9
+ TOKEN = os.environ.get("HF_TOKEN")
10
+ if not TOKEN:
11
+ print("ERROR: Set HF_TOKEN environment variable")
12
+ sys.exit(1)
13
+
14
+ api = HfApi(token=TOKEN)
15
+ ORG = "piyushptiwari"
16
+
17
+ # Ensure the organization exists (it may just be the user namespace)
18
+ # We'll use "bytical" as org β€” if it doesn't exist, we fall back to user
19
+ try:
20
+ api.whoami()
21
+ print("Authenticated successfully")
22
+ except Exception as e:
23
+ print(f"Auth error: {e}")
24
+ sys.exit(1)
25
+
26
+
27
+ def ensure_repo(repo_id: str, repo_type: str = "model"):
28
+ """Create repo if it doesn't exist."""
29
+ try:
30
+ create_repo(
31
+ repo_id=repo_id,
32
+ repo_type=repo_type,
33
+ exist_ok=True,
34
+ token=TOKEN,
35
+ )
36
+ print(f" Repo ready: {repo_id} ({repo_type})")
37
+ except Exception as e:
38
+ print(f" Repo creation note for {repo_id}: {e}")
39
+
40
+
41
+ def upload_folder(repo_id: str, local_dir: str, repo_type: str = "model",
42
+ ignore_patterns=None):
43
+ """Upload a local folder to HF Hub."""
44
+ if ignore_patterns is None:
45
+ ignore_patterns = ["__pycache__", "*.pyc", ".DS_Store"]
46
+
47
+ print(f" Uploading {local_dir} -> {repo_id}...")
48
+ api.upload_folder(
49
+ repo_id=repo_id,
50
+ folder_path=local_dir,
51
+ repo_type=repo_type,
52
+ ignore_patterns=ignore_patterns,
53
+ )
54
+ print(f" Done: {repo_id}")
55
+
56
+
57
+ if __name__ == "__main__":
58
+ BASE = "/home/piyush/Desktop/Insurance/insureos-models"
59
+
60
+ # =========================================================
61
+ # 1. InsureLLM-4B (main LLM β€” the best merged model only)
62
+ # =========================================================
63
+ print("\n[1/7] InsureLLM-4B")
64
+ repo = f"{ORG}/InsureLLM-4B"
65
+ ensure_repo(repo)
66
+ upload_folder(repo, f"{BASE}/models/insurellm-4b-realworld-merged")
67
+
68
+ # =========================================================
69
+ # 2. InsureDocClassifier
70
+ # =========================================================
71
+ print("\n[2/7] InsureDocClassifier")
72
+ repo = f"{ORG}/InsureDocClassifier"
73
+ ensure_repo(repo)
74
+ upload_folder(
75
+ repo, f"{BASE}/models/doc-classifier",
76
+ ignore_patterns=["__pycache__", "*.pyc", ".DS_Store",
77
+ "checkpoint-*"]
78
+ )
79
+
80
+ # =========================================================
81
+ # 3. InsureNER
82
+ # =========================================================
83
+ print("\n[3/7] InsureNER")
84
+ repo = f"{ORG}/InsureNER"
85
+ ensure_repo(repo)
86
+ upload_folder(
87
+ repo, f"{BASE}/models/ner-model",
88
+ ignore_patterns=["__pycache__", "*.pyc", ".DS_Store",
89
+ "checkpoint-*"]
90
+ )
91
+
92
+ # =========================================================
93
+ # 4. InsureFraudNet
94
+ # =========================================================
95
+ print("\n[4/7] InsureFraudNet")
96
+ repo = f"{ORG}/InsureFraudNet"
97
+ ensure_repo(repo)
98
+ upload_folder(repo, f"{BASE}/models/fraudnet")
99
+
100
+ # =========================================================
101
+ # 5. InsurePricing
102
+ # =========================================================
103
+ print("\n[5/7] InsurePricing")
104
+ repo = f"{ORG}/InsurePricing"
105
+ ensure_repo(repo)
106
+ upload_folder(repo, f"{BASE}/models/pricing-glm")
107
+
108
+ # =========================================================
109
+ # 6. Training Code + Search Engine (as a regular repo)
110
+ # =========================================================
111
+ print("\n[6/7] insureos-models (code repo)")
112
+ repo = f"{ORG}/insureos-models"
113
+ ensure_repo(repo, repo_type="model")
114
+ # We upload code only β€” no model weights, no raw data, no personal files
115
+ upload_folder(
116
+ repo, BASE, repo_type="model",
117
+ ignore_patterns=[
118
+ "__pycache__", "*.pyc", ".DS_Store",
119
+ "models/*", # model weights are in separate repos
120
+ "*.pkl", "*.bin", # no binary artifacts in code repo
121
+ "raw/*", # raw scraped data
122
+ "processed/*",
123
+ "search/index_data/*", # search index binaries
124
+ "data/output/*", # generated training data
125
+ "*.jsonl", # training data files
126
+ ".venv/*",
127
+ ]
128
+ )
129
+
130
+ # =========================================================
131
+ # 7. Training Data (as a dataset)
132
+ # =========================================================
133
+ print("\n[7/7] insureos-training-data (dataset)")
134
+ repo = f"{ORG}/insureos-training-data"
135
+ ensure_repo(repo, repo_type="dataset")
136
+ upload_folder(
137
+ repo, f"{BASE}/data/output", repo_type="dataset",
138
+ ignore_patterns=["__pycache__", "*.pyc"]
139
+ )
140
+
141
+ print("\n" + "=" * 60)
142
+ print("ALL UPLOADS COMPLETE!")
143
+ print("=" * 60)
144
+ print(f"\nModels:")
145
+ print(f" https://huggingface.co/{ORG}/InsureLLM-4B")
146
+ print(f" https://huggingface.co/{ORG}/InsureDocClassifier")
147
+ print(f" https://huggingface.co/{ORG}/InsureNER")
148
+ print(f" https://huggingface.co/{ORG}/InsureFraudNet")
149
+ print(f" https://huggingface.co/{ORG}/InsurePricing")
150
+ print(f"\nCode:")
151
+ print(f" https://huggingface.co/{ORG}/insureos-models")
152
+ print(f"\nDataset:")
153
+ print(f" https://huggingface.co/datasets/{ORG}/insureos-training-data")
qlora_finetune.py ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” QLoRA Fine-Tuning Script
3
+ Fine-tunes Qwen3-8B on UK insurance SFT data using 4-bit QLoRA.
4
+ Fits in 16 GB VRAM with gradient checkpointing.
5
+ """
6
+
7
+ import os
8
+ os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
9
+ import json
10
+ import argparse
11
+ from pathlib import Path
12
+
13
+ import torch
14
+ from datasets import Dataset
15
+ from transformers import (
16
+ AutoModelForCausalLM,
17
+ AutoTokenizer,
18
+ BitsAndBytesConfig,
19
+ TrainingArguments,
20
+ )
21
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
22
+ from trl import SFTTrainer, SFTConfig
23
+
24
+
25
+ # ── Defaults ──
26
+
27
+ BASE_MODEL = "Qwen/Qwen3-4B"
28
+ DATA_PATH = "data/output/insurance_sft_10k.jsonl"
29
+ OUTPUT_DIR = "models/insurellm-4b-qlora"
30
+ MAX_SEQ_LEN = 1024
31
+ LORA_R = 64
32
+ LORA_ALPHA = 128
33
+ LORA_DROPOUT = 0.05
34
+ EPOCHS = 3
35
+ BATCH_SIZE = 2
36
+ GRAD_ACCUM = 8 # effective batch = 16
37
+ LR = 2e-4
38
+ WARMUP_RATIO = 0.05
39
+ LOGGING_STEPS = 10
40
+ SAVE_STEPS = 200
41
+ EVAL_SPLIT = 0.05 # 5% held out for eval
42
+
43
+
44
+ def load_sft_data(path: str) -> Dataset:
45
+ """Load JSONL SFT data into a HuggingFace Dataset."""
46
+ records = []
47
+ with open(path) as f:
48
+ for line in f:
49
+ obj = json.loads(line)
50
+ records.append(obj)
51
+ return Dataset.from_list(records)
52
+
53
+
54
+ def format_messages(example: dict) -> dict:
55
+ """Convert messages list to a single training text using Qwen3 chat template."""
56
+ # The SFTTrainer with `dataset_text_field` or chat template will handle this,
57
+ # but we can also format manually if needed.
58
+ # Our SFT data has {"messages": [...], "category": ...}
59
+ return example # SFTTrainer will use the tokenizer's chat template
60
+
61
+
62
+ def main():
63
+ parser = argparse.ArgumentParser(description="QLoRA fine-tune Qwen3-8B for UK insurance")
64
+ parser.add_argument("--base-model", default=BASE_MODEL)
65
+ parser.add_argument("--data-path", default=DATA_PATH)
66
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
67
+ parser.add_argument("--epochs", type=int, default=EPOCHS)
68
+ parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
69
+ parser.add_argument("--lr", type=float, default=LR)
70
+ parser.add_argument("--lora-r", type=int, default=LORA_R)
71
+ parser.add_argument("--lora-alpha", type=int, default=LORA_ALPHA)
72
+ parser.add_argument("--max-seq-len", type=int, default=MAX_SEQ_LEN)
73
+ parser.add_argument("--grad-accum", type=int, default=GRAD_ACCUM)
74
+ parser.add_argument("--resume-from-checkpoint", action="store_true")
75
+ args = parser.parse_args()
76
+
77
+ print(f"{'='*60}")
78
+ print(f" InsureOS β€” QLoRA Fine-Tuning")
79
+ print(f" Base model: {args.base_model}")
80
+ print(f" Data: {args.data_path}")
81
+ print(f" LoRA rank: {args.lora_r}, alpha: {args.lora_alpha}")
82
+ print(f"{'='*60}\n")
83
+
84
+ # ── 1. Load tokenizer ──
85
+ print("[1/5] Loading tokenizer...")
86
+ tokenizer = AutoTokenizer.from_pretrained(
87
+ args.base_model,
88
+ trust_remote_code=True,
89
+ padding_side="right",
90
+ )
91
+ if tokenizer.pad_token is None:
92
+ tokenizer.pad_token = tokenizer.eos_token
93
+
94
+ # ── 2. Load model in 4-bit ──
95
+ print("[2/5] Loading model in 4-bit quantization...")
96
+ bnb_config = BitsAndBytesConfig(
97
+ load_in_4bit=True,
98
+ bnb_4bit_quant_type="nf4",
99
+ bnb_4bit_compute_dtype=torch.bfloat16,
100
+ bnb_4bit_use_double_quant=True,
101
+ )
102
+
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ args.base_model,
105
+ quantization_config=bnb_config,
106
+ device_map="auto",
107
+ trust_remote_code=True,
108
+ attn_implementation="sdpa",
109
+ dtype=torch.bfloat16,
110
+ )
111
+ model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
112
+
113
+ # ── 3. Apply LoRA ──
114
+ print("[3/5] Applying LoRA adapters...")
115
+ lora_config = LoraConfig(
116
+ r=args.lora_r,
117
+ lora_alpha=args.lora_alpha,
118
+ lora_dropout=LORA_DROPOUT,
119
+ target_modules="all-linear",
120
+ task_type=TaskType.CAUSAL_LM,
121
+ bias="none",
122
+ )
123
+ model = get_peft_model(model, lora_config)
124
+ model.print_trainable_parameters()
125
+
126
+ # ── 4. Load data ──
127
+ print("[4/5] Loading training data...")
128
+ dataset = load_sft_data(args.data_path)
129
+ print(f" Total examples: {len(dataset)}")
130
+
131
+ # Train/eval split
132
+ split = dataset.train_test_split(test_size=EVAL_SPLIT, seed=42)
133
+ train_ds = split["train"]
134
+ eval_ds = split["test"]
135
+ print(f" Train: {len(train_ds)}, Eval: {len(eval_ds)}")
136
+
137
+ # ── 5. Train ──
138
+ print("[5/5] Starting training...")
139
+
140
+ sft_config = SFTConfig(
141
+ output_dir=args.output_dir,
142
+ num_train_epochs=args.epochs,
143
+ per_device_train_batch_size=args.batch_size,
144
+ per_device_eval_batch_size=args.batch_size,
145
+ gradient_accumulation_steps=args.grad_accum,
146
+ learning_rate=args.lr,
147
+ lr_scheduler_type="cosine",
148
+ warmup_ratio=WARMUP_RATIO,
149
+ weight_decay=0.01,
150
+ bf16=True,
151
+ logging_steps=LOGGING_STEPS,
152
+ save_steps=SAVE_STEPS,
153
+ save_total_limit=3,
154
+ eval_strategy="steps",
155
+ eval_steps=SAVE_STEPS,
156
+ load_best_model_at_end=True,
157
+ metric_for_best_model="eval_loss",
158
+ greater_is_better=False,
159
+ report_to="none",
160
+ max_length=args.max_seq_len,
161
+ packing=False,
162
+ gradient_checkpointing=True,
163
+ gradient_checkpointing_kwargs={"use_reentrant": False},
164
+ dataset_kwargs={"skip_prepare_dataset": False},
165
+ )
166
+
167
+ trainer = SFTTrainer(
168
+ model=model,
169
+ args=sft_config,
170
+ train_dataset=train_ds,
171
+ eval_dataset=eval_ds,
172
+ processing_class=tokenizer,
173
+ )
174
+
175
+ if args.resume_from_checkpoint:
176
+ trainer.train(resume_from_checkpoint=True)
177
+ else:
178
+ trainer.train()
179
+
180
+ # ── Save ──
181
+ print("\nSaving model and tokenizer...")
182
+ trainer.save_model(args.output_dir)
183
+ tokenizer.save_pretrained(args.output_dir)
184
+
185
+ # Save merged adapter weights for easier loading
186
+ merged_dir = f"{args.output_dir}-merged"
187
+ print(f"Merging LoRA weights β†’ {merged_dir}")
188
+ merged_model = model.merge_and_unload()
189
+ merged_model.save_pretrained(merged_dir)
190
+ tokenizer.save_pretrained(merged_dir)
191
+
192
+ print(f"\nβœ“ Training complete!")
193
+ print(f" LoRA adapter: {args.output_dir}")
194
+ print(f" Merged model: {merged_dir}")
195
+
196
+
197
+ if __name__ == "__main__":
198
+ main()
retrain_realworld.py ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Retrain InsureLLM with real-world collected data.
2
+
3
+ This script:
4
+ 1. Loads the existing DPO-merged model (best checkpoint)
5
+ 2. Runs QLoRA fine-tuning on real-world SFT data
6
+ 3. Saves the improved model
7
+ """
8
+
9
+ import argparse
10
+ import logging
11
+ import os
12
+ import torch
13
+ from pathlib import Path
14
+
15
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
16
+ logger = logging.getLogger(__name__)
17
+
18
+ BASE_DIR = Path(__file__).resolve().parent.parent
19
+
20
+
21
+ def retrain(
22
+ sft_file: str = "collect/sft_real_world.jsonl",
23
+ base_model: str = "models/insurellm-4b-dpo-merged",
24
+ output_dir: str = "models/insurellm-4b-realworld",
25
+ max_seq_len: int = 1024,
26
+ batch_size: int = 2,
27
+ grad_accum: int = 4,
28
+ epochs: int = 2,
29
+ lr: float = 2e-5,
30
+ lora_r: int = 64,
31
+ lora_alpha: int = 128,
32
+ ):
33
+ from datasets import load_dataset
34
+ from peft import LoraConfig, get_peft_model, TaskType
35
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
36
+ from trl import SFTConfig, SFTTrainer
37
+
38
+ sft_path = str(BASE_DIR / sft_file)
39
+ model_path = str(BASE_DIR / base_model)
40
+ out_path = str(BASE_DIR / output_dir)
41
+
42
+ logger.info(f"Loading SFT data from {sft_path}")
43
+ dataset = load_dataset("json", data_files=sft_path, split="train")
44
+ logger.info(f" {len(dataset)} training examples")
45
+
46
+ # Train/eval split
47
+ split = dataset.train_test_split(test_size=0.05, seed=42)
48
+ train_ds = split["train"]
49
+ eval_ds = split["test"]
50
+ logger.info(f" Train: {len(train_ds)}, Eval: {len(eval_ds)}")
51
+
52
+ # Quantization config
53
+ bnb = BitsAndBytesConfig(
54
+ load_in_4bit=True,
55
+ bnb_4bit_quant_type="nf4",
56
+ bnb_4bit_compute_dtype=torch.bfloat16,
57
+ bnb_4bit_use_double_quant=True,
58
+ )
59
+
60
+ logger.info(f"Loading base model: {model_path}")
61
+ model = AutoModelForCausalLM.from_pretrained(
62
+ model_path,
63
+ quantization_config=bnb,
64
+ device_map="auto",
65
+ trust_remote_code=True,
66
+ attn_implementation="sdpa",
67
+ dtype=torch.bfloat16,
68
+ )
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
71
+ if tokenizer.pad_token is None:
72
+ tokenizer.pad_token = tokenizer.eos_token
73
+
74
+ # LoRA config
75
+ lora_config = LoraConfig(
76
+ task_type=TaskType.CAUSAL_LM,
77
+ r=lora_r,
78
+ lora_alpha=lora_alpha,
79
+ lora_dropout=0.05,
80
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
81
+ "gate_proj", "up_proj", "down_proj"],
82
+ bias="none",
83
+ )
84
+
85
+ model = get_peft_model(model, lora_config)
86
+ model.print_trainable_parameters()
87
+
88
+ # Training config
89
+ training_args = SFTConfig(
90
+ output_dir=out_path,
91
+ num_train_epochs=epochs,
92
+ per_device_train_batch_size=batch_size,
93
+ per_device_eval_batch_size=batch_size,
94
+ gradient_accumulation_steps=grad_accum,
95
+ learning_rate=lr,
96
+ lr_scheduler_type="cosine",
97
+ warmup_ratio=0.05,
98
+ logging_steps=10,
99
+ save_strategy="steps",
100
+ save_steps=200,
101
+ eval_strategy="steps",
102
+ eval_steps=200,
103
+ save_total_limit=3,
104
+ bf16=True,
105
+ max_length=max_seq_len,
106
+ packing=False,
107
+ gradient_checkpointing=True,
108
+ gradient_checkpointing_kwargs={"use_reentrant": False},
109
+ report_to="none",
110
+ seed=42,
111
+ )
112
+
113
+ trainer = SFTTrainer(
114
+ model=model,
115
+ args=training_args,
116
+ train_dataset=train_ds,
117
+ eval_dataset=eval_ds,
118
+ processing_class=tokenizer,
119
+ )
120
+
121
+ logger.info("Starting training...")
122
+ trainer.train()
123
+ logger.info("Training complete!")
124
+
125
+ # Save LoRA adapter
126
+ trainer.save_model(out_path)
127
+ tokenizer.save_pretrained(out_path)
128
+ logger.info(f"LoRA adapter saved to {out_path}")
129
+
130
+ # Merge and save
131
+ merged_path = out_path + "-merged"
132
+ logger.info("Merging LoRA into base model...")
133
+ from peft import PeftModel
134
+
135
+ base = AutoModelForCausalLM.from_pretrained(
136
+ model_path, dtype=torch.bfloat16, device_map="cpu", trust_remote_code=True,
137
+ )
138
+ merged = PeftModel.from_pretrained(base, out_path)
139
+ merged = merged.merge_and_unload()
140
+
141
+ # Save with safetensors (avoid transformers 5.4.0 bug)
142
+ os.makedirs(merged_path, exist_ok=True)
143
+ from safetensors.torch import save_file
144
+ state = merged.state_dict()
145
+ # Handle tied weights
146
+ if "lm_head.weight" in state and "model.embed_tokens.weight" in state:
147
+ if state["lm_head.weight"].data_ptr() == state["model.embed_tokens.weight"].data_ptr():
148
+ state["lm_head.weight"] = state["lm_head.weight"].clone()
149
+ save_file(state, f"{merged_path}/model.safetensors")
150
+ merged.config.save_pretrained(merged_path)
151
+ tokenizer.save_pretrained(merged_path)
152
+ logger.info(f"Merged model saved to {merged_path}")
153
+
154
+
155
+ if __name__ == "__main__":
156
+ parser = argparse.ArgumentParser(description="Retrain InsureLLM with real-world data")
157
+ parser.add_argument("--sft-file", default="collect/sft_real_world.jsonl")
158
+ parser.add_argument("--base-model", default="models/insurellm-4b-dpo-merged")
159
+ parser.add_argument("--output-dir", default="models/insurellm-4b-realworld")
160
+ parser.add_argument("--max-seq-len", type=int, default=1024)
161
+ parser.add_argument("--batch-size", type=int, default=2)
162
+ parser.add_argument("--grad-accum", type=int, default=4)
163
+ parser.add_argument("--epochs", type=int, default=2)
164
+ parser.add_argument("--lr", type=float, default=2e-5)
165
+ args = parser.parse_args()
166
+
167
+ retrain(
168
+ sft_file=args.sft_file,
169
+ base_model=args.base_model,
170
+ output_dir=args.output_dir,
171
+ max_seq_len=args.max_seq_len,
172
+ batch_size=args.batch_size,
173
+ grad_accum=args.grad_accum,
174
+ epochs=args.epochs,
175
+ lr=args.lr,
176
+ )
run_collection.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Master orchestrator for all data collection sources."""
2
+
3
+ import json
4
+ import logging
5
+ import sys
6
+ import time
7
+ from pathlib import Path
8
+
9
+ # Setup logging
10
+ logging.basicConfig(
11
+ level=logging.INFO,
12
+ format="%(asctime)s [%(levelname)s] %(message)s",
13
+ datefmt="%H:%M:%S",
14
+ )
15
+ logger = logging.getLogger(__name__)
16
+
17
+
18
+ def run_collection():
19
+ """Run all data collection sources."""
20
+ start = time.time()
21
+ total_docs = 0
22
+
23
+ # ── 1. Wikipedia ───────────────────────────────────────────────
24
+ logger.info("=" * 60)
25
+ logger.info("1/6 WIKIPEDIA β€” Insurance articles")
26
+ logger.info("=" * 60)
27
+ try:
28
+ from collect.sources.wikipedia import collect_wikipedia
29
+ docs = collect_wikipedia(max_articles=400)
30
+ total_docs += len(docs)
31
+ logger.info(f" βœ“ Wikipedia: {len(docs)} documents")
32
+ except Exception as e:
33
+ logger.error(f" βœ— Wikipedia failed: {e}")
34
+
35
+ # ── 2. FCA Handbook ────────────────────────────────────────────
36
+ logger.info("=" * 60)
37
+ logger.info("2/6 FCA HANDBOOK β€” UK insurance regulation")
38
+ logger.info("=" * 60)
39
+ try:
40
+ from collect.sources.fca import collect_fca
41
+ docs = collect_fca()
42
+ total_docs += len(docs)
43
+ logger.info(f" βœ“ FCA: {len(docs)} documents")
44
+ except Exception as e:
45
+ logger.error(f" βœ— FCA failed: {e}")
46
+
47
+ # ── 3. UK Legislation ──────────────────────────────────────────
48
+ logger.info("=" * 60)
49
+ logger.info("3/6 UK LEGISLATION β€” Insurance Act 2015 etc.")
50
+ logger.info("=" * 60)
51
+ try:
52
+ from collect.sources.legislation import collect_legislation
53
+ docs = collect_legislation()
54
+ total_docs += len(docs)
55
+ logger.info(f" βœ“ Legislation: {len(docs)} documents")
56
+ except Exception as e:
57
+ logger.error(f" βœ— Legislation failed: {e}")
58
+
59
+ # ── 4. Investopedia ────────────────────────────────────────────
60
+ logger.info("=" * 60)
61
+ logger.info("4/6 INVESTOPEDIA β€” Insurance glossary")
62
+ logger.info("=" * 60)
63
+ try:
64
+ from collect.sources.investopedia import collect_investopedia
65
+ docs = collect_investopedia()
66
+ total_docs += len(docs)
67
+ logger.info(f" βœ“ Investopedia: {len(docs)} documents")
68
+ except Exception as e:
69
+ logger.error(f" βœ— Investopedia failed: {e}")
70
+
71
+ # ── 5. HuggingFace ─────────────────────────────────────────────
72
+ logger.info("=" * 60)
73
+ logger.info("5/6 HUGGINGFACE β€” Insurance datasets")
74
+ logger.info("=" * 60)
75
+ try:
76
+ from collect.sources.hf_datasets import collect_huggingface
77
+ docs = collect_huggingface()
78
+ total_docs += len(docs)
79
+ logger.info(f" βœ“ HuggingFace: {len(docs)} documents")
80
+ except Exception as e:
81
+ logger.error(f" βœ— HuggingFace failed: {e}")
82
+
83
+ # ── 6. RSS / News ──────────────────────────────────────────────
84
+ logger.info("=" * 60)
85
+ logger.info("6/6 RSS NEWS β€” Insurance industry news")
86
+ logger.info("=" * 60)
87
+ try:
88
+ from collect.sources.rss_news import collect_rss
89
+ docs = collect_rss()
90
+ total_docs += len(docs)
91
+ logger.info(f" βœ“ RSS: {len(docs)} documents")
92
+ except Exception as e:
93
+ logger.error(f" βœ— RSS failed: {e}")
94
+
95
+ # ── 7. Education ───────────────────────────────────────────────
96
+ logger.info("=" * 60)
97
+ logger.info("7/7 EDUCATION β€” Open textbooks & exam content")
98
+ logger.info("=" * 60)
99
+ try:
100
+ from collect.sources.education import collect_education
101
+ docs = collect_education()
102
+ total_docs += len(docs)
103
+ logger.info(f" βœ“ Education: {len(docs)} documents")
104
+ except Exception as e:
105
+ logger.error(f" βœ— Education failed: {e}")
106
+
107
+ # ── Convert to SFT ─────────────────────────────────────────────
108
+ logger.info("=" * 60)
109
+ logger.info("CONVERTING collected data β†’ SFT + DPO training format")
110
+ logger.info("=" * 60)
111
+ try:
112
+ from collect.convert_sft import convert_all_to_sft
113
+ sft_count, dpo_count = convert_all_to_sft()
114
+ logger.info(f" βœ“ SFT pairs: {sft_count}")
115
+ logger.info(f" βœ“ DPO pairs: {dpo_count}")
116
+ except Exception as e:
117
+ logger.error(f" βœ— SFT conversion failed: {e}")
118
+
119
+ elapsed = time.time() - start
120
+ logger.info("=" * 60)
121
+ logger.info(f"COLLECTION COMPLETE")
122
+ logger.info(f" Total documents: {total_docs:,}")
123
+ logger.info(f" Time elapsed: {elapsed / 60:.1f} minutes")
124
+ logger.info("=" * 60)
125
+
126
+
127
+ if __name__ == "__main__":
128
+ run_collection()
run_eval.py ADDED
@@ -0,0 +1,356 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ InsureOS β€” Comprehensive Evaluation Suite
3
+ Evaluates all trained models: InsureLLM (generative), FraudNet, PricingGLM, DocClassifier, NER.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import argparse
9
+ import time
10
+ from pathlib import Path
11
+
12
+ import torch
13
+ import numpy as np
14
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
15
+
16
+ # Optional imports (gracefully degrade if not installed)
17
+ try:
18
+ from rouge_score import rouge_scorer
19
+ HAS_ROUGE = True
20
+ except ImportError:
21
+ HAS_ROUGE = False
22
+
23
+ try:
24
+ from bert_score import score as bert_score_fn
25
+ HAS_BERTSCORE = True
26
+ except ImportError:
27
+ HAS_BERTSCORE = False
28
+
29
+
30
+ # ── Defaults ──
31
+
32
+ INSURELLM_MODEL = "models/insurellm-8b-dpo-merged"
33
+ EVAL_DATA = "data/output/insurance_sft_10k.jsonl"
34
+ OUTPUT_DIR = "evaluation/results"
35
+ N_SAMPLES = 100 # number of samples for generative eval
36
+
37
+
38
+ # ── UK Insurance Domain-Specific Evaluation Prompts ──
39
+
40
+ DOMAIN_EVAL_PROMPTS = [
41
+ {
42
+ "id": "fca_consumer_duty",
43
+ "prompt": "A customer's home insurance claim for escape of water has been open for 8 weeks with no update. They call asking what's happening. Draft a response.",
44
+ "required_elements": ["apology", "timeline", "next steps", "complaint rights", "financial ombudsman"],
45
+ "forbidden_elements": ["jargon", "indemnification", "proximate cause", "peril"],
46
+ },
47
+ {
48
+ "id": "gdpr_data_protection",
49
+ "prompt": "A third party calls asking about the status of a motor claim made against our policyholder. How should we respond?",
50
+ "required_elements": ["data protection", "cannot share", "their own insurer", "GDPR"],
51
+ "forbidden_elements": ["policy number", "policyholder address", "bank details"],
52
+ },
53
+ {
54
+ "id": "claims_process",
55
+ "prompt": "Explain the insurance claims process for a domestic property fire from FNOL through to settlement.",
56
+ "required_elements": ["FNOL", "loss adjuster", "investigation", "settlement", "excess"],
57
+ "forbidden_elements": [],
58
+ },
59
+ {
60
+ "id": "fraud_indicators",
61
+ "prompt": "What are the key red flags for insurance fraud in motor third-party claims?",
62
+ "required_elements": ["exaggeration", "staged", "late reporting", "multiple claims"],
63
+ "forbidden_elements": [],
64
+ },
65
+ {
66
+ "id": "lloyds_market",
67
+ "prompt": "Explain the role of an MGA in the Lloyd's market and how binding authority agreements work.",
68
+ "required_elements": ["binding authority", "capacity provider", "syndicate", "delegated authority", "bordereaux"],
69
+ "forbidden_elements": [],
70
+ },
71
+ {
72
+ "id": "pricing_fairness",
73
+ "prompt": "An insurer wants to use first names as a rating factor because it improves their model by 3%. Should they?",
74
+ "required_elements": ["proxy discrimination", "protected characteristics", "Equality Act", "FCA"],
75
+ "forbidden_elements": [],
76
+ },
77
+ {
78
+ "id": "subrogation",
79
+ "prompt": "Explain subrogation rights in UK insurance. When does an insurer pursue recovery?",
80
+ "required_elements": ["recovery", "third party", "policyholder indemnified", "non-fault"],
81
+ "forbidden_elements": [],
82
+ },
83
+ {
84
+ "id": "renewal_transparency",
85
+ "prompt": "A customer's premium increased by 25% at renewal. They want to know why. Draft an explanation.",
86
+ "required_elements": ["transparency", "factors", "shop around", "Consumer Duty", "fair value"],
87
+ "forbidden_elements": ["take it or leave it", "market rate"],
88
+ },
89
+ ]
90
+
91
+
92
+ def evaluate_insurellm(model_path: str, n_samples: int, output_dir: str) -> dict:
93
+ """Evaluate the generative InsureLLM model."""
94
+ print(f"\n{'='*60}")
95
+ print(f" Evaluating InsureLLM: {model_path}")
96
+ print(f"{'='*60}")
97
+
98
+ # Load model
99
+ print("Loading model...")
100
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
101
+ if tokenizer.pad_token is None:
102
+ tokenizer.pad_token = tokenizer.eos_token
103
+
104
+ bnb_config = BitsAndBytesConfig(
105
+ load_in_4bit=True,
106
+ bnb_4bit_quant_type="nf4",
107
+ bnb_4bit_compute_dtype=torch.bfloat16,
108
+ bnb_4bit_use_double_quant=True,
109
+ )
110
+
111
+ model = AutoModelForCausalLM.from_pretrained(
112
+ model_path,
113
+ quantization_config=bnb_config,
114
+ device_map="auto",
115
+ trust_remote_code=True,
116
+ attn_implementation="sdpa",
117
+ torch_dtype=torch.bfloat16,
118
+ )
119
+ model.eval()
120
+
121
+ results = {
122
+ "model": model_path,
123
+ "domain_eval": [],
124
+ "generation_metrics": {},
125
+ }
126
+
127
+ # ── 1. Domain-Specific Evaluation ──
128
+ print("\n[1/3] Domain-specific evaluation...")
129
+ for item in DOMAIN_EVAL_PROMPTS:
130
+ messages = [
131
+ {"role": "system", "content": "You are InsureLLM, a specialist UK insurance AI assistant."},
132
+ {"role": "user", "content": item["prompt"]},
133
+ ]
134
+
135
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
136
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
137
+
138
+ start = time.time()
139
+ with torch.no_grad():
140
+ outputs = model.generate(
141
+ **inputs,
142
+ max_new_tokens=512,
143
+ temperature=0.7,
144
+ top_p=0.9,
145
+ do_sample=True,
146
+ )
147
+ latency = time.time() - start
148
+
149
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
150
+
151
+ # Check required elements
152
+ response_lower = response.lower()
153
+ found_required = [e for e in item["required_elements"] if e.lower() in response_lower]
154
+ found_forbidden = [e for e in item["forbidden_elements"] if e.lower() in response_lower]
155
+
156
+ score = len(found_required) / max(len(item["required_elements"]), 1)
157
+ penalty = len(found_forbidden) * 0.15
158
+ final_score = max(0, score - penalty)
159
+
160
+ eval_result = {
161
+ "id": item["id"],
162
+ "score": final_score,
163
+ "required_found": len(found_required),
164
+ "required_total": len(item["required_elements"]),
165
+ "forbidden_found": len(found_forbidden),
166
+ "latency_s": latency,
167
+ "response_length": len(response.split()),
168
+ }
169
+ results["domain_eval"].append(eval_result)
170
+
171
+ status = "βœ“" if final_score >= 0.7 else "β–³" if final_score >= 0.4 else "βœ—"
172
+ print(f" {status} {item['id']}: {final_score:.2f} "
173
+ f"({len(found_required)}/{len(item['required_elements'])} required, "
174
+ f"{len(found_forbidden)} forbidden, {latency:.1f}s)")
175
+
176
+ avg_domain = np.mean([r["score"] for r in results["domain_eval"]])
177
+ avg_latency = np.mean([r["latency_s"] for r in results["domain_eval"]])
178
+ print(f"\n Average domain score: {avg_domain:.3f}")
179
+ print(f" Average latency: {avg_latency:.1f}s")
180
+
181
+ # ── 2. ROUGE scores on held-out SFT data ──
182
+ if HAS_ROUGE and os.path.exists(EVAL_DATA):
183
+ print("\n[2/3] ROUGE evaluation on SFT test set...")
184
+ scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
185
+
186
+ eval_records = []
187
+ with open(EVAL_DATA) as f:
188
+ for line in f:
189
+ eval_records.append(json.loads(line))
190
+
191
+ # Use last N as eval
192
+ eval_subset = eval_records[-min(n_samples, len(eval_records)):]
193
+
194
+ rouge1_scores = []
195
+ rouge2_scores = []
196
+ rougeL_scores = []
197
+
198
+ for rec in eval_subset:
199
+ messages = rec["messages"]
200
+ # Get reference (last assistant message)
201
+ reference = messages[-1]["content"]
202
+ prompt_messages = messages[:-1]
203
+
204
+ text = tokenizer.apply_chat_template(prompt_messages, tokenize=False, add_generation_prompt=True)
205
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
206
+
207
+ with torch.no_grad():
208
+ outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=False)
209
+
210
+ generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
211
+
212
+ scores = scorer.score(reference, generated)
213
+ rouge1_scores.append(scores["rouge1"].fmeasure)
214
+ rouge2_scores.append(scores["rouge2"].fmeasure)
215
+ rougeL_scores.append(scores["rougeL"].fmeasure)
216
+
217
+ results["generation_metrics"]["rouge1"] = float(np.mean(rouge1_scores))
218
+ results["generation_metrics"]["rouge2"] = float(np.mean(rouge2_scores))
219
+ results["generation_metrics"]["rougeL"] = float(np.mean(rougeL_scores))
220
+
221
+ print(f" ROUGE-1: {results['generation_metrics']['rouge1']:.4f}")
222
+ print(f" ROUGE-2: {results['generation_metrics']['rouge2']:.4f}")
223
+ print(f" ROUGE-L: {results['generation_metrics']['rougeL']:.4f}")
224
+ else:
225
+ print("\n[2/3] Skipping ROUGE (rouge_score not installed or data not found)")
226
+
227
+ # ── 3. Summary metrics ──
228
+ print("\n[3/3] Computing summary...")
229
+ results["summary"] = {
230
+ "avg_domain_score": float(avg_domain),
231
+ "avg_latency_s": float(avg_latency),
232
+ "domain_pass_rate": float(np.mean([1 if r["score"] >= 0.7 else 0 for r in results["domain_eval"]])),
233
+ }
234
+
235
+ # Save
236
+ os.makedirs(output_dir, exist_ok=True)
237
+ outpath = os.path.join(output_dir, "insurellm_eval.json")
238
+ with open(outpath, "w") as f:
239
+ json.dump(results, f, indent=2)
240
+ print(f"\nβœ“ InsureLLM eval results β†’ {outpath}")
241
+
242
+ return results
243
+
244
+
245
+ def evaluate_all(args):
246
+ """Run evaluation for all available models."""
247
+ print(f"{'='*60}")
248
+ print(f" InsureOS β€” Full Evaluation Suite")
249
+ print(f"{'='*60}")
250
+
251
+ os.makedirs(args.output_dir, exist_ok=True)
252
+ all_results = {}
253
+
254
+ # 1. InsureLLM
255
+ if os.path.exists(args.insurellm_model):
256
+ all_results["insurellm"] = evaluate_insurellm(
257
+ args.insurellm_model, args.n_samples, args.output_dir
258
+ )
259
+ else:
260
+ print(f"\n⚠ InsureLLM not found at {args.insurellm_model}, skipping")
261
+
262
+ # 2. FraudNet β€” just check if results exist from training
263
+ fraud_results = Path("models/fraudnet/training_results.json")
264
+ if fraud_results.exists():
265
+ with open(fraud_results) as f:
266
+ all_results["fraudnet"] = json.load(f)
267
+ print(f"\nβœ“ FraudNet results loaded from training")
268
+ else:
269
+ print(f"\n⚠ FraudNet results not found, skipping")
270
+
271
+ # 3. Pricing GLM
272
+ pricing_results = Path("models/pricing-glm/training_results.json")
273
+ if pricing_results.exists():
274
+ with open(pricing_results) as f:
275
+ all_results["pricing"] = json.load(f)
276
+ print(f"βœ“ Pricing model results loaded from training")
277
+ else:
278
+ print(f"⚠ Pricing results not found, skipping")
279
+
280
+ # 4. Doc Classifier
281
+ doc_meta = Path("models/doc-classifier/training_meta.json")
282
+ if doc_meta.exists():
283
+ with open(doc_meta) as f:
284
+ all_results["doc_classifier"] = json.load(f)
285
+ print(f"βœ“ Doc classifier results loaded")
286
+ else:
287
+ print(f"⚠ Doc classifier results not found, skipping")
288
+
289
+ # 5. NER
290
+ ner_meta = Path("models/ner-model/training_meta.json")
291
+ if ner_meta.exists():
292
+ with open(ner_meta) as f:
293
+ all_results["ner"] = json.load(f)
294
+ print(f"βœ“ NER results loaded")
295
+ else:
296
+ print(f"⚠ NER results not found, skipping")
297
+
298
+ # ── Summary report ──
299
+ report_path = os.path.join(args.output_dir, "full_eval_report.json")
300
+ with open(report_path, "w") as f:
301
+ json.dump(all_results, f, indent=2, default=str)
302
+
303
+ print(f"\n{'='*60}")
304
+ print(f" EVALUATION SUMMARY")
305
+ print(f"{'='*60}")
306
+
307
+ if "insurellm" in all_results:
308
+ s = all_results["insurellm"].get("summary", {})
309
+ print(f"\n InsureLLM (Generative):")
310
+ print(f" Domain score: {s.get('avg_domain_score', 'N/A')}")
311
+ print(f" Pass rate: {s.get('domain_pass_rate', 'N/A')}")
312
+ print(f" Latency: {s.get('avg_latency_s', 'N/A')}s")
313
+
314
+ if "fraudnet" in all_results:
315
+ for r in all_results["fraudnet"]:
316
+ if isinstance(r, dict):
317
+ print(f"\n FraudNet ({r.get('lob', '?')}):")
318
+ print(f" AUC-ROC: {r.get('auc_roc', 'N/A')}")
319
+ print(f" Avg Precision: {r.get('avg_precision', 'N/A')}")
320
+
321
+ if "pricing" in all_results:
322
+ for model_type in ["glm", "ebm"]:
323
+ if model_type in all_results["pricing"]:
324
+ m = all_results["pricing"][model_type]
325
+ print(f"\n Pricing {model_type.upper()}:")
326
+ print(f" MAE: Β£{m.get('mae', 'N/A')}")
327
+ print(f" RMSE: Β£{m.get('rmse', 'N/A')}")
328
+
329
+ if "doc_classifier" in all_results:
330
+ r = all_results["doc_classifier"].get("results", {})
331
+ print(f"\n Document Classifier:")
332
+ print(f" Accuracy: {r.get('eval_accuracy', 'N/A')}")
333
+ print(f" F1 (macro): {r.get('eval_f1_macro', 'N/A')}")
334
+
335
+ if "ner" in all_results:
336
+ r = all_results["ner"].get("results", {})
337
+ print(f"\n NER Model:")
338
+ print(f" F1: {r.get('eval_f1', 'N/A')}")
339
+ print(f" Precision: {r.get('eval_precision', 'N/A')}")
340
+ print(f" Recall: {r.get('eval_recall', 'N/A')}")
341
+
342
+ print(f"\n Full report β†’ {report_path}")
343
+
344
+
345
+ def main():
346
+ parser = argparse.ArgumentParser(description="InsureOS evaluation suite")
347
+ parser.add_argument("--insurellm-model", default=INSURELLM_MODEL)
348
+ parser.add_argument("--n-samples", type=int, default=N_SAMPLES)
349
+ parser.add_argument("--output-dir", default=OUTPUT_DIR)
350
+ args = parser.parse_args()
351
+
352
+ evaluate_all(args)
353
+
354
+
355
+ if __name__ == "__main__":
356
+ main()
run_fast.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fast data collection β€” reduced Wikipedia cap, lower API delay."""
2
+
3
+ import json
4
+ import logging
5
+ import sys
6
+ import time
7
+ from pathlib import Path
8
+
9
+ logging.basicConfig(
10
+ level=logging.INFO,
11
+ format="%(asctime)s [%(levelname)s] %(message)s",
12
+ datefmt="%H:%M:%S",
13
+ )
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # Override delay for API sources (Wikipedia API is generous)
17
+ import collect.config as cfg
18
+ cfg.REQUEST_DELAY = 0.5
19
+
20
+
21
+ def run_fast():
22
+ start = time.time()
23
+ total_docs = 0
24
+
25
+ # 1. Wikipedia (cap at 150 β€” still 2M+ chars of insurance knowledge)
26
+ logger.info("=" * 60)
27
+ logger.info("1/7 WIKIPEDIA β€” Insurance articles (max 150)")
28
+ logger.info("=" * 60)
29
+ try:
30
+ from collect.sources.wikipedia import collect_wikipedia
31
+ docs = collect_wikipedia(max_articles=150)
32
+ total_docs += len(docs)
33
+ logger.info(f" => Wikipedia: {len(docs)} documents")
34
+ except Exception as e:
35
+ logger.error(f" Wikipedia failed: {e}", exc_info=True)
36
+
37
+ # 2. FCA Handbook
38
+ cfg.REQUEST_DELAY = 1.5 # Web scraping β€” be polite
39
+ logger.info("=" * 60)
40
+ logger.info("2/7 FCA HANDBOOK")
41
+ logger.info("=" * 60)
42
+ try:
43
+ from collect.sources.fca import collect_fca
44
+ docs = collect_fca()
45
+ total_docs += len(docs)
46
+ logger.info(f" => FCA: {len(docs)} documents")
47
+ except Exception as e:
48
+ logger.error(f" FCA failed: {e}", exc_info=True)
49
+
50
+ # 3. UK Legislation
51
+ logger.info("=" * 60)
52
+ logger.info("3/7 UK LEGISLATION")
53
+ logger.info("=" * 60)
54
+ try:
55
+ from collect.sources.legislation import collect_legislation
56
+ docs = collect_legislation()
57
+ total_docs += len(docs)
58
+ logger.info(f" => Legislation: {len(docs)} documents")
59
+ except Exception as e:
60
+ logger.error(f" Legislation failed: {e}", exc_info=True)
61
+
62
+ # 4. Investopedia
63
+ logger.info("=" * 60)
64
+ logger.info("4/7 INVESTOPEDIA")
65
+ logger.info("=" * 60)
66
+ try:
67
+ from collect.sources.investopedia import collect_investopedia
68
+ docs = collect_investopedia()
69
+ total_docs += len(docs)
70
+ logger.info(f" => Investopedia: {len(docs)} documents")
71
+ except Exception as e:
72
+ logger.error(f" Investopedia failed: {e}", exc_info=True)
73
+
74
+ # 5. HuggingFace
75
+ cfg.REQUEST_DELAY = 0.3
76
+ logger.info("=" * 60)
77
+ logger.info("5/7 HUGGINGFACE DATASETS")
78
+ logger.info("=" * 60)
79
+ try:
80
+ from collect.sources.hf_datasets import collect_huggingface
81
+ docs = collect_huggingface()
82
+ total_docs += len(docs)
83
+ logger.info(f" => HuggingFace: {len(docs)} documents")
84
+ except Exception as e:
85
+ logger.error(f" HuggingFace failed: {e}", exc_info=True)
86
+
87
+ # 6. RSS News
88
+ cfg.REQUEST_DELAY = 1.0
89
+ logger.info("=" * 60)
90
+ logger.info("6/7 RSS NEWS")
91
+ logger.info("=" * 60)
92
+ try:
93
+ from collect.sources.rss_news import collect_rss
94
+ docs = collect_rss()
95
+ total_docs += len(docs)
96
+ logger.info(f" => RSS: {len(docs)} documents")
97
+ except Exception as e:
98
+ logger.error(f" RSS failed: {e}", exc_info=True)
99
+
100
+ # 7. Education
101
+ logger.info("=" * 60)
102
+ logger.info("7/7 EDUCATION")
103
+ logger.info("=" * 60)
104
+ try:
105
+ from collect.sources.education import collect_education
106
+ docs = collect_education()
107
+ total_docs += len(docs)
108
+ logger.info(f" => Education: {len(docs)} documents")
109
+ except Exception as e:
110
+ logger.error(f" Education failed: {e}", exc_info=True)
111
+
112
+ # Convert to SFT
113
+ logger.info("=" * 60)
114
+ logger.info("CONVERTING β†’ SFT + DPO format")
115
+ logger.info("=" * 60)
116
+ try:
117
+ from collect.convert_sft import convert_all_to_sft
118
+ sft_count, dpo_count = convert_all_to_sft()
119
+ logger.info(f" => SFT pairs: {sft_count}")
120
+ logger.info(f" => DPO pairs: {dpo_count}")
121
+ except Exception as e:
122
+ logger.error(f" SFT conversion failed: {e}", exc_info=True)
123
+
124
+ elapsed = time.time() - start
125
+ logger.info("=" * 60)
126
+ logger.info(f"DONE β€” {total_docs:,} documents in {elapsed / 60:.1f} min")
127
+ logger.info("=" * 60)
128
+
129
+
130
+ if __name__ == "__main__":
131
+ run_fast()
scraper_base.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Base scraper with rate limiting, retries, and polite crawling."""
2
+
3
+ import time
4
+ import json
5
+ import hashlib
6
+ import logging
7
+ from pathlib import Path
8
+ from typing import Optional
9
+
10
+ import requests
11
+ from requests.adapters import HTTPAdapter
12
+ from urllib3.util.retry import Retry
13
+
14
+ from collect.config import (
15
+ HEADERS, REQUEST_DELAY, MAX_RETRIES, TIMEOUT, RAW_DIR,
16
+ )
17
+
18
+ logger = logging.getLogger(__name__)
19
+
20
+
21
+ class BaseScraper:
22
+ """Polite web scraper with rate limiting and caching."""
23
+
24
+ def __init__(self, source_name: str):
25
+ self.source_name = source_name
26
+ self.output_dir = RAW_DIR / source_name
27
+ self.output_dir.mkdir(parents=True, exist_ok=True)
28
+ self.session = self._build_session()
29
+ self._last_request_time = 0.0
30
+ self.stats = {"fetched": 0, "cached": 0, "failed": 0, "total_chars": 0}
31
+
32
+ def _build_session(self) -> requests.Session:
33
+ session = requests.Session()
34
+ session.headers.update(HEADERS)
35
+ retry = Retry(
36
+ total=MAX_RETRIES,
37
+ backoff_factor=1.0,
38
+ status_forcelist=[429, 500, 502, 503, 504],
39
+ allowed_methods=["GET"],
40
+ )
41
+ adapter = HTTPAdapter(max_retries=retry)
42
+ session.mount("https://", adapter)
43
+ session.mount("http://", adapter)
44
+ return session
45
+
46
+ def _rate_limit(self):
47
+ elapsed = time.time() - self._last_request_time
48
+ if elapsed < REQUEST_DELAY:
49
+ time.sleep(REQUEST_DELAY - elapsed)
50
+ self._last_request_time = time.time()
51
+
52
+ def _cache_key(self, url: str) -> str:
53
+ return hashlib.sha256(url.encode()).hexdigest()[:16]
54
+
55
+ def _cache_path(self, url: str) -> Path:
56
+ return self.output_dir / f"{self._cache_key(url)}.json"
57
+
58
+ def fetch(self, url: str, force: bool = False) -> Optional[str]:
59
+ """Fetch URL content with caching and rate limiting."""
60
+ cache = self._cache_path(url)
61
+ if not force and cache.exists():
62
+ data = json.loads(cache.read_text())
63
+ self.stats["cached"] += 1
64
+ return data.get("content")
65
+
66
+ self._rate_limit()
67
+ try:
68
+ resp = self.session.get(url, timeout=TIMEOUT)
69
+ resp.raise_for_status()
70
+ content = resp.text
71
+ # Cache the result
72
+ cache.write_text(json.dumps({
73
+ "url": url,
74
+ "status": resp.status_code,
75
+ "content": content,
76
+ "fetched_at": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
77
+ }))
78
+ self.stats["fetched"] += 1
79
+ self.stats["total_chars"] += len(content)
80
+ return content
81
+ except Exception as e:
82
+ logger.warning(f"[{self.source_name}] Failed to fetch {url}: {e}")
83
+ self.stats["failed"] += 1
84
+ return None
85
+
86
+ def save_documents(self, documents: list[dict], filename: str = "documents.jsonl"):
87
+ """Save collected documents as JSONL."""
88
+ out = self.output_dir / filename
89
+ with open(out, "w") as f:
90
+ for doc in documents:
91
+ f.write(json.dumps(doc, ensure_ascii=False) + "\n")
92
+ logger.info(f"[{self.source_name}] Saved {len(documents)} docs β†’ {out}")
93
+ return out
94
+
95
+ def print_stats(self):
96
+ logger.info(
97
+ f"[{self.source_name}] Stats: "
98
+ f"fetched={self.stats['fetched']}, "
99
+ f"cached={self.stats['cached']}, "
100
+ f"failed={self.stats['failed']}, "
101
+ f"chars={self.stats['total_chars']:,}"
102
+ )
scripts/setup.sh ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ # ============================================================
5
+ # InsureOS Models β€” Environment Setup
6
+ # Target: 16GB GPU VM (Bytical Audio)
7
+ # ============================================================
8
+
9
+ echo "=== InsureOS Models Setup ==="
10
+ echo "GPU check:"
11
+ nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo "WARNING: No GPU detected"
12
+
13
+ # System packages
14
+ echo "=== Installing system dependencies ==="
15
+ sudo apt-get update -qq
16
+ sudo apt-get install -y -qq python3-pip python3-venv git curl wget
17
+
18
+ # Create venv
19
+ echo "=== Creating Python virtual environment ==="
20
+ python3 -m venv .venv
21
+ source .venv/bin/activate
22
+
23
+ # Core ML
24
+ echo "=== Installing PyTorch + CUDA ==="
25
+ pip install --upgrade pip wheel setuptools
26
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
27
+
28
+ # Training stack
29
+ echo "=== Installing training dependencies ==="
30
+ pip install \
31
+ transformers>=4.50.0 \
32
+ datasets>=3.0.0 \
33
+ accelerate>=1.2.0 \
34
+ peft>=0.14.0 \
35
+ trl>=0.15.0 \
36
+ bitsandbytes>=0.45.0 \
37
+ flash-attn --no-build-isolation \
38
+ sentencepiece \
39
+ protobuf \
40
+ wandb
41
+
42
+ # Specialized ML
43
+ echo "=== Installing specialized ML packages ==="
44
+ pip install \
45
+ scikit-learn>=1.5.0 \
46
+ xgboost>=2.1.0 \
47
+ lightgbm>=4.5.0 \
48
+ interpret>=0.6.0 \
49
+ statsmodels>=0.14.0 \
50
+ scipy>=1.14.0 \
51
+ networkx>=3.4 \
52
+ torch-geometric>=2.6.0 \
53
+ pyg-lib -f https://data.pyg.org/whl/torch-2.5.0+cu121.html
54
+
55
+ # Serving
56
+ echo "=== Installing serving dependencies ==="
57
+ pip install \
58
+ fastapi>=0.115.0 \
59
+ uvicorn>=0.32.0 \
60
+ pydantic>=2.10.0
61
+
62
+ # Evaluation
63
+ echo "=== Installing evaluation packages ==="
64
+ pip install \
65
+ rouge-score \
66
+ nltk \
67
+ bert-score \
68
+ seqeval
69
+
70
+ # Data generation
71
+ echo "=== Installing data generation packages ==="
72
+ pip install \
73
+ faker>=33.0.0 \
74
+ numpy>=1.26.0 \
75
+ pandas>=2.2.0 \
76
+ tqdm>=4.67.0
77
+
78
+ echo ""
79
+ echo "=== Setup complete! ==="
80
+ echo "Activate with: source .venv/bin/activate"
81
+ echo "Generate data: python -m data.generate_all"
82
+ echo "Train all: bash scripts/train_all.sh"
scripts/train_all.sh ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ # ============================================================
5
+ # InsureOS Models β€” Train All Models Sequentially
6
+ # Designed for 16GB GPU β€” runs one model at a time
7
+ # ============================================================
8
+
9
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
10
+ ROOT_DIR="$(dirname "$SCRIPT_DIR")"
11
+ cd "$ROOT_DIR"
12
+
13
+ source .venv/bin/activate
14
+
15
+ echo "============================================"
16
+ echo " InsureOS Models β€” Full Training Pipeline"
17
+ echo " $(date)"
18
+ echo "============================================"
19
+
20
+ # Phase 0: Generate data
21
+ echo ""
22
+ echo "=== PHASE 0: Generating synthetic training data ==="
23
+ python -m data.generate_all
24
+
25
+ # Phase 1: QLoRA fine-tuning (InsureLLM-8B)
26
+ echo ""
27
+ echo "=== PHASE 1: QLoRA Fine-Tuning β€” Qwen3-8B ==="
28
+ python -m training.qlora_finetune \
29
+ --base-model Qwen/Qwen3-8B \
30
+ --dataset data/output/insurance_sft_10k.jsonl \
31
+ --output-dir outputs/insurellm-8b \
32
+ --epochs 3 \
33
+ --batch-size 2 \
34
+ --gradient-accumulation 8 \
35
+ --learning-rate 2e-4 \
36
+ --lora-rank 64 \
37
+ --lora-alpha 128
38
+
39
+ # Phase 2: DPO reinforcement learning
40
+ echo ""
41
+ echo "=== PHASE 2: DPO Reinforcement Learning ==="
42
+ python -m training.dpo_train \
43
+ --base-model Qwen/Qwen3-8B \
44
+ --adapter-path outputs/insurellm-8b \
45
+ --dataset data/output/insurance_dpo_5k.jsonl \
46
+ --output-dir outputs/insurellm-8b-dpo \
47
+ --epochs 1 \
48
+ --batch-size 1 \
49
+ --gradient-accumulation 16 \
50
+ --learning-rate 5e-5 \
51
+ --beta 0.1
52
+
53
+ # Phase 3: Distillation to smaller model
54
+ echo ""
55
+ echo "=== PHASE 3: Knowledge Distillation β€” Qwen3-4B ==="
56
+ python -m training.distill \
57
+ --teacher-model Qwen/Qwen3-8B \
58
+ --teacher-adapter outputs/insurellm-8b-dpo \
59
+ --student-model Qwen/Qwen3-4B \
60
+ --dataset data/output/insurance_sft_10k.jsonl \
61
+ --output-dir outputs/insurellm-4b \
62
+ --epochs 2 \
63
+ --batch-size 2 \
64
+ --gradient-accumulation 8
65
+
66
+ # Phase 4: Fraud detection model
67
+ echo ""
68
+ echo "=== PHASE 4: Fraud Detection Model ==="
69
+ python -m training.fraud_model \
70
+ --dataset data/output/claims_tabular_50k.csv \
71
+ --output-dir outputs/fraudnet
72
+
73
+ # Phase 5: Pricing GLM
74
+ echo ""
75
+ echo "=== PHASE 5: Pricing GLM ==="
76
+ python -m training.pricing_glm \
77
+ --dataset data/output/claims_tabular_50k.csv \
78
+ --output-dir outputs/pricing-glm
79
+
80
+ # Phase 6: Document classifier
81
+ echo ""
82
+ echo "=== PHASE 6: Document Classifier ==="
83
+ python -m training.doc_classifier \
84
+ --dataset data/output/documents_10k.jsonl \
85
+ --output-dir outputs/doc-classifier \
86
+ --epochs 5 \
87
+ --batch-size 16
88
+
89
+ # Phase 7: Insurance NER
90
+ echo ""
91
+ echo "=== PHASE 7: Insurance NER ==="
92
+ python -m training.ner_model \
93
+ --dataset data/output/entities_8k.jsonl \
94
+ --output-dir outputs/insure-ner \
95
+ --epochs 5 \
96
+ --batch-size 16
97
+
98
+ # Phase 8: Evaluation
99
+ echo ""
100
+ echo "=== PHASE 8: Running Evaluation Suite ==="
101
+ python -m evaluation.run_eval --all
102
+
103
+ echo ""
104
+ echo "============================================"
105
+ echo " Training Complete! $(date)"
106
+ echo " Models saved in outputs/"
107
+ echo "============================================"
108
+ ls -la outputs/
search/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # InsureSearch β€” open-source search engine
search/api.py ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FastAPI REST API for InsureSearch β€” matching Azure AI Search API patterns.
2
+
3
+ Endpoints:
4
+ POST /search β€” Hybrid search (vector + BM25 + reranker)
5
+ POST /search/vector β€” Pure vector search
6
+ POST /search/keyword β€” Pure BM25 keyword search
7
+ GET /suggest β€” Autocomplete suggestions
8
+ GET /facets β€” Get available filter facets
9
+ GET /stats β€” Index statistics
10
+ POST /index/build β€” Trigger index rebuild
11
+ GET /health β€” Health check
12
+ """
13
+
14
+ import logging
15
+ import time
16
+ from typing import Optional
17
+
18
+ from fastapi import FastAPI, HTTPException, Query
19
+ from fastapi.middleware.cors import CORSMiddleware
20
+ from pydantic import BaseModel, Field
21
+
22
+ from search.config import (
23
+ API_DESCRIPTION, API_HOST, API_PORT, API_TITLE, API_VERSION,
24
+ DEFAULT_TOP_K, MAX_TOP_K,
25
+ )
26
+
27
+ logger = logging.getLogger(__name__)
28
+
29
+ # ── FastAPI app ────────────────────────────────────────────────────
30
+ app = FastAPI(
31
+ title=API_TITLE,
32
+ version=API_VERSION,
33
+ description=API_DESCRIPTION,
34
+ )
35
+
36
+ app.add_middleware(
37
+ CORSMiddleware,
38
+ allow_origins=["*"],
39
+ allow_credentials=True,
40
+ allow_methods=["*"],
41
+ allow_headers=["*"],
42
+ )
43
+
44
+ # ── Engine singleton ───────────────────────────────────────────────
45
+ _engine = None
46
+
47
+
48
+ def _get_engine():
49
+ global _engine
50
+ if _engine is None:
51
+ from search.hybrid_engine import HybridSearchEngine
52
+ _engine = HybridSearchEngine()
53
+ _engine.load()
54
+ return _engine
55
+
56
+
57
+ # ── Request/Response models ────────────────────────────────────────
58
+
59
+ class SearchRequest(BaseModel):
60
+ query: str = Field(..., min_length=1, max_length=1000,
61
+ description="Search query text")
62
+ top_k: int = Field(DEFAULT_TOP_K, ge=1, le=MAX_TOP_K,
63
+ description="Number of results to return")
64
+ filter_source: Optional[str] = Field(None,
65
+ description="Filter by source (e.g. 'wikipedia', 'fca_handbook')")
66
+ filter_category: Optional[str] = Field(None,
67
+ description="Filter by category")
68
+ use_reranker: bool = Field(True,
69
+ description="Apply cross-encoder reranking")
70
+
71
+
72
+ class SearchResultResponse(BaseModel):
73
+ chunk_id: str
74
+ doc_id: str
75
+ title: str
76
+ text: str
77
+ score: float
78
+ source: str
79
+ url: str
80
+ category: str
81
+ highlights: list[str]
82
+ vector_rank: Optional[int] = None
83
+ bm25_rank: Optional[int] = None
84
+ rerank_score: Optional[float] = None
85
+
86
+
87
+ class SearchResponse(BaseModel):
88
+ query: str
89
+ results: list[SearchResultResponse]
90
+ total_found: int
91
+ latency_ms: float
92
+ method: str
93
+ facets: dict
94
+
95
+
96
+ class StatsResponse(BaseModel):
97
+ bm25_chunks: int
98
+ bm25_terms: int
99
+ vector_stats: dict
100
+ facets: dict
101
+
102
+
103
+ # ── Endpoints ──────────────────────────────────────────────────────
104
+
105
+ @app.get("/health")
106
+ def health():
107
+ return {"status": "ok", "engine": API_TITLE, "version": API_VERSION}
108
+
109
+
110
+ @app.post("/search", response_model=SearchResponse)
111
+ def search(req: SearchRequest):
112
+ """Hybrid search (vector + BM25 + cross-encoder reranking)."""
113
+ engine = _get_engine()
114
+ result = engine.search(
115
+ query=req.query,
116
+ top_k=req.top_k,
117
+ method="hybrid",
118
+ filter_source=req.filter_source,
119
+ filter_category=req.filter_category,
120
+ use_reranker=req.use_reranker,
121
+ )
122
+ return _to_response(result)
123
+
124
+
125
+ @app.post("/search/vector", response_model=SearchResponse)
126
+ def search_vector(req: SearchRequest):
127
+ """Pure vector (semantic) search."""
128
+ engine = _get_engine()
129
+ result = engine.search(
130
+ query=req.query,
131
+ top_k=req.top_k,
132
+ method="vector",
133
+ filter_source=req.filter_source,
134
+ filter_category=req.filter_category,
135
+ use_reranker=req.use_reranker,
136
+ )
137
+ return _to_response(result)
138
+
139
+
140
+ @app.post("/search/keyword", response_model=SearchResponse)
141
+ def search_keyword(req: SearchRequest):
142
+ """Pure BM25 keyword search."""
143
+ engine = _get_engine()
144
+ result = engine.search(
145
+ query=req.query,
146
+ top_k=req.top_k,
147
+ method="bm25",
148
+ filter_source=req.filter_source,
149
+ filter_category=req.filter_category,
150
+ use_reranker=False, # No reranker for pure keyword
151
+ )
152
+ return _to_response(result)
153
+
154
+
155
+ @app.get("/suggest")
156
+ def suggest(
157
+ prefix: str = Query(..., min_length=2, max_length=100),
158
+ limit: int = Query(10, ge=1, le=50),
159
+ ):
160
+ """Autocomplete suggestions (like Azure AI Search Suggest)."""
161
+ engine = _get_engine()
162
+ suggestions = engine.suggest(prefix, limit)
163
+ return {"prefix": prefix, "suggestions": suggestions}
164
+
165
+
166
+ @app.get("/facets")
167
+ def facets():
168
+ """Get available filter facets with counts."""
169
+ engine = _get_engine()
170
+ return engine.get_facets()
171
+
172
+
173
+ @app.get("/stats", response_model=StatsResponse)
174
+ def stats():
175
+ """Get index statistics."""
176
+ engine = _get_engine()
177
+ from search.vector_store import get_collection_stats
178
+ return StatsResponse(
179
+ bm25_chunks=engine.bm25.doc_count,
180
+ bm25_terms=len(engine.bm25.inverted_index),
181
+ vector_stats=get_collection_stats(),
182
+ facets=engine.get_facets(),
183
+ )
184
+
185
+
186
+ @app.post("/index/build")
187
+ def build_index():
188
+ """Trigger full index rebuild from collected data."""
189
+ from search.indexer import build_index as do_build
190
+ do_build(force_rebuild=True)
191
+ # Reload engine
192
+ global _engine
193
+ _engine = None
194
+ return {"status": "ok", "message": "Index rebuilt successfully"}
195
+
196
+
197
+ def _to_response(result) -> SearchResponse:
198
+ """Convert internal SearchResponse to API SearchResponse."""
199
+ return SearchResponse(
200
+ query=result.query,
201
+ results=[
202
+ SearchResultResponse(
203
+ chunk_id=r.chunk_id,
204
+ doc_id=r.doc_id,
205
+ title=r.title,
206
+ text=r.text[:1000], # Cap text in response
207
+ score=r.score,
208
+ source=r.source,
209
+ url=r.url or "",
210
+ category=r.category,
211
+ highlights=r.highlights,
212
+ vector_rank=r.vector_rank,
213
+ bm25_rank=r.bm25_rank,
214
+ rerank_score=r.rerank_score,
215
+ )
216
+ for r in result.results
217
+ ],
218
+ total_found=result.total_found,
219
+ latency_ms=result.latency_ms,
220
+ method=result.method,
221
+ facets=result.facets,
222
+ )
223
+
224
+
225
+ def start():
226
+ """Start the API server."""
227
+ import uvicorn
228
+ logging.basicConfig(level=logging.INFO,
229
+ format="%(asctime)s [%(levelname)s] %(message)s")
230
+ uvicorn.run(app, host=API_HOST, port=API_PORT)
231
+
232
+
233
+ if __name__ == "__main__":
234
+ start()
search/bm25.py ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """BM25 keyword search index with persistence."""
2
+
3
+ import json
4
+ import logging
5
+ import math
6
+ import pickle
7
+ import re
8
+ from collections import Counter, defaultdict
9
+ from pathlib import Path
10
+ from typing import Optional
11
+
12
+ from search.config import BM25_DIR
13
+ from search.models import Chunk
14
+
15
+ logger = logging.getLogger(__name__)
16
+
17
+
18
+ class BM25Index:
19
+ """Okapi BM25 ranking implementation with persistence.
20
+
21
+ This is a from-scratch BM25 implementation β€” no external dependency.
22
+ Matches the quality of rank_bm25 or Lucene's BM25 with:
23
+ - Configurable k1 and b parameters
24
+ - Insurance-domain stop words
25
+ - Stemming-light via regex normalization
26
+ - Persistent storage
27
+ """
28
+
29
+ def __init__(self, k1: float = 1.5, b: float = 0.75):
30
+ self.k1 = k1
31
+ self.b = b
32
+ self.doc_count = 0
33
+ self.avg_doc_len = 0.0
34
+ # chunk_id -> tokenized doc
35
+ self.docs: dict[str, list[str]] = {}
36
+ # chunk_id -> Chunk metadata
37
+ self.chunk_meta: dict[str, dict] = {}
38
+ # term -> set of chunk_ids
39
+ self.inverted_index: dict[str, set[str]] = defaultdict(set)
40
+ # term -> document frequency
41
+ self.df: dict[str, int] = defaultdict(int)
42
+ # chunk_id -> doc length (in tokens)
43
+ self.doc_lengths: dict[str, int] = {}
44
+
45
+ # ── Stop words (general + insurance-domain) ────────────────────
46
+ STOP_WORDS = {
47
+ "a", "an", "the", "is", "it", "in", "on", "at", "to", "for",
48
+ "of", "and", "or", "but", "not", "with", "by", "from", "as",
49
+ "be", "was", "were", "been", "being", "have", "has", "had",
50
+ "do", "does", "did", "will", "would", "could", "should",
51
+ "may", "might", "shall", "can", "this", "that", "these",
52
+ "those", "i", "you", "he", "she", "we", "they", "me",
53
+ "him", "her", "us", "them", "my", "your", "his", "its",
54
+ "our", "their", "what", "which", "who", "whom", "how",
55
+ "when", "where", "why", "all", "each", "every", "both",
56
+ "few", "more", "most", "other", "some", "such", "no",
57
+ "nor", "only", "own", "same", "so", "than", "too", "very",
58
+ "just", "also", "if", "then", "else", "about", "up", "out",
59
+ "any", "are", "into", "over", "after", "before", "between",
60
+ }
61
+
62
+ def _tokenize(self, text: str) -> list[str]:
63
+ """Tokenize text with light normalization."""
64
+ text = text.lower()
65
+ # Keep alphanumeric, hyphens (for terms like "co-insurance")
66
+ tokens = re.findall(r'[a-z0-9](?:[a-z0-9-]*[a-z0-9])?', text)
67
+ # Remove stop words and very short tokens
68
+ tokens = [t for t in tokens if t not in self.STOP_WORDS and len(t) > 1]
69
+ return tokens
70
+
71
+ def add_chunk(self, chunk: Chunk):
72
+ """Add a single chunk to the BM25 index."""
73
+ tokens = self._tokenize(chunk.text)
74
+ if not tokens:
75
+ return
76
+
77
+ self.docs[chunk.chunk_id] = tokens
78
+ self.doc_lengths[chunk.chunk_id] = len(tokens)
79
+ self.chunk_meta[chunk.chunk_id] = {
80
+ "doc_id": chunk.doc_id,
81
+ "title": chunk.title,
82
+ "source": chunk.source,
83
+ "category": chunk.category,
84
+ "text": chunk.text[:500],
85
+ }
86
+
87
+ # Update inverted index
88
+ unique_terms = set(tokens)
89
+ for term in unique_terms:
90
+ self.inverted_index[term].add(chunk.chunk_id)
91
+ self.df[term] += 1
92
+
93
+ self.doc_count += 1
94
+ # Update average document length (running average)
95
+ self.avg_doc_len = (
96
+ (self.avg_doc_len * (self.doc_count - 1) + len(tokens))
97
+ / self.doc_count
98
+ )
99
+
100
+ def add_chunks(self, chunks: list[Chunk]):
101
+ """Add multiple chunks to the index."""
102
+ for chunk in chunks:
103
+ self.add_chunk(chunk)
104
+ logger.info(f"BM25 index: {self.doc_count} chunks, "
105
+ f"{len(self.inverted_index)} unique terms")
106
+
107
+ def _idf(self, term: str) -> float:
108
+ """Compute inverse document frequency for a term."""
109
+ if term not in self.df:
110
+ return 0.0
111
+ n = self.doc_count
112
+ df = self.df[term]
113
+ return math.log((n - df + 0.5) / (df + 0.5) + 1.0)
114
+
115
+ def search(self, query: str, top_k: int = 10,
116
+ filter_source: Optional[str] = None,
117
+ filter_category: Optional[str] = None) -> list[tuple[str, float]]:
118
+ """Search the BM25 index. Returns list of (chunk_id, score)."""
119
+ query_tokens = self._tokenize(query)
120
+ if not query_tokens:
121
+ return []
122
+
123
+ # Find candidate chunks (union of all query term posting lists)
124
+ candidates = set()
125
+ for token in query_tokens:
126
+ candidates |= self.inverted_index.get(token, set())
127
+
128
+ if not candidates:
129
+ return []
130
+
131
+ # Apply filters
132
+ if filter_source:
133
+ candidates = {
134
+ c for c in candidates
135
+ if self.chunk_meta.get(c, {}).get("source") == filter_source
136
+ }
137
+ if filter_category:
138
+ candidates = {
139
+ c for c in candidates
140
+ if self.chunk_meta.get(c, {}).get("category") == filter_category
141
+ }
142
+
143
+ # Score each candidate
144
+ scores: list[tuple[str, float]] = []
145
+ for chunk_id in candidates:
146
+ score = 0.0
147
+ doc_tokens = self.docs[chunk_id]
148
+ doc_len = self.doc_lengths[chunk_id]
149
+ tf_counter = Counter(doc_tokens)
150
+
151
+ for term in query_tokens:
152
+ if term not in tf_counter:
153
+ continue
154
+ tf = tf_counter[term]
155
+ idf = self._idf(term)
156
+ # BM25 scoring formula
157
+ numerator = tf * (self.k1 + 1)
158
+ denominator = tf + self.k1 * (
159
+ 1 - self.b + self.b * doc_len / max(self.avg_doc_len, 1)
160
+ )
161
+ score += idf * numerator / denominator
162
+
163
+ if score > 0:
164
+ scores.append((chunk_id, score))
165
+
166
+ # Sort by score descending
167
+ scores.sort(key=lambda x: x[1], reverse=True)
168
+ return scores[:top_k]
169
+
170
+ def get_suggestions(self, prefix: str, limit: int = 10) -> list[str]:
171
+ """Get autocomplete suggestions based on indexed terms."""
172
+ prefix = prefix.lower().strip()
173
+ if len(prefix) < 2:
174
+ return []
175
+ matches = [
176
+ term for term in self.inverted_index
177
+ if term.startswith(prefix) and self.df[term] >= 2
178
+ ]
179
+ # Sort by document frequency (more common = better suggestion)
180
+ matches.sort(key=lambda t: self.df[t], reverse=True)
181
+ return matches[:limit]
182
+
183
+ def get_facets(self) -> dict[str, dict[str, int]]:
184
+ """Get facet counts for filtering."""
185
+ source_counts: dict[str, int] = defaultdict(int)
186
+ category_counts: dict[str, int] = defaultdict(int)
187
+
188
+ for meta in self.chunk_meta.values():
189
+ source_counts[meta.get("source", "unknown")] += 1
190
+ category_counts[meta.get("category", "unknown")] += 1
191
+
192
+ return {
193
+ "sources": dict(source_counts),
194
+ "categories": dict(category_counts),
195
+ }
196
+
197
+ def save(self, path: Path = BM25_DIR / "bm25_index.pkl"):
198
+ """Persist the BM25 index to disk."""
199
+ data = {
200
+ "k1": self.k1,
201
+ "b": self.b,
202
+ "doc_count": self.doc_count,
203
+ "avg_doc_len": self.avg_doc_len,
204
+ "docs": self.docs,
205
+ "chunk_meta": self.chunk_meta,
206
+ "inverted_index": {k: list(v) for k, v in self.inverted_index.items()},
207
+ "df": dict(self.df),
208
+ "doc_lengths": self.doc_lengths,
209
+ }
210
+ with open(path, "wb") as f:
211
+ pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)
212
+ size_mb = path.stat().st_size / 1024 / 1024
213
+ logger.info(f"BM25 index saved: {path} ({size_mb:.1f} MB)")
214
+
215
+ def load(self, path: Path = BM25_DIR / "bm25_index.pkl") -> bool:
216
+ """Load BM25 index from disk."""
217
+ if not path.exists():
218
+ return False
219
+ with open(path, "rb") as f:
220
+ data = pickle.load(f)
221
+ self.k1 = data["k1"]
222
+ self.b = data["b"]
223
+ self.doc_count = data["doc_count"]
224
+ self.avg_doc_len = data["avg_doc_len"]
225
+ self.docs = data["docs"]
226
+ self.chunk_meta = data["chunk_meta"]
227
+ self.inverted_index = {k: set(v) for k, v in data["inverted_index"].items()}
228
+ self.df = defaultdict(int, data["df"])
229
+ self.doc_lengths = data["doc_lengths"]
230
+ logger.info(f"BM25 index loaded: {self.doc_count} chunks, "
231
+ f"{len(self.inverted_index)} terms")
232
+ return True
search/config.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration for InsureSearch engine."""
2
+
3
+ from pathlib import Path
4
+
5
+ # ── Paths ──────────────────────────────────────────────────────────
6
+ BASE_DIR = Path(__file__).resolve().parent.parent
7
+ SEARCH_DIR = BASE_DIR / "search"
8
+ INDEX_DIR = SEARCH_DIR / "index_data"
9
+ QDRANT_DIR = INDEX_DIR / "qdrant_storage"
10
+ BM25_DIR = INDEX_DIR / "bm25_storage"
11
+ METADATA_DB = INDEX_DIR / "metadata.db"
12
+
13
+ INDEX_DIR.mkdir(parents=True, exist_ok=True)
14
+ QDRANT_DIR.mkdir(parents=True, exist_ok=True)
15
+ BM25_DIR.mkdir(parents=True, exist_ok=True)
16
+
17
+ # ── Embedding model ────────────────────────────────────────────────
18
+ # BAAI/bge-small-en-v1.5: 33M params, 384-dim, fast, excellent quality
19
+ EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
20
+ EMBEDDING_DIM = 384
21
+ EMBEDDING_BATCH_SIZE = 64
22
+
23
+ # ── Reranker model ─────────────────────────────────────────────────
24
+ # Cross-encoder for second-stage reranking
25
+ RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
26
+ RERANKER_TOP_K = 20 # Rerank top K candidates
27
+
28
+ # ── Qdrant ─────────────────────────────────────────────────────────
29
+ QDRANT_COLLECTION = "insurance_docs"
30
+
31
+ # ── Search defaults ────────────────────────────────────────────────
32
+ DEFAULT_TOP_K = 10
33
+ MAX_TOP_K = 100
34
+ CHUNK_SIZE = 512 # tokens per chunk
35
+ CHUNK_OVERLAP = 64 # overlap between chunks
36
+
37
+ # ── Hybrid search weights ─────────────────────────────────────────
38
+ # RRF (Reciprocal Rank Fusion) constant
39
+ RRF_K = 60
40
+
41
+ # Weight for combining vector vs BM25 scores (0.0 = all BM25, 1.0 = all vector)
42
+ VECTOR_WEIGHT = 0.6
43
+ BM25_WEIGHT = 0.4
44
+
45
+ # ── API ────────────────────────────────────────────────────────────
46
+ API_HOST = "0.0.0.0"
47
+ API_PORT = 8900
48
+ API_TITLE = "InsureSearch"
49
+ API_VERSION = "1.0.0"
50
+ API_DESCRIPTION = (
51
+ "Open-source hybrid search engine for insurance documents. "
52
+ "Combines dense vector search (BGE) + sparse keyword search (BM25) "
53
+ "with cross-encoder reranking. Matches or exceeds Azure AI Search "
54
+ "for insurance domain."
55
+ )