INSUREOS Models β€” Complete Insurance AI Training Pipeline

Created by Bytical AI β€” AI agents that run insurance operations.

Overview

INSUREOS is a complete AI/ML training and inference pipeline for UK insurance operations. This repository contains all source code for data generation, model training, evaluation, data collection, and a hybrid search engine.

Model Suite

Model HuggingFace Task Key Metric
InsureLLM-4B piyushptiwari/InsureLLM-4B Insurance domain LLM ROUGE-1: 0.384
InsureDocClassifier piyushptiwari/InsureDocClassifier 12-class document classification F1: 1.0
InsureNER piyushptiwari/InsureNER 13-entity NER F1: 1.0
InsureFraudNet piyushptiwari/InsureFraudNet Fraud detection (3 LoB) AUC-ROC: 1.0
InsurePricing piyushptiwari/InsurePricing Premium pricing (GLM + EBM) MAE: Β£11,132
InsureSearch (included in this repo) Hybrid search engine 33K docs indexed

Training Dataset

piyushptiwari/insureos-training-data β€” 10K SFT, 5K DPO, 50K tabular, 10K docs, 8K NER

Repository Structure

insureos-models/
β”œβ”€β”€ data/                        # Synthetic data generation
β”‚   β”œβ”€β”€ constants.py             # UK insurance constants (regions, perils, regulators)
β”‚   β”œβ”€β”€ gen_sft.py               # Generate SFT instruction-response pairs
β”‚   β”œβ”€β”€ gen_dpo.py               # Generate DPO preference pairs
β”‚   β”œβ”€β”€ gen_documents.py         # Generate insurance documents (12 classes)
β”‚   β”œβ”€β”€ gen_ner.py               # Generate NER-annotated text
β”‚   β”œβ”€β”€ gen_tabular.py           # Generate claims tabular data
β”‚   └── generate_all.py          # Run all generators
β”‚
β”œβ”€β”€ collect/                     # Real-world data collection
β”‚   β”œβ”€β”€ config.py                # Scraping targets and configuration
β”‚   β”œβ”€β”€ scraper_base.py          # Base HTTP scraper with caching
β”‚   β”œβ”€β”€ convert_sft.py           # Convert raw docs β†’ SFT/DPO format
β”‚   β”œβ”€β”€ run_fast.py              # Fast collection orchestrator
β”‚   └── sources/                 # Per-source scrapers
β”‚       β”œβ”€β”€ wikipedia.py         # Wikipedia insurance articles
β”‚       β”œβ”€β”€ legislation.py       # UK legislation (legislation.gov.uk)
β”‚       β”œβ”€β”€ fca.py               # FCA Handbook
β”‚       β”œβ”€β”€ hf_datasets.py       # HuggingFace insurance datasets
β”‚       β”œβ”€β”€ rss_news.py          # Insurance news RSS feeds
β”‚       └── education.py         # Insurance education resources
β”‚
β”œβ”€β”€ training/                    # Model training scripts
β”‚   β”œβ”€β”€ qlora_finetune.py        # QLoRA fine-tuning (Qwen3-4B)
β”‚   β”œβ”€β”€ dpo_train.py             # DPO alignment training
β”‚   β”œβ”€β”€ retrain_realworld.py     # Real-world data retraining
β”‚   β”œβ”€β”€ doc_classifier.py        # ModernBERT document classifier
β”‚   β”œβ”€β”€ ner_model.py             # ModernBERT NER model
β”‚   β”œβ”€β”€ fraud_model.py           # XGBoost + Isolation Forest fraud
β”‚   β”œβ”€β”€ pricing_glm.py           # Tweedie GLM + EBM pricing
β”‚   └── distill.py               # Model distillation (experimental)
β”‚
β”œβ”€β”€ evaluation/                  # Evaluation suite
β”‚   β”œβ”€β”€ run_eval.py              # Full multi-model evaluation
β”‚   └── results/                 # Evaluation results (JSON)
β”‚
β”œβ”€β”€ search/                      # Hybrid search engine
β”‚   β”œβ”€β”€ config.py                # Search configuration
β”‚   β”œβ”€β”€ embedder.py              # BGE-small-en-v1.5 embedding service
β”‚   β”œβ”€β”€ bm25.py                  # Custom Okapi BM25 implementation
β”‚   β”œβ”€β”€ vector_store.py          # Qdrant vector store
β”‚   β”œβ”€β”€ reranker.py              # Cross-encoder reranker
β”‚   β”œβ”€β”€ hybrid_engine.py         # RRF fusion (vector + BM25 + reranker)
β”‚   β”œβ”€β”€ indexer.py               # Document ingestion pipeline
β”‚   β”œβ”€β”€ models.py                # Pydantic data models
β”‚   └── api.py                   # FastAPI REST API
β”‚
β”œβ”€β”€ serve/                       # Model serving
β”‚   └── api.py                   # FastAPI inference endpoints
β”‚
└── scripts/                     # Automation
    β”œβ”€β”€ setup.sh                 # Environment setup (NVIDIA, Python, deps)
    └── train_all.sh             # Full training pipeline script

Quick Start

1. Environment Setup

# Create virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
pip install torch transformers trl peft bitsandbytes
pip install xgboost scikit-learn interpret
pip install sentence-transformers qdrant-client fastapi uvicorn

2. Generate Training Data

python -m data.generate_all
# Outputs: data/output/ (SFT, DPO, docs, NER, tabular)

3. Train Models

# Train all models sequentially
bash scripts/train_all.sh

# Or individually:
python training/qlora_finetune.py          # InsureLLM QLoRA
python training/dpo_train.py               # InsureLLM DPO
python training/doc_classifier.py          # Document classifier
python training/ner_model.py               # NER model
python training/fraud_model.py             # Fraud detection
python training/pricing_glm.py             # Pricing models

4. Evaluate

python evaluation/run_eval.py
# Results saved to evaluation/results/

5. Run Search Engine

# Index documents
python search/indexer.py

# Start API
python search/api.py
# API at http://localhost:8900
# Endpoints: /search, /search/vector, /search/keyword, /suggest, /facets, /stats

Search Engine β€” InsureSearch

A hybrid search engine rivaling Azure AI Search, built entirely on open-source components:

Component Technology Details
Vector Search BGE-small-en-v1.5 (384-dim) + Qdrant Semantic similarity
Keyword Search Custom Okapi BM25 Insurance-aware tokenization
Reranking cross-encoder/ms-marco-MiniLM-L-6-v2 Cross-encoder reranking
Fusion Reciprocal Rank Fusion (RRF) Vector 60% + BM25 40%
API FastAPI REST API with facets, suggestions

Index stats: 33,034 chunks from 31,679 documents, 51,640 BM25 terms.

Training Pipeline

Stage 1: Synthetic Data Generation
β”œβ”€β”€ 10K SFT instruction-response pairs
β”œβ”€β”€ 5K DPO preference pairs
β”œβ”€β”€ 50K tabular claims (Motor/Property/Liability)
β”œβ”€β”€ 10K insurance documents (12 classes)
└── 8K NER-annotated texts (13 entity types)

Stage 2: QLoRA Fine-Tuning β†’ Qwen3-4B
β”œβ”€β”€ rank=64, alpha=128, all-linear targets
β”œβ”€β”€ 2 epochs, batch=2, grad_accum=4
β”œβ”€β”€ Final: train_loss=0.012, eval_loss=0.118
└── Token accuracy: 95.88%

Stage 3: DPO Alignment
β”œβ”€β”€ 5K preference pairs
β”œβ”€β”€ 149 steps, reward_accuracy=1.0
└── Reward margin: 26.76

Stage 4: Real-World Data Collection
β”œβ”€β”€ Wikipedia (150 docs), UK Legislation (692)
β”œβ”€β”€ HuggingFace datasets (31,060), RSS (50), Education (88)
β”œβ”€β”€ Converted to 3,685 SFT + 776 DPO pairs
└── Quality filtered (English-only, no echo responses)

Stage 5: Real-World Retraining
β”œβ”€β”€ 876 steps on real-world SFT data
└── Claims process score improved 0.40 β†’ 0.60

Stage 6: Specialized Models (parallel)
β”œβ”€β”€ FraudNet: XGBoost + Isolation Forest β†’ AUC-ROC 1.0
β”œβ”€β”€ PricingGLM: Tweedie GLM + EBM β†’ MAE Β£11,132
β”œβ”€β”€ DocClassifier: ModernBERT β†’ F1 1.0
└── InsureNER: ModernBERT β†’ F1 1.0

Tech Stack

  • LLM: Qwen3-4B + QLoRA + DPO (PyTorch, Transformers, TRL, PEFT, bitsandbytes)
  • Classification & NER: ModernBERT-base (Transformers)
  • Fraud Detection: XGBoost + Isolation Forest (scikit-learn)
  • Pricing: Tweedie GLM (scikit-learn) + EBM (InterpretML)
  • Search: BGE-small-en-v1.5 + Qdrant + BM25 + cross-encoder
  • Training GPU: NVIDIA Tesla T4 16GB

Citation

@misc{bytical2026insureos,
  title={INSUREOS: A Complete AI/ML Suite for UK Insurance Operations},
  author={Bytical AI},
  year={2026},
  url={https://huggingface.co/piyushptiwari/insureos-models}
}

About Bytical AI

Bytical builds AI agents that run insurance operations β€” claims automation, underwriting intelligence, digital sales, and core system modernization for insurers across the UK and Europe. Microsoft AI Partner | NVIDIA | Salesforce.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support