# Fraud Simulator Dataset ## Overview This dataset contains synthetic insurance claims for fraud detection training and validation. ## Dataset Structure ### Files - `claims_normal.csv` - Legitimate insurance claims - `claims_fraudulent.csv` - Fraudulent insurance claims - `claims_combined.csv` - Combined dataset with labels - `metadata.json` - Dataset metadata and statistics ### Schema **Claim Record:** ```json { "claim_id": "string", "amount": "float", "type": "string (auto|property|health|life)", "claimant_id": "string", "days_since_policy_start": "integer", "claimant_history": { "claim_count": "integer", "avg_amount": "float", "total_paid": "float" }, "document_consistency_score": "float (0.0-1.0)", "linked_suspicious_entities": "integer", "label": "string (fraud|legitimate)" } ``` ## Fraud Patterns Included 1. **Staged Accidents**: Multiple claims with similar patterns 2. **Document Mismatch**: Inconsistent documentation 3. **Early Claims**: Claims filed shortly after policy inception 4. **Amount Inflation**: Claims significantly above average 5. **Entity Networks**: Connected suspicious entities 6. **High Frequency**: Repeated claims from same claimant ## Dataset Statistics - **Total Claims**: 10,000 - **Fraudulent**: 2,500 (25%) - **Legitimate**: 7,500 (75%) - **Claim Types**: Auto (40%), Property (30%), Health (20%), Life (10%) - **Average Claim Amount**: $5,000 - **Date Range**: 2020-2026 ## Usage This dataset is used for: - Model training and validation - Fraud pattern simulation - Stress testing - Drift scenario testing - Performance benchmarking ## Data Quality - No missing values - Balanced across claim types - Realistic fraud patterns based on industry data - Regular updates with new fraud patterns ## Privacy All data is synthetic and does not contain real PII. ## License For internal use only. Part of BDR-Agent-Factory ecosystem.