Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.3.0
Fraud Simulator Dataset
Overview
This dataset contains synthetic insurance claims for fraud detection training and validation.
Dataset Structure
Files
claims_normal.csv- Legitimate insurance claimsclaims_fraudulent.csv- Fraudulent insurance claimsclaims_combined.csv- Combined dataset with labelsmetadata.json- Dataset metadata and statistics
Schema
Claim Record:
{
"claim_id": "string",
"amount": "float",
"type": "string (auto|property|health|life)",
"claimant_id": "string",
"days_since_policy_start": "integer",
"claimant_history": {
"claim_count": "integer",
"avg_amount": "float",
"total_paid": "float"
},
"document_consistency_score": "float (0.0-1.0)",
"linked_suspicious_entities": "integer",
"label": "string (fraud|legitimate)"
}
Fraud Patterns Included
- Staged Accidents: Multiple claims with similar patterns
- Document Mismatch: Inconsistent documentation
- Early Claims: Claims filed shortly after policy inception
- Amount Inflation: Claims significantly above average
- Entity Networks: Connected suspicious entities
- High Frequency: Repeated claims from same claimant
Dataset Statistics
- Total Claims: 10,000
- Fraudulent: 2,500 (25%)
- Legitimate: 7,500 (75%)
- Claim Types: Auto (40%), Property (30%), Health (20%), Life (10%)
- Average Claim Amount: $5,000
- Date Range: 2020-2026
Usage
This dataset is used for:
- Model training and validation
- Fraud pattern simulation
- Stress testing
- Drift scenario testing
- Performance benchmarking
Data Quality
- No missing values
- Balanced across claim types
- Realistic fraud patterns based on industry data
- Regular updates with new fraud patterns
Privacy
All data is synthetic and does not contain real PII.
License
For internal use only. Part of BDR-Agent-Factory ecosystem.