Credit-Card-Anomaly / docs /DATA_FORMAT_GUIDE.md
Zayeemk's picture
Rename DATA_FORMAT_GUIDE.md to docs/DATA_FORMAT_GUIDE.md
52da4b8 verified
# Data Format Guide
## Overview
This guide explains the required data format for the Credit Card Anomaly Detection System.
## Supported File Formats
### 1. CSV (Comma Separated Values) - Recommended
**Extension**: `.csv`
**File Size Limit**: 16MB
**Best For**: Most use cases, easy to create and edit
### 2. Excel Files
**Extensions**: `.xlsx`, `.xls`
**File Size Limit**: 16MB
**Best For**: Data from Excel spreadsheets
### 3. JSON
**Extension**: `.json`
**File Size Limit**: 16MB
**Best For**: Programmatic data export/import
### 4. Parquet
**Extension**: `.parquet`
**File Size Limit**: 16MB
**Best For**: Large datasets, efficient storage
---
## Required Columns
### Essential Columns (Must Include)
These columns are required for the system to work properly:
| Column Name | Data Type | Description | Example |
|-------------|-----------|-------------|---------|
| Transaction ID | String | Unique identifier for each transaction | TX001, TX002, ORD-123 |
| User ID | String | Identifier for the user making the transaction | USER001, CUST-456, john_doe |
| Amount | Numeric | Transaction amount (positive number) | 150.50, 89.99, 1250.00 |
| Timestamp | DateTime | Date and time of the transaction | 2024-01-15 10:30:00, 15/01/2024 14:22 |
| Merchant Category | String | Category of the merchant | Grocery, Restaurant, Electronics |
### Optional Columns (Recommended)
These columns improve detection accuracy but are not required:
| Column Name | Data Type | Description | Example |
|-------------|-----------|-------------|---------|
| Location | String | Location where transaction occurred | New York, Los Angeles, Chicago |
| Merchant Name | String | Name of the merchant | Walmart, Amazon, Target |
| Payment Method | String | How payment was made | Credit Card, Debit Card, Online |
---
## CSV Format Example
### Standard Format
```csv
Transaction ID,User ID,Amount,Timestamp,Merchant Category,Location
TX001,USER001,150.50,2024-01-15 10:30:00,Grocery,New York
TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant,Los Angeles
TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics,Chicago
TX004,USER003,45.25,2024-01-16 11:45:00,Gas Station,Houston
TX005,USER002,2100.00,2024-01-17 16:30:00,Electronics,Phoenix
TX006,USER004,78.50,2024-01-17 18:00:00,Clothing,New York
TX007,USER005,320.00,2024-01-18 08:15:00,Restaurant,Chicago
TX008,USER001,195.00,2024-01-18 12:30:00,Grocery,Los Angeles
TX009,USER003,56.75,2024-01-19 07:45:00,Gas Station,Houston
TX010,USER005,8900.00,2024-01-19 15:00:00,Electronics,Phoenix
```
### Minimal Format (Only Required Columns)
```csv
Transaction ID,User ID,Amount,Timestamp,Merchant Category
TX001,USER001,150.50,2024-01-15 10:30:00,Grocery
TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant
TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics
```
---
## JSON Format Example
### Standard Format
```json
[
{
"Transaction ID": "TX001",
"User ID": "USER001",
"Amount": 150.50,
"Timestamp": "2024-01-15 10:30:00",
"Merchant Category": "Grocery",
"Location": "New York"
},
{
"Transaction ID": "TX002",
"User ID": "USER001",
"Amount": 89.99,
"Timestamp": "2024-01-15 14:22:00",
"Merchant Category": "Restaurant",
"Location": "Los Angeles"
}
]
```
---
## Excel Format Example
### Worksheet Structure
| Transaction ID | User ID | Amount | Timestamp | Merchant Category | Location |
|----------------|---------|--------|-----------|-------------------|----------|
| TX001 | USER001 | 150.50 | 2024-01-15 10:30:00 | Grocery | New York |
| TX002 | USER001 | 89.99 | 2024-01-15 14:22:00 | Restaurant | Los Angeles |
| TX003 | USER002 | 1250.00 | 2024-01-16 09:15:00 | Electronics | Chicago |
---
## Data Quality Requirements
### Transaction ID
- **Must be unique** for each transaction
- **Can be alphanumeric** (TX001, ORD-123, etc.)
- **Cannot be empty**
### User ID
- **Must identify the user** making the transaction
- **Can be alphanumeric** (USER001, CUST-456, email, etc.)
- **Cannot be empty**
### Amount
- **Must be a positive number**
- **Can include decimals** (150.50, 89.99)
- **Cannot be negative or zero**
- **Currency symbol not required** (use plain numbers)
### Timestamp
- **Must be a valid date/time**
- **Supported formats**:
- `YYYY-MM-DD HH:MM:SS` (recommended)
- `YYYY-MM-DD`
- `DD/MM/YYYY`
- `MM/DD/YYYY`
- **Cannot be empty**
### Merchant Category
- **Must be a string**
- **Common categories**: Grocery, Restaurant, Electronics, Clothing, Gas Station, Travel, Entertainment, Healthcare, etc.
- **Cannot be empty**
---
## Common Merchant Categories
### Food & Dining
- Grocery
- Restaurant
- Fast Food
- Coffee Shop
- Bakery
### Shopping
- Electronics
- Clothing
- Department Store
- Online Shopping
- Jewelry
### Services
- Healthcare
- Education
- Professional Services
- Utilities
- Insurance
### Travel & Transportation
- Gas Station
- Airline
- Hotel
- Car Rental
- Public Transport
### Entertainment
- Movies
- Gaming
- Sports
- Concerts
- Streaming Services
---
## Sample Dataset
The project includes a sample dataset at `data/sample.csv` with:
- **50 transactions** across 5 users
- **Transaction amounts** ranging from $32.50 to $8,900.00
- **Timestamps** across multiple days
- **Merchant categories**: Grocery, Restaurant, Electronics, Clothing, Gas Station, etc.
- **Locations**: New York, Los Angeles, Chicago, Houston, Phoenix
You can download this sample to test the system before uploading your own data.
---
## Creating Your Dataset
### Option 1: Export from Existing System
1. Export transaction data from your database or payment system
2. Ensure it has the required columns
3. Save as CSV, Excel, JSON, or Parquet
4. Upload to the system
### Option 2: Create from Scratch
1. Create a new CSV/Excel file
2. Add the required column headers
3. Add your transaction data
4. Ensure data quality (no missing required fields)
5. Upload to the system
### Option 3: Use Sample Data
1. Download the sample dataset from the upload page
2. Use it as a template
3. Replace with your own data
4. Upload to the system
---
## Data Validation
### Before Upload
- Check for duplicate Transaction IDs
- Ensure all required columns are present
- Verify data types are correct
- Remove or handle missing values
- Check for invalid amounts (negative or zero)
### After Upload
The system will:
- Auto-generate missing Transaction IDs
- Auto-generate missing User IDs
- Use current timestamp if Timestamp is missing
- Use "Unknown" for missing Merchant Category
- Use "Unknown" for missing Location
---
## Common Data Issues and Solutions
### Issue: Missing Required Columns
**Solution**: The system will auto-generate missing columns, but results may be less accurate. It's better to provide complete data.
### Issue: Invalid Date Format
**Solution**: Use `YYYY-MM-DD HH:MM:SS` format for best compatibility.
### Issue: Duplicate Transaction IDs
**Solution**: Ensure each transaction has a unique identifier. Remove duplicates before upload.
### Issue: Negative or Zero Amounts
**Solution**: Transaction amounts must be positive. Remove or correct invalid amounts.
### Issue: Missing Values
**Solution**:
- For numeric columns: System fills with median
- For categorical columns: System fills with mode
- For best results: Fill missing values before upload
---
## File Size Limits
### Maximum File Size: 16MB
### What This Means
- **CSV**: Approximately 500,000 - 1,000,000 rows (depending on columns)
- **Excel**: Similar to CSV
- **JSON**: Similar to CSV
- **Parquet**: Can hold more rows due to compression
### If Your File is Too Large
- Split the file into smaller chunks
- Remove unnecessary columns
- Use data sampling
- Compress the file
---
## Best Practices
### Data Preparation
1. **Clean your data** before upload
2. **Remove duplicates** to avoid confusion
3. **Handle missing values** appropriately
4. **Use consistent date/time formats**
5. **Validate amounts** are positive numbers
### Data Organization
1. **Use consistent User IDs** across all transactions
2. **Use standard merchant categories** for better detection
3. **Include location data** if available (improves accuracy)
4. **Sort by timestamp** for easier analysis
### Data Quality
1. **Minimum 100 transactions** recommended for accurate detection
2. **Minimum 5 users** recommended for user-based normalization
3. **Variety in amounts** helps the system learn patterns
4. **Time span** of at least a few days recommended
---
## Troubleshooting
### Upload Fails
- Check file format is supported (CSV, Excel, JSON, Parquet)
- Verify file size is under 16MB
- Ensure file is not corrupted
- Check required columns are present
### No Anomalies Detected
- Increase contamination parameter
- Check if data has enough variation
- Ensure amounts are realistic (not all the same)
- Verify data has enough transactions
### Too Many Anomalies Detected
- Decrease contamination parameter
- Check for data quality issues
- Remove obvious outliers from training data
- Verify amounts are in consistent currency
---
## External Data Sources
### Public Datasets for Testing
- **Kaggle Credit Card Fraud Detection**: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- **PaySim Synthetic Data**: https://www.kaggle.com/datasets/ealaxi/paysim1
- **Fraud Detection Dataset**: https://www.kaggle.com/datasets/kartik2112/fraud-detection
### Note on External Datasets
- External datasets may have different column names
- You may need to rename columns to match required format
- Preprocessing may be needed before upload
---
## Next Steps
After preparing your data:
1. Refer to [Installation Guide](INSTALLATION_GUIDE.md) to set up the system
2. Follow [Usage Guide](USAGE_GUIDE.md) to upload and analyze your data