Spaces:
Sleeping
Sleeping
| # Data Format Guide | |
| ## Overview | |
| This guide explains the required data format for the Credit Card Anomaly Detection System. | |
| ## Supported File Formats | |
| ### 1. CSV (Comma Separated Values) - Recommended | |
| **Extension**: `.csv` | |
| **File Size Limit**: 16MB | |
| **Best For**: Most use cases, easy to create and edit | |
| ### 2. Excel Files | |
| **Extensions**: `.xlsx`, `.xls` | |
| **File Size Limit**: 16MB | |
| **Best For**: Data from Excel spreadsheets | |
| ### 3. JSON | |
| **Extension**: `.json` | |
| **File Size Limit**: 16MB | |
| **Best For**: Programmatic data export/import | |
| ### 4. Parquet | |
| **Extension**: `.parquet` | |
| **File Size Limit**: 16MB | |
| **Best For**: Large datasets, efficient storage | |
| --- | |
| ## Required Columns | |
| ### Essential Columns (Must Include) | |
| These columns are required for the system to work properly: | |
| | Column Name | Data Type | Description | Example | | |
| |-------------|-----------|-------------|---------| | |
| | Transaction ID | String | Unique identifier for each transaction | TX001, TX002, ORD-123 | | |
| | User ID | String | Identifier for the user making the transaction | USER001, CUST-456, john_doe | | |
| | Amount | Numeric | Transaction amount (positive number) | 150.50, 89.99, 1250.00 | | |
| | Timestamp | DateTime | Date and time of the transaction | 2024-01-15 10:30:00, 15/01/2024 14:22 | | |
| | Merchant Category | String | Category of the merchant | Grocery, Restaurant, Electronics | | |
| ### Optional Columns (Recommended) | |
| These columns improve detection accuracy but are not required: | |
| | Column Name | Data Type | Description | Example | | |
| |-------------|-----------|-------------|---------| | |
| | Location | String | Location where transaction occurred | New York, Los Angeles, Chicago | | |
| | Merchant Name | String | Name of the merchant | Walmart, Amazon, Target | | |
| | Payment Method | String | How payment was made | Credit Card, Debit Card, Online | | |
| --- | |
| ## CSV Format Example | |
| ### Standard Format | |
| ```csv | |
| Transaction ID,User ID,Amount,Timestamp,Merchant Category,Location | |
| TX001,USER001,150.50,2024-01-15 10:30:00,Grocery,New York | |
| TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant,Los Angeles | |
| TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics,Chicago | |
| TX004,USER003,45.25,2024-01-16 11:45:00,Gas Station,Houston | |
| TX005,USER002,2100.00,2024-01-17 16:30:00,Electronics,Phoenix | |
| TX006,USER004,78.50,2024-01-17 18:00:00,Clothing,New York | |
| TX007,USER005,320.00,2024-01-18 08:15:00,Restaurant,Chicago | |
| TX008,USER001,195.00,2024-01-18 12:30:00,Grocery,Los Angeles | |
| TX009,USER003,56.75,2024-01-19 07:45:00,Gas Station,Houston | |
| TX010,USER005,8900.00,2024-01-19 15:00:00,Electronics,Phoenix | |
| ``` | |
| ### Minimal Format (Only Required Columns) | |
| ```csv | |
| Transaction ID,User ID,Amount,Timestamp,Merchant Category | |
| TX001,USER001,150.50,2024-01-15 10:30:00,Grocery | |
| TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant | |
| TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics | |
| ``` | |
| --- | |
| ## JSON Format Example | |
| ### Standard Format | |
| ```json | |
| [ | |
| { | |
| "Transaction ID": "TX001", | |
| "User ID": "USER001", | |
| "Amount": 150.50, | |
| "Timestamp": "2024-01-15 10:30:00", | |
| "Merchant Category": "Grocery", | |
| "Location": "New York" | |
| }, | |
| { | |
| "Transaction ID": "TX002", | |
| "User ID": "USER001", | |
| "Amount": 89.99, | |
| "Timestamp": "2024-01-15 14:22:00", | |
| "Merchant Category": "Restaurant", | |
| "Location": "Los Angeles" | |
| } | |
| ] | |
| ``` | |
| --- | |
| ## Excel Format Example | |
| ### Worksheet Structure | |
| | Transaction ID | User ID | Amount | Timestamp | Merchant Category | Location | | |
| |----------------|---------|--------|-----------|-------------------|----------| | |
| | TX001 | USER001 | 150.50 | 2024-01-15 10:30:00 | Grocery | New York | | |
| | TX002 | USER001 | 89.99 | 2024-01-15 14:22:00 | Restaurant | Los Angeles | | |
| | TX003 | USER002 | 1250.00 | 2024-01-16 09:15:00 | Electronics | Chicago | | |
| --- | |
| ## Data Quality Requirements | |
| ### Transaction ID | |
| - **Must be unique** for each transaction | |
| - **Can be alphanumeric** (TX001, ORD-123, etc.) | |
| - **Cannot be empty** | |
| ### User ID | |
| - **Must identify the user** making the transaction | |
| - **Can be alphanumeric** (USER001, CUST-456, email, etc.) | |
| - **Cannot be empty** | |
| ### Amount | |
| - **Must be a positive number** | |
| - **Can include decimals** (150.50, 89.99) | |
| - **Cannot be negative or zero** | |
| - **Currency symbol not required** (use plain numbers) | |
| ### Timestamp | |
| - **Must be a valid date/time** | |
| - **Supported formats**: | |
| - `YYYY-MM-DD HH:MM:SS` (recommended) | |
| - `YYYY-MM-DD` | |
| - `DD/MM/YYYY` | |
| - `MM/DD/YYYY` | |
| - **Cannot be empty** | |
| ### Merchant Category | |
| - **Must be a string** | |
| - **Common categories**: Grocery, Restaurant, Electronics, Clothing, Gas Station, Travel, Entertainment, Healthcare, etc. | |
| - **Cannot be empty** | |
| --- | |
| ## Common Merchant Categories | |
| ### Food & Dining | |
| - Grocery | |
| - Restaurant | |
| - Fast Food | |
| - Coffee Shop | |
| - Bakery | |
| ### Shopping | |
| - Electronics | |
| - Clothing | |
| - Department Store | |
| - Online Shopping | |
| - Jewelry | |
| ### Services | |
| - Healthcare | |
| - Education | |
| - Professional Services | |
| - Utilities | |
| - Insurance | |
| ### Travel & Transportation | |
| - Gas Station | |
| - Airline | |
| - Hotel | |
| - Car Rental | |
| - Public Transport | |
| ### Entertainment | |
| - Movies | |
| - Gaming | |
| - Sports | |
| - Concerts | |
| - Streaming Services | |
| --- | |
| ## Sample Dataset | |
| The project includes a sample dataset at `data/sample.csv` with: | |
| - **50 transactions** across 5 users | |
| - **Transaction amounts** ranging from $32.50 to $8,900.00 | |
| - **Timestamps** across multiple days | |
| - **Merchant categories**: Grocery, Restaurant, Electronics, Clothing, Gas Station, etc. | |
| - **Locations**: New York, Los Angeles, Chicago, Houston, Phoenix | |
| You can download this sample to test the system before uploading your own data. | |
| --- | |
| ## Creating Your Dataset | |
| ### Option 1: Export from Existing System | |
| 1. Export transaction data from your database or payment system | |
| 2. Ensure it has the required columns | |
| 3. Save as CSV, Excel, JSON, or Parquet | |
| 4. Upload to the system | |
| ### Option 2: Create from Scratch | |
| 1. Create a new CSV/Excel file | |
| 2. Add the required column headers | |
| 3. Add your transaction data | |
| 4. Ensure data quality (no missing required fields) | |
| 5. Upload to the system | |
| ### Option 3: Use Sample Data | |
| 1. Download the sample dataset from the upload page | |
| 2. Use it as a template | |
| 3. Replace with your own data | |
| 4. Upload to the system | |
| --- | |
| ## Data Validation | |
| ### Before Upload | |
| - Check for duplicate Transaction IDs | |
| - Ensure all required columns are present | |
| - Verify data types are correct | |
| - Remove or handle missing values | |
| - Check for invalid amounts (negative or zero) | |
| ### After Upload | |
| The system will: | |
| - Auto-generate missing Transaction IDs | |
| - Auto-generate missing User IDs | |
| - Use current timestamp if Timestamp is missing | |
| - Use "Unknown" for missing Merchant Category | |
| - Use "Unknown" for missing Location | |
| --- | |
| ## Common Data Issues and Solutions | |
| ### Issue: Missing Required Columns | |
| **Solution**: The system will auto-generate missing columns, but results may be less accurate. It's better to provide complete data. | |
| ### Issue: Invalid Date Format | |
| **Solution**: Use `YYYY-MM-DD HH:MM:SS` format for best compatibility. | |
| ### Issue: Duplicate Transaction IDs | |
| **Solution**: Ensure each transaction has a unique identifier. Remove duplicates before upload. | |
| ### Issue: Negative or Zero Amounts | |
| **Solution**: Transaction amounts must be positive. Remove or correct invalid amounts. | |
| ### Issue: Missing Values | |
| **Solution**: | |
| - For numeric columns: System fills with median | |
| - For categorical columns: System fills with mode | |
| - For best results: Fill missing values before upload | |
| --- | |
| ## File Size Limits | |
| ### Maximum File Size: 16MB | |
| ### What This Means | |
| - **CSV**: Approximately 500,000 - 1,000,000 rows (depending on columns) | |
| - **Excel**: Similar to CSV | |
| - **JSON**: Similar to CSV | |
| - **Parquet**: Can hold more rows due to compression | |
| ### If Your File is Too Large | |
| - Split the file into smaller chunks | |
| - Remove unnecessary columns | |
| - Use data sampling | |
| - Compress the file | |
| --- | |
| ## Best Practices | |
| ### Data Preparation | |
| 1. **Clean your data** before upload | |
| 2. **Remove duplicates** to avoid confusion | |
| 3. **Handle missing values** appropriately | |
| 4. **Use consistent date/time formats** | |
| 5. **Validate amounts** are positive numbers | |
| ### Data Organization | |
| 1. **Use consistent User IDs** across all transactions | |
| 2. **Use standard merchant categories** for better detection | |
| 3. **Include location data** if available (improves accuracy) | |
| 4. **Sort by timestamp** for easier analysis | |
| ### Data Quality | |
| 1. **Minimum 100 transactions** recommended for accurate detection | |
| 2. **Minimum 5 users** recommended for user-based normalization | |
| 3. **Variety in amounts** helps the system learn patterns | |
| 4. **Time span** of at least a few days recommended | |
| --- | |
| ## Troubleshooting | |
| ### Upload Fails | |
| - Check file format is supported (CSV, Excel, JSON, Parquet) | |
| - Verify file size is under 16MB | |
| - Ensure file is not corrupted | |
| - Check required columns are present | |
| ### No Anomalies Detected | |
| - Increase contamination parameter | |
| - Check if data has enough variation | |
| - Ensure amounts are realistic (not all the same) | |
| - Verify data has enough transactions | |
| ### Too Many Anomalies Detected | |
| - Decrease contamination parameter | |
| - Check for data quality issues | |
| - Remove obvious outliers from training data | |
| - Verify amounts are in consistent currency | |
| --- | |
| ## External Data Sources | |
| ### Public Datasets for Testing | |
| - **Kaggle Credit Card Fraud Detection**: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud | |
| - **PaySim Synthetic Data**: https://www.kaggle.com/datasets/ealaxi/paysim1 | |
| - **Fraud Detection Dataset**: https://www.kaggle.com/datasets/kartik2112/fraud-detection | |
| ### Note on External Datasets | |
| - External datasets may have different column names | |
| - You may need to rename columns to match required format | |
| - Preprocessing may be needed before upload | |
| --- | |
| ## Next Steps | |
| After preparing your data: | |
| 1. Refer to [Installation Guide](INSTALLATION_GUIDE.md) to set up the system | |
| 2. Follow [Usage Guide](USAGE_GUIDE.md) to upload and analyze your data | |