Spaces:
Sleeping
Data Format Guide
Overview
This guide explains the required data format for the Credit Card Anomaly Detection System.
Supported File Formats
1. CSV (Comma Separated Values) - Recommended
Extension: .csv
File Size Limit: 16MB
Best For: Most use cases, easy to create and edit
2. Excel Files
Extensions: .xlsx, .xls
File Size Limit: 16MB
Best For: Data from Excel spreadsheets
3. JSON
Extension: .json
File Size Limit: 16MB
Best For: Programmatic data export/import
4. Parquet
Extension: .parquet
File Size Limit: 16MB
Best For: Large datasets, efficient storage
Required Columns
Essential Columns (Must Include)
These columns are required for the system to work properly:
| Column Name | Data Type | Description | Example |
|---|---|---|---|
| Transaction ID | String | Unique identifier for each transaction | TX001, TX002, ORD-123 |
| User ID | String | Identifier for the user making the transaction | USER001, CUST-456, john_doe |
| Amount | Numeric | Transaction amount (positive number) | 150.50, 89.99, 1250.00 |
| Timestamp | DateTime | Date and time of the transaction | 2024-01-15 10:30:00, 15/01/2024 14:22 |
| Merchant Category | String | Category of the merchant | Grocery, Restaurant, Electronics |
Optional Columns (Recommended)
These columns improve detection accuracy but are not required:
| Column Name | Data Type | Description | Example |
|---|---|---|---|
| Location | String | Location where transaction occurred | New York, Los Angeles, Chicago |
| Merchant Name | String | Name of the merchant | Walmart, Amazon, Target |
| Payment Method | String | How payment was made | Credit Card, Debit Card, Online |
CSV Format Example
Standard Format
Transaction ID,User ID,Amount,Timestamp,Merchant Category,Location
TX001,USER001,150.50,2024-01-15 10:30:00,Grocery,New York
TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant,Los Angeles
TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics,Chicago
TX004,USER003,45.25,2024-01-16 11:45:00,Gas Station,Houston
TX005,USER002,2100.00,2024-01-17 16:30:00,Electronics,Phoenix
TX006,USER004,78.50,2024-01-17 18:00:00,Clothing,New York
TX007,USER005,320.00,2024-01-18 08:15:00,Restaurant,Chicago
TX008,USER001,195.00,2024-01-18 12:30:00,Grocery,Los Angeles
TX009,USER003,56.75,2024-01-19 07:45:00,Gas Station,Houston
TX010,USER005,8900.00,2024-01-19 15:00:00,Electronics,Phoenix
Minimal Format (Only Required Columns)
Transaction ID,User ID,Amount,Timestamp,Merchant Category
TX001,USER001,150.50,2024-01-15 10:30:00,Grocery
TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant
TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics
JSON Format Example
Standard Format
[
{
"Transaction ID": "TX001",
"User ID": "USER001",
"Amount": 150.50,
"Timestamp": "2024-01-15 10:30:00",
"Merchant Category": "Grocery",
"Location": "New York"
},
{
"Transaction ID": "TX002",
"User ID": "USER001",
"Amount": 89.99,
"Timestamp": "2024-01-15 14:22:00",
"Merchant Category": "Restaurant",
"Location": "Los Angeles"
}
]
Excel Format Example
Worksheet Structure
| Transaction ID | User ID | Amount | Timestamp | Merchant Category | Location |
|---|---|---|---|---|---|
| TX001 | USER001 | 150.50 | 2024-01-15 10:30:00 | Grocery | New York |
| TX002 | USER001 | 89.99 | 2024-01-15 14:22:00 | Restaurant | Los Angeles |
| TX003 | USER002 | 1250.00 | 2024-01-16 09:15:00 | Electronics | Chicago |
Data Quality Requirements
Transaction ID
- Must be unique for each transaction
- Can be alphanumeric (TX001, ORD-123, etc.)
- Cannot be empty
User ID
- Must identify the user making the transaction
- Can be alphanumeric (USER001, CUST-456, email, etc.)
- Cannot be empty
Amount
- Must be a positive number
- Can include decimals (150.50, 89.99)
- Cannot be negative or zero
- Currency symbol not required (use plain numbers)
Timestamp
- Must be a valid date/time
- Supported formats:
YYYY-MM-DD HH:MM:SS(recommended)YYYY-MM-DDDD/MM/YYYYMM/DD/YYYY
- Cannot be empty
Merchant Category
- Must be a string
- Common categories: Grocery, Restaurant, Electronics, Clothing, Gas Station, Travel, Entertainment, Healthcare, etc.
- Cannot be empty
Common Merchant Categories
Food & Dining
- Grocery
- Restaurant
- Fast Food
- Coffee Shop
- Bakery
Shopping
- Electronics
- Clothing
- Department Store
- Online Shopping
- Jewelry
Services
- Healthcare
- Education
- Professional Services
- Utilities
- Insurance
Travel & Transportation
- Gas Station
- Airline
- Hotel
- Car Rental
- Public Transport
Entertainment
- Movies
- Gaming
- Sports
- Concerts
- Streaming Services
Sample Dataset
The project includes a sample dataset at data/sample.csv with:
- 50 transactions across 5 users
- Transaction amounts ranging from $32.50 to $8,900.00
- Timestamps across multiple days
- Merchant categories: Grocery, Restaurant, Electronics, Clothing, Gas Station, etc.
- Locations: New York, Los Angeles, Chicago, Houston, Phoenix
You can download this sample to test the system before uploading your own data.
Creating Your Dataset
Option 1: Export from Existing System
- Export transaction data from your database or payment system
- Ensure it has the required columns
- Save as CSV, Excel, JSON, or Parquet
- Upload to the system
Option 2: Create from Scratch
- Create a new CSV/Excel file
- Add the required column headers
- Add your transaction data
- Ensure data quality (no missing required fields)
- Upload to the system
Option 3: Use Sample Data
- Download the sample dataset from the upload page
- Use it as a template
- Replace with your own data
- Upload to the system
Data Validation
Before Upload
- Check for duplicate Transaction IDs
- Ensure all required columns are present
- Verify data types are correct
- Remove or handle missing values
- Check for invalid amounts (negative or zero)
After Upload
The system will:
- Auto-generate missing Transaction IDs
- Auto-generate missing User IDs
- Use current timestamp if Timestamp is missing
- Use "Unknown" for missing Merchant Category
- Use "Unknown" for missing Location
Common Data Issues and Solutions
Issue: Missing Required Columns
Solution: The system will auto-generate missing columns, but results may be less accurate. It's better to provide complete data.
Issue: Invalid Date Format
Solution: Use YYYY-MM-DD HH:MM:SS format for best compatibility.
Issue: Duplicate Transaction IDs
Solution: Ensure each transaction has a unique identifier. Remove duplicates before upload.
Issue: Negative or Zero Amounts
Solution: Transaction amounts must be positive. Remove or correct invalid amounts.
Issue: Missing Values
Solution:
- For numeric columns: System fills with median
- For categorical columns: System fills with mode
- For best results: Fill missing values before upload
File Size Limits
Maximum File Size: 16MB
What This Means
- CSV: Approximately 500,000 - 1,000,000 rows (depending on columns)
- Excel: Similar to CSV
- JSON: Similar to CSV
- Parquet: Can hold more rows due to compression
If Your File is Too Large
- Split the file into smaller chunks
- Remove unnecessary columns
- Use data sampling
- Compress the file
Best Practices
Data Preparation
- Clean your data before upload
- Remove duplicates to avoid confusion
- Handle missing values appropriately
- Use consistent date/time formats
- Validate amounts are positive numbers
Data Organization
- Use consistent User IDs across all transactions
- Use standard merchant categories for better detection
- Include location data if available (improves accuracy)
- Sort by timestamp for easier analysis
Data Quality
- Minimum 100 transactions recommended for accurate detection
- Minimum 5 users recommended for user-based normalization
- Variety in amounts helps the system learn patterns
- Time span of at least a few days recommended
Troubleshooting
Upload Fails
- Check file format is supported (CSV, Excel, JSON, Parquet)
- Verify file size is under 16MB
- Ensure file is not corrupted
- Check required columns are present
No Anomalies Detected
- Increase contamination parameter
- Check if data has enough variation
- Ensure amounts are realistic (not all the same)
- Verify data has enough transactions
Too Many Anomalies Detected
- Decrease contamination parameter
- Check for data quality issues
- Remove obvious outliers from training data
- Verify amounts are in consistent currency
External Data Sources
Public Datasets for Testing
- Kaggle Credit Card Fraud Detection: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- PaySim Synthetic Data: https://www.kaggle.com/datasets/ealaxi/paysim1
- Fraud Detection Dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection
Note on External Datasets
- External datasets may have different column names
- You may need to rename columns to match required format
- Preprocessing may be needed before upload
Next Steps
After preparing your data:
- Refer to Installation Guide to set up the system
- Follow Usage Guide to upload and analyze your data