Credit-Card-Anomaly / docs /DATA_FORMAT_GUIDE.md
Zayeemk's picture
Rename DATA_FORMAT_GUIDE.md to docs/DATA_FORMAT_GUIDE.md
52da4b8 verified

Data Format Guide

Overview

This guide explains the required data format for the Credit Card Anomaly Detection System.

Supported File Formats

1. CSV (Comma Separated Values) - Recommended

Extension: .csv File Size Limit: 16MB Best For: Most use cases, easy to create and edit

2. Excel Files

Extensions: .xlsx, .xls File Size Limit: 16MB Best For: Data from Excel spreadsheets

3. JSON

Extension: .json File Size Limit: 16MB Best For: Programmatic data export/import

4. Parquet

Extension: .parquet File Size Limit: 16MB Best For: Large datasets, efficient storage


Required Columns

Essential Columns (Must Include)

These columns are required for the system to work properly:

Column Name Data Type Description Example
Transaction ID String Unique identifier for each transaction TX001, TX002, ORD-123
User ID String Identifier for the user making the transaction USER001, CUST-456, john_doe
Amount Numeric Transaction amount (positive number) 150.50, 89.99, 1250.00
Timestamp DateTime Date and time of the transaction 2024-01-15 10:30:00, 15/01/2024 14:22
Merchant Category String Category of the merchant Grocery, Restaurant, Electronics

Optional Columns (Recommended)

These columns improve detection accuracy but are not required:

Column Name Data Type Description Example
Location String Location where transaction occurred New York, Los Angeles, Chicago
Merchant Name String Name of the merchant Walmart, Amazon, Target
Payment Method String How payment was made Credit Card, Debit Card, Online

CSV Format Example

Standard Format

Transaction ID,User ID,Amount,Timestamp,Merchant Category,Location
TX001,USER001,150.50,2024-01-15 10:30:00,Grocery,New York
TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant,Los Angeles
TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics,Chicago
TX004,USER003,45.25,2024-01-16 11:45:00,Gas Station,Houston
TX005,USER002,2100.00,2024-01-17 16:30:00,Electronics,Phoenix
TX006,USER004,78.50,2024-01-17 18:00:00,Clothing,New York
TX007,USER005,320.00,2024-01-18 08:15:00,Restaurant,Chicago
TX008,USER001,195.00,2024-01-18 12:30:00,Grocery,Los Angeles
TX009,USER003,56.75,2024-01-19 07:45:00,Gas Station,Houston
TX010,USER005,8900.00,2024-01-19 15:00:00,Electronics,Phoenix

Minimal Format (Only Required Columns)

Transaction ID,User ID,Amount,Timestamp,Merchant Category
TX001,USER001,150.50,2024-01-15 10:30:00,Grocery
TX002,USER001,89.99,2024-01-15 14:22:00,Restaurant
TX003,USER002,1250.00,2024-01-16 09:15:00,Electronics

JSON Format Example

Standard Format

[
  {
    "Transaction ID": "TX001",
    "User ID": "USER001",
    "Amount": 150.50,
    "Timestamp": "2024-01-15 10:30:00",
    "Merchant Category": "Grocery",
    "Location": "New York"
  },
  {
    "Transaction ID": "TX002",
    "User ID": "USER001",
    "Amount": 89.99,
    "Timestamp": "2024-01-15 14:22:00",
    "Merchant Category": "Restaurant",
    "Location": "Los Angeles"
  }
]

Excel Format Example

Worksheet Structure

Transaction ID User ID Amount Timestamp Merchant Category Location
TX001 USER001 150.50 2024-01-15 10:30:00 Grocery New York
TX002 USER001 89.99 2024-01-15 14:22:00 Restaurant Los Angeles
TX003 USER002 1250.00 2024-01-16 09:15:00 Electronics Chicago

Data Quality Requirements

Transaction ID

  • Must be unique for each transaction
  • Can be alphanumeric (TX001, ORD-123, etc.)
  • Cannot be empty

User ID

  • Must identify the user making the transaction
  • Can be alphanumeric (USER001, CUST-456, email, etc.)
  • Cannot be empty

Amount

  • Must be a positive number
  • Can include decimals (150.50, 89.99)
  • Cannot be negative or zero
  • Currency symbol not required (use plain numbers)

Timestamp

  • Must be a valid date/time
  • Supported formats:
    • YYYY-MM-DD HH:MM:SS (recommended)
    • YYYY-MM-DD
    • DD/MM/YYYY
    • MM/DD/YYYY
  • Cannot be empty

Merchant Category

  • Must be a string
  • Common categories: Grocery, Restaurant, Electronics, Clothing, Gas Station, Travel, Entertainment, Healthcare, etc.
  • Cannot be empty

Common Merchant Categories

Food & Dining

  • Grocery
  • Restaurant
  • Fast Food
  • Coffee Shop
  • Bakery

Shopping

  • Electronics
  • Clothing
  • Department Store
  • Online Shopping
  • Jewelry

Services

  • Healthcare
  • Education
  • Professional Services
  • Utilities
  • Insurance

Travel & Transportation

  • Gas Station
  • Airline
  • Hotel
  • Car Rental
  • Public Transport

Entertainment

  • Movies
  • Gaming
  • Sports
  • Concerts
  • Streaming Services

Sample Dataset

The project includes a sample dataset at data/sample.csv with:

  • 50 transactions across 5 users
  • Transaction amounts ranging from $32.50 to $8,900.00
  • Timestamps across multiple days
  • Merchant categories: Grocery, Restaurant, Electronics, Clothing, Gas Station, etc.
  • Locations: New York, Los Angeles, Chicago, Houston, Phoenix

You can download this sample to test the system before uploading your own data.


Creating Your Dataset

Option 1: Export from Existing System

  1. Export transaction data from your database or payment system
  2. Ensure it has the required columns
  3. Save as CSV, Excel, JSON, or Parquet
  4. Upload to the system

Option 2: Create from Scratch

  1. Create a new CSV/Excel file
  2. Add the required column headers
  3. Add your transaction data
  4. Ensure data quality (no missing required fields)
  5. Upload to the system

Option 3: Use Sample Data

  1. Download the sample dataset from the upload page
  2. Use it as a template
  3. Replace with your own data
  4. Upload to the system

Data Validation

Before Upload

  • Check for duplicate Transaction IDs
  • Ensure all required columns are present
  • Verify data types are correct
  • Remove or handle missing values
  • Check for invalid amounts (negative or zero)

After Upload

The system will:

  • Auto-generate missing Transaction IDs
  • Auto-generate missing User IDs
  • Use current timestamp if Timestamp is missing
  • Use "Unknown" for missing Merchant Category
  • Use "Unknown" for missing Location

Common Data Issues and Solutions

Issue: Missing Required Columns

Solution: The system will auto-generate missing columns, but results may be less accurate. It's better to provide complete data.

Issue: Invalid Date Format

Solution: Use YYYY-MM-DD HH:MM:SS format for best compatibility.

Issue: Duplicate Transaction IDs

Solution: Ensure each transaction has a unique identifier. Remove duplicates before upload.

Issue: Negative or Zero Amounts

Solution: Transaction amounts must be positive. Remove or correct invalid amounts.

Issue: Missing Values

Solution:

  • For numeric columns: System fills with median
  • For categorical columns: System fills with mode
  • For best results: Fill missing values before upload

File Size Limits

Maximum File Size: 16MB

What This Means

  • CSV: Approximately 500,000 - 1,000,000 rows (depending on columns)
  • Excel: Similar to CSV
  • JSON: Similar to CSV
  • Parquet: Can hold more rows due to compression

If Your File is Too Large

  • Split the file into smaller chunks
  • Remove unnecessary columns
  • Use data sampling
  • Compress the file

Best Practices

Data Preparation

  1. Clean your data before upload
  2. Remove duplicates to avoid confusion
  3. Handle missing values appropriately
  4. Use consistent date/time formats
  5. Validate amounts are positive numbers

Data Organization

  1. Use consistent User IDs across all transactions
  2. Use standard merchant categories for better detection
  3. Include location data if available (improves accuracy)
  4. Sort by timestamp for easier analysis

Data Quality

  1. Minimum 100 transactions recommended for accurate detection
  2. Minimum 5 users recommended for user-based normalization
  3. Variety in amounts helps the system learn patterns
  4. Time span of at least a few days recommended

Troubleshooting

Upload Fails

  • Check file format is supported (CSV, Excel, JSON, Parquet)
  • Verify file size is under 16MB
  • Ensure file is not corrupted
  • Check required columns are present

No Anomalies Detected

  • Increase contamination parameter
  • Check if data has enough variation
  • Ensure amounts are realistic (not all the same)
  • Verify data has enough transactions

Too Many Anomalies Detected

  • Decrease contamination parameter
  • Check for data quality issues
  • Remove obvious outliers from training data
  • Verify amounts are in consistent currency

External Data Sources

Public Datasets for Testing

Note on External Datasets

  • External datasets may have different column names
  • You may need to rename columns to match required format
  • Preprocessing may be needed before upload

Next Steps

After preparing your data:

  1. Refer to Installation Guide to set up the system
  2. Follow Usage Guide to upload and analyze your data