SPG_ML / DATABASE_POPULATION_GUIDE.md
meetmendapara's picture
Initial commit for ML space
df31aa1

Database Population Guide

This guide explains how to populate your Supabase database with synthetic data for testing the Cognexa ML pipeline.

Overview

The populate_database.py script creates realistic synthetic data that simulates real users and their behavior. This allows you to:

  • Test the full ML pipeline from database → Spring Boot → FastAPI → predictions
  • Verify feature calculation matches training data
  • Test your backend API endpoints with realistic data
  • Validate your ML service can predict on real database rows

Prerequisites

  1. Supabase Database: Ensure your Supabase database is created and running
  2. Database Credentials: Get your Supabase connection details from the Supabase dashboard
  3. Python Dependencies: Install required packages

Setup

1. Configure Database Credentials

Copy .env.example to .env and update with your Supabase credentials:

cd ML
cp .env.example .env

Edit .env with your Supabase connection details:

SUPABASE_DB_URL=jdbc:postgresql://aws-1-ap-southeast-2.pooler.supabase.com:6543/postgres?prepareThreshold=0&tcpKeepAlive=true&socketTimeout=0&connectTimeout=30
SUPABASE_DB_USERNAME=postgres.your-project-ref
SUPABASE_DB_PASSWORD=your-supabase-password

2. Install Dependencies

cd ML
pip install psycopg2-binary python-dotenv numpy

Usage

Basic Usage

cd ML
python populate_database.py

This creates 50 users with 100 tasks each (5000 total tasks).

Custom Parameters

# Create 20 users with 50 tasks each
python populate_database.py --users 20 --tasks 50

# Create 100 users with 200 tasks each
python populate_database.py --users 100 --tasks 200

What Gets Created

Per User

Data Type Quantity Description
Users 1 User account with personality profile
Tasks 100 Realistic tasks with completion probabilities
Behavior Events 50 Task and focus session events
Habits 5 Habit tracking with completions
Analytics 30 Daily productivity metrics
Task Predictions 100 ML predictions for each task
Interventions 20 Proactive behavioral interventions
Coaching Insights 15 AI-generated coaching recommendations
Notifications 30 User notifications
Prediction Feedback 40 User feedback on predictions
Activity Logs 50 User activity tracking
User Streaks 8 Habit streak tracking
Achievements 10 Gamification achievements
Focus Sessions 25 Pomodoro session tracking
User Settings 1 User preferences
Wellbeing Data 30 Daily mood and energy tracking
Goals 5 User goals with progress
Recurring Tasks 10 Recurring task patterns
Saved Filters 5 Task filter configurations
Research Data 1 Research consent and participation
Export History 3 Data export records
ML Experiments 1 Model training records
AB Testing 1 Feature experiment records
EMA Prompts 20 Experience sampling prompts
Task Templates 10 Reusable task templates

Data Generation Logic

Personality Profiles

Uses the Big Five (OCEAN) personality model with realistic correlations:

  • Openness: Creativity and curiosity
  • Conscientiousness: Organization and dependability
  • Extraversion: Sociability and energy
  • Agreeableness: Cooperativeness
  • Neuroticism: Emotional stability

Task Completion Probability

Calculated based on:

  • Personality-task interaction (conscientiousness is strongest predictor)
  • Task priority and complexity
  • Time pressure
  • Category-specific effects

Behavioral Patterns

  • Realistic task creation timestamps
  • Historical behavior tracking
  • Habit completion streaks
  • Focus session metrics

Testing the Pipeline

1. Start Spring Boot Backend

cd server
mvn spring-boot:run

2. Start FastAPI ML Service

cd ML
python main.py

3. Test Predictions

# Get a task ID from the database
SELECT id FROM tasks LIMIT 1;

# Test prediction endpoint
curl -X POST http://localhost:8080/api/predictions/task/{taskId}

4. Verify Feature Calculation

Check that the features calculated by Spring Boot match what was used during training:

SELECT 
    t.id,
    t.complexity,
    t.priority,
    t.category,
    p.openness,
    p.conscientiousness,
    p.extraversion,
    p.agreeableness,
    p.neuroticism
FROM tasks t
JOIN personality_profiles p ON t.user_id = p.user_id
LIMIT 10;

Troubleshooting

Connection Issues

If you see connection errors:

  1. Verify Supabase credentials in .env
  2. Check Supabase dashboard for database status
  3. Ensure your IP is whitelisted in Supabase network settings
  4. Verify the database URL format

Permission Errors

If you see permission errors:

  1. Ensure the database user has INSERT permissions
  2. Check table constraints are satisfied
  3. Verify UUID generation extension is enabled

Data Already Exists

If data already exists, the script will skip duplicate entries using ON CONFLICT DO NOTHING.

Data Cleanup

To clear the database and start fresh:

-- In Supabase SQL Editor or psql
TRUNCATE TABLE user_achievements CASCADE;
TRUNCATE TABLE user_streaks CASCADE;
TRUNCATE TABLE habit_completions CASCADE;
TRUNCATE TABLE habits CASCADE;
TRUNCATE TABLE task_predictions CASCADE;
TRUNCATE TABLE behavior_events CASCADE;
TRUNCATE TABLE tasks CASCADE;
TRUNCATE TABLE personality_profiles CASCADE;
TRUNCATE TABLE users CASCADE;

Performance Notes

  • Uses batch inserts (executemany) for optimal performance
  • Connection pooling recommended for large datasets
  • Consider running during off-peak hours for production databases

Next Steps

After populating the database:

  1. Test your Spring Boot API endpoints
  2. Verify ML service predictions
  3. Check feature calculation accuracy
  4. Validate frontend displays data correctly
  5. Run end-to-end tests