Spaces:

meetmendapara
/

SPG_ML

Sleeping

App Files Files Community

SPG_ML / DATABASE_POPULATION_GUIDE.md

meetmendapara

Initial commit for ML space

df31aa1 3 months ago

preview code

raw

history blame contribute delete

6.01 kB

Database Population Guide

This guide explains how to populate your Supabase database with synthetic data for testing the Cognexa ML pipeline.

Overview

The populate_database.py script creates realistic synthetic data that simulates real users and their behavior. This allows you to:

Test the full ML pipeline from database → Spring Boot → FastAPI → predictions
Verify feature calculation matches training data
Test your backend API endpoints with realistic data
Validate your ML service can predict on real database rows

Prerequisites

Supabase Database: Ensure your Supabase database is created and running
Database Credentials: Get your Supabase connection details from the Supabase dashboard
Python Dependencies: Install required packages

Setup

1. Configure Database Credentials

Copy .env.example to .env and update with your Supabase credentials:

cd ML
cp .env.example .env

Edit .env with your Supabase connection details:

SUPABASE_DB_URL=jdbc:postgresql://aws-1-ap-southeast-2.pooler.supabase.com:6543/postgres?prepareThreshold=0&tcpKeepAlive=true&socketTimeout=0&connectTimeout=30
SUPABASE_DB_USERNAME=postgres.your-project-ref
SUPABASE_DB_PASSWORD=your-supabase-password

2. Install Dependencies

cd ML
pip install psycopg2-binary python-dotenv numpy

Usage

Basic Usage

cd ML
python populate_database.py

This creates 50 users with 100 tasks each (5000 total tasks).

Custom Parameters

# Create 20 users with 50 tasks each
python populate_database.py --users 20 --tasks 50

# Create 100 users with 200 tasks each
python populate_database.py --users 100 --tasks 200

What Gets Created

Per User

Data Type	Quantity	Description
Users	1	User account with personality profile
Tasks	100	Realistic tasks with completion probabilities
Behavior Events	50	Task and focus session events
Habits	5	Habit tracking with completions
Analytics	30	Daily productivity metrics
Task Predictions	100	ML predictions for each task
Interventions	20	Proactive behavioral interventions
Coaching Insights	15	AI-generated coaching recommendations
Notifications	30	User notifications
Prediction Feedback	40	User feedback on predictions
Activity Logs	50	User activity tracking
User Streaks	8	Habit streak tracking
Achievements	10	Gamification achievements
Focus Sessions	25	Pomodoro session tracking
User Settings	1	User preferences
Wellbeing Data	30	Daily mood and energy tracking
Goals	5	User goals with progress
Recurring Tasks	10	Recurring task patterns
Saved Filters	5	Task filter configurations
Research Data	1	Research consent and participation
Export History	3	Data export records
ML Experiments	1	Model training records
AB Testing	1	Feature experiment records
EMA Prompts	20	Experience sampling prompts
Task Templates	10	Reusable task templates

Data Generation Logic

Personality Profiles

Uses the Big Five (OCEAN) personality model with realistic correlations:

Openness: Creativity and curiosity
Conscientiousness: Organization and dependability
Extraversion: Sociability and energy
Agreeableness: Cooperativeness
Neuroticism: Emotional stability

Task Completion Probability

Calculated based on:

Personality-task interaction (conscientiousness is strongest predictor)
Task priority and complexity
Time pressure
Category-specific effects

Behavioral Patterns

Realistic task creation timestamps
Historical behavior tracking
Habit completion streaks
Focus session metrics

Testing the Pipeline

1. Start Spring Boot Backend

cd server
mvn spring-boot:run

2. Start FastAPI ML Service

cd ML
python main.py

3. Test Predictions

# Get a task ID from the database
SELECT id FROM tasks LIMIT 1;

# Test prediction endpoint
curl -X POST http://localhost:8080/api/predictions/task/{taskId}

4. Verify Feature Calculation

Check that the features calculated by Spring Boot match what was used during training:

SELECT 
    t.id,
    t.complexity,
    t.priority,
    t.category,
    p.openness,
    p.conscientiousness,
    p.extraversion,
    p.agreeableness,
    p.neuroticism
FROM tasks t
JOIN personality_profiles p ON t.user_id = p.user_id
LIMIT 10;

Troubleshooting

Connection Issues

If you see connection errors:

Verify Supabase credentials in .env
Check Supabase dashboard for database status
Ensure your IP is whitelisted in Supabase network settings
Verify the database URL format

Permission Errors

If you see permission errors:

Ensure the database user has INSERT permissions
Check table constraints are satisfied
Verify UUID generation extension is enabled

Data Already Exists

If data already exists, the script will skip duplicate entries using ON CONFLICT DO NOTHING.

Data Cleanup

To clear the database and start fresh:

-- In Supabase SQL Editor or psql
TRUNCATE TABLE user_achievements CASCADE;
TRUNCATE TABLE user_streaks CASCADE;
TRUNCATE TABLE habit_completions CASCADE;
TRUNCATE TABLE habits CASCADE;
TRUNCATE TABLE task_predictions CASCADE;
TRUNCATE TABLE behavior_events CASCADE;
TRUNCATE TABLE tasks CASCADE;
TRUNCATE TABLE personality_profiles CASCADE;
TRUNCATE TABLE users CASCADE;

Performance Notes

Uses batch inserts (executemany) for optimal performance
Connection pooling recommended for large datasets
Consider running during off-peak hours for production databases

Next Steps

After populating the database:

Test your Spring Boot API endpoints
Verify ML service predictions
Check feature calculation accuracy
Validate frontend displays data correctly
Run end-to-end tests