# Database Population Guide This guide explains how to populate your Supabase database with synthetic data for testing the Cognexa ML pipeline. ## Overview The `populate_database.py` script creates realistic synthetic data that simulates real users and their behavior. This allows you to: - Test the full ML pipeline from database → Spring Boot → FastAPI → predictions - Verify feature calculation matches training data - Test your backend API endpoints with realistic data - Validate your ML service can predict on real database rows ## Prerequisites 1. **Supabase Database**: Ensure your Supabase database is created and running 2. **Database Credentials**: Get your Supabase connection details from the Supabase dashboard 3. **Python Dependencies**: Install required packages ## Setup ### 1. Configure Database Credentials Copy `.env.example` to `.env` and update with your Supabase credentials: ```bash cd ML cp .env.example .env ``` Edit `.env` with your Supabase connection details: ```env SUPABASE_DB_URL=jdbc:postgresql://aws-1-ap-southeast-2.pooler.supabase.com:6543/postgres?prepareThreshold=0&tcpKeepAlive=true&socketTimeout=0&connectTimeout=30 SUPABASE_DB_USERNAME=postgres.your-project-ref SUPABASE_DB_PASSWORD=your-supabase-password ``` ### 2. Install Dependencies ```bash cd ML pip install psycopg2-binary python-dotenv numpy ``` ## Usage ### Basic Usage ```bash cd ML python populate_database.py ``` This creates 50 users with 100 tasks each (5000 total tasks). ### Custom Parameters ```bash # Create 20 users with 50 tasks each python populate_database.py --users 20 --tasks 50 # Create 100 users with 200 tasks each python populate_database.py --users 100 --tasks 200 ``` ## What Gets Created ### Per User | Data Type | Quantity | Description | |-----------|----------|-------------| | Users | 1 | User account with personality profile | | Tasks | 100 | Realistic tasks with completion probabilities | | Behavior Events | 50 | Task and focus session events | | Habits | 5 | Habit tracking with completions | | Analytics | 30 | Daily productivity metrics | | Task Predictions | 100 | ML predictions for each task | | Interventions | 20 | Proactive behavioral interventions | | Coaching Insights | 15 | AI-generated coaching recommendations | | Notifications | 30 | User notifications | | Prediction Feedback | 40 | User feedback on predictions | | Activity Logs | 50 | User activity tracking | | User Streaks | 8 | Habit streak tracking | | Achievements | 10 | Gamification achievements | | Focus Sessions | 25 | Pomodoro session tracking | | User Settings | 1 | User preferences | | Wellbeing Data | 30 | Daily mood and energy tracking | | Goals | 5 | User goals with progress | | Recurring Tasks | 10 | Recurring task patterns | | Saved Filters | 5 | Task filter configurations | | Research Data | 1 | Research consent and participation | | Export History | 3 | Data export records | | ML Experiments | 1 | Model training records | | AB Testing | 1 | Feature experiment records | | EMA Prompts | 20 | Experience sampling prompts | | Task Templates | 10 | Reusable task templates | ## Data Generation Logic ### Personality Profiles Uses the Big Five (OCEAN) personality model with realistic correlations: - **Openness**: Creativity and curiosity - **Conscientiousness**: Organization and dependability - **Extraversion**: Sociability and energy - **Agreeableness**: Cooperativeness - **Neuroticism**: Emotional stability ### Task Completion Probability Calculated based on: - Personality-task interaction (conscientiousness is strongest predictor) - Task priority and complexity - Time pressure - Category-specific effects ### Behavioral Patterns - Realistic task creation timestamps - Historical behavior tracking - Habit completion streaks - Focus session metrics ## Testing the Pipeline ### 1. Start Spring Boot Backend ```bash cd server mvn spring-boot:run ``` ### 2. Start FastAPI ML Service ```bash cd ML python main.py ``` ### 3. Test Predictions ```bash # Get a task ID from the database SELECT id FROM tasks LIMIT 1; # Test prediction endpoint curl -X POST http://localhost:8080/api/predictions/task/{taskId} ``` ### 4. Verify Feature Calculation Check that the features calculated by Spring Boot match what was used during training: ```sql SELECT t.id, t.complexity, t.priority, t.category, p.openness, p.conscientiousness, p.extraversion, p.agreeableness, p.neuroticism FROM tasks t JOIN personality_profiles p ON t.user_id = p.user_id LIMIT 10; ``` ## Troubleshooting ### Connection Issues If you see connection errors: 1. Verify Supabase credentials in `.env` 2. Check Supabase dashboard for database status 3. Ensure your IP is whitelisted in Supabase network settings 4. Verify the database URL format ### Permission Errors If you see permission errors: 1. Ensure the database user has INSERT permissions 2. Check table constraints are satisfied 3. Verify UUID generation extension is enabled ### Data Already Exists If data already exists, the script will skip duplicate entries using `ON CONFLICT DO NOTHING`. ## Data Cleanup To clear the database and start fresh: ```sql -- In Supabase SQL Editor or psql TRUNCATE TABLE user_achievements CASCADE; TRUNCATE TABLE user_streaks CASCADE; TRUNCATE TABLE habit_completions CASCADE; TRUNCATE TABLE habits CASCADE; TRUNCATE TABLE task_predictions CASCADE; TRUNCATE TABLE behavior_events CASCADE; TRUNCATE TABLE tasks CASCADE; TRUNCATE TABLE personality_profiles CASCADE; TRUNCATE TABLE users CASCADE; ``` ## Performance Notes - Uses batch inserts (`executemany`) for optimal performance - Connection pooling recommended for large datasets - Consider running during off-peak hours for production databases ## Next Steps After populating the database: 1. Test your Spring Boot API endpoints 2. Verify ML service predictions 3. Check feature calculation accuracy 4. Validate frontend displays data correctly 5. Run end-to-end tests