# Database Population Guide

This guide explains how to populate your Supabase database with synthetic data for testing the Cognexa ML pipeline.

## Overview

The `populate_database.py` script creates realistic synthetic data that simulates real users and their behavior. This allows you to:

- Test the full ML pipeline from database → Spring Boot → FastAPI → predictions
- Verify feature calculation matches training data
- Test your backend API endpoints with realistic data
- Validate your ML service can predict on real database rows

## Prerequisites

1. **Supabase Database**: Ensure your Supabase database is created and running
2. **Database Credentials**: Get your Supabase connection details from the Supabase dashboard
3. **Python Dependencies**: Install required packages

## Setup

### 1. Configure Database Credentials

Copy `.env.example` to `.env` and update with your Supabase credentials:

```bash
cd ML
cp .env.example .env
```

Edit `.env` with your Supabase connection details:

```env
SUPABASE_DB_URL=jdbc:postgresql://aws-1-ap-southeast-2.pooler.supabase.com:6543/postgres?prepareThreshold=0&tcpKeepAlive=true&socketTimeout=0&connectTimeout=30
SUPABASE_DB_USERNAME=postgres.your-project-ref
SUPABASE_DB_PASSWORD=your-supabase-password
```

### 2. Install Dependencies

```bash
cd ML
pip install psycopg2-binary python-dotenv numpy
```

## Usage

### Basic Usage

```bash
cd ML
python populate_database.py
```

This creates 50 users with 100 tasks each (5000 total tasks).

### Custom Parameters

```bash
# Create 20 users with 50 tasks each
python populate_database.py --users 20 --tasks 50

# Create 100 users with 200 tasks each
python populate_database.py --users 100 --tasks 200
```

## What Gets Created

### Per User

| Data Type | Quantity | Description |
|-----------|----------|-------------|
| Users | 1 | User account with personality profile |
| Tasks | 100 | Realistic tasks with completion probabilities |
| Behavior Events | 50 | Task and focus session events |
| Habits | 5 | Habit tracking with completions |
| Analytics | 30 | Daily productivity metrics |
| Task Predictions | 100 | ML predictions for each task |
| Interventions | 20 | Proactive behavioral interventions |
| Coaching Insights | 15 | AI-generated coaching recommendations |
| Notifications | 30 | User notifications |
| Prediction Feedback | 40 | User feedback on predictions |
| Activity Logs | 50 | User activity tracking |
| User Streaks | 8 | Habit streak tracking |
| Achievements | 10 | Gamification achievements |
| Focus Sessions | 25 | Pomodoro session tracking |
| User Settings | 1 | User preferences |
| Wellbeing Data | 30 | Daily mood and energy tracking |
| Goals | 5 | User goals with progress |
| Recurring Tasks | 10 | Recurring task patterns |
| Saved Filters | 5 | Task filter configurations |
| Research Data | 1 | Research consent and participation |
| Export History | 3 | Data export records |
| ML Experiments | 1 | Model training records |
| AB Testing | 1 | Feature experiment records |
| EMA Prompts | 20 | Experience sampling prompts |
| Task Templates | 10 | Reusable task templates |

## Data Generation Logic

### Personality Profiles

Uses the Big Five (OCEAN) personality model with realistic correlations:
- **Openness**: Creativity and curiosity
- **Conscientiousness**: Organization and dependability
- **Extraversion**: Sociability and energy
- **Agreeableness**: Cooperativeness
- **Neuroticism**: Emotional stability

### Task Completion Probability

Calculated based on:
- Personality-task interaction (conscientiousness is strongest predictor)
- Task priority and complexity
- Time pressure
- Category-specific effects

### Behavioral Patterns

- Realistic task creation timestamps
- Historical behavior tracking
- Habit completion streaks
- Focus session metrics

## Testing the Pipeline

### 1. Start Spring Boot Backend

```bash
cd server
mvn spring-boot:run
```

### 2. Start FastAPI ML Service

```bash
cd ML
python main.py
```

### 3. Test Predictions

```bash
# Get a task ID from the database
SELECT id FROM tasks LIMIT 1;

# Test prediction endpoint
curl -X POST http://localhost:8080/api/predictions/task/{taskId}
```

### 4. Verify Feature Calculation

Check that the features calculated by Spring Boot match what was used during training:

```sql
SELECT 
    t.id,
    t.complexity,
    t.priority,
    t.category,
    p.openness,
    p.conscientiousness,
    p.extraversion,
    p.agreeableness,
    p.neuroticism
FROM tasks t
JOIN personality_profiles p ON t.user_id = p.user_id
LIMIT 10;
```

## Troubleshooting

### Connection Issues

If you see connection errors:

1. Verify Supabase credentials in `.env`
2. Check Supabase dashboard for database status
3. Ensure your IP is whitelisted in Supabase network settings
4. Verify the database URL format

### Permission Errors

If you see permission errors:

1. Ensure the database user has INSERT permissions
2. Check table constraints are satisfied
3. Verify UUID generation extension is enabled

### Data Already Exists

If data already exists, the script will skip duplicate entries using `ON CONFLICT DO NOTHING`.

## Data Cleanup

To clear the database and start fresh:

```sql
-- In Supabase SQL Editor or psql
TRUNCATE TABLE user_achievements CASCADE;
TRUNCATE TABLE user_streaks CASCADE;
TRUNCATE TABLE habit_completions CASCADE;
TRUNCATE TABLE habits CASCADE;
TRUNCATE TABLE task_predictions CASCADE;
TRUNCATE TABLE behavior_events CASCADE;
TRUNCATE TABLE tasks CASCADE;
TRUNCATE TABLE personality_profiles CASCADE;
TRUNCATE TABLE users CASCADE;
```

## Performance Notes

- Uses batch inserts (`executemany`) for optimal performance
- Connection pooling recommended for large datasets
- Consider running during off-peak hours for production databases

## Next Steps

After populating the database:

1. Test your Spring Boot API endpoints
2. Verify ML service predictions
3. Check feature calculation accuracy
4. Validate frontend displays data correctly
5. Run end-to-end tests