SPG_ML / DATABASE_POPULATION_GUIDE.md
meetmendapara's picture
Initial commit for ML space
df31aa1
# Database Population Guide
This guide explains how to populate your Supabase database with synthetic data for testing the Cognexa ML pipeline.
## Overview
The `populate_database.py` script creates realistic synthetic data that simulates real users and their behavior. This allows you to:
- Test the full ML pipeline from database → Spring Boot → FastAPI → predictions
- Verify feature calculation matches training data
- Test your backend API endpoints with realistic data
- Validate your ML service can predict on real database rows
## Prerequisites
1. **Supabase Database**: Ensure your Supabase database is created and running
2. **Database Credentials**: Get your Supabase connection details from the Supabase dashboard
3. **Python Dependencies**: Install required packages
## Setup
### 1. Configure Database Credentials
Copy `.env.example` to `.env` and update with your Supabase credentials:
```bash
cd ML
cp .env.example .env
```
Edit `.env` with your Supabase connection details:
```env
SUPABASE_DB_URL=jdbc:postgresql://aws-1-ap-southeast-2.pooler.supabase.com:6543/postgres?prepareThreshold=0&tcpKeepAlive=true&socketTimeout=0&connectTimeout=30
SUPABASE_DB_USERNAME=postgres.your-project-ref
SUPABASE_DB_PASSWORD=your-supabase-password
```
### 2. Install Dependencies
```bash
cd ML
pip install psycopg2-binary python-dotenv numpy
```
## Usage
### Basic Usage
```bash
cd ML
python populate_database.py
```
This creates 50 users with 100 tasks each (5000 total tasks).
### Custom Parameters
```bash
# Create 20 users with 50 tasks each
python populate_database.py --users 20 --tasks 50
# Create 100 users with 200 tasks each
python populate_database.py --users 100 --tasks 200
```
## What Gets Created
### Per User
| Data Type | Quantity | Description |
|-----------|----------|-------------|
| Users | 1 | User account with personality profile |
| Tasks | 100 | Realistic tasks with completion probabilities |
| Behavior Events | 50 | Task and focus session events |
| Habits | 5 | Habit tracking with completions |
| Analytics | 30 | Daily productivity metrics |
| Task Predictions | 100 | ML predictions for each task |
| Interventions | 20 | Proactive behavioral interventions |
| Coaching Insights | 15 | AI-generated coaching recommendations |
| Notifications | 30 | User notifications |
| Prediction Feedback | 40 | User feedback on predictions |
| Activity Logs | 50 | User activity tracking |
| User Streaks | 8 | Habit streak tracking |
| Achievements | 10 | Gamification achievements |
| Focus Sessions | 25 | Pomodoro session tracking |
| User Settings | 1 | User preferences |
| Wellbeing Data | 30 | Daily mood and energy tracking |
| Goals | 5 | User goals with progress |
| Recurring Tasks | 10 | Recurring task patterns |
| Saved Filters | 5 | Task filter configurations |
| Research Data | 1 | Research consent and participation |
| Export History | 3 | Data export records |
| ML Experiments | 1 | Model training records |
| AB Testing | 1 | Feature experiment records |
| EMA Prompts | 20 | Experience sampling prompts |
| Task Templates | 10 | Reusable task templates |
## Data Generation Logic
### Personality Profiles
Uses the Big Five (OCEAN) personality model with realistic correlations:
- **Openness**: Creativity and curiosity
- **Conscientiousness**: Organization and dependability
- **Extraversion**: Sociability and energy
- **Agreeableness**: Cooperativeness
- **Neuroticism**: Emotional stability
### Task Completion Probability
Calculated based on:
- Personality-task interaction (conscientiousness is strongest predictor)
- Task priority and complexity
- Time pressure
- Category-specific effects
### Behavioral Patterns
- Realistic task creation timestamps
- Historical behavior tracking
- Habit completion streaks
- Focus session metrics
## Testing the Pipeline
### 1. Start Spring Boot Backend
```bash
cd server
mvn spring-boot:run
```
### 2. Start FastAPI ML Service
```bash
cd ML
python main.py
```
### 3. Test Predictions
```bash
# Get a task ID from the database
SELECT id FROM tasks LIMIT 1;
# Test prediction endpoint
curl -X POST http://localhost:8080/api/predictions/task/{taskId}
```
### 4. Verify Feature Calculation
Check that the features calculated by Spring Boot match what was used during training:
```sql
SELECT
t.id,
t.complexity,
t.priority,
t.category,
p.openness,
p.conscientiousness,
p.extraversion,
p.agreeableness,
p.neuroticism
FROM tasks t
JOIN personality_profiles p ON t.user_id = p.user_id
LIMIT 10;
```
## Troubleshooting
### Connection Issues
If you see connection errors:
1. Verify Supabase credentials in `.env`
2. Check Supabase dashboard for database status
3. Ensure your IP is whitelisted in Supabase network settings
4. Verify the database URL format
### Permission Errors
If you see permission errors:
1. Ensure the database user has INSERT permissions
2. Check table constraints are satisfied
3. Verify UUID generation extension is enabled
### Data Already Exists
If data already exists, the script will skip duplicate entries using `ON CONFLICT DO NOTHING`.
## Data Cleanup
To clear the database and start fresh:
```sql
-- In Supabase SQL Editor or psql
TRUNCATE TABLE user_achievements CASCADE;
TRUNCATE TABLE user_streaks CASCADE;
TRUNCATE TABLE habit_completions CASCADE;
TRUNCATE TABLE habits CASCADE;
TRUNCATE TABLE task_predictions CASCADE;
TRUNCATE TABLE behavior_events CASCADE;
TRUNCATE TABLE tasks CASCADE;
TRUNCATE TABLE personality_profiles CASCADE;
TRUNCATE TABLE users CASCADE;
```
## Performance Notes
- Uses batch inserts (`executemany`) for optimal performance
- Connection pooling recommended for large datasets
- Consider running during off-peak hours for production databases
## Next Steps
After populating the database:
1. Test your Spring Boot API endpoints
2. Verify ML service predictions
3. Check feature calculation accuracy
4. Validate frontend displays data correctly
5. Run end-to-end tests