Spaces:

meetmendapara
/

SPG_ML

Sleeping

App Files Files Community

SPG_ML / DATABASE_POPULATION_GUIDE.md

meetmendapara

Initial commit for ML space

df31aa1 3 months ago

preview code

raw

history blame contribute delete

6.01 kB

	# Database Population Guide

	This guide explains how to populate your Supabase database with synthetic data for testing the Cognexa ML pipeline.

	## Overview

	The `populate_database.py` script creates realistic synthetic data that simulates real users and their behavior. This allows you to:

	- Test the full ML pipeline from database → Spring Boot → FastAPI → predictions
	- Verify feature calculation matches training data
	- Test your backend API endpoints with realistic data
	- Validate your ML service can predict on real database rows

	## Prerequisites

	1. Supabase Database: Ensure your Supabase database is created and running
	2. Database Credentials: Get your Supabase connection details from the Supabase dashboard
	3. Python Dependencies: Install required packages

	## Setup

	### 1. Configure Database Credentials

	Copy `.env.example` to `.env` and update with your Supabase credentials:

	```bash
	cd ML
	cp .env.example .env
	```

	Edit `.env` with your Supabase connection details:

	```env
	SUPABASE_DB_URL=jdbc:postgresql://aws-1-ap-southeast-2.pooler.supabase.com:6543/postgres?prepareThreshold=0&tcpKeepAlive=true&socketTimeout=0&connectTimeout=30
	SUPABASE_DB_USERNAME=postgres.your-project-ref
	SUPABASE_DB_PASSWORD=your-supabase-password
	```

	### 2. Install Dependencies

	```bash
	cd ML
	pip install psycopg2-binary python-dotenv numpy
	```

	## Usage

	### Basic Usage

	```bash
	cd ML
	python populate_database.py
	```

	This creates 50 users with 100 tasks each (5000 total tasks).

	### Custom Parameters

	```bash
	# Create 20 users with 50 tasks each
	python populate_database.py --users 20 --tasks 50

	# Create 100 users with 200 tasks each
	python populate_database.py --users 100 --tasks 200
	```

	## What Gets Created

	### Per User

	\| Data Type \| Quantity \| Description \|
	\|-----------\|----------\|-------------\|
	\| Users \| 1 \| User account with personality profile \|
	\| Tasks \| 100 \| Realistic tasks with completion probabilities \|
	\| Behavior Events \| 50 \| Task and focus session events \|
	\| Habits \| 5 \| Habit tracking with completions \|
	\| Analytics \| 30 \| Daily productivity metrics \|
	\| Task Predictions \| 100 \| ML predictions for each task \|
	\| Interventions \| 20 \| Proactive behavioral interventions \|
	\| Coaching Insights \| 15 \| AI-generated coaching recommendations \|
	\| Notifications \| 30 \| User notifications \|
	\| Prediction Feedback \| 40 \| User feedback on predictions \|
	\| Activity Logs \| 50 \| User activity tracking \|
	\| User Streaks \| 8 \| Habit streak tracking \|
	\| Achievements \| 10 \| Gamification achievements \|
	\| Focus Sessions \| 25 \| Pomodoro session tracking \|
	\| User Settings \| 1 \| User preferences \|
	\| Wellbeing Data \| 30 \| Daily mood and energy tracking \|
	\| Goals \| 5 \| User goals with progress \|
	\| Recurring Tasks \| 10 \| Recurring task patterns \|
	\| Saved Filters \| 5 \| Task filter configurations \|
	\| Research Data \| 1 \| Research consent and participation \|
	\| Export History \| 3 \| Data export records \|
	\| ML Experiments \| 1 \| Model training records \|
	\| AB Testing \| 1 \| Feature experiment records \|
	\| EMA Prompts \| 20 \| Experience sampling prompts \|
	\| Task Templates \| 10 \| Reusable task templates \|

	## Data Generation Logic

	### Personality Profiles

	Uses the Big Five (OCEAN) personality model with realistic correlations:
	- Openness: Creativity and curiosity
	- Conscientiousness: Organization and dependability
	- Extraversion: Sociability and energy
	- Agreeableness: Cooperativeness
	- Neuroticism: Emotional stability

	### Task Completion Probability

	Calculated based on:
	- Personality-task interaction (conscientiousness is strongest predictor)
	- Task priority and complexity
	- Time pressure
	- Category-specific effects

	### Behavioral Patterns

	- Realistic task creation timestamps
	- Historical behavior tracking
	- Habit completion streaks
	- Focus session metrics

	## Testing the Pipeline

	### 1. Start Spring Boot Backend

	```bash
	cd server
	mvn spring-boot:run
	```

	### 2. Start FastAPI ML Service

	```bash
	cd ML
	python main.py
	```

	### 3. Test Predictions

	```bash
	# Get a task ID from the database
	SELECT id FROM tasks LIMIT 1;

	# Test prediction endpoint
	curl -X POST http://localhost:8080/api/predictions/task/{taskId}
	```

	### 4. Verify Feature Calculation

	Check that the features calculated by Spring Boot match what was used during training:

	```sql
	SELECT
	t.id,
	t.complexity,
	t.priority,
	t.category,
	p.openness,
	p.conscientiousness,
	p.extraversion,
	p.agreeableness,
	p.neuroticism
	FROM tasks t
	JOIN personality_profiles p ON t.user_id = p.user_id
	LIMIT 10;
	```

	## Troubleshooting

	### Connection Issues

	If you see connection errors:

	1. Verify Supabase credentials in `.env`
	2. Check Supabase dashboard for database status
	3. Ensure your IP is whitelisted in Supabase network settings
	4. Verify the database URL format

	### Permission Errors

	If you see permission errors:

	1. Ensure the database user has INSERT permissions
	2. Check table constraints are satisfied
	3. Verify UUID generation extension is enabled

	### Data Already Exists

	If data already exists, the script will skip duplicate entries using `ON CONFLICT DO NOTHING`.

	## Data Cleanup

	To clear the database and start fresh:

	```sql
	-- In Supabase SQL Editor or psql
	TRUNCATE TABLE user_achievements CASCADE;
	TRUNCATE TABLE user_streaks CASCADE;
	TRUNCATE TABLE habit_completions CASCADE;
	TRUNCATE TABLE habits CASCADE;
	TRUNCATE TABLE task_predictions CASCADE;
	TRUNCATE TABLE behavior_events CASCADE;
	TRUNCATE TABLE tasks CASCADE;
	TRUNCATE TABLE personality_profiles CASCADE;
	TRUNCATE TABLE users CASCADE;
	```

	## Performance Notes

	- Uses batch inserts (`executemany`) for optimal performance
	- Connection pooling recommended for large datasets
	- Consider running during off-peak hours for production databases

	## Next Steps

	After populating the database:

	1. Test your Spring Boot API endpoints
	2. Verify ML service predictions
	3. Check feature calculation accuracy
	4. Validate frontend displays data correctly
	5. Run end-to-end tests