Spaces:
Running
Running
Pulastya B
commited on
Commit
·
fc23b4d
1
Parent(s):
7bccc03
docs: Clean up repository and update README
Browse files- Remove redundant documentation files (CHECKLIST, FRONTEND_INTEGRATION, MIGRATION_COMPLETE, etc.)
- Remove old/unused files (chat_ui.py, test_environment.py, cloudbuild.yaml)
- Delete duplicate README in FRRONTEEEND folder
- Create comprehensive single README with clear sections
- Add badges, architecture diagram, and detailed usage instructions
- Include example workflow and tech stack details
- Add contact information and acknowledgments
- BIGQUERY_SCHEMAS.md +0 -691
- CHECKLIST.md +0 -97
- DEPLOYMENT.md +0 -495
- FRONTEND_INTEGRATION.md +0 -234
- FRRONTEEEND/README.md +0 -20
- GEMINI_UPDATE.md +0 -93
- MIGRATION_COMPLETE.md +0 -325
- QUICK_REFERENCE.txt +0 -71
- README.md +248 -515
- chat_ui.py +0 -1073
- cloudbuild.yaml +0 -69
- setup-deployment.sh +0 -78
- test_environment.py +0 -48
BIGQUERY_SCHEMAS.md
DELETED
|
@@ -1,691 +0,0 @@
|
|
| 1 |
-
# BigQuery Output Schemas for Looker Compatibility
|
| 2 |
-
|
| 3 |
-
**Purpose**: Define stable BigQuery table schemas that BI tools (Looker, Data Studio) can query reliably.
|
| 4 |
-
|
| 5 |
-
**Design Principles**:
|
| 6 |
-
- ✅ **Stable Schema**: No breaking changes without versioning
|
| 7 |
-
- ✅ **Consistent Naming**: snake_case columns, clear dimension/metric separation
|
| 8 |
-
- ✅ **BI-Friendly Types**: Standard SQL types, no complex nested structures
|
| 9 |
-
- ✅ **Documented Grain**: Clear primary keys and update patterns
|
| 10 |
-
- ✅ **Dashboard-Ready**: Metrics aligned with common visualizations
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## 📊 Table 1: `model_metrics`
|
| 15 |
-
|
| 16 |
-
**Description**: Model performance metrics tracked over time for monitoring and comparison.
|
| 17 |
-
|
| 18 |
-
**Use Cases**:
|
| 19 |
-
- Performance dashboards
|
| 20 |
-
- Model comparison reports
|
| 21 |
-
- Drift detection alerts
|
| 22 |
-
- A/B test analysis
|
| 23 |
-
|
| 24 |
-
**Update Frequency**: On every model training run
|
| 25 |
-
|
| 26 |
-
**Grain**: One row per model training execution
|
| 27 |
-
|
| 28 |
-
### Schema
|
| 29 |
-
|
| 30 |
-
| Column Name | Type | Description | Dimension/Metric | Example |
|
| 31 |
-
|------------|------|-------------|------------------|---------|
|
| 32 |
-
| `project_id` | STRING | Google Cloud project ID | Dimension | `my-ml-project` |
|
| 33 |
-
| `dataset_id` | STRING | BigQuery dataset name | Dimension | `ml_models` |
|
| 34 |
-
| `model_id` | STRING | Unique model identifier | Dimension (Primary Key) | `xgboost_churn_20251223_153045` |
|
| 35 |
-
| `model_name` | STRING | Human-readable model name | Dimension | `Customer Churn Predictor` |
|
| 36 |
-
| `model_type` | STRING | Algorithm used | Dimension | `XGBoost`, `RandomForest`, `LightGBM` |
|
| 37 |
-
| `task_type` | STRING | ML task category | Dimension | `classification`, `regression` |
|
| 38 |
-
| `training_dataset` | STRING | Source table/file reference | Dimension | `project.dataset.train_data` |
|
| 39 |
-
| `target_column` | STRING | Prediction target name | Dimension | `churn`, `price`, `survived` |
|
| 40 |
-
| `created_at` | TIMESTAMP | Model training timestamp | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
|
| 41 |
-
| `created_date` | DATE | Training date (for partitioning) | Dimension (Time) | `2025-12-23` |
|
| 42 |
-
| `feature_count` | INTEGER | Number of features used | Metric | `42` |
|
| 43 |
-
| `training_rows` | INTEGER | Training set size | Metric | `10000` |
|
| 44 |
-
| `test_rows` | INTEGER | Test set size | Metric | `2500` |
|
| 45 |
-
| `training_duration_seconds` | FLOAT | Time to train model | Metric | `123.45` |
|
| 46 |
-
| `accuracy` | FLOAT | Overall accuracy (0-1) | Metric | `0.95` |
|
| 47 |
-
| `precision` | FLOAT | Precision score (0-1) | Metric | `0.92` |
|
| 48 |
-
| `recall` | FLOAT | Recall score (0-1) | Metric | `0.88` |
|
| 49 |
-
| `f1_score` | FLOAT | F1 score (0-1) | Metric | `0.90` |
|
| 50 |
-
| `roc_auc` | FLOAT | ROC AUC score (0-1) | Metric | `0.94` |
|
| 51 |
-
| `pr_auc` | FLOAT | Precision-Recall AUC (0-1) | Metric | `0.91` |
|
| 52 |
-
| `mae` | FLOAT | Mean Absolute Error (regression) | Metric | `1234.56` |
|
| 53 |
-
| `mse` | FLOAT | Mean Squared Error (regression) | Metric | `567890.12` |
|
| 54 |
-
| `rmse` | FLOAT | Root Mean Squared Error (regression) | Metric | `753.59` |
|
| 55 |
-
| `r2_score` | FLOAT | R² coefficient (regression) | Metric | `0.85` |
|
| 56 |
-
| `cross_val_mean` | FLOAT | Mean CV score | Metric | `0.93` |
|
| 57 |
-
| `cross_val_std` | FLOAT | CV score std deviation | Metric | `0.02` |
|
| 58 |
-
| `hyperparameters` | STRING (JSON) | Model hyperparameters | Metadata | `{"max_depth": 6, "n_estimators": 100}` |
|
| 59 |
-
| `version` | STRING | Model version tag | Dimension | `v1.2.3` |
|
| 60 |
-
| `environment` | STRING | Training environment | Dimension | `production`, `staging`, `development` |
|
| 61 |
-
| `user_email` | STRING | User who trained model | Dimension | `data-scientist@company.com` |
|
| 62 |
-
|
| 63 |
-
### Partitioning & Clustering
|
| 64 |
-
|
| 65 |
-
```sql
|
| 66 |
-
-- Recommended table setup
|
| 67 |
-
CREATE TABLE `project.dataset.model_metrics`
|
| 68 |
-
(
|
| 69 |
-
-- columns as above
|
| 70 |
-
)
|
| 71 |
-
PARTITION BY created_date
|
| 72 |
-
CLUSTER BY model_type, task_type, environment
|
| 73 |
-
OPTIONS(
|
| 74 |
-
description="Model performance metrics for BI dashboards",
|
| 75 |
-
require_partition_filter=true
|
| 76 |
-
);
|
| 77 |
-
```
|
| 78 |
-
|
| 79 |
-
### Primary Dimensions for Looker
|
| 80 |
-
|
| 81 |
-
- **Time**: `created_at`, `created_date`
|
| 82 |
-
- **Model**: `model_type`, `model_name`, `task_type`
|
| 83 |
-
- **Performance Tier**: CASE expression on `accuracy`/`f1_score`
|
| 84 |
-
- `Excellent` (>0.90)
|
| 85 |
-
- `Good` (0.80-0.90)
|
| 86 |
-
- `Fair` (0.70-0.80)
|
| 87 |
-
- `Poor` (<0.70)
|
| 88 |
-
|
| 89 |
-
### Sample Looker View
|
| 90 |
-
|
| 91 |
-
```lookml
|
| 92 |
-
view: model_metrics {
|
| 93 |
-
sql_table_name: `project.dataset.model_metrics` ;;
|
| 94 |
-
|
| 95 |
-
dimension: model_id {
|
| 96 |
-
primary_key: yes
|
| 97 |
-
type: string
|
| 98 |
-
sql: ${TABLE}.model_id ;;
|
| 99 |
-
}
|
| 100 |
-
|
| 101 |
-
dimension_group: created {
|
| 102 |
-
type: time
|
| 103 |
-
timeframes: [date, week, month, quarter, year]
|
| 104 |
-
sql: ${TABLE}.created_at ;;
|
| 105 |
-
}
|
| 106 |
-
|
| 107 |
-
dimension: model_type {
|
| 108 |
-
type: string
|
| 109 |
-
sql: ${TABLE}.model_type ;;
|
| 110 |
-
}
|
| 111 |
-
|
| 112 |
-
dimension: performance_tier {
|
| 113 |
-
type: string
|
| 114 |
-
sql: CASE
|
| 115 |
-
WHEN ${TABLE}.accuracy >= 0.90 THEN 'Excellent'
|
| 116 |
-
WHEN ${TABLE}.accuracy >= 0.80 THEN 'Good'
|
| 117 |
-
WHEN ${TABLE}.accuracy >= 0.70 THEN 'Fair'
|
| 118 |
-
ELSE 'Poor'
|
| 119 |
-
END ;;
|
| 120 |
-
}
|
| 121 |
-
|
| 122 |
-
measure: count {
|
| 123 |
-
type: count
|
| 124 |
-
}
|
| 125 |
-
|
| 126 |
-
measure: avg_accuracy {
|
| 127 |
-
type: average
|
| 128 |
-
sql: ${TABLE}.accuracy ;;
|
| 129 |
-
value_format_name: percent_2
|
| 130 |
-
}
|
| 131 |
-
|
| 132 |
-
measure: avg_f1_score {
|
| 133 |
-
type: average
|
| 134 |
-
sql: ${TABLE}.f1_score ;;
|
| 135 |
-
value_format_name: percent_2
|
| 136 |
-
}
|
| 137 |
-
}
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
---
|
| 141 |
-
|
| 142 |
-
## 🎯 Table 2: `feature_importance`
|
| 143 |
-
|
| 144 |
-
**Description**: Feature importance scores for model interpretability.
|
| 145 |
-
|
| 146 |
-
**Use Cases**:
|
| 147 |
-
- Feature impact analysis
|
| 148 |
-
- Feature selection dashboards
|
| 149 |
-
- Model explainability reports
|
| 150 |
-
|
| 151 |
-
**Update Frequency**: On every model training run
|
| 152 |
-
|
| 153 |
-
**Grain**: One row per feature per model
|
| 154 |
-
|
| 155 |
-
### Schema
|
| 156 |
-
|
| 157 |
-
| Column Name | Type | Description | Dimension/Metric | Example |
|
| 158 |
-
|------------|------|-------------|------------------|---------|
|
| 159 |
-
| `model_id` | STRING | Foreign key to model_metrics | Dimension (Foreign Key) | `xgboost_churn_20251223_153045` |
|
| 160 |
-
| `feature_name` | STRING | Name of the feature | Dimension (Primary Key) | `age`, `total_purchases`, `days_since_last_login` |
|
| 161 |
-
| `importance_score` | FLOAT | Importance value (0-1) | Metric | `0.35` |
|
| 162 |
-
| `importance_rank` | INTEGER | Rank by importance (1=most important) | Metric | `1`, `2`, `3` |
|
| 163 |
-
| `importance_type` | STRING | Calculation method | Dimension | `gain`, `weight`, `cover`, `shap` |
|
| 164 |
-
| `feature_type` | STRING | Data type category | Dimension | `numeric`, `categorical`, `datetime`, `text` |
|
| 165 |
-
| `is_engineered` | BOOLEAN | Created by feature engineering? | Dimension | `true`, `false` |
|
| 166 |
-
| `created_at` | TIMESTAMP | When importance was calculated | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
|
| 167 |
-
| `created_date` | DATE | Calculation date | Dimension (Time) | `2025-12-23` |
|
| 168 |
-
|
| 169 |
-
### Partitioning & Clustering
|
| 170 |
-
|
| 171 |
-
```sql
|
| 172 |
-
CREATE TABLE `project.dataset.feature_importance`
|
| 173 |
-
(
|
| 174 |
-
-- columns as above
|
| 175 |
-
)
|
| 176 |
-
PARTITION BY created_date
|
| 177 |
-
CLUSTER BY model_id, importance_rank
|
| 178 |
-
OPTIONS(
|
| 179 |
-
description="Feature importance scores for model explainability",
|
| 180 |
-
require_partition_filter=false -- Allow cross-model queries
|
| 181 |
-
);
|
| 182 |
-
```
|
| 183 |
-
|
| 184 |
-
### Primary Dimensions for Looker
|
| 185 |
-
|
| 186 |
-
- **Feature**: `feature_name`, `feature_type`, `is_engineered`
|
| 187 |
-
- **Model**: `model_id` (join to model_metrics)
|
| 188 |
-
- **Importance**: `importance_rank`, `importance_type`
|
| 189 |
-
|
| 190 |
-
### Sample Looker View
|
| 191 |
-
|
| 192 |
-
```lookml
|
| 193 |
-
view: feature_importance {
|
| 194 |
-
sql_table_name: `project.dataset.feature_importance` ;;
|
| 195 |
-
|
| 196 |
-
dimension: compound_key {
|
| 197 |
-
primary_key: yes
|
| 198 |
-
hidden: yes
|
| 199 |
-
sql: CONCAT(${TABLE}.model_id, '|', ${TABLE}.feature_name) ;;
|
| 200 |
-
}
|
| 201 |
-
|
| 202 |
-
dimension: feature_name {
|
| 203 |
-
type: string
|
| 204 |
-
sql: ${TABLE}.feature_name ;;
|
| 205 |
-
}
|
| 206 |
-
|
| 207 |
-
dimension: is_top_10 {
|
| 208 |
-
type: yesno
|
| 209 |
-
sql: ${TABLE}.importance_rank <= 10 ;;
|
| 210 |
-
}
|
| 211 |
-
|
| 212 |
-
measure: avg_importance {
|
| 213 |
-
type: average
|
| 214 |
-
sql: ${TABLE}.importance_score ;;
|
| 215 |
-
value_format_name: percent_2
|
| 216 |
-
}
|
| 217 |
-
|
| 218 |
-
measure: count_features {
|
| 219 |
-
type: count_distinct
|
| 220 |
-
sql: ${TABLE}.feature_name ;;
|
| 221 |
-
}
|
| 222 |
-
}
|
| 223 |
-
```
|
| 224 |
-
|
| 225 |
-
---
|
| 226 |
-
|
| 227 |
-
## 🔮 Table 3: `predictions`
|
| 228 |
-
|
| 229 |
-
**Description**: Model predictions with actuals for monitoring and evaluation.
|
| 230 |
-
|
| 231 |
-
**Use Cases**:
|
| 232 |
-
- Prediction monitoring
|
| 233 |
-
- Accuracy tracking over time
|
| 234 |
-
- Segment performance analysis
|
| 235 |
-
- Business impact measurement
|
| 236 |
-
|
| 237 |
-
**Update Frequency**: Real-time or batch (daily/hourly)
|
| 238 |
-
|
| 239 |
-
**Grain**: One row per prediction
|
| 240 |
-
|
| 241 |
-
### Schema
|
| 242 |
-
|
| 243 |
-
| Column Name | Type | Description | Dimension/Metric | Example |
|
| 244 |
-
|------------|------|-------------|------------------|---------|
|
| 245 |
-
| `prediction_id` | STRING | Unique prediction identifier | Dimension (Primary Key) | `pred_abc123xyz` |
|
| 246 |
-
| `model_id` | STRING | Model used for prediction | Dimension (Foreign Key) | `xgboost_churn_20251223_153045` |
|
| 247 |
-
| `entity_id` | STRING | Entity being predicted (customer_id, product_id, etc.) | Dimension | `customer_12345` |
|
| 248 |
-
| `predicted_at` | TIMESTAMP | When prediction was made | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
|
| 249 |
-
| `predicted_date` | DATE | Prediction date (for partitioning) | Dimension (Time) | `2025-12-23` |
|
| 250 |
-
| `prediction_value` | FLOAT | Predicted value | Metric | `0.85` (probability), `49.99` (price) |
|
| 251 |
-
| `prediction_class` | STRING | Predicted class (classification) | Dimension | `churn`, `not_churn` |
|
| 252 |
-
| `prediction_confidence` | FLOAT | Model confidence (0-1) | Metric | `0.92` |
|
| 253 |
-
| `actual_value` | FLOAT | True value (when available) | Metric | `1.0` (churned), `52.50` (actual price) |
|
| 254 |
-
| `actual_class` | STRING | True class (when available) | Dimension | `churn`, `not_churn` |
|
| 255 |
-
| `actual_recorded_at` | TIMESTAMP | When actual became known | Dimension (Time) | `2025-12-30 10:00:00 UTC` |
|
| 256 |
-
| `is_correct` | BOOLEAN | Prediction was correct? | Dimension | `true`, `false` |
|
| 257 |
-
| `absolute_error` | FLOAT | \|predicted - actual\| | Metric | `2.51` |
|
| 258 |
-
| `squared_error` | FLOAT | (predicted - actual)² | Metric | `6.30` |
|
| 259 |
-
| `feature_values` | STRING (JSON) | Input features used | Metadata | `{"age": 35, "tenure": 24}` |
|
| 260 |
-
| `segment` | STRING | Business segment | Dimension | `enterprise`, `smb`, `consumer` |
|
| 261 |
-
| `region` | STRING | Geographic region | Dimension | `us-west`, `eu-central` |
|
| 262 |
-
| `model_version` | STRING | Model version | Dimension | `v1.2.3` |
|
| 263 |
-
| `prediction_latency_ms` | FLOAT | Inference time | Metric | `23.4` |
|
| 264 |
-
|
| 265 |
-
### Partitioning & Clustering
|
| 266 |
-
|
| 267 |
-
```sql
|
| 268 |
-
CREATE TABLE `project.dataset.predictions`
|
| 269 |
-
(
|
| 270 |
-
-- columns as above
|
| 271 |
-
)
|
| 272 |
-
PARTITION BY predicted_date
|
| 273 |
-
CLUSTER BY model_id, segment, is_correct
|
| 274 |
-
OPTIONS(
|
| 275 |
-
description="Model predictions with actuals for monitoring",
|
| 276 |
-
require_partition_filter=true,
|
| 277 |
-
partition_expiration_days=730 -- 2 years retention
|
| 278 |
-
);
|
| 279 |
-
```
|
| 280 |
-
|
| 281 |
-
### Primary Dimensions for Looker
|
| 282 |
-
|
| 283 |
-
- **Time**: `predicted_date`, days since prediction
|
| 284 |
-
- **Model**: `model_id`, `model_version`
|
| 285 |
-
- **Segment**: `segment`, `region`
|
| 286 |
-
- **Accuracy**: `is_correct`, error buckets
|
| 287 |
-
|
| 288 |
-
### Sample Looker View
|
| 289 |
-
|
| 290 |
-
```lookml
|
| 291 |
-
view: predictions {
|
| 292 |
-
sql_table_name: `project.dataset.predictions` ;;
|
| 293 |
-
|
| 294 |
-
dimension: prediction_id {
|
| 295 |
-
primary_key: yes
|
| 296 |
-
type: string
|
| 297 |
-
sql: ${TABLE}.prediction_id ;;
|
| 298 |
-
}
|
| 299 |
-
|
| 300 |
-
dimension_group: predicted {
|
| 301 |
-
type: time
|
| 302 |
-
timeframes: [date, week, month]
|
| 303 |
-
sql: ${TABLE}.predicted_at ;;
|
| 304 |
-
}
|
| 305 |
-
|
| 306 |
-
dimension: segment {
|
| 307 |
-
type: string
|
| 308 |
-
sql: ${TABLE}.segment ;;
|
| 309 |
-
}
|
| 310 |
-
|
| 311 |
-
dimension: error_bucket {
|
| 312 |
-
type: string
|
| 313 |
-
sql: CASE
|
| 314 |
-
WHEN ${TABLE}.absolute_error IS NULL THEN 'No Actual Yet'
|
| 315 |
-
WHEN ${TABLE}.absolute_error <= 0.1 THEN '0-10%'
|
| 316 |
-
WHEN ${TABLE}.absolute_error <= 0.2 THEN '10-20%'
|
| 317 |
-
ELSE '>20%'
|
| 318 |
-
END ;;
|
| 319 |
-
}
|
| 320 |
-
|
| 321 |
-
measure: count {
|
| 322 |
-
type: count
|
| 323 |
-
}
|
| 324 |
-
|
| 325 |
-
measure: accuracy_rate {
|
| 326 |
-
type: average
|
| 327 |
-
sql: CAST(${TABLE}.is_correct AS FLOAT64) ;;
|
| 328 |
-
value_format_name: percent_1
|
| 329 |
-
}
|
| 330 |
-
|
| 331 |
-
measure: avg_confidence {
|
| 332 |
-
type: average
|
| 333 |
-
sql: ${TABLE}.prediction_confidence ;;
|
| 334 |
-
value_format_name: percent_2
|
| 335 |
-
}
|
| 336 |
-
|
| 337 |
-
measure: mae {
|
| 338 |
-
type: average
|
| 339 |
-
sql: ${TABLE}.absolute_error ;;
|
| 340 |
-
value_format_name: decimal_2
|
| 341 |
-
}
|
| 342 |
-
}
|
| 343 |
-
```
|
| 344 |
-
|
| 345 |
-
---
|
| 346 |
-
|
| 347 |
-
## 📋 Table 4: `data_profile_summary`
|
| 348 |
-
|
| 349 |
-
**Description**: Dataset profiling statistics for data quality monitoring.
|
| 350 |
-
|
| 351 |
-
**Use Cases**:
|
| 352 |
-
- Data quality dashboards
|
| 353 |
-
- Schema drift detection
|
| 354 |
-
- Data validation reports
|
| 355 |
-
- Column-level monitoring
|
| 356 |
-
|
| 357 |
-
**Update Frequency**: Daily or on-demand
|
| 358 |
-
|
| 359 |
-
**Grain**: One row per column per dataset per run
|
| 360 |
-
|
| 361 |
-
### Schema
|
| 362 |
-
|
| 363 |
-
| Column Name | Type | Description | Dimension/Metric | Example |
|
| 364 |
-
|------------|------|-------------|------------------|---------|
|
| 365 |
-
| `profile_id` | STRING | Unique profile run identifier | Dimension (Primary Key) | `profile_abc123xyz` |
|
| 366 |
-
| `dataset_name` | STRING | Source table/file name | Dimension | `project.dataset.customers` |
|
| 367 |
-
| `column_name` | STRING | Column being profiled | Dimension | `age`, `email`, `signup_date` |
|
| 368 |
-
| `profiled_at` | TIMESTAMP | When profiling ran | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
|
| 369 |
-
| `profiled_date` | DATE | Profiling date | Dimension (Time) | `2025-12-23` |
|
| 370 |
-
| `data_type` | STRING | Column data type | Dimension | `INTEGER`, `STRING`, `FLOAT`, `TIMESTAMP` |
|
| 371 |
-
| `inferred_type` | STRING | Smart type inference | Dimension | `numeric`, `categorical`, `datetime`, `text`, `email` |
|
| 372 |
-
| `row_count` | INTEGER | Total rows in dataset | Metric | `10000` |
|
| 373 |
-
| `non_null_count` | INTEGER | Non-null values | Metric | `9850` |
|
| 374 |
-
| `null_count` | INTEGER | Null values | Metric | `150` |
|
| 375 |
-
| `null_percentage` | FLOAT | % null (0-100) | Metric | `1.5` |
|
| 376 |
-
| `unique_count` | INTEGER | Distinct values | Metric | `450` |
|
| 377 |
-
| `uniqueness_percentage` | FLOAT | % unique (0-100) | Metric | `4.5` |
|
| 378 |
-
| `min_value` | STRING | Minimum value (as string) | Metadata | `18`, `2020-01-01` |
|
| 379 |
-
| `max_value` | STRING | Maximum value (as string) | Metadata | `95`, `2025-12-23` |
|
| 380 |
-
| `mean_value` | FLOAT | Mean (numeric only) | Metric | `42.5` |
|
| 381 |
-
| `median_value` | FLOAT | Median (numeric only) | Metric | `38.0` |
|
| 382 |
-
| `std_dev` | FLOAT | Standard deviation (numeric only) | Metric | `15.2` |
|
| 383 |
-
| `skewness` | FLOAT | Distribution skewness | Metric | `0.85` |
|
| 384 |
-
| `kurtosis` | FLOAT | Distribution kurtosis | Metric | `2.1` |
|
| 385 |
-
| `top_value` | STRING | Most common value | Metadata | `male`, `active` |
|
| 386 |
-
| `top_value_frequency` | INTEGER | Count of most common value | Metric | `6500` |
|
| 387 |
-
| `top_value_percentage` | FLOAT | % of most common value | Metric | `65.0` |
|
| 388 |
-
| `has_outliers` | BOOLEAN | Outliers detected? | Dimension | `true`, `false` |
|
| 389 |
-
| `outlier_count` | INTEGER | Number of outliers | Metric | `23` |
|
| 390 |
-
| `outlier_percentage` | FLOAT | % outliers | Metric | `0.23` |
|
| 391 |
-
| `quality_score` | FLOAT | Overall quality score (0-100) | Metric | `92.5` |
|
| 392 |
-
| `quality_issues` | STRING (JSON) | Detected issues | Metadata | `["high_nulls", "duplicate_values"]` |
|
| 393 |
-
| `validation_status` | STRING | Quality check result | Dimension | `pass`, `warn`, `fail` |
|
| 394 |
-
|
| 395 |
-
### Partitioning & Clustering
|
| 396 |
-
|
| 397 |
-
```sql
|
| 398 |
-
CREATE TABLE `project.dataset.data_profile_summary`
|
| 399 |
-
(
|
| 400 |
-
-- columns as above
|
| 401 |
-
)
|
| 402 |
-
PARTITION BY profiled_date
|
| 403 |
-
CLUSTER BY dataset_name, validation_status
|
| 404 |
-
OPTIONS(
|
| 405 |
-
description="Dataset profiling for data quality monitoring",
|
| 406 |
-
require_partition_filter=true,
|
| 407 |
-
partition_expiration_days=90 -- 3 months retention
|
| 408 |
-
);
|
| 409 |
-
```
|
| 410 |
-
|
| 411 |
-
### Primary Dimensions for Looker
|
| 412 |
-
|
| 413 |
-
- **Dataset**: `dataset_name`
|
| 414 |
-
- **Column**: `column_name`, `data_type`, `inferred_type`
|
| 415 |
-
- **Quality**: `validation_status`, `quality_score` buckets
|
| 416 |
-
- **Time**: `profiled_date`
|
| 417 |
-
|
| 418 |
-
### Sample Looker View
|
| 419 |
-
|
| 420 |
-
```lookml
|
| 421 |
-
view: data_profile_summary {
|
| 422 |
-
sql_table_name: `project.dataset.data_profile_summary` ;;
|
| 423 |
-
|
| 424 |
-
dimension: compound_key {
|
| 425 |
-
primary_key: yes
|
| 426 |
-
hidden: yes
|
| 427 |
-
sql: CONCAT(${TABLE}.profile_id, '|', ${TABLE}.column_name) ;;
|
| 428 |
-
}
|
| 429 |
-
|
| 430 |
-
dimension: column_name {
|
| 431 |
-
type: string
|
| 432 |
-
sql: ${TABLE}.column_name ;;
|
| 433 |
-
}
|
| 434 |
-
|
| 435 |
-
dimension: quality_tier {
|
| 436 |
-
type: string
|
| 437 |
-
sql: CASE
|
| 438 |
-
WHEN ${TABLE}.quality_score >= 90 THEN 'Excellent'
|
| 439 |
-
WHEN ${TABLE}.quality_score >= 75 THEN 'Good'
|
| 440 |
-
WHEN ${TABLE}.quality_score >= 60 THEN 'Fair'
|
| 441 |
-
ELSE 'Poor'
|
| 442 |
-
END ;;
|
| 443 |
-
}
|
| 444 |
-
|
| 445 |
-
dimension: has_quality_issues {
|
| 446 |
-
type: yesno
|
| 447 |
-
sql: ${TABLE}.validation_status IN ('warn', 'fail') ;;
|
| 448 |
-
}
|
| 449 |
-
|
| 450 |
-
measure: count_columns {
|
| 451 |
-
type: count_distinct
|
| 452 |
-
sql: ${TABLE}.column_name ;;
|
| 453 |
-
}
|
| 454 |
-
|
| 455 |
-
measure: avg_quality_score {
|
| 456 |
-
type: average
|
| 457 |
-
sql: ${TABLE}.quality_score ;;
|
| 458 |
-
value_format_name: decimal_1
|
| 459 |
-
}
|
| 460 |
-
|
| 461 |
-
measure: avg_null_percentage {
|
| 462 |
-
type: average
|
| 463 |
-
sql: ${TABLE}.null_percentage ;;
|
| 464 |
-
value_format_name: percent_1
|
| 465 |
-
}
|
| 466 |
-
|
| 467 |
-
measure: columns_with_issues {
|
| 468 |
-
type: count_distinct
|
| 469 |
-
sql: ${TABLE}.column_name ;;
|
| 470 |
-
filters: [has_quality_issues: "yes"]
|
| 471 |
-
}
|
| 472 |
-
}
|
| 473 |
-
```
|
| 474 |
-
|
| 475 |
-
---
|
| 476 |
-
|
| 477 |
-
## 🔄 Schema Evolution Guidelines
|
| 478 |
-
|
| 479 |
-
### ✅ **SAFE Changes** (Non-Breaking)
|
| 480 |
-
|
| 481 |
-
1. **Add new columns** (always nullable or with defaults)
|
| 482 |
-
```sql
|
| 483 |
-
ALTER TABLE `project.dataset.model_metrics`
|
| 484 |
-
ADD COLUMN IF NOT EXISTS new_metric FLOAT64;
|
| 485 |
-
```
|
| 486 |
-
|
| 487 |
-
2. **Add new tables** (doesn't affect existing dashboards)
|
| 488 |
-
|
| 489 |
-
3. **Lengthen STRING columns** (VARCHAR(50) → VARCHAR(100))
|
| 490 |
-
|
| 491 |
-
4. **Add indexes/clustering** (performance only)
|
| 492 |
-
|
| 493 |
-
5. **Add column descriptions**
|
| 494 |
-
```sql
|
| 495 |
-
ALTER TABLE `project.dataset.model_metrics`
|
| 496 |
-
ALTER COLUMN accuracy SET OPTIONS (description='Model accuracy (0-1)');
|
| 497 |
-
```
|
| 498 |
-
|
| 499 |
-
### ❌ **BREAKING Changes** (Require Dashboard Updates)
|
| 500 |
-
|
| 501 |
-
1. **Rename columns** → Use views for backward compatibility:
|
| 502 |
-
```sql
|
| 503 |
-
CREATE OR REPLACE VIEW `project.dataset.model_metrics_v2` AS
|
| 504 |
-
SELECT
|
| 505 |
-
model_id,
|
| 506 |
-
accuracy AS acc, -- renamed column
|
| 507 |
-
...
|
| 508 |
-
FROM `project.dataset.model_metrics`;
|
| 509 |
-
```
|
| 510 |
-
|
| 511 |
-
2. **Change data types** → Create new column, migrate, deprecate old:
|
| 512 |
-
```sql
|
| 513 |
-
-- Step 1: Add new column
|
| 514 |
-
ALTER TABLE model_metrics ADD COLUMN created_at_new TIMESTAMP;
|
| 515 |
-
|
| 516 |
-
-- Step 2: Backfill
|
| 517 |
-
UPDATE model_metrics SET created_at_new = CAST(created_at AS TIMESTAMP) WHERE true;
|
| 518 |
-
|
| 519 |
-
-- Step 3: Update dashboards to use new column
|
| 520 |
-
|
| 521 |
-
-- Step 4: Drop old column after validation period
|
| 522 |
-
ALTER TABLE model_metrics DROP COLUMN created_at;
|
| 523 |
-
```
|
| 524 |
-
|
| 525 |
-
3. **Remove columns** → Deprecate first, remove after 90 days
|
| 526 |
-
|
| 527 |
-
4. **Change partitioning** → Requires table recreation
|
| 528 |
-
|
| 529 |
-
### 🔄 **Versioning Strategy**
|
| 530 |
-
|
| 531 |
-
For major schema changes, create versioned tables:
|
| 532 |
-
|
| 533 |
-
```
|
| 534 |
-
project.dataset.model_metrics_v1 (deprecated, keep 90 days)
|
| 535 |
-
project.dataset.model_metrics_v2 (current)
|
| 536 |
-
project.dataset.model_metrics (view pointing to latest version)
|
| 537 |
-
```
|
| 538 |
-
|
| 539 |
-
---
|
| 540 |
-
|
| 541 |
-
## 📊 Dashboard-Ready Metrics Catalog
|
| 542 |
-
|
| 543 |
-
### Model Performance Metrics
|
| 544 |
-
|
| 545 |
-
| Metric Name | Calculation | Use Case |
|
| 546 |
-
|------------|-------------|----------|
|
| 547 |
-
| **Model Count** | `COUNT(DISTINCT model_id)` | Total models trained |
|
| 548 |
-
| **Avg Accuracy** | `AVG(accuracy)` | Overall model quality |
|
| 549 |
-
| **Accuracy Trend** | `AVG(accuracy) OVER (ORDER BY created_date)` | Performance over time |
|
| 550 |
-
| **Best Model** | `model_id WHERE accuracy = MAX(accuracy)` | Top performer |
|
| 551 |
-
| **Models by Type** | `COUNT(*) GROUP BY model_type` | Algorithm distribution |
|
| 552 |
-
| **Training Time** | `AVG(training_duration_seconds)` | Resource usage |
|
| 553 |
-
| **Recent Models** | `WHERE created_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)` | Latest activity |
|
| 554 |
-
|
| 555 |
-
### Feature Importance Metrics
|
| 556 |
-
|
| 557 |
-
| Metric Name | Calculation | Use Case |
|
| 558 |
-
|------------|-------------|----------|
|
| 559 |
-
| **Top Features** | `WHERE importance_rank <= 10` | Most impactful features |
|
| 560 |
-
| **Avg Importance** | `AVG(importance_score)` | Feature impact distribution |
|
| 561 |
-
| **Engineered Features** | `COUNT(*) WHERE is_engineered = true` | Feature engineering effectiveness |
|
| 562 |
-
| **Feature Stability** | `STDDEV(importance_score) GROUP BY feature_name` | Consistent predictors |
|
| 563 |
-
|
| 564 |
-
### Prediction Metrics
|
| 565 |
-
|
| 566 |
-
| Metric Name | Calculation | Use Case |
|
| 567 |
-
|------------|-------------|----------|
|
| 568 |
-
| **Accuracy Rate** | `AVG(CAST(is_correct AS FLOAT64))` | Real-world performance |
|
| 569 |
-
| **MAE** | `AVG(absolute_error)` | Average error magnitude |
|
| 570 |
-
| **RMSE** | `SQRT(AVG(squared_error))` | Error with outlier penalty |
|
| 571 |
-
| **Predictions/Day** | `COUNT(*) GROUP BY predicted_date` | Volume tracking |
|
| 572 |
-
| **Confidence Distribution** | `APPROX_QUANTILES(prediction_confidence, 10)` | Model calibration |
|
| 573 |
-
| **Segment Performance** | `AVG(is_correct) GROUP BY segment` | Fairness check |
|
| 574 |
-
|
| 575 |
-
### Data Quality Metrics
|
| 576 |
-
|
| 577 |
-
| Metric Name | Calculation | Use Case |
|
| 578 |
-
|------------|-------------|----------|
|
| 579 |
-
| **Data Quality Score** | `AVG(quality_score)` | Overall health |
|
| 580 |
-
| **Null Rate** | `AVG(null_percentage)` | Completeness |
|
| 581 |
-
| **Columns with Issues** | `COUNT(DISTINCT column_name) WHERE validation_status != 'pass'` | Problem areas |
|
| 582 |
-
| **Quality Trend** | `AVG(quality_score) OVER (ORDER BY profiled_date)` | Improving/degrading? |
|
| 583 |
-
|
| 584 |
-
---
|
| 585 |
-
|
| 586 |
-
## 🎯 Sample Looker Explores
|
| 587 |
-
|
| 588 |
-
### Explore 1: Model Performance Analysis
|
| 589 |
-
|
| 590 |
-
```lookml
|
| 591 |
-
explore: model_metrics {
|
| 592 |
-
label: "Model Performance"
|
| 593 |
-
description: "Track model accuracy, training time, and comparison"
|
| 594 |
-
|
| 595 |
-
join: feature_importance {
|
| 596 |
-
type: left_outer
|
| 597 |
-
sql_on: ${model_metrics.model_id} = ${feature_importance.model_id} ;;
|
| 598 |
-
relationship: one_to_many
|
| 599 |
-
}
|
| 600 |
-
}
|
| 601 |
-
```
|
| 602 |
-
|
| 603 |
-
### Explore 2: Prediction Monitoring
|
| 604 |
-
|
| 605 |
-
```lookml
|
| 606 |
-
explore: predictions {
|
| 607 |
-
label: "Prediction Monitoring"
|
| 608 |
-
description: "Real-time prediction accuracy and drift"
|
| 609 |
-
|
| 610 |
-
join: model_metrics {
|
| 611 |
-
type: left_outer
|
| 612 |
-
sql_on: ${predictions.model_id} = ${model_metrics.model_id} ;;
|
| 613 |
-
relationship: many_to_one
|
| 614 |
-
}
|
| 615 |
-
}
|
| 616 |
-
```
|
| 617 |
-
|
| 618 |
-
### Explore 3: Data Quality Dashboard
|
| 619 |
-
|
| 620 |
-
```lookml
|
| 621 |
-
explore: data_profile_summary {
|
| 622 |
-
label: "Data Quality"
|
| 623 |
-
description: "Monitor data health and schema drift"
|
| 624 |
-
}
|
| 625 |
-
```
|
| 626 |
-
|
| 627 |
-
---
|
| 628 |
-
|
| 629 |
-
## 📝 Implementation Checklist
|
| 630 |
-
|
| 631 |
-
### Phase 1: Setup (Week 1)
|
| 632 |
-
- [ ] Create all 4 BigQuery tables with partitioning
|
| 633 |
-
- [ ] Set up service account permissions
|
| 634 |
-
- [ ] Configure table expiration policies
|
| 635 |
-
- [ ] Document table owners and update SLAs
|
| 636 |
-
|
| 637 |
-
### Phase 2: Integration (Week 2)
|
| 638 |
-
- [ ] Update tools to write to these schemas
|
| 639 |
-
- [ ] Add schema validation in CI/CD
|
| 640 |
-
- [ ] Create data dictionary in Looker
|
| 641 |
-
- [ ] Set up table monitoring alerts
|
| 642 |
-
|
| 643 |
-
### Phase 3: BI Layer (Week 3)
|
| 644 |
-
- [ ] Create Looker views for all 4 tables
|
| 645 |
-
- [ ] Build explores with joins
|
| 646 |
-
- [ ] Create initial dashboards
|
| 647 |
-
- [ ] Set up scheduled data refreshes
|
| 648 |
-
|
| 649 |
-
### Phase 4: Validation (Week 4)
|
| 650 |
-
- [ ] Backfill historical data
|
| 651 |
-
- [ ] Verify dashboard accuracy
|
| 652 |
-
- [ ] Train stakeholders on dashboards
|
| 653 |
-
- [ ] Document runbooks for common issues
|
| 654 |
-
|
| 655 |
-
---
|
| 656 |
-
|
| 657 |
-
## 🔗 Related Tools
|
| 658 |
-
|
| 659 |
-
**BigQuery Write Tools** (src/bigquery/):
|
| 660 |
-
- `bigquery_write_results()` - Generic write function
|
| 661 |
-
- Helper: `bigquery_write_model_metrics()` - Specialized writer
|
| 662 |
-
- Helper: `bigquery_write_feature_importance()` - Specialized writer
|
| 663 |
-
- Helper: `bigquery_write_predictions()` - Specialized writer
|
| 664 |
-
- Helper: `bigquery_write_data_profile()` - Specialized writer
|
| 665 |
-
|
| 666 |
-
**Example Usage**:
|
| 667 |
-
```python
|
| 668 |
-
from src.bigquery import bigquery_write_results
|
| 669 |
-
|
| 670 |
-
# Write model metrics
|
| 671 |
-
bigquery_write_results(
|
| 672 |
-
data=metrics_df,
|
| 673 |
-
table_id="project.dataset.model_metrics",
|
| 674 |
-
write_disposition="WRITE_APPEND"
|
| 675 |
-
)
|
| 676 |
-
```
|
| 677 |
-
|
| 678 |
-
---
|
| 679 |
-
|
| 680 |
-
## 📚 Additional Resources
|
| 681 |
-
|
| 682 |
-
- [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices)
|
| 683 |
-
- [Looker LookML Reference](https://cloud.google.com/looker/docs/reference/lookml-quick-reference)
|
| 684 |
-
- [Schema Design for BI](https://cloud.google.com/architecture/bigquery-data-warehouse)
|
| 685 |
-
|
| 686 |
-
---
|
| 687 |
-
|
| 688 |
-
**Last Updated**: December 23, 2025
|
| 689 |
-
**Schema Version**: 1.0.0
|
| 690 |
-
**Maintained By**: Data Science Team
|
| 691 |
-
**Review Cadence**: Quarterly
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CHECKLIST.md
DELETED
|
@@ -1,97 +0,0 @@
|
|
| 1 |
-
# ✅ Pre-Launch Checklist
|
| 2 |
-
|
| 3 |
-
## Before Running the Application
|
| 4 |
-
|
| 5 |
-
### 1. Environment Variables ⚠️ **REQUIRED**
|
| 6 |
-
|
| 7 |
-
You MUST set your API key before starting:
|
| 8 |
-
|
| 9 |
-
```powershell
|
| 10 |
-
# Windows PowerShell
|
| 11 |
-
$env:GOOGLE_API_KEY="your-google-api-key-here"
|
| 12 |
-
|
| 13 |
-
# Verify it's set
|
| 14 |
-
echo $env:GOOGLE_API_KEY
|
| 15 |
-
```
|
| 16 |
-
|
| 17 |
-
### 2. Build Status ✅
|
| 18 |
-
|
| 19 |
-
- [x] Frontend dependencies installed
|
| 20 |
-
- [x] Frontend built (FRRONTEEEND/dist exists)
|
| 21 |
-
- [x] Backend code updated with new endpoints
|
| 22 |
-
- [x] Configuration files in place
|
| 23 |
-
|
| 24 |
-
### 3. Quick Start Commands
|
| 25 |
-
|
| 26 |
-
**Option A - Use the start script:**
|
| 27 |
-
```powershell
|
| 28 |
-
.\start.ps1
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
**Option B - Manual start:**
|
| 32 |
-
```powershell
|
| 33 |
-
# Make sure you're in the project root
|
| 34 |
-
Set-Location "c:\Users\Pulastya\Videos\DS AGENTTTT"
|
| 35 |
-
|
| 36 |
-
# Set API key (if not already set)
|
| 37 |
-
$env:GOOGLE_API_KEY="your-key-here"
|
| 38 |
-
|
| 39 |
-
# Start the server
|
| 40 |
-
python src\api\app.py
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
### 4. Access the Application
|
| 44 |
-
|
| 45 |
-
Once the server starts, open your browser to:
|
| 46 |
-
**http://localhost:8080**
|
| 47 |
-
|
| 48 |
-
You should see:
|
| 49 |
-
1. **Landing Page** - Professional homepage with agent features
|
| 50 |
-
2. **Launch Console** button - Click to open the chat interface
|
| 51 |
-
3. **Chat Interface** - Modern conversational UI
|
| 52 |
-
|
| 53 |
-
### 5. Test the Chat
|
| 54 |
-
|
| 55 |
-
Try these sample prompts:
|
| 56 |
-
- "What can you do?"
|
| 57 |
-
- "Explain your data science capabilities"
|
| 58 |
-
- "How do I upload a dataset?"
|
| 59 |
-
- "What ML models do you support?"
|
| 60 |
-
|
| 61 |
-
### 6. Expected Console Output
|
| 62 |
-
|
| 63 |
-
When you start the server, you should see:
|
| 64 |
-
```
|
| 65 |
-
INFO: Started server process [####]
|
| 66 |
-
INFO: Waiting for application startup.
|
| 67 |
-
✅ Agent initialized with provider: groq
|
| 68 |
-
✅ Frontend assets mounted from C:\Users\Pulastya\Videos\DS AGENTTTT\FRRONTEEEND\dist
|
| 69 |
-
INFO: Application startup complete.
|
| 70 |
-
INFO: Uvicorn running on http://0.0.0.0:8080
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
### 7. Troubleshooting Quick Reference
|
| 74 |
-
|
| 75 |
-
| Issue | Solution |
|
| 76 |
-
|-------|----------|
|
| 77 |
-
| "Agent not initialized" | Set GOOGLE_API_KEY environment variable |
|
| 78 |
-
| "Frontend not found" | Run `cd FRRONTEEEND && npm run build` |
|
| 79 |
-
| Port 8080 in use | Kill the process or change PORT env var |
|
| 80 |
-
| Import errors | Run `pip install -r requirements.txt` |
|
| 81 |
-
|
| 82 |
-
## Next Steps After Launch
|
| 83 |
-
|
| 84 |
-
1. **Test the chat** with the agent
|
| 85 |
-
2. **Upload a dataset** (feature coming soon in chat)
|
| 86 |
-
3. **Try the API endpoints** at http://localhost:8080/docs
|
| 87 |
-
4. **Customize the frontend** in FRRONTEEEND/components/
|
| 88 |
-
|
| 89 |
-
## Documentation
|
| 90 |
-
|
| 91 |
-
- 📖 [MIGRATION_COMPLETE.md](MIGRATION_COMPLETE.md) - What was changed
|
| 92 |
-
- 📖 [FRONTEND_INTEGRATION.md](FRONTEND_INTEGRATION.md) - Technical details
|
| 93 |
-
- 📖 [README.md](README.md) - Main project docs
|
| 94 |
-
|
| 95 |
-
---
|
| 96 |
-
|
| 97 |
-
**Ready to launch?** Run `.\start.ps1` and visit http://localhost:8080 🚀
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
DEPLOYMENT.md
DELETED
|
@@ -1,495 +0,0 @@
|
|
| 1 |
-
# 🚀 Google Cloud Run Deployment Guide
|
| 2 |
-
|
| 3 |
-
Complete guide to deploy the Data Science Agent to Google Cloud Run as a serverless API.
|
| 4 |
-
|
| 5 |
-
## 📋 Prerequisites
|
| 6 |
-
|
| 7 |
-
1. **Google Cloud Platform Account**
|
| 8 |
-
- Active GCP account with billing enabled
|
| 9 |
-
- Project created (or use existing project)
|
| 10 |
-
|
| 11 |
-
2. **Install Google Cloud SDK**
|
| 12 |
-
```bash
|
| 13 |
-
# macOS (Homebrew)
|
| 14 |
-
brew install --cask google-cloud-sdk
|
| 15 |
-
|
| 16 |
-
# Or download from: https://cloud.google.com/sdk/install
|
| 17 |
-
```
|
| 18 |
-
|
| 19 |
-
3. **Authenticate with GCP**
|
| 20 |
-
```bash
|
| 21 |
-
gcloud auth login
|
| 22 |
-
gcloud auth application-default login
|
| 23 |
-
```
|
| 24 |
-
|
| 25 |
-
4. **Set Your Project**
|
| 26 |
-
```bash
|
| 27 |
-
gcloud config set project YOUR_PROJECT_ID
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
---
|
| 31 |
-
|
| 32 |
-
## 🎯 Deployment Options
|
| 33 |
-
|
| 34 |
-
### Option 1: Automated Deployment (Recommended)
|
| 35 |
-
|
| 36 |
-
Use the provided deployment script for one-command deployment:
|
| 37 |
-
|
| 38 |
-
```bash
|
| 39 |
-
# Set required environment variables
|
| 40 |
-
export GCP_PROJECT_ID="your-project-id"
|
| 41 |
-
export GROQ_API_KEY="your-groq-api-key"
|
| 42 |
-
export GOOGLE_API_KEY="your-google-api-key" # Optional for Gemini
|
| 43 |
-
|
| 44 |
-
# Run deployment script
|
| 45 |
-
./deploy.sh
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
**What it does:**
|
| 49 |
-
- ✅ Enables required GCP APIs (Cloud Build, Cloud Run, Secret Manager)
|
| 50 |
-
- ✅ Creates secrets for API keys
|
| 51 |
-
- ✅ Builds Docker container
|
| 52 |
-
- ✅ Deploys to Cloud Run
|
| 53 |
-
- ✅ Returns service URL
|
| 54 |
-
|
| 55 |
-
**Configuration options:**
|
| 56 |
-
```bash
|
| 57 |
-
# Optional: Customize deployment
|
| 58 |
-
export CLOUD_RUN_REGION="us-central1" # Change region
|
| 59 |
-
export MEMORY="4Gi" # Increase memory
|
| 60 |
-
export CPU="2" # Set CPU count
|
| 61 |
-
export MAX_INSTANCES="10" # Scale limit
|
| 62 |
-
export TIMEOUT="900" # Request timeout (15 min)
|
| 63 |
-
|
| 64 |
-
./deploy.sh
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
---
|
| 68 |
-
|
| 69 |
-
### Option 2: Manual Deployment
|
| 70 |
-
|
| 71 |
-
Step-by-step manual deployment for full control:
|
| 72 |
-
|
| 73 |
-
#### Step 1: Enable APIs
|
| 74 |
-
```bash
|
| 75 |
-
gcloud services enable \
|
| 76 |
-
cloudbuild.googleapis.com \
|
| 77 |
-
run.googleapis.com \
|
| 78 |
-
containerregistry.googleapis.com \
|
| 79 |
-
secretmanager.googleapis.com
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
#### Step 2: Create Secrets
|
| 83 |
-
```bash
|
| 84 |
-
# Create GROQ API key secret
|
| 85 |
-
echo -n "your-groq-api-key" | gcloud secrets create GROQ_API_KEY --data-file=-
|
| 86 |
-
|
| 87 |
-
# Create Google API key secret (optional)
|
| 88 |
-
echo -n "your-google-api-key" | gcloud secrets create GOOGLE_API_KEY --data-file=-
|
| 89 |
-
|
| 90 |
-
# Grant Cloud Run access to secrets
|
| 91 |
-
PROJECT_NUMBER=$(gcloud projects describe $(gcloud config get-value project) --format="value(projectNumber)")
|
| 92 |
-
gcloud secrets add-iam-policy-binding GROQ_API_KEY \
|
| 93 |
-
--member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
|
| 94 |
-
--role="roles/secretmanager.secretAccessor"
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
#### Step 3: Build Container
|
| 98 |
-
```bash
|
| 99 |
-
gcloud builds submit --tag gcr.io/$(gcloud config get-value project)/data-science-agent
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
#### Step 4: Deploy to Cloud Run
|
| 103 |
-
```bash
|
| 104 |
-
gcloud run deploy data-science-agent \
|
| 105 |
-
--image gcr.io/$(gcloud config get-value project)/data-science-agent \
|
| 106 |
-
--platform managed \
|
| 107 |
-
--region us-central1 \
|
| 108 |
-
--allow-unauthenticated \
|
| 109 |
-
--memory 4Gi \
|
| 110 |
-
--cpu 2 \
|
| 111 |
-
--timeout 900 \
|
| 112 |
-
--max-instances 10 \
|
| 113 |
-
--set-env-vars LLM_PROVIDER=groq,REASONING_EFFORT=medium \
|
| 114 |
-
--set-secrets GROQ_API_KEY=GROQ_API_KEY:latest,GOOGLE_API_KEY=GOOGLE_API_KEY:latest
|
| 115 |
-
```
|
| 116 |
-
|
| 117 |
-
---
|
| 118 |
-
|
| 119 |
-
### Option 3: CI/CD with Cloud Build Triggers
|
| 120 |
-
|
| 121 |
-
Automated deployment on git push:
|
| 122 |
-
|
| 123 |
-
#### Step 1: Connect Repository
|
| 124 |
-
```bash
|
| 125 |
-
# Connect GitHub/GitLab/Bitbucket repository
|
| 126 |
-
gcloud beta builds connections create github connection-name \
|
| 127 |
-
--region=us-central1
|
| 128 |
-
```
|
| 129 |
-
|
| 130 |
-
#### Step 2: Create Build Trigger
|
| 131 |
-
```bash
|
| 132 |
-
gcloud builds triggers create github \
|
| 133 |
-
--name="deploy-data-science-agent" \
|
| 134 |
-
--repo-name="Data-Science-Agent" \
|
| 135 |
-
--repo-owner="Surfing-Ninja" \
|
| 136 |
-
--branch-pattern="^main$" \
|
| 137 |
-
--build-config="cloudbuild.yaml"
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
Now every push to `main` branch automatically deploys! 🎉
|
| 141 |
-
|
| 142 |
-
---
|
| 143 |
-
|
| 144 |
-
## 🧪 Testing the Deployment
|
| 145 |
-
|
| 146 |
-
### 1. Health Check
|
| 147 |
-
```bash
|
| 148 |
-
SERVICE_URL=$(gcloud run services describe data-science-agent \
|
| 149 |
-
--region us-central1 \
|
| 150 |
-
--format 'value(status.url)')
|
| 151 |
-
|
| 152 |
-
curl $SERVICE_URL/health
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
**Expected response:**
|
| 156 |
-
```json
|
| 157 |
-
{
|
| 158 |
-
"status": "healthy",
|
| 159 |
-
"agent_ready": true,
|
| 160 |
-
"provider": "groq",
|
| 161 |
-
"tools_count": 82
|
| 162 |
-
}
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
### 2. List Available Tools
|
| 166 |
-
```bash
|
| 167 |
-
curl $SERVICE_URL/tools | jq
|
| 168 |
-
```
|
| 169 |
-
|
| 170 |
-
### 3. Profile a Dataset
|
| 171 |
-
```bash
|
| 172 |
-
curl -X POST $SERVICE_URL/profile \
|
| 173 |
-
-F "file=@test_data/sample.csv"
|
| 174 |
-
```
|
| 175 |
-
|
| 176 |
-
### 4. Run Full Analysis
|
| 177 |
-
```bash
|
| 178 |
-
curl -X POST $SERVICE_URL/run \
|
| 179 |
-
-F "file=@test_data/sample.csv" \
|
| 180 |
-
-F "task_description=Analyze this dataset, detect outliers, and train a prediction model" \
|
| 181 |
-
-F "target_col=target" \
|
| 182 |
-
| jq
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
---
|
| 186 |
-
|
| 187 |
-
## 📊 Monitoring & Logs
|
| 188 |
-
|
| 189 |
-
### View Real-time Logs
|
| 190 |
-
```bash
|
| 191 |
-
gcloud run logs tail data-science-agent --region us-central1
|
| 192 |
-
```
|
| 193 |
-
|
| 194 |
-
### View Recent Logs
|
| 195 |
-
```bash
|
| 196 |
-
gcloud run logs read data-science-agent \
|
| 197 |
-
--region us-central1 \
|
| 198 |
-
--limit 50
|
| 199 |
-
```
|
| 200 |
-
|
| 201 |
-
### Cloud Console Monitoring
|
| 202 |
-
- Go to: https://console.cloud.google.com/run
|
| 203 |
-
- Click on `data-science-agent`
|
| 204 |
-
- View: Metrics, Logs, Revisions
|
| 205 |
-
|
| 206 |
-
---
|
| 207 |
-
|
| 208 |
-
## 💰 Cost Estimation
|
| 209 |
-
|
| 210 |
-
### Cloud Run Pricing (as of Dec 2024)
|
| 211 |
-
**Free Tier** (per month):
|
| 212 |
-
- 2 million requests
|
| 213 |
-
- 360,000 GB-seconds of memory
|
| 214 |
-
- 180,000 vCPU-seconds
|
| 215 |
-
|
| 216 |
-
**Paid Tier** (us-central1):
|
| 217 |
-
- CPU: $0.00002400 per vCPU-second
|
| 218 |
-
- Memory: $0.00000250 per GB-second
|
| 219 |
-
- Requests: $0.40 per million requests
|
| 220 |
-
|
| 221 |
-
**Example Cost for 4Gi Memory, 2 vCPU:**
|
| 222 |
-
- 1 request taking 60 seconds
|
| 223 |
-
- CPU: 2 vCPU × 60s × $0.000024 = $0.00288
|
| 224 |
-
- Memory: 4GB × 60s × $0.0000025 = $0.0006
|
| 225 |
-
- Request: $0.0000004
|
| 226 |
-
- **Total: ~$0.0035 per request**
|
| 227 |
-
|
| 228 |
-
**Monthly estimate for 1000 requests/month:**
|
| 229 |
-
- ~$3.50/month (well within free tier for testing!)
|
| 230 |
-
|
| 231 |
-
---
|
| 232 |
-
|
| 233 |
-
## 🔒 Security Best Practices
|
| 234 |
-
|
| 235 |
-
### 1. Enable Authentication (Production)
|
| 236 |
-
```bash
|
| 237 |
-
# Deploy with authentication required
|
| 238 |
-
gcloud run deploy data-science-agent \
|
| 239 |
-
--no-allow-unauthenticated \
|
| 240 |
-
--region us-central1 \
|
| 241 |
-
--image gcr.io/PROJECT_ID/data-science-agent
|
| 242 |
-
|
| 243 |
-
# Create service account for clients
|
| 244 |
-
gcloud iam service-accounts create api-client
|
| 245 |
-
|
| 246 |
-
# Grant invoker role
|
| 247 |
-
gcloud run services add-iam-policy-binding data-science-agent \
|
| 248 |
-
--member="serviceAccount:api-client@PROJECT_ID.iam.gserviceaccount.com" \
|
| 249 |
-
--role="roles/run.invoker" \
|
| 250 |
-
--region us-central1
|
| 251 |
-
```
|
| 252 |
-
|
| 253 |
-
### 2. Use VPC Connector (For BigQuery/GCS)
|
| 254 |
-
```bash
|
| 255 |
-
# Create VPC connector
|
| 256 |
-
gcloud compute networks vpc-access connectors create ds-agent-connector \
|
| 257 |
-
--network default \
|
| 258 |
-
--region us-central1 \
|
| 259 |
-
--range 10.8.0.0/28
|
| 260 |
-
|
| 261 |
-
# Deploy with VPC
|
| 262 |
-
gcloud run deploy data-science-agent \
|
| 263 |
-
--vpc-connector ds-agent-connector \
|
| 264 |
-
--region us-central1
|
| 265 |
-
```
|
| 266 |
-
|
| 267 |
-
### 3. Restrict API Keys
|
| 268 |
-
- Set **Application restrictions** in Google Cloud Console
|
| 269 |
-
- Whitelist only Cloud Run service URL
|
| 270 |
-
- Set **API restrictions** to only required APIs
|
| 271 |
-
|
| 272 |
-
---
|
| 273 |
-
|
| 274 |
-
## 🔧 Configuration Options
|
| 275 |
-
|
| 276 |
-
### Environment Variables
|
| 277 |
-
```bash
|
| 278 |
-
# Set during deployment
|
| 279 |
-
--set-env-vars KEY1=value1,KEY2=value2
|
| 280 |
-
|
| 281 |
-
# Available variables:
|
| 282 |
-
LLM_PROVIDER=groq # or "gemini"
|
| 283 |
-
REASONING_EFFORT=medium # low, medium, high
|
| 284 |
-
CACHE_TTL_SECONDS=86400 # Cache lifetime
|
| 285 |
-
ARTIFACT_BACKEND=local # or "gcs" for cloud storage
|
| 286 |
-
GCS_BUCKET_NAME=your-bucket # If using GCS backend
|
| 287 |
-
OUTPUT_DIR=/tmp/outputs # Output directory
|
| 288 |
-
MAX_PARALLEL_TOOLS=5 # Concurrent tool execution
|
| 289 |
-
MAX_RETRIES=3 # Tool retry attempts
|
| 290 |
-
TIMEOUT_SECONDS=300 # Tool timeout
|
| 291 |
-
```
|
| 292 |
-
|
| 293 |
-
### Resource Limits
|
| 294 |
-
```bash
|
| 295 |
-
--memory 4Gi # 128Mi to 32Gi
|
| 296 |
-
--cpu 2 # 1 to 8 vCPU
|
| 297 |
-
--timeout 900 # Max 3600s (1 hour)
|
| 298 |
-
--max-instances 10 # Scale limit
|
| 299 |
-
--min-instances 0 # Always-warm instances
|
| 300 |
-
--concurrency 10 # Requests per instance
|
| 301 |
-
```
|
| 302 |
-
|
| 303 |
-
---
|
| 304 |
-
|
| 305 |
-
## 🐛 Troubleshooting
|
| 306 |
-
|
| 307 |
-
### Build Fails
|
| 308 |
-
```bash
|
| 309 |
-
# Check build logs
|
| 310 |
-
gcloud builds list --limit=5
|
| 311 |
-
gcloud builds log BUILD_ID
|
| 312 |
-
|
| 313 |
-
# Common fixes:
|
| 314 |
-
# - Ensure Dockerfile is in root directory
|
| 315 |
-
# - Check requirements.txt has all dependencies
|
| 316 |
-
# - Increase build timeout: --timeout=1200s
|
| 317 |
-
```
|
| 318 |
-
|
| 319 |
-
### Deployment Fails
|
| 320 |
-
```bash
|
| 321 |
-
# Check service status
|
| 322 |
-
gcloud run services describe data-science-agent --region us-central1
|
| 323 |
-
|
| 324 |
-
# Common fixes:
|
| 325 |
-
# - Ensure APIs are enabled
|
| 326 |
-
# - Check secrets exist and are accessible
|
| 327 |
-
# - Verify service account permissions
|
| 328 |
-
```
|
| 329 |
-
|
| 330 |
-
### Runtime Errors
|
| 331 |
-
```bash
|
| 332 |
-
# View logs
|
| 333 |
-
gcloud run logs tail data-science-agent --region us-central1
|
| 334 |
-
|
| 335 |
-
# Common issues:
|
| 336 |
-
# - API keys not set: Check secrets
|
| 337 |
-
# - Import errors: Ensure all dependencies in requirements.txt
|
| 338 |
-
# - Memory issues: Increase --memory limit
|
| 339 |
-
# - Timeout: Increase --timeout value
|
| 340 |
-
```
|
| 341 |
-
|
| 342 |
-
### Container Crashes
|
| 343 |
-
```bash
|
| 344 |
-
# Test locally first
|
| 345 |
-
docker build -t ds-agent .
|
| 346 |
-
docker run -p 8080:8080 \
|
| 347 |
-
-e GROQ_API_KEY="your-key" \
|
| 348 |
-
ds-agent
|
| 349 |
-
|
| 350 |
-
curl http://localhost:8080/health
|
| 351 |
-
```
|
| 352 |
-
|
| 353 |
-
---
|
| 354 |
-
|
| 355 |
-
## 🚀 Advanced Features
|
| 356 |
-
|
| 357 |
-
### Custom Domain
|
| 358 |
-
```bash
|
| 359 |
-
# Map custom domain
|
| 360 |
-
gcloud run domain-mappings create \
|
| 361 |
-
--service data-science-agent \
|
| 362 |
-
--domain api.yourdomain.com \
|
| 363 |
-
--region us-central1
|
| 364 |
-
```
|
| 365 |
-
|
| 366 |
-
### Load Balancing
|
| 367 |
-
```bash
|
| 368 |
-
# Create multiple regional deployments
|
| 369 |
-
for region in us-central1 us-east1 europe-west1; do
|
| 370 |
-
gcloud run deploy data-science-agent \
|
| 371 |
-
--image gcr.io/PROJECT_ID/data-science-agent \
|
| 372 |
-
--region $region
|
| 373 |
-
done
|
| 374 |
-
|
| 375 |
-
# Set up global load balancer
|
| 376 |
-
# Follow: https://cloud.google.com/load-balancing/docs/https/setup-global-ext-https-serverless
|
| 377 |
-
```
|
| 378 |
-
|
| 379 |
-
### Multi-Region Deployment
|
| 380 |
-
```bash
|
| 381 |
-
# Deploy to multiple regions for high availability
|
| 382 |
-
./deploy.sh CLOUD_RUN_REGION=us-central1
|
| 383 |
-
./deploy.sh CLOUD_RUN_REGION=europe-west1
|
| 384 |
-
./deploy.sh CLOUD_RUN_REGION=asia-east1
|
| 385 |
-
```
|
| 386 |
-
|
| 387 |
-
---
|
| 388 |
-
|
| 389 |
-
## 📝 API Documentation
|
| 390 |
-
|
| 391 |
-
Once deployed, access Swagger docs at:
|
| 392 |
-
```
|
| 393 |
-
https://YOUR_SERVICE_URL/docs
|
| 394 |
-
```
|
| 395 |
-
|
| 396 |
-
### Available Endpoints
|
| 397 |
-
|
| 398 |
-
#### `GET /` - Health Check
|
| 399 |
-
Returns service status and tool count.
|
| 400 |
-
|
| 401 |
-
#### `GET /health` - Detailed Health
|
| 402 |
-
Returns agent readiness and provider info.
|
| 403 |
-
|
| 404 |
-
#### `GET /tools` - List Tools
|
| 405 |
-
Returns all 82 available tools organized by category.
|
| 406 |
-
|
| 407 |
-
#### `POST /run` - Run Full Analysis
|
| 408 |
-
Upload dataset and execute complete data science workflow.
|
| 409 |
-
|
| 410 |
-
**Parameters:**
|
| 411 |
-
- `file`: CSV/Parquet file (multipart/form-data)
|
| 412 |
-
- `task_description`: Natural language task description
|
| 413 |
-
- `target_col`: Target column for ML (optional)
|
| 414 |
-
- `use_cache`: Enable caching (default: true)
|
| 415 |
-
- `max_iterations`: Max workflow steps (default: 20)
|
| 416 |
-
|
| 417 |
-
#### `POST /profile` - Quick Profile
|
| 418 |
-
Quick dataset profiling without full workflow.
|
| 419 |
-
|
| 420 |
-
**Parameters:**
|
| 421 |
-
- `file`: CSV/Parquet file (multipart/form-data)
|
| 422 |
-
|
| 423 |
-
---
|
| 424 |
-
|
| 425 |
-
## 🔄 Updates & Rollbacks
|
| 426 |
-
|
| 427 |
-
### Update Deployment
|
| 428 |
-
```bash
|
| 429 |
-
# Rebuild and redeploy
|
| 430 |
-
./deploy.sh
|
| 431 |
-
```
|
| 432 |
-
|
| 433 |
-
### Rollback to Previous Revision
|
| 434 |
-
```bash
|
| 435 |
-
# List revisions
|
| 436 |
-
gcloud run revisions list --service data-science-agent --region us-central1
|
| 437 |
-
|
| 438 |
-
# Rollback
|
| 439 |
-
gcloud run services update-traffic data-science-agent \
|
| 440 |
-
--to-revisions REVISION_NAME=100 \
|
| 441 |
-
--region us-central1
|
| 442 |
-
```
|
| 443 |
-
|
| 444 |
-
### Blue/Green Deployment
|
| 445 |
-
```bash
|
| 446 |
-
# Deploy new version with tag
|
| 447 |
-
gcloud run deploy data-science-agent \
|
| 448 |
-
--tag blue \
|
| 449 |
-
--no-traffic \
|
| 450 |
-
--region us-central1
|
| 451 |
-
|
| 452 |
-
# Test: https://blue---data-science-agent-HASH.run.app
|
| 453 |
-
|
| 454 |
-
# Switch traffic
|
| 455 |
-
gcloud run services update-traffic data-science-agent \
|
| 456 |
-
--to-tags blue=100 \
|
| 457 |
-
--region us-central1
|
| 458 |
-
```
|
| 459 |
-
|
| 460 |
-
---
|
| 461 |
-
|
| 462 |
-
## 📚 Additional Resources
|
| 463 |
-
|
| 464 |
-
- **Cloud Run Docs**: https://cloud.google.com/run/docs
|
| 465 |
-
- **Pricing Calculator**: https://cloud.google.com/products/calculator
|
| 466 |
-
- **Best Practices**: https://cloud.google.com/run/docs/tips
|
| 467 |
-
- **Quotas & Limits**: https://cloud.google.com/run/quotas
|
| 468 |
-
|
| 469 |
-
---
|
| 470 |
-
|
| 471 |
-
## ✅ Deployment Checklist
|
| 472 |
-
|
| 473 |
-
- [ ] GCP project created and billing enabled
|
| 474 |
-
- [ ] Google Cloud SDK installed and authenticated
|
| 475 |
-
- [ ] API keys obtained (GROQ_API_KEY, GOOGLE_API_KEY)
|
| 476 |
-
- [ ] Secrets created in Secret Manager
|
| 477 |
-
- [ ] Docker container builds successfully locally
|
| 478 |
-
- [ ] Cloud Run APIs enabled
|
| 479 |
-
- [ ] Service deployed to Cloud Run
|
| 480 |
-
- [ ] Health check endpoint returns 200
|
| 481 |
-
- [ ] Test dataset profiled successfully
|
| 482 |
-
- [ ] Full analysis workflow tested
|
| 483 |
-
- [ ] Monitoring/logging configured
|
| 484 |
-
- [ ] Cost alerts set up (optional)
|
| 485 |
-
- [ ] Custom domain mapped (optional)
|
| 486 |
-
- [ ] CI/CD pipeline configured (optional)
|
| 487 |
-
|
| 488 |
-
---
|
| 489 |
-
|
| 490 |
-
**Need help?** Check the troubleshooting section or view logs with:
|
| 491 |
-
```bash
|
| 492 |
-
gcloud run logs tail data-science-agent --region us-central1
|
| 493 |
-
```
|
| 494 |
-
|
| 495 |
-
Happy deploying! 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FRONTEND_INTEGRATION.md
DELETED
|
@@ -1,234 +0,0 @@
|
|
| 1 |
-
# Data Science Agent - Frontend Integration Guide
|
| 2 |
-
|
| 3 |
-
## 🎉 New React Frontend
|
| 4 |
-
|
| 5 |
-
The application now features a modern, professional React frontend that replaces the old Gradio interface.
|
| 6 |
-
|
| 7 |
-
### Features
|
| 8 |
-
|
| 9 |
-
- **Beautiful Landing Page**: Showcases the agent's capabilities with modern design
|
| 10 |
-
- **Professional Chat Interface**: NextChat-style conversational UI
|
| 11 |
-
- **Direct Backend Integration**: Communicates with your FastAPI backend
|
| 12 |
-
- **Responsive Design**: Works on all devices
|
| 13 |
-
- **Dark Theme**: Modern, eye-friendly interface
|
| 14 |
-
|
| 15 |
-
## 🚀 Quick Start
|
| 16 |
-
|
| 17 |
-
### Prerequisites
|
| 18 |
-
|
| 19 |
-
- Python 3.13+
|
| 20 |
-
- Node.js 20+
|
| 21 |
-
- npm (comes with Node.js)
|
| 22 |
-
|
| 23 |
-
### Running the Application
|
| 24 |
-
|
| 25 |
-
#### Option 1: Using the Build Script (Recommended)
|
| 26 |
-
|
| 27 |
-
**Windows:**
|
| 28 |
-
```powershell
|
| 29 |
-
.\build-and-deploy.ps1
|
| 30 |
-
```
|
| 31 |
-
|
| 32 |
-
**Linux/Mac:**
|
| 33 |
-
```bash
|
| 34 |
-
chmod +x build-and-deploy.sh
|
| 35 |
-
./build-and-deploy.sh
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
Then start the server:
|
| 39 |
-
```bash
|
| 40 |
-
python src/api/app.py
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
#### Option 2: Manual Steps
|
| 44 |
-
|
| 45 |
-
1. **Build the Frontend:**
|
| 46 |
-
```bash
|
| 47 |
-
cd FRRONTEEEND
|
| 48 |
-
npm.cmd install
|
| 49 |
-
npm.cmd run build
|
| 50 |
-
cd ..
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
2. **Install Python Dependencies:**
|
| 54 |
-
```bash
|
| 55 |
-
pip install -r requirements.txt
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
3. **Start the Backend Server:**
|
| 59 |
-
```bash
|
| 60 |
-
python src/api/app.py
|
| 61 |
-
```
|
| 62 |
-
|
| 63 |
-
4. **Access the Application:**
|
| 64 |
-
Open your browser and navigate to: http://localhost:8080
|
| 65 |
-
|
| 66 |
-
## 🏗️ Architecture
|
| 67 |
-
|
| 68 |
-
### Backend (FastAPI)
|
| 69 |
-
- **Location**: `src/api/app.py`
|
| 70 |
-
- **Port**: 8080
|
| 71 |
-
- **Endpoints**:
|
| 72 |
-
- `GET /` - Health check & landing page
|
| 73 |
-
- `POST /chat` - Chat interface endpoint
|
| 74 |
-
- `POST /run` - Full data science workflow
|
| 75 |
-
- `POST /profile` - Dataset profiling
|
| 76 |
-
- `GET /tools` - List available tools
|
| 77 |
-
|
| 78 |
-
### Frontend (React + Vite)
|
| 79 |
-
- **Location**: `FRRONTEEEND/`
|
| 80 |
-
- **Build Output**: `FRRONTEEEND/dist/`
|
| 81 |
-
- **Dev Port**: 3000 (development mode)
|
| 82 |
-
- **Production**: Served by FastAPI at port 8080
|
| 83 |
-
|
| 84 |
-
## 🔧 Development Mode
|
| 85 |
-
|
| 86 |
-
If you want to develop the frontend with hot-reloading:
|
| 87 |
-
|
| 88 |
-
1. **Terminal 1 - Backend:**
|
| 89 |
-
```bash
|
| 90 |
-
python src/api/app.py
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
2. **Terminal 2 - Frontend:**
|
| 94 |
-
```bash
|
| 95 |
-
cd FRRONTEEEND
|
| 96 |
-
npm.cmd run dev
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
Access:
|
| 100 |
-
- Frontend (dev): http://localhost:3000
|
| 101 |
-
- Backend API: http://localhost:8080
|
| 102 |
-
|
| 103 |
-
## 🌐 API Integration
|
| 104 |
-
|
| 105 |
-
The frontend now communicates with your FastAPI backend instead of calling external APIs directly.
|
| 106 |
-
|
| 107 |
-
### Environment Variables
|
| 108 |
-
|
| 109 |
-
Create `FRRONTEEEND/.env` for local development:
|
| 110 |
-
```env
|
| 111 |
-
VITE_API_URL=http://localhost:8080
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
For production, update `FRRONTEEEND/.env.production`:
|
| 115 |
-
```env
|
| 116 |
-
VITE_API_URL=https://your-cloud-run-url.run.app
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
## 📦 Deployment
|
| 120 |
-
|
| 121 |
-
### Docker Build
|
| 122 |
-
|
| 123 |
-
The Dockerfile now includes a multi-stage build that:
|
| 124 |
-
1. Builds the React frontend
|
| 125 |
-
2. Builds the Python environment
|
| 126 |
-
3. Combines both in the final image
|
| 127 |
-
|
| 128 |
-
```bash
|
| 129 |
-
docker build -t data-science-agent .
|
| 130 |
-
docker run -p 8080:8080 data-science-agent
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
### Google Cloud Run
|
| 134 |
-
|
| 135 |
-
```bash
|
| 136 |
-
gcloud builds submit --tag gcr.io/YOUR-PROJECT-ID/data-science-agent
|
| 137 |
-
gcloud run deploy data-science-agent \
|
| 138 |
-
--image gcr.io/YOUR-PROJECT-ID/data-science-agent \
|
| 139 |
-
--platform managed \
|
| 140 |
-
--region us-central1 \
|
| 141 |
-
--allow-unauthenticated \
|
| 142 |
-
--set-env-vars GROQ_API_KEY=your-api-key
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
## 🔄 What Changed
|
| 146 |
-
|
| 147 |
-
### Removed
|
| 148 |
-
- ❌ Gradio interface (`chat_ui.py` - kept for reference)
|
| 149 |
-
- ❌ Direct Google GenAI calls from frontend
|
| 150 |
-
- ❌ Gradio dependency
|
| 151 |
-
|
| 152 |
-
### Added
|
| 153 |
-
- ✅ React + TypeScript frontend with Vite
|
| 154 |
-
- ✅ Professional landing page
|
| 155 |
-
- ✅ Modern chat interface
|
| 156 |
-
- ✅ `/chat` API endpoint
|
| 157 |
-
- ✅ CORS support in FastAPI
|
| 158 |
-
- ✅ Static file serving for React app
|
| 159 |
-
- ✅ Multi-stage Docker build
|
| 160 |
-
|
| 161 |
-
## 🛠️ Tech Stack
|
| 162 |
-
|
| 163 |
-
### Frontend
|
| 164 |
-
- React 19
|
| 165 |
-
- TypeScript 5.8
|
| 166 |
-
- Vite 6
|
| 167 |
-
- Tailwind CSS
|
| 168 |
-
- Framer Motion (animations)
|
| 169 |
-
- Lucide React (icons)
|
| 170 |
-
|
| 171 |
-
### Backend (unchanged)
|
| 172 |
-
- FastAPI
|
| 173 |
-
- Python 3.13
|
| 174 |
-
- Groq API
|
| 175 |
-
- Polars, DuckDB
|
| 176 |
-
- Scikit-learn, XGBoost, LightGBM
|
| 177 |
-
|
| 178 |
-
## 📁 Project Structure
|
| 179 |
-
|
| 180 |
-
```
|
| 181 |
-
.
|
| 182 |
-
├── FRRONTEEEND/ # React frontend
|
| 183 |
-
│ ├── components/ # React components
|
| 184 |
-
│ ├── dist/ # Built frontend (after npm run build)
|
| 185 |
-
│ ├── package.json
|
| 186 |
-
│ ├── vite.config.ts
|
| 187 |
-
│ └── .env # Frontend environment variables
|
| 188 |
-
├── src/
|
| 189 |
-
│ ├── api/
|
| 190 |
-
│ │ └── app.py # FastAPI backend (updated)
|
| 191 |
-
│ ├── tools/ # Data science tools
|
| 192 |
-
│ └── orchestrator.py # Main agent logic
|
| 193 |
-
├── requirements.txt # Python dependencies (updated)
|
| 194 |
-
├── Dockerfile # Multi-stage build (updated)
|
| 195 |
-
├── build-and-deploy.ps1 # Windows build script
|
| 196 |
-
└── build-and-deploy.sh # Linux/Mac build script
|
| 197 |
-
```
|
| 198 |
-
|
| 199 |
-
## 🐛 Troubleshooting
|
| 200 |
-
|
| 201 |
-
### Frontend doesn't load
|
| 202 |
-
- Make sure you've run `npm run build` in the FRRONTEEEND directory
|
| 203 |
-
- Check that `FRRONTEEEND/dist/` exists and contains files
|
| 204 |
-
|
| 205 |
-
### API errors in chat
|
| 206 |
-
- Ensure the backend is running on port 8080
|
| 207 |
-
- Check that `GROQ_API_KEY` is set in your environment
|
| 208 |
-
- Verify the API URL in `.env` file
|
| 209 |
-
|
| 210 |
-
### CORS errors
|
| 211 |
-
- The backend now has CORS enabled for development
|
| 212 |
-
- For production, update the `allow_origins` in `src/api/app.py`
|
| 213 |
-
|
| 214 |
-
## 📝 Notes
|
| 215 |
-
|
| 216 |
-
- The old `chat_ui.py` has been kept for reference but is no longer used
|
| 217 |
-
- All chat functionality now goes through the `/chat` endpoint
|
| 218 |
-
- The frontend is automatically served by FastAPI in production mode
|
| 219 |
-
- Session history is maintained in the frontend (browser)
|
| 220 |
-
|
| 221 |
-
## 🎯 Next Steps
|
| 222 |
-
|
| 223 |
-
1. **Customize the frontend**: Edit files in `FRRONTEEEND/components/`
|
| 224 |
-
2. **Add file upload**: Extend `ChatInterface.tsx` to handle file uploads
|
| 225 |
-
3. **Add visualization**: Display charts from the backend in the chat
|
| 226 |
-
4. **Authentication**: Add user authentication if needed
|
| 227 |
-
|
| 228 |
-
## 📞 Support
|
| 229 |
-
|
| 230 |
-
For issues or questions:
|
| 231 |
-
1. Check the console logs (browser & terminal)
|
| 232 |
-
2. Verify environment variables
|
| 233 |
-
3. Ensure all dependencies are installed
|
| 234 |
-
4. Review the API documentation at http://localhost:8080/docs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FRRONTEEEND/README.md
DELETED
|
@@ -1,20 +0,0 @@
|
|
| 1 |
-
<div align="center">
|
| 2 |
-
<img width="1200" height="475" alt="GHBanner" src="https://github.com/user-attachments/assets/0aa67016-6eaf-458a-adb2-6e31a0763ed6" />
|
| 3 |
-
</div>
|
| 4 |
-
|
| 5 |
-
# Run and deploy your AI Studio app
|
| 6 |
-
|
| 7 |
-
This contains everything you need to run your app locally.
|
| 8 |
-
|
| 9 |
-
View your app in AI Studio: https://ai.studio/apps/drive/1gChoktTuh429q26FzxS4BPo0q0LnlRE9
|
| 10 |
-
|
| 11 |
-
## Run Locally
|
| 12 |
-
|
| 13 |
-
**Prerequisites:** Node.js
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
1. Install dependencies:
|
| 17 |
-
`npm install`
|
| 18 |
-
2. Set the `GEMINI_API_KEY` in [.env.local](.env.local) to your Gemini API key
|
| 19 |
-
3. Run the app:
|
| 20 |
-
`npm run dev`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GEMINI_UPDATE.md
DELETED
|
@@ -1,93 +0,0 @@
|
|
| 1 |
-
# 🔄 Updated to Use Google Gemini!
|
| 2 |
-
|
| 3 |
-
## What Changed
|
| 4 |
-
|
| 5 |
-
The application now uses **Google Gemini (gemini-2.0-flash-exp)** instead of Groq for the chat interface.
|
| 6 |
-
|
| 7 |
-
## Required Setup
|
| 8 |
-
|
| 9 |
-
### 1. Set Your Google API Key
|
| 10 |
-
|
| 11 |
-
```powershell
|
| 12 |
-
# Windows PowerShell
|
| 13 |
-
$env:GOOGLE_API_KEY="your-google-api-key-here"
|
| 14 |
-
|
| 15 |
-
# Verify it's set
|
| 16 |
-
echo $env:GOOGLE_API_KEY
|
| 17 |
-
```
|
| 18 |
-
|
| 19 |
-
### 2. Get Your API Key
|
| 20 |
-
|
| 21 |
-
If you don't have a Google API key:
|
| 22 |
-
1. Go to [Google AI Studio](https://aistudio.google.com/app/apikey)
|
| 23 |
-
2. Create a new API key
|
| 24 |
-
3. Copy and set it as shown above
|
| 25 |
-
|
| 26 |
-
## Quick Start
|
| 27 |
-
|
| 28 |
-
```powershell
|
| 29 |
-
# Set your API key
|
| 30 |
-
$env:GOOGLE_API_KEY="your-key-here"
|
| 31 |
-
|
| 32 |
-
# Run the application
|
| 33 |
-
.\start.ps1
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
Then open: **http://localhost:8080**
|
| 37 |
-
|
| 38 |
-
## What's Using Gemini
|
| 39 |
-
|
| 40 |
-
- ✅ **Chat Interface** (`/chat` endpoint) - Uses Gemini 2.0 Flash
|
| 41 |
-
- ℹ️ **Full Workflow** (`/run` endpoint) - Uses the main agent (configurable via LLM_PROVIDER)
|
| 42 |
-
|
| 43 |
-
## Technical Details
|
| 44 |
-
|
| 45 |
-
The `/chat` endpoint now:
|
| 46 |
-
- Uses `google.generativeai` SDK
|
| 47 |
-
- Model: `gemini-2.0-flash-exp`
|
| 48 |
-
- Maintains conversation history
|
| 49 |
-
- Professional data science system instruction
|
| 50 |
-
|
| 51 |
-
## Expected Console Output
|
| 52 |
-
|
| 53 |
-
When you start the server:
|
| 54 |
-
```
|
| 55 |
-
INFO: Started server process [####]
|
| 56 |
-
INFO: Waiting for application startup.
|
| 57 |
-
✅ Agent initialized with provider: gemini
|
| 58 |
-
✅ Frontend assets mounted from C:\Users\Pulastya\Videos\DS AGENTTTT\FRRONTEEEND\dist
|
| 59 |
-
INFO: Application startup complete.
|
| 60 |
-
INFO: Uvicorn running on http://0.0.0.0:8080
|
| 61 |
-
```
|
| 62 |
-
|
| 63 |
-
## Files Updated
|
| 64 |
-
|
| 65 |
-
- ✅ [src/api/app.py](src/api/app.py) - `/chat` endpoint now uses Gemini
|
| 66 |
-
- ✅ [.env.example](.env.example) - Updated to GOOGLE_API_KEY
|
| 67 |
-
- ✅ [start.ps1](start.ps1) - Updated environment variable reference
|
| 68 |
-
- ✅ [start.sh](start.sh) - Updated environment variable reference
|
| 69 |
-
- ✅ [CHECKLIST.md](CHECKLIST.md) - Updated instructions
|
| 70 |
-
- ✅ [FRRONTEEEND/.env](FRRONTEEEND/.env) - Added note about Gemini
|
| 71 |
-
|
| 72 |
-
## Troubleshooting
|
| 73 |
-
|
| 74 |
-
### Error: "API key not configured"
|
| 75 |
-
**Solution**: Make sure you've set the environment variable:
|
| 76 |
-
```powershell
|
| 77 |
-
$env:GOOGLE_API_KEY="your-actual-api-key"
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
### Error: "Module google.generativeai not found"
|
| 81 |
-
**Solution**: The dependency is already in requirements.txt. Verify it's installed:
|
| 82 |
-
```bash
|
| 83 |
-
pip install google-generativeai
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
### Rate Limits
|
| 87 |
-
Gemini 2.0 Flash has generous rate limits:
|
| 88 |
-
- Free tier: 15 RPM (requests per minute)
|
| 89 |
-
- 1 million TPM (tokens per minute)
|
| 90 |
-
|
| 91 |
-
---
|
| 92 |
-
|
| 93 |
-
**Ready?** Set your `GOOGLE_API_KEY` and run `.\start.ps1` 🚀
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MIGRATION_COMPLETE.md
DELETED
|
@@ -1,325 +0,0 @@
|
|
| 1 |
-
# 🎉 Frontend Migration Complete!
|
| 2 |
-
|
| 3 |
-
## Summary
|
| 4 |
-
|
| 5 |
-
Successfully replaced the old Gradio interface with a modern React-based frontend featuring:
|
| 6 |
-
- **Professional Landing Page**: Showcases the agent's capabilities
|
| 7 |
-
- **Modern Chat Interface**: NextChat-style conversational UI
|
| 8 |
-
- **Direct Backend Integration**: Communicates with FastAPI backend
|
| 9 |
-
- **Beautiful Design**: Dark theme with animations and responsive layout
|
| 10 |
-
|
| 11 |
-
## What Was Changed
|
| 12 |
-
|
| 13 |
-
### ✅ Backend Updates ([src/api/app.py](src/api/app.py))
|
| 14 |
-
1. **Added CORS middleware** for frontend communication
|
| 15 |
-
2. **Created `/chat` endpoint** for conversational interface
|
| 16 |
-
3. **Static file serving** for built React app
|
| 17 |
-
4. **Catch-all route** to serve `index.html` for client-side routing
|
| 18 |
-
|
| 19 |
-
### ✅ Frontend Updates
|
| 20 |
-
1. **Removed Google GenAI dependency** from [package.json](FRRONTEEEND/package.json)
|
| 21 |
-
2. **Updated ChatInterface.tsx** to call backend `/chat` endpoint instead of external API
|
| 22 |
-
3. **Added environment configuration**:
|
| 23 |
-
- `.env` for local development
|
| 24 |
-
- `.env.production` for production builds
|
| 25 |
-
4. **Updated vite.config.ts** with proxy configuration
|
| 26 |
-
|
| 27 |
-
### ✅ Configuration Files
|
| 28 |
-
1. **requirements.txt**: Commented out Gradio (no longer needed)
|
| 29 |
-
2. **Dockerfile**: Added multi-stage build for React frontend
|
| 30 |
-
3. **.dockerignore**: Excluded node_modules and frontend dev files
|
| 31 |
-
4. **New Scripts**:
|
| 32 |
-
- `start.ps1` / `start.sh` - Quick start scripts
|
| 33 |
-
- `build-and-deploy.ps1` / `build-and-deploy.sh` - Build scripts
|
| 34 |
-
|
| 35 |
-
### ✅ Documentation
|
| 36 |
-
- **FRONTEND_INTEGRATION.md**: Complete integration guide
|
| 37 |
-
- **README.md**: Updated with frontend announcement
|
| 38 |
-
|
| 39 |
-
## 🚀 How to Run
|
| 40 |
-
|
| 41 |
-
### Quick Start (Recommended)
|
| 42 |
-
|
| 43 |
-
**Windows:**
|
| 44 |
-
```powershell
|
| 45 |
-
.\start.ps1
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
**Linux/Mac:**
|
| 49 |
-
```bash
|
| 50 |
-
chmod +x start.sh
|
| 51 |
-
./start.sh
|
| 52 |
-
```
|
| 53 |
-
|
| 54 |
-
### Manual Steps
|
| 55 |
-
|
| 56 |
-
1. **Build Frontend** (already done ✅):
|
| 57 |
-
```bash
|
| 58 |
-
cd FRRONTEEEND
|
| 59 |
-
npm.cmd install
|
| 60 |
-
npm.cmd run build
|
| 61 |
-
cd ..
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
2. **Set Environment Variables**:
|
| 65 |
-
```powershell
|
| 66 |
-
# Required
|
| 67 |
-
$env:GROQ_API_KEY="your-groq-api-key-here"
|
| 68 |
-
|
| 69 |
-
# Optional
|
| 70 |
-
$env:GOOGLE_API_KEY="your-google-api-key"
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
3. **Start Backend**:
|
| 74 |
-
```bash
|
| 75 |
-
python src\api\app.py
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
4. **Access Application**:
|
| 79 |
-
Open browser to: **http://localhost:8080**
|
| 80 |
-
|
| 81 |
-
## 🏗️ Architecture
|
| 82 |
-
|
| 83 |
-
```
|
| 84 |
-
┌─────────────────────────────────────────────────────────┐
|
| 85 |
-
│ Browser │
|
| 86 |
-
│ │
|
| 87 |
-
│ ┌──────────────────────────────────────────────────┐ │
|
| 88 |
-
│ │ React Frontend (Port 8080) │ │
|
| 89 |
-
│ │ - Landing Page (HeroGeometric, etc.) │ │
|
| 90 |
-
│ │ - Chat Interface (ChatInterface.tsx) │ │
|
| 91 |
-
│ └──────────────────────────────────────────────────┘ │
|
| 92 |
-
│ │ │
|
| 93 |
-
│ │ HTTP POST /chat │
|
| 94 |
-
└─────────────────────────┼────────────────────────────────┘
|
| 95 |
-
│
|
| 96 |
-
▼
|
| 97 |
-
┌─────────────────────────────────────────────────────────┐
|
| 98 |
-
│ FastAPI Backend (Port 8080) │
|
| 99 |
-
│ │
|
| 100 |
-
│ ┌──────────────────────────────────────────────────┐ │
|
| 101 |
-
│ │ API Endpoints │ │
|
| 102 |
-
│ │ - POST /chat → Chat with agent │ │
|
| 103 |
-
│ │ - POST /run → Full workflow │ │
|
| 104 |
-
│ │ - POST /profile → Dataset profiling │ │
|
| 105 |
-
│ │ - GET /tools → List tools │ │
|
| 106 |
-
│ │ - GET /* → Serve React app │ │
|
| 107 |
-
│ └──────────────────────────────────────────────────┘ │
|
| 108 |
-
│ │ │
|
| 109 |
-
│ ▼ │
|
| 110 |
-
│ ┌──────────────────────────────────────────────────┐ │
|
| 111 |
-
│ │ DataScienceCopilot (orchestrator.py) │ │
|
| 112 |
-
│ │ - 82+ Tools │ │
|
| 113 |
-
│ │ - Groq LLM │ │
|
| 114 |
-
│ │ - Session Memory │ │
|
| 115 |
-
│ └──────────────────────────────────────────────────┘ │
|
| 116 |
-
└─────────────────────────────────────────────────────────┘
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
## 🎯 Key Endpoints
|
| 120 |
-
|
| 121 |
-
### `/chat` - Conversational Interface
|
| 122 |
-
```typescript
|
| 123 |
-
POST /chat
|
| 124 |
-
Content-Type: application/json
|
| 125 |
-
|
| 126 |
-
{
|
| 127 |
-
"messages": [
|
| 128 |
-
{"role": "user", "content": "Profile my dataset"},
|
| 129 |
-
{"role": "assistant", "content": "..."}
|
| 130 |
-
],
|
| 131 |
-
"stream": false
|
| 132 |
-
}
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
**Response:**
|
| 136 |
-
```json
|
| 137 |
-
{
|
| 138 |
-
"success": true,
|
| 139 |
-
"message": "I can help you profile your dataset...",
|
| 140 |
-
"model": "llama-3.3-70b-versatile",
|
| 141 |
-
"provider": "groq"
|
| 142 |
-
}
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
### `/run` - Complete Workflow
|
| 146 |
-
```bash
|
| 147 |
-
POST /run
|
| 148 |
-
Content-Type: multipart/form-data
|
| 149 |
-
|
| 150 |
-
file: <dataset.csv>
|
| 151 |
-
task_description: "Predict house prices"
|
| 152 |
-
target_col: "price"
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
### `/profile` - Quick Profiling
|
| 156 |
-
```bash
|
| 157 |
-
POST /profile
|
| 158 |
-
Content-Type: multipart/form-data
|
| 159 |
-
|
| 160 |
-
file: <dataset.csv>
|
| 161 |
-
```
|
| 162 |
-
|
| 163 |
-
## 📝 Environment Variables
|
| 164 |
-
|
| 165 |
-
### Backend (.env or system)
|
| 166 |
-
```env
|
| 167 |
-
# Required
|
| 168 |
-
GROQ_API_KEY=your-groq-api-key
|
| 169 |
-
|
| 170 |
-
# Optional
|
| 171 |
-
GOOGLE_API_KEY=your-google-api-key
|
| 172 |
-
GCP_PROJECT_ID=your-project-id
|
| 173 |
-
LLM_PROVIDER=groq # or "gemini"
|
| 174 |
-
```
|
| 175 |
-
|
| 176 |
-
### Frontend (FRRONTEEEND/.env)
|
| 177 |
-
```env
|
| 178 |
-
# Development
|
| 179 |
-
VITE_API_URL=http://localhost:8080
|
| 180 |
-
|
| 181 |
-
# Production (FRRONTEEEND/.env.production)
|
| 182 |
-
VITE_API_URL=https://your-cloud-run-url.run.app
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
## 🐳 Docker Deployment
|
| 186 |
-
|
| 187 |
-
The Dockerfile now includes a multi-stage build:
|
| 188 |
-
|
| 189 |
-
```bash
|
| 190 |
-
# Build image
|
| 191 |
-
docker build -t data-science-agent .
|
| 192 |
-
|
| 193 |
-
# Run container
|
| 194 |
-
docker run -p 8080:8080 \
|
| 195 |
-
-e GROQ_API_KEY=your-key \
|
| 196 |
-
data-science-agent
|
| 197 |
-
```
|
| 198 |
-
|
| 199 |
-
## ☁️ Google Cloud Run Deployment
|
| 200 |
-
|
| 201 |
-
```bash
|
| 202 |
-
# Build and push
|
| 203 |
-
gcloud builds submit --tag gcr.io/YOUR-PROJECT-ID/data-science-agent
|
| 204 |
-
|
| 205 |
-
# Deploy
|
| 206 |
-
gcloud run deploy data-science-agent \
|
| 207 |
-
--image gcr.io/YOUR-PROJECT-ID/data-science-agent \
|
| 208 |
-
--platform managed \
|
| 209 |
-
--region us-central1 \
|
| 210 |
-
--allow-unauthenticated \
|
| 211 |
-
--set-env-vars GROQ_API_KEY=your-api-key
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
## 🔍 Testing
|
| 215 |
-
|
| 216 |
-
### Test Backend API
|
| 217 |
-
```bash
|
| 218 |
-
# Health check
|
| 219 |
-
curl http://localhost:8080/health
|
| 220 |
-
|
| 221 |
-
# List tools
|
| 222 |
-
curl http://localhost:8080/tools
|
| 223 |
-
|
| 224 |
-
# Chat
|
| 225 |
-
curl -X POST http://localhost:8080/chat \
|
| 226 |
-
-H "Content-Type: application/json" \
|
| 227 |
-
-d '{
|
| 228 |
-
"messages": [
|
| 229 |
-
{"role": "user", "content": "Hello, what can you do?"}
|
| 230 |
-
]
|
| 231 |
-
}'
|
| 232 |
-
```
|
| 233 |
-
|
| 234 |
-
### Test Frontend
|
| 235 |
-
1. Open browser: http://localhost:8080
|
| 236 |
-
2. Click "Launch Console"
|
| 237 |
-
3. Type a message and send
|
| 238 |
-
|
| 239 |
-
## 🎨 Frontend Development
|
| 240 |
-
|
| 241 |
-
For frontend development with hot-reloading:
|
| 242 |
-
|
| 243 |
-
**Terminal 1 - Backend:**
|
| 244 |
-
```bash
|
| 245 |
-
python src\api\app.py
|
| 246 |
-
```
|
| 247 |
-
|
| 248 |
-
**Terminal 2 - Frontend:**
|
| 249 |
-
```bash
|
| 250 |
-
cd FRRONTEEEND
|
| 251 |
-
npm.cmd run dev
|
| 252 |
-
```
|
| 253 |
-
|
| 254 |
-
Access:
|
| 255 |
-
- Frontend Dev: http://localhost:3000
|
| 256 |
-
- Backend API: http://localhost:8080
|
| 257 |
-
|
| 258 |
-
## 📦 Build Status
|
| 259 |
-
|
| 260 |
-
✅ **Frontend Built**: FRRONTEEEND/dist/ contains:
|
| 261 |
-
- index.html
|
| 262 |
-
- assets/index-[hash].js (384 KB)
|
| 263 |
-
|
| 264 |
-
✅ **Backend Ready**: src/api/app.py configured to:
|
| 265 |
-
- Serve static files from FRRONTEEEND/dist/assets
|
| 266 |
-
- Route all non-API requests to index.html
|
| 267 |
-
- Handle /chat endpoint
|
| 268 |
-
|
| 269 |
-
## 🔄 Migration Notes
|
| 270 |
-
|
| 271 |
-
### What's Deprecated
|
| 272 |
-
- ❌ `chat_ui.py` - Old Gradio interface (kept for reference)
|
| 273 |
-
- ❌ Direct Google GenAI calls from frontend
|
| 274 |
-
|
| 275 |
-
### What's New
|
| 276 |
-
- ✅ React 19 + TypeScript
|
| 277 |
-
- ✅ Vite 6 build system
|
| 278 |
-
- ✅ Tailwind CSS styling
|
| 279 |
-
- ✅ Framer Motion animations
|
| 280 |
-
- ✅ Backend-first architecture
|
| 281 |
-
|
| 282 |
-
## 🐛 Troubleshooting
|
| 283 |
-
|
| 284 |
-
### Issue: Frontend shows 404
|
| 285 |
-
**Solution**: Make sure you've built the frontend:
|
| 286 |
-
```bash
|
| 287 |
-
cd FRRONTEEEND
|
| 288 |
-
npm.cmd run build
|
| 289 |
-
```
|
| 290 |
-
|
| 291 |
-
### Issue: API errors in chat
|
| 292 |
-
**Solution**:
|
| 293 |
-
1. Check backend is running: `python src\api\app.py`
|
| 294 |
-
2. Verify GROQ_API_KEY is set
|
| 295 |
-
3. Check console for errors
|
| 296 |
-
|
| 297 |
-
### Issue: CORS errors
|
| 298 |
-
**Solution**: The backend has CORS enabled. If issues persist, check the `allow_origins` in app.py
|
| 299 |
-
|
| 300 |
-
### Issue: Module import errors
|
| 301 |
-
**Solution**: Make sure all Python dependencies are installed:
|
| 302 |
-
```bash
|
| 303 |
-
pip install -r requirements.txt
|
| 304 |
-
```
|
| 305 |
-
|
| 306 |
-
## 📚 Additional Resources
|
| 307 |
-
|
| 308 |
-
- **[FRONTEND_INTEGRATION.md](FRONTEND_INTEGRATION.md)** - Detailed integration guide
|
| 309 |
-
- **[README.md](README.md)** - Main project documentation
|
| 310 |
-
- **[DEPLOYMENT.md](DEPLOYMENT.md)** - Cloud deployment guide
|
| 311 |
-
|
| 312 |
-
## ✨ Next Steps
|
| 313 |
-
|
| 314 |
-
1. **File Upload**: Add file upload capability to ChatInterface
|
| 315 |
-
2. **Visualizations**: Display charts and plots in chat
|
| 316 |
-
3. **Session Persistence**: Store chat history in backend
|
| 317 |
-
4. **Authentication**: Add user authentication
|
| 318 |
-
5. **Streaming**: Implement streaming responses
|
| 319 |
-
6. **Dark/Light Mode**: Add theme toggle
|
| 320 |
-
|
| 321 |
-
---
|
| 322 |
-
|
| 323 |
-
**Status**: ✅ Ready to use!
|
| 324 |
-
|
| 325 |
-
**Last Updated**: December 27, 2025
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
QUICK_REFERENCE.txt
DELETED
|
@@ -1,71 +0,0 @@
|
|
| 1 |
-
╔═══════════════════════════════════════════════════════════════╗
|
| 2 |
-
║ 🚀 DATA SCIENCE AGENT - QUICK REFERENCE ║
|
| 3 |
-
║ Now powered by Google Gemini! 🤖 ║
|
| 4 |
-
╚═══════════════════════════════════════════════════════════════╝
|
| 5 |
-
|
| 6 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 7 |
-
│ 1. SET API KEY (REQUIRED!) │
|
| 8 |
-
└───────────────────────────────────────────────────────────────┘
|
| 9 |
-
|
| 10 |
-
PowerShell:
|
| 11 |
-
$env:GOOGLE_API_KEY="your-google-api-key-here"
|
| 12 |
-
|
| 13 |
-
Get your key: https://aistudio.google.com/app/apikey
|
| 14 |
-
|
| 15 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 16 |
-
│ 2. START THE APPLICATION │
|
| 17 |
-
└───────────────────────────────────────────────────────────────┘
|
| 18 |
-
|
| 19 |
-
.\start.ps1
|
| 20 |
-
|
| 21 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 22 |
-
│ 3. ACCESS THE APP │
|
| 23 |
-
└───────────────────────────────────────────────────────────────┘
|
| 24 |
-
|
| 25 |
-
Open browser: http://localhost:8080
|
| 26 |
-
|
| 27 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 28 |
-
│ WHAT'S INCLUDED │
|
| 29 |
-
└───────────────────────────────────────────────────────────────┘
|
| 30 |
-
|
| 31 |
-
✅ Modern React frontend with landing page
|
| 32 |
-
✅ Professional chat interface
|
| 33 |
-
✅ Google Gemini 2.0 Flash integration
|
| 34 |
-
✅ 82+ data science tools
|
| 35 |
-
✅ Complete ML pipeline automation
|
| 36 |
-
|
| 37 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 38 |
-
│ KEY FILES │
|
| 39 |
-
└───────────────────────────────────────────────────────────────┘
|
| 40 |
-
|
| 41 |
-
📖 GEMINI_UPDATE.md - Gemini migration details
|
| 42 |
-
📖 CHECKLIST.md - Pre-launch checklist
|
| 43 |
-
📖 MIGRATION_COMPLETE.md - Full change log
|
| 44 |
-
📖 FRONTEND_INTEGRATION.md - Technical docs
|
| 45 |
-
|
| 46 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 47 |
-
│ TROUBLESHOOTING │
|
| 48 |
-
└───────────────────────────────────────────────────────────────┘
|
| 49 |
-
|
| 50 |
-
Issue: "API key not configured"
|
| 51 |
-
→ Set: $env:GOOGLE_API_KEY="your-key"
|
| 52 |
-
|
| 53 |
-
Issue: "Frontend not found"
|
| 54 |
-
→ Run: cd FRRONTEEEND && npm run build
|
| 55 |
-
|
| 56 |
-
Issue: "Module not found"
|
| 57 |
-
→ Run: pip install -r requirements.txt
|
| 58 |
-
|
| 59 |
-
┌───────────────────────────────────────────────────────────────┐
|
| 60 |
-
│ API ENDPOINTS │
|
| 61 |
-
└───────────────────────────────────────────────────────────────┘
|
| 62 |
-
|
| 63 |
-
POST /chat - Chat with Gemini agent
|
| 64 |
-
POST /run - Full ML workflow
|
| 65 |
-
POST /profile - Quick dataset profiling
|
| 66 |
-
GET /tools - List available tools
|
| 67 |
-
GET /docs - API documentation
|
| 68 |
-
|
| 69 |
-
╔═══════════════════════════════════════════════════════════════╗
|
| 70 |
-
║ Ready to start? Run: .\start.ps1 ║
|
| 71 |
-
���═══════════════════════════════════════════════════════════════╝
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,632 +1,365 @@
|
|
| 1 |
-
# Data Science Agent
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
> The application now features a **professional React-based web interface** with a beautiful landing page and chat UI, replacing the old Gradio interface.
|
| 12 |
-
>
|
| 13 |
-
> **Quick Start:**
|
| 14 |
-
> ```powershell
|
| 15 |
-
> .\start.ps1 # Windows
|
| 16 |
-
> ```
|
| 17 |
-
> or
|
| 18 |
-
> ```bash
|
| 19 |
-
> ./start.sh # Linux/Mac
|
| 20 |
-
> ```
|
| 21 |
-
>
|
| 22 |
-
> 📖 **[See Full Frontend Integration Guide →](FRONTEND_INTEGRATION.md)**
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
-
##
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
- **
|
| 36 |
-
- **
|
| 37 |
-
- **
|
| 38 |
-
- **Session Memory**: Contextual awareness across conversations ("cross-validate it", "try with Ridge")
|
| 39 |
-
- **Code Interpreter**: Write and execute custom Python code for tasks beyond predefined tools
|
| 40 |
-
- **Error Recovery**: Automatic retry with corrected parameters
|
| 41 |
-
- **Reasoning Modules**: Dedicated LLM reasoning layer with 19 specialized functions
|
| 42 |
-
- **Cloud Integration**: BigQuery data access + GCS artifact storage
|
| 43 |
-
|
| 44 |
-
### 🎨 **Multiple Interfaces**
|
| 45 |
-
- **Gradio Web UI** (`chat_ui.py`): Upload files, chat interface, visual plots
|
| 46 |
-
- **CLI Interface** (`src/cli.py`): Command-line workflow automation
|
| 47 |
-
- **REST API** (`src/api/app.py`): Cloud Run-ready FastAPI wrapper
|
| 48 |
-
- **Python SDK**: Direct programmatic access
|
| 49 |
|
| 50 |
### 📊 **Complete ML Pipeline**
|
| 51 |
-
1. **Data Profiling**
|
| 52 |
-
2. **Data Cleaning**
|
| 53 |
-
3. **Feature Engineering**
|
| 54 |
-
4. **Model Training**
|
| 55 |
-
5. **Hyperparameter Tuning**
|
| 56 |
-
6. **
|
| 57 |
-
7. **
|
| 58 |
-
8. **
|
| 59 |
-
|
| 60 |
-
### ⚡ **
|
| 61 |
-
- **
|
| 62 |
-
- **
|
| 63 |
-
- **
|
| 64 |
-
- **
|
| 65 |
-
- **
|
| 66 |
-
|
| 67 |
-
---
|
| 68 |
-
|
| 69 |
-
## 🏗️ Architecture
|
| 70 |
-
|
| 71 |
-
### **System Design**
|
| 72 |
-
|
| 73 |
-
```
|
| 74 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 75 |
-
│ User Interfaces │
|
| 76 |
-
│ Gradio UI │ CLI │ REST API │ Python SDK │
|
| 77 |
-
└─────────────────────────┬───────────────────────────────────┘
|
| 78 |
-
▼
|
| 79 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 80 |
-
│ DataScienceCopilot Orchestrator │
|
| 81 |
-
│ • LLM Function Calling (Groq/Gemini) │
|
| 82 |
-
│ • Session Memory Management │
|
| 83 |
-
│ • Tool Execution & Chaining │
|
| 84 |
-
│ • Error Recovery & Retry Logic │
|
| 85 |
-
└─────────────────────────┬───────────────────────────────────┘
|
| 86 |
-
▼
|
| 87 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 88 |
-
│ 75+ Specialized Tools │
|
| 89 |
-
│ Data Profiling │ Cleaning │ Feature Engineering │
|
| 90 |
-
│ Model Training │ Visualization │ EDA Reports │
|
| 91 |
-
│ NLP/Text │ Computer Vision │ Time Series │ MLOps │
|
| 92 |
-
└─────────────────────────┬───────────────────────────────────┘
|
| 93 |
-
▼
|
| 94 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 95 |
-
│ Execution & Storage Backends │
|
| 96 |
-
│ Local: Polars, sklearn, XGBoost │
|
| 97 |
-
│ Cloud: BigQuery, Vertex AI, Cloud Storage (planned) │
|
| 98 |
-
│ Cache: SQLite with TTL │
|
| 99 |
-
└─────────────────────────────────────────────────────────────┘
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
### **Tech Stack**
|
| 103 |
-
|
| 104 |
-
| Layer | Technologies |
|
| 105 |
-
|-------|-------------|
|
| 106 |
-
| **LLM** | Groq (llama-3.3-70b), Google Gemini (2.0-flash-exp) |
|
| 107 |
-
| **Data Processing** | Polars, DuckDB, Pandas, PyArrow, BigQuery |
|
| 108 |
-
| **ML/AI** | scikit-learn, XGBoost, LightGBM, CatBoost, Optuna |
|
| 109 |
-
| **Visualization** | Matplotlib, Seaborn, Plotly |
|
| 110 |
-
| **EDA Reports** | Sweetviz, ydata-profiling |
|
| 111 |
-
| **Explainability** | SHAP, LIME |
|
| 112 |
-
| **APIs** | FastAPI, Uvicorn |
|
| 113 |
-
| **UI** | Gradio, Typer + Rich (CLI) |
|
| 114 |
-
| **Storage** | SQLite (cache), CSV, Parquet, Google Cloud Storage |
|
| 115 |
-
| **Cloud** | Google Cloud Run, BigQuery, GCS, Vertex AI (planned) |
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
## 🚀 Quick Start
|
| 120 |
|
| 121 |
-
###
|
| 122 |
-
- Python 3.
|
| 123 |
-
-
|
|
|
|
| 124 |
|
| 125 |
-
###
|
| 126 |
|
|
|
|
| 127 |
```bash
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
# Create virtual environment
|
| 133 |
-
python -m venv .venv
|
| 134 |
-
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 135 |
-
|
| 136 |
-
# Install dependencies
|
| 137 |
-
pip install -r requirements.txt
|
| 138 |
|
| 139 |
-
|
|
|
|
| 140 |
cp .env.example .env
|
| 141 |
-
# Edit .env and add your
|
| 142 |
-
# GROQ_API_KEY=your_groq_key
|
| 143 |
-
# GOOGLE_API_KEY=your_google_key (optional)
|
| 144 |
-
# LLM_PROVIDER=groq # or "gemini"
|
| 145 |
```
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
#### **1. Gradio Web UI** (Recommended for beginners)
|
| 150 |
```bash
|
| 151 |
-
|
| 152 |
-
# Opens at http://localhost:7860
|
| 153 |
-
# Upload CSV → Ask: "Analyze this data and predict house prices"
|
| 154 |
```
|
| 155 |
|
| 156 |
-
|
| 157 |
```bash
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
python src/cli.py profile data.csv
|
| 163 |
-
|
| 164 |
-
# Train models only
|
| 165 |
-
python src/cli.py train cleaned.csv Survived --task-type classification
|
| 166 |
```
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
agent = DataScienceCopilot(
|
| 174 |
-
provider="groq", # or "gemini"
|
| 175 |
-
reasoning_effort="medium"
|
| 176 |
-
)
|
| 177 |
-
|
| 178 |
-
# Run workflow
|
| 179 |
-
result = agent.analyze(
|
| 180 |
-
file_path="titanic.csv",
|
| 181 |
-
task_description="Build a model to predict passenger survival",
|
| 182 |
-
target_col="Survived"
|
| 183 |
-
)
|
| 184 |
-
|
| 185 |
-
print(f"Status: {result['status']}")
|
| 186 |
-
print(f"Best Model: {result['best_model']}")
|
| 187 |
-
print(f"Accuracy: {result['best_score']}")
|
| 188 |
```
|
| 189 |
|
| 190 |
-
|
| 191 |
```bash
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
python app.py
|
| 195 |
-
# Server runs at http://localhost:8080
|
| 196 |
-
|
| 197 |
-
# Make API call
|
| 198 |
-
curl -X POST http://localhost:8080/run \
|
| 199 |
-
-F "file=@data.csv" \
|
| 200 |
-
-F "task_description=Analyze and predict churn" \
|
| 201 |
-
-F "target_col=churn"
|
| 202 |
```
|
| 203 |
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
## 📁 Project Structure
|
| 207 |
-
|
| 208 |
-
```
|
| 209 |
-
Data-Science-Agent/
|
| 210 |
-
├── src/
|
| 211 |
-
│ ├── orchestrator.py # Main agent brain (1,136 lines)
|
| 212 |
-
│ ├── cli.py # CLI interface (346 lines)
|
| 213 |
-
│ ├── api/
|
| 214 |
-
│ │ └── app.py # FastAPI Cloud Run wrapper (331 lines)
|
| 215 |
-
│ ├── bigquery/ # BigQuery integration 🆕
|
| 216 |
-
│ │ ├── __init__.py # BigQuery tools (4 functions)
|
| 217 |
-
│ │ └── client.py # BigQuery client wrapper
|
| 218 |
-
│ ├── storage/ # Artifact storage 🆕
|
| 219 |
-
│ │ ├── artifact_store.py # Local + GCS backends (613 lines)
|
| 220 |
-
│ │ └── helpers.py # Storage helper functions (125 lines)
|
| 221 |
-
│ ├── reasoning/ # LLM reasoning layer 🆕
|
| 222 |
-
│ │ ├── __init__.py # Core reasoning engine (350 lines)
|
| 223 |
-
│ │ ├── data_understanding.py # Data insights (6 functions)
|
| 224 |
-
│ │ ├── model_explanation.py # Model interpretation (6 functions)
|
| 225 |
-
│ │ └── business_summary.py # Business translations (7 functions)
|
| 226 |
-
│ ├── cache/
|
| 227 |
-
│ │ └── cache_manager.py # SQLite caching with TTL
|
| 228 |
-
│ ├── tools/ # 82+ specialized tools
|
| 229 |
-
│ │ ├── data_profiling.py # Dataset analysis
|
| 230 |
-
│ │ ├── data_cleaning.py # Cleaning & preprocessing
|
| 231 |
-
│ │ ├── feature_engineering.py # Feature creation
|
| 232 |
-
│ │ ├── model_training.py # ML training
|
| 233 |
-
│ │ ├── visualization_engine.py # Matplotlib/Seaborn plots
|
| 234 |
-
│ │ ├── plotly_visualizations.py # Interactive charts
|
| 235 |
-
│ │ ├── eda_reports.py # Sweetviz, ydata-profiling
|
| 236 |
-
│ │ ├── advanced_*.py # Advanced features
|
| 237 |
-
│ │ └── tools_registry.py # All 82 tool definitions (1,600+ lines)
|
| 238 |
-
│ └── utils/ # Helper utilities
|
| 239 |
-
│ ├── polars_helpers.py # Data manipulation
|
| 240 |
-
│ └── validation.py # Input validation
|
| 241 |
-
├── chat_ui.py # Gradio web interface (912 lines)
|
| 242 |
-
├── examples/
|
| 243 |
-
│ └── titanic_example.py # Complete workflow demo
|
| 244 |
-
├── outputs/
|
| 245 |
-
│ ├── data/ # Processed datasets
|
| 246 |
-
│ ├── models/ # Trained models (.pkl)
|
| 247 |
-
│ ├── plots/ # Visualizations (.png, .html)
|
| 248 |
-
│ └── reports/ # EDA reports (.html)
|
| 249 |
-
├── cache_db/ # SQLite cache storage
|
| 250 |
-
├── requirements.txt # Python dependencies
|
| 251 |
-
├── .env.example # Environment template
|
| 252 |
-
└── README.md # This file
|
| 253 |
-
```
|
| 254 |
|
| 255 |
---
|
| 256 |
|
| 257 |
-
##
|
| 258 |
-
|
| 259 |
-
### **📊 Data Profiling & Analysis (7 tools)**
|
| 260 |
-
- `profile_dataset`, `detect_data_quality_issues`, `analyze_correlations`, `get_smart_summary`, `compare_datasets`, `calculate_statistics`, `detect_skewness`
|
| 261 |
-
|
| 262 |
-
### **☁️ BigQuery Integration (4 tools)** 🆕
|
| 263 |
-
- `bigquery_profile_table`, `bigquery_load_table`, `bigquery_execute_query`, `bigquery_write_results`
|
| 264 |
-
|
| 265 |
-
### **🧹 Data Cleaning (8 tools)**
|
| 266 |
-
- `clean_missing_values`, `handle_outliers`, `remove_duplicates`, `filter_rows`, `rename_columns`, `drop_columns`, `sort_data`, `fix_data_types`
|
| 267 |
-
|
| 268 |
-
### **🔧 Feature Engineering (13 tools)**
|
| 269 |
-
- `encode_categorical`, `force_numeric_conversion`, `smart_type_inference`, `create_time_features`, `create_interaction_features`, `create_aggregation_features`, `create_ratio_features`, `create_statistical_features`, `create_log_features`, `create_binned_features`, `engineer_text_features`, `auto_feature_engineering`, `auto_feature_selection`
|
| 270 |
-
|
| 271 |
-
### **🤖 Model Training & Tuning (6 tools)**
|
| 272 |
-
- `train_baseline_models`, `hyperparameter_tuning`, `train_ensemble_models`, `perform_cross_validation`, `generate_model_report`, `auto_ml_pipeline`
|
| 273 |
-
|
| 274 |
-
### **📈 Visualization (11 tools)**
|
| 275 |
-
- `generate_all_plots`, `generate_data_quality_plots`, `generate_eda_plots`, `generate_model_performance_plots`, `generate_feature_importance_plot`, `generate_interactive_scatter`, `generate_interactive_histogram`, `generate_interactive_correlation_heatmap`, `generate_interactive_box_plots`, `generate_interactive_time_series`, `generate_plotly_dashboard`
|
| 276 |
-
|
| 277 |
-
### **📊 EDA Reports (3 tools)**
|
| 278 |
-
- `generate_sweetviz_report`, `generate_ydata_profiling_report`, `generate_combined_eda_report`
|
| 279 |
|
| 280 |
-
###
|
| 281 |
-
- `perform_eda_analysis`, `detect_model_issues`, `detect_anomalies`, `detect_and_handle_multicollinearity`, `perform_statistical_tests`, `analyze_root_cause`, `detect_trends_and_seasonality`, `detect_anomalies_advanced`, `perform_hypothesis_testing`, `analyze_distribution`, `perform_segment_analysis`
|
| 282 |
|
| 283 |
-
|
| 284 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 285 |
|
| 286 |
-
###
|
| 287 |
-
- `monitor_model_drift`, `explain_predictions`, `generate_model_card`, `perform_ab_test_analysis`, `detect_feature_leakage`
|
| 288 |
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
### **💼 Business Intelligence (4 tools)**
|
| 293 |
-
- `perform_cohort_analysis`, `perform_rfm_analysis`, `detect_causal_relationships`, `generate_business_insights`
|
| 294 |
-
|
| 295 |
-
### **📚 NLP/Text (4 tools)**
|
| 296 |
-
- `perform_topic_modeling`, `perform_named_entity_recognition`, `analyze_sentiment_advanced`, `perform_text_similarity`
|
| 297 |
|
| 298 |
-
|
| 299 |
-
- `extract_image_features`, `perform_image_clustering`, `analyze_tabular_image_hybrid`
|
| 300 |
|
| 301 |
-
|
| 302 |
|
| 303 |
-
|
| 304 |
|
| 305 |
-
|
| 306 |
-
The agent remembers context across conversations:
|
| 307 |
|
| 308 |
-
|
| 309 |
-
# Conversation 1
|
| 310 |
-
"Train a model on earthquake.csv to predict magnitude"
|
| 311 |
-
→ Agent trains XGBoost, achieves 0.92 R²
|
| 312 |
-
|
| 313 |
-
# Conversation 2 (Same session)
|
| 314 |
-
"Cross-validate it"
|
| 315 |
-
→ Agent knows: model=XGBoost, dataset=earthquake.csv, target=magnitude
|
| 316 |
-
→ Runs 5-fold CV automatically
|
| 317 |
```
|
| 318 |
|
| 319 |
-
|
| 320 |
-
Execute custom Python code for tasks beyond predefined tools:
|
| 321 |
|
| 322 |
-
|
| 323 |
-
User: "Make a Plotly scatter with custom dropdown filters"
|
| 324 |
|
| 325 |
-
Agent: execute_python_code(code='''
|
| 326 |
-
import plotly.graph_objects as go
|
| 327 |
-
df = pd.read_csv('./temp/data.csv')
|
| 328 |
-
# Custom visualization code...
|
| 329 |
-
fig.write_html('./outputs/code/custom_plot.html')
|
| 330 |
-
''')
|
| 331 |
```
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
```
|
| 353 |
|
| 354 |
---
|
| 355 |
|
| 356 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 357 |
|
| 358 |
-
|
| 359 |
-
Direct access to BigQuery tables without local downloads:
|
| 360 |
|
| 361 |
-
|
| 362 |
-
# Profile a BigQuery table
|
| 363 |
-
agent.chat("Profile the table project.dataset.sales")
|
| 364 |
|
| 365 |
-
|
| 366 |
-
agent.chat("Query top 10 customers by revenue from BigQuery")
|
| 367 |
|
| 368 |
-
|
| 369 |
-
|
|
|
|
| 370 |
```
|
| 371 |
|
| 372 |
-
**
|
| 373 |
-
- `bigquery_profile_table`: Get statistics for any BigQuery table
|
| 374 |
-
- `bigquery_load_table`: Load BigQuery data into local Polars DataFrame
|
| 375 |
-
- `bigquery_execute_query`: Run SQL queries directly on BigQuery
|
| 376 |
-
- `bigquery_write_results`: Write processed data back to BigQuery
|
| 377 |
|
| 378 |
-
**Setup:**
|
| 379 |
```bash
|
| 380 |
-
#
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
# Set environment variable
|
| 384 |
-
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
|
| 385 |
```
|
| 386 |
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
The project defines stable BigQuery table schemas for BI tools (see [`BIGQUERY_SCHEMAS.md`](BIGQUERY_SCHEMAS.md)):
|
| 390 |
-
- 📊 `model_metrics` - Model performance tracking over time
|
| 391 |
-
- 🎯 `feature_importance` - Feature impact analysis
|
| 392 |
-
- 🔮 `predictions` - Prediction monitoring with actuals
|
| 393 |
-
- 📋 `data_profile_summary` - Data quality metrics
|
| 394 |
|
| 395 |
-
|
| 396 |
-
- Stable schemas (no breaking changes without versioning)
|
| 397 |
-
- Consistent snake_case naming
|
| 398 |
-
- Clear dimension/metric separation
|
| 399 |
-
- Dashboard-ready with sample Looker views
|
| 400 |
|
| 401 |
-
|
| 402 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 403 |
|
| 404 |
-
|
| 405 |
-
# Local storage (default)
|
| 406 |
-
agent.save_model(model, "my_model.pkl")
|
| 407 |
-
# → Saves to outputs/models/my_model.pkl
|
| 408 |
|
| 409 |
-
|
| 410 |
-
agent.save_model(model, "my_model.pkl")
|
| 411 |
-
# → Saves to gs://your-bucket/models/my_model_v1.pkl with versioning
|
| 412 |
-
```
|
| 413 |
|
| 414 |
-
|
| 415 |
-
- **Automatic Backend Selection**: Uses GCS if credentials available, falls back to local
|
| 416 |
-
- **Versioning**: Automatic version suffixes for GCS artifacts
|
| 417 |
-
- **Metadata**: Stores creation time, size, checksums
|
| 418 |
-
- **Unified API**: Same code works for local and cloud storage
|
| 419 |
|
| 420 |
-
**Setup:**
|
| 421 |
```bash
|
| 422 |
-
#
|
| 423 |
-
|
| 424 |
|
| 425 |
-
#
|
| 426 |
-
|
| 427 |
-
```
|
| 428 |
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
```python
|
| 433 |
-
from reasoning.data_understanding import explain_dataset
|
| 434 |
-
from reasoning.model_explanation import explain_model_performance
|
| 435 |
-
from reasoning.business_summary import create_executive_summary
|
| 436 |
-
|
| 437 |
-
# Data insights
|
| 438 |
-
insights = explain_dataset(summary={
|
| 439 |
-
"rows": 10000,
|
| 440 |
-
"columns": 20,
|
| 441 |
-
"missing_values": {"age": {"count": 150, "percentage": 1.5}}
|
| 442 |
-
})
|
| 443 |
-
|
| 444 |
-
# Model explanations
|
| 445 |
-
explanation = explain_model_performance(metrics={
|
| 446 |
-
"accuracy": 0.95,
|
| 447 |
-
"precision": 0.92,
|
| 448 |
-
"recall": 0.88
|
| 449 |
-
}, task_type="classification")
|
| 450 |
-
|
| 451 |
-
# Business summaries
|
| 452 |
-
summary = create_executive_summary(
|
| 453 |
-
project_results={"model_accuracy": 0.95},
|
| 454 |
-
project_name="churn_prediction",
|
| 455 |
-
business_objective="Reduce customer churn"
|
| 456 |
-
)
|
| 457 |
-
```
|
| 458 |
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
- **Business Summary**: create_executive_summary, estimate_business_impact, create_stakeholder_report, translate_technical_to_business, prioritize_next_steps, explain_to_customer, assess_deployment_readiness (7 functions)
|
| 463 |
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
- ✅ **Dual Backend**: Works with both Gemini and Groq
|
| 469 |
|
| 470 |
---
|
| 471 |
|
| 472 |
-
##
|
| 473 |
|
| 474 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 475 |
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
# Model Selection
|
| 483 |
-
GROQ_MODEL=llama-3.3-70b-versatile
|
| 484 |
-
GEMINI_MODEL=gemini-2.0-flash-exp
|
| 485 |
-
REASONING_EFFORT=medium # low, medium, high
|
| 486 |
-
|
| 487 |
-
# Cache Settings
|
| 488 |
-
CACHE_DB_PATH=./cache_db/cache.db
|
| 489 |
-
CACHE_TTL_SECONDS=86400 # 24 hours
|
| 490 |
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-key.json # For BigQuery + GCS
|
| 494 |
|
| 495 |
-
|
| 496 |
-
|
| 497 |
```
|
| 498 |
|
| 499 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
| 500 |
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
| **Free Tier** | 100K tokens/day | 1,500 requests/day |
|
| 506 |
-
| **Rate Limit** | 12K tokens/min | 10 requests/min |
|
| 507 |
-
| **Best For** | High-volume, low-latency | Free tier, high quota |
|
| 508 |
|
| 509 |
---
|
| 510 |
|
| 511 |
-
##
|
| 512 |
-
|
| 513 |
-
### **Deploy REST API**
|
| 514 |
|
| 515 |
-
|
| 516 |
-
# 1. Build Docker image (Dockerfile provided)
|
| 517 |
-
docker build -t data-science-agent .
|
| 518 |
-
|
| 519 |
-
# 2. Push to Google Container Registry
|
| 520 |
-
gcloud builds submit --tag gcr.io/PROJECT_ID/data-science-agent
|
| 521 |
-
|
| 522 |
-
# 3. Deploy to Cloud Run
|
| 523 |
-
gcloud run deploy data-science-agent \
|
| 524 |
-
--image gcr.io/PROJECT_ID/data-science-agent \
|
| 525 |
-
--platform managed \
|
| 526 |
-
--region us-central1 \
|
| 527 |
-
--allow-unauthenticated \
|
| 528 |
-
--memory 4Gi \
|
| 529 |
-
--timeout 3600 \
|
| 530 |
-
--set-env-vars GROQ_API_KEY=your_key,LLM_PROVIDER=groq
|
| 531 |
-
|
| 532 |
-
# 4. Test deployment
|
| 533 |
-
curl -X POST https://your-service-url/run \
|
| 534 |
-
-F "file=@data.csv" \
|
| 535 |
-
-F "task_description=Predict churn"
|
| 536 |
-
```
|
| 537 |
-
|
| 538 |
-
### **API Endpoints**
|
| 539 |
|
| 540 |
-
|
| 541 |
-
- `GET /health` - Readiness probe
|
| 542 |
-
- `POST /run` - Full analysis workflow
|
| 543 |
-
- `POST /profile` - Quick dataset profiling
|
| 544 |
-
- `GET /tools` - List all available tools
|
| 545 |
|
| 546 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 547 |
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
### **Phase 1: Core Agent** ✅ COMPLETE
|
| 551 |
-
- [x] 75 specialized tools
|
| 552 |
-
- [x] Dual LLM support (Groq + Gemini)
|
| 553 |
-
- [x] CLI + Gradio UI
|
| 554 |
-
- [x] SQLite caching
|
| 555 |
-
- [x] Token optimization
|
| 556 |
-
|
| 557 |
-
### **Phase 2: Intelligence** ✅ COMPLETE
|
| 558 |
-
- [x] Session memory
|
| 559 |
-
- [x] Code interpreter
|
| 560 |
-
- [x] Error recovery
|
| 561 |
-
- [x] EDA reports (Sweetviz, ydata-profiling)
|
| 562 |
-
- [x] Interactive Plotly visualizations
|
| 563 |
-
|
| 564 |
-
### **Phase 3: Cloud Native** ✅ COMPLETE
|
| 565 |
-
- [x] FastAPI Cloud Run wrapper with 4 REST endpoints
|
| 566 |
-
- [x] BigQuery integration (4 tools: profile, load, query, write)
|
| 567 |
-
- [x] Artifact Storage abstraction (Local ↔ GCS switching)
|
| 568 |
-
- [x] Reasoning modules for LLM explanations (19 functions)
|
| 569 |
-
- [x] Looker-compatible BigQuery schemas (4 stable tables)
|
| 570 |
-
- [ ] Vertex AI model training (planned)
|
| 571 |
-
- [ ] Cloud Logging & Monitoring (planned)
|
| 572 |
-
|
| 573 |
-
### **Phase 4: Enterprise** 📋 PLANNED
|
| 574 |
-
- [ ] Multi-user authentication
|
| 575 |
-
- [ ] Team workspaces
|
| 576 |
-
- [ ] Model registry
|
| 577 |
-
- [ ] Automated retraining pipelines
|
| 578 |
-
|
| 579 |
-
### **Phase 5: Kaggle Integration** 🎯 FUTURE
|
| 580 |
-
- [ ] Direct Kaggle API integration
|
| 581 |
-
- [ ] Automated competition workflow
|
| 582 |
-
- [ ] Ensemble strategies
|
| 583 |
-
- [ ] Submission automation
|
| 584 |
|
| 585 |
---
|
| 586 |
|
| 587 |
## 🤝 Contributing
|
| 588 |
|
| 589 |
-
Contributions welcome!
|
| 590 |
-
|
| 591 |
-
1. **New Tools**: Time series forecasting, NLP preprocessing, image augmentation
|
| 592 |
-
2. **Cloud Backends**: AWS, Azure support
|
| 593 |
-
3. **Performance**: Optimize tool execution, reduce latency
|
| 594 |
-
4. **UI/UX**: Better visualization, workflow builder
|
| 595 |
-
5. **Documentation**: Tutorials, video guides, blog posts
|
| 596 |
|
| 597 |
---
|
| 598 |
|
| 599 |
-
##
|
| 600 |
|
| 601 |
-
|
| 602 |
|
| 603 |
---
|
| 604 |
|
| 605 |
-
##
|
| 606 |
|
| 607 |
-
- **
|
| 608 |
-
- **
|
|
|
|
|
|
|
|
|
|
| 609 |
|
| 610 |
---
|
| 611 |
|
| 612 |
-
##
|
| 613 |
|
| 614 |
-
|
| 615 |
-
-
|
| 616 |
-
-
|
| 617 |
-
- **Supported Models**: 10+ (LR, Ridge, Lasso, RF, XGBoost, LightGBM, CatBoost, etc.)
|
| 618 |
-
- **Visualization Types**: 20+ (static + interactive)
|
| 619 |
-
- **Data Formats**: CSV, Parquet, JSON, BigQuery tables
|
| 620 |
-
- **Cloud Platforms**: Google Cloud (Run, BigQuery, GCS) - AWS/Azure planned
|
| 621 |
|
| 622 |
---
|
| 623 |
|
| 624 |
<div align="center">
|
| 625 |
|
| 626 |
-
**Built with ❤️ for
|
| 627 |
-
|
| 628 |
-
*"Making data science accessible through AI automation"*
|
| 629 |
|
| 630 |
-
⭐ Star this repo if you find it
|
| 631 |
|
| 632 |
</div>
|
|
|
|
| 1 |
+
# 🤖 AI-Powered Data Science Agent
|
| 2 |
|
| 3 |
+
> **An intelligent autonomous agent that performs end-to-end data science workflows through natural language**
|
| 4 |
|
| 5 |
+
Upload your dataset, describe what you want in plain English, and watch as the AI agent handles profiling, cleaning, feature engineering, model training, hyperparameter tuning, and comprehensive reporting - all automatically.
|
| 6 |
|
| 7 |
+
[](https://reactjs.org/)
|
| 8 |
+
[](https://fastapi.tiangolo.com/)
|
| 9 |
+
[](https://ai.google.dev/)
|
| 10 |
+
[](https://python.org/)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
---
|
| 13 |
|
| 14 |
+
## ✨ Key Features
|
| 15 |
|
| 16 |
+
### 🎯 **Autonomous AI Agent**
|
| 17 |
+
- **82+ Specialized ML Tools** organized across data profiling, cleaning, feature engineering, model training, and visualization
|
| 18 |
+
- **Intelligent Orchestration** with Google Gemini 2.5 Flash for function calling and decision-making
|
| 19 |
+
- **Session Memory** for contextual awareness across conversations
|
| 20 |
+
- **Smart Intent Detection** automatically classifies tasks (ML pipeline, cleaning only, visualization, etc.)
|
| 21 |
+
- **Error Recovery** with automatic retry logic and file tracking
|
| 22 |
|
| 23 |
+
### 🎨 **Modern Web Interface**
|
| 24 |
+
- **Beautiful React Frontend** with glassmorphism design and smooth animations
|
| 25 |
+
- **Interactive Chat** with file upload support (CSV, Parquet)
|
| 26 |
+
- **Report Viewer** to view YData profiling and Sweetviz HTML reports in-app
|
| 27 |
+
- **Markdown Support** for formatted responses
|
| 28 |
+
- **Session Management** to maintain conversation history
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### 📊 **Complete ML Pipeline**
|
| 31 |
+
1. **Data Profiling** - Automated statistical analysis and data quality assessment
|
| 32 |
+
2. **Data Cleaning** - Smart missing value handling, outlier treatment, type conversion
|
| 33 |
+
3. **Feature Engineering** - Time-based features, encoding, interactions, statistical features
|
| 34 |
+
4. **Model Training** - Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost
|
| 35 |
+
5. **Hyperparameter Tuning** - Optuna-based optimization with 50+ trials
|
| 36 |
+
6. **Cross-Validation** - Stratified K-fold validation for robust evaluation
|
| 37 |
+
7. **Visualization** - Interactive Plotly dashboards and correlation heatmaps
|
| 38 |
+
8. **Reporting** - Comprehensive HTML reports with YData Profiling
|
| 39 |
+
|
| 40 |
+
### ⚡ **Production Ready**
|
| 41 |
+
- **FastAPI Backend** with async support and automatic API documentation
|
| 42 |
+
- **Docker Support** with multi-stage builds for optimized deployment
|
| 43 |
+
- **Rate Limiting** configured for Gemini API (6.5s intervals for 10 RPM limit)
|
| 44 |
+
- **Caching System** for faster repeated queries
|
| 45 |
+
- **CORS Enabled** for frontend-backend communication
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
## 🚀 Quick Start
|
| 50 |
|
| 51 |
+
### Prerequisites
|
| 52 |
+
- Python 3.10+
|
| 53 |
+
- Node.js 18+ (for frontend)
|
| 54 |
+
- Google Gemini API key ([Get one here](https://ai.google.dev/))
|
| 55 |
|
| 56 |
+
### Installation
|
| 57 |
|
| 58 |
+
**1. Clone the repository**
|
| 59 |
```bash
|
| 60 |
+
git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
|
| 61 |
+
cd DevSprint-Data-Science-Agent
|
| 62 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
**2. Set up environment variables**
|
| 65 |
+
```bash
|
| 66 |
cp .env.example .env
|
| 67 |
+
# Edit .env and add your GOOGLE_API_KEY
|
|
|
|
|
|
|
|
|
|
| 68 |
```
|
| 69 |
|
| 70 |
+
**3. Install Python dependencies**
|
|
|
|
|
|
|
| 71 |
```bash
|
| 72 |
+
pip install -r requirements.txt
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
+
**4. Install frontend dependencies**
|
| 76 |
```bash
|
| 77 |
+
cd FRRONTEEEND
|
| 78 |
+
npm install
|
| 79 |
+
npm run build
|
| 80 |
+
cd ..
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
+
**5. Run the application**
|
| 84 |
+
|
| 85 |
+
**Windows:**
|
| 86 |
+
```powershell
|
| 87 |
+
.\start.ps1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
+
**Linux/Mac:**
|
| 91 |
```bash
|
| 92 |
+
chmod +x start.sh
|
| 93 |
+
./start.sh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
```
|
| 95 |
|
| 96 |
+
The application will be available at **http://localhost:8080**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
---
|
| 99 |
|
| 100 |
+
## 📖 Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
### Web Interface
|
|
|
|
| 103 |
|
| 104 |
+
1. **Navigate to http://localhost:8080**
|
| 105 |
+
2. **Click "Launch Agent"** from the landing page
|
| 106 |
+
3. **Upload your dataset** (CSV or Parquet format)
|
| 107 |
+
4. **Type your request** in natural language:
|
| 108 |
+
- "Generate a comprehensive report on this dataset"
|
| 109 |
+
- "Train a model to predict [target_column]"
|
| 110 |
+
- "Clean the data and show me visualizations"
|
| 111 |
+
- "Perform feature engineering and train the best model"
|
| 112 |
+
5. **View results** in the chat and click "View Report" buttons to see detailed HTML reports
|
| 113 |
|
| 114 |
+
### Example Queries
|
|
|
|
| 115 |
|
| 116 |
+
```
|
| 117 |
+
📊 "Profile this dataset and tell me about data quality issues"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
+
🧹 "Clean the missing values and handle outliers"
|
|
|
|
| 120 |
|
| 121 |
+
🎯 "Train a model to predict house prices with target column 'price'"
|
| 122 |
|
| 123 |
+
📈 "Generate a correlation heatmap and feature importance plot"
|
| 124 |
|
| 125 |
+
🔧 "Create time-based features and perform hyperparameter tuning"
|
|
|
|
| 126 |
|
| 127 |
+
📋 "Generate a comprehensive YData profiling report"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
```
|
| 129 |
|
| 130 |
+
---
|
|
|
|
| 131 |
|
| 132 |
+
## 🏗️ Architecture
|
|
|
|
| 133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
```
|
| 135 |
+
┌───────────────────────────────────────────────��─────────────┐
|
| 136 |
+
│ React Frontend (Port 8080) │
|
| 137 |
+
│ Landing Page │ Chat Interface │ Report Viewer │
|
| 138 |
+
└─────────────────────────┬───────────────────────────────────┘
|
| 139 |
+
│
|
| 140 |
+
▼
|
| 141 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 142 |
+
│ FastAPI Backend (Python 3.10+) │
|
| 143 |
+
│ /chat │ /run │ /outputs │ /api/health │
|
| 144 |
+
└─────────────────────────┬───────────────────────────────────┘
|
| 145 |
+
│
|
| 146 |
+
▼
|
| 147 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 148 |
+
│ DataScienceCopilot Orchestrator │
|
| 149 |
+
│ • Gemini 2.5 Flash Integration │
|
| 150 |
+
│ • 82+ Specialized Tools │
|
| 151 |
+
│ • Session Memory & Context │
|
| 152 |
+
│ • Intelligent Intent Detection │
|
| 153 |
+
│ • Error Recovery & Loop Prevention │
|
| 154 |
+
└─────────────────────────┬───────────────────────────────────┘
|
| 155 |
+
│
|
| 156 |
+
▼
|
| 157 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 158 |
+
│ Tool Categories │
|
| 159 |
+
│ Profiling │ Cleaning │ Feature Engineering │ ML Training │
|
| 160 |
+
│ Visualization │ EDA Reports │ Data Wrangling │
|
| 161 |
+
└─────────────────────────────────────────────────────────────┘
|
| 162 |
```
|
| 163 |
|
| 164 |
---
|
| 165 |
|
| 166 |
+
## 🛠️ Tech Stack
|
| 167 |
+
|
| 168 |
+
### Frontend
|
| 169 |
+
- **React 19** - Modern UI library
|
| 170 |
+
- **TypeScript 5.8** - Type-safe development
|
| 171 |
+
- **Vite 6** - Lightning-fast build tool
|
| 172 |
+
- **Tailwind CSS** - Utility-first styling
|
| 173 |
+
- **Framer Motion** - Smooth animations
|
| 174 |
+
- **React Markdown** - Formatted responses
|
| 175 |
+
|
| 176 |
+
### Backend
|
| 177 |
+
- **FastAPI** - High-performance Python web framework
|
| 178 |
+
- **Google Gemini 2.5 Flash** - LLM for agent orchestration
|
| 179 |
+
- **Polars** - Fast dataframe library (10-100x faster than pandas)
|
| 180 |
+
- **Scikit-learn** - Classical ML algorithms
|
| 181 |
+
- **XGBoost / LightGBM / CatBoost** - Gradient boosting frameworks
|
| 182 |
+
- **Optuna** - Hyperparameter optimization
|
| 183 |
+
- **YData Profiling** - Automated EDA reports
|
| 184 |
+
- **Plotly / Matplotlib** - Interactive visualizations
|
| 185 |
+
|
| 186 |
+
### DevOps
|
| 187 |
+
- **Docker** - Containerization with multi-stage builds
|
| 188 |
+
- **Python-dotenv** - Environment variable management
|
| 189 |
+
- **SQLite** - Caching layer for performance
|
| 190 |
|
| 191 |
+
---
|
|
|
|
| 192 |
|
| 193 |
+
## 🐳 Docker Deployment
|
|
|
|
|
|
|
| 194 |
|
| 195 |
+
**Build and run with Docker:**
|
|
|
|
| 196 |
|
| 197 |
+
```bash
|
| 198 |
+
docker build -t ds-agent .
|
| 199 |
+
docker run -p 8080:8080 --env-file .env ds-agent
|
| 200 |
```
|
| 201 |
|
| 202 |
+
**Or use the deployment script:**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
|
|
|
| 204 |
```bash
|
| 205 |
+
.\build-and-deploy.ps1 # Windows
|
| 206 |
+
./build-and-deploy.sh # Linux/Mac
|
|
|
|
|
|
|
|
|
|
| 207 |
```
|
| 208 |
|
| 209 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
+
## 📂 Project Structure
|
|
|
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
+
```
|
| 214 |
+
.
|
| 215 |
+
├── FRRONTEEEND/ # React frontend
|
| 216 |
+
│ ├── components/ # UI components
|
| 217 |
+
│ │ ├── ChatInterface.tsx # Main chat interface
|
| 218 |
+
│ │ ├── HeroGeometric.tsx # Landing page hero
|
| 219 |
+
│ │ └── ...
|
| 220 |
+
│ ├── dist/ # Built frontend
|
| 221 |
+
│ └── package.json
|
| 222 |
+
│
|
| 223 |
+
├── src/ # Python backend
|
| 224 |
+
│ ├── api/
|
| 225 |
+
│ │ └── app.py # FastAPI application
|
| 226 |
+
│ ├── orchestrator.py # Agent orchestrator
|
| 227 |
+
│ ├── session_memory.py # Session management
|
| 228 |
+
│ ├── tools/ # 82+ ML tools
|
| 229 |
+
│ │ ├── data_profiling.py
|
| 230 |
+
│ │ ├── data_cleaning.py
|
| 231 |
+
│ │ ├── feature_engineering.py
|
| 232 |
+
│ │ ├── model_training.py
|
| 233 |
+
│ │ └── ...
|
| 234 |
+
│ └── utils/ # Helper utilities
|
| 235 |
+
│
|
| 236 |
+
├── Dockerfile # Multi-stage Docker build
|
| 237 |
+
├── requirements.txt # Python dependencies
|
| 238 |
+
├── start.ps1 / start.sh # Quick start scripts
|
| 239 |
+
└── README.md # This file
|
| 240 |
+
```
|
| 241 |
|
| 242 |
+
---
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
+
## 🔑 Environment Variables
|
|
|
|
|
|
|
|
|
|
| 245 |
|
| 246 |
+
Create a `.env` file in the root directory:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
|
|
|
|
| 248 |
```bash
|
| 249 |
+
# LLM Provider Configuration
|
| 250 |
+
LLM_PROVIDER=gemini
|
| 251 |
|
| 252 |
+
# API Keys
|
| 253 |
+
GOOGLE_API_KEY=your_gemini_api_key_here
|
|
|
|
| 254 |
|
| 255 |
+
# Model Configuration
|
| 256 |
+
GEMINI_MODEL=gemini-2.5-flash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 257 |
|
| 258 |
+
# Cache Configuration
|
| 259 |
+
CACHE_DB_PATH=./cache_db/cache.db
|
| 260 |
+
CACHE_TTL_SECONDS=86400
|
|
|
|
| 261 |
|
| 262 |
+
# Output Configuration
|
| 263 |
+
OUTPUT_DIR=./outputs
|
| 264 |
+
DATA_DIR=./data
|
| 265 |
+
```
|
|
|
|
| 266 |
|
| 267 |
---
|
| 268 |
|
| 269 |
+
## 🎯 Features in Detail
|
| 270 |
|
| 271 |
+
### Intelligent Intent Detection
|
| 272 |
+
The agent automatically classifies your request and applies the appropriate workflow:
|
| 273 |
+
- **Full ML Pipeline** - Complete end-to-end workflow with training
|
| 274 |
+
- **Exploratory Analysis** - Data profiling and visualization only
|
| 275 |
+
- **Cleaning Only** - Data quality improvements without modeling
|
| 276 |
+
- **Visualization Only** - Generate plots and dashboards
|
| 277 |
+
- **Multi-Intent** - Combine multiple tasks intelligently
|
| 278 |
|
| 279 |
+
### Session Memory
|
| 280 |
+
The agent remembers context across messages:
|
| 281 |
+
```
|
| 282 |
+
You: "Train a model on this dataset"
|
| 283 |
+
Agent: [Trains XGBoost model with R² = 0.85]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
|
| 285 |
+
You: "Now try hyperparameter tuning"
|
| 286 |
+
Agent: [Automatically uses previous model and dataset]
|
|
|
|
| 287 |
|
| 288 |
+
You: "Cross-validate it"
|
| 289 |
+
Agent: [Applies CV to tuned model from context]
|
| 290 |
```
|
| 291 |
|
| 292 |
+
### Error Recovery
|
| 293 |
+
- Automatic retry with corrected parameters
|
| 294 |
+
- File existence validation before execution
|
| 295 |
+
- Recovery guidance showing last successful file
|
| 296 |
+
- Loop detection to prevent infinite retries
|
| 297 |
|
| 298 |
+
### Report Viewing
|
| 299 |
+
- Click "View Report" buttons to see HTML reports in-app
|
| 300 |
+
- Full-screen modal with professional styling
|
| 301 |
+
- Supports YData Profiling, Sweetviz, and custom dashboards
|
|
|
|
|
|
|
|
|
|
| 302 |
|
| 303 |
---
|
| 304 |
|
| 305 |
+
## 📊 Example Workflow
|
|
|
|
|
|
|
| 306 |
|
| 307 |
+
**Upload:** `earthquake_data.csv` (175K rows, 22 columns)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 308 |
|
| 309 |
+
**Prompt:** "Train a model to predict earthquake magnitude"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 310 |
|
| 311 |
+
**Agent Actions:**
|
| 312 |
+
1. ✅ Profiles dataset (175,947 rows, 22 columns)
|
| 313 |
+
2. ✅ Detects data quality issues (11.67% missing, outliers)
|
| 314 |
+
3. ✅ Drops high-missing columns (>40% missing)
|
| 315 |
+
4. ✅ Imputes remaining missing values with median/mode
|
| 316 |
+
5. ✅ Handles outliers with IQR clipping
|
| 317 |
+
6. ✅ Extracts time-based features (year, month, hour, cyclical)
|
| 318 |
+
7. ✅ Encodes categorical variables
|
| 319 |
+
8. ✅ Trains 6 baseline models (XGBoost wins with R² = 0.716)
|
| 320 |
+
9. ✅ Performs hyperparameter tuning (R² = 0.743)
|
| 321 |
+
10. ✅ Runs 5-fold cross-validation (RMSE = 0.167 ± 0.0005)
|
| 322 |
+
11. ✅ Generates YData profiling report
|
| 323 |
+
12. ✅ Creates interactive Plotly dashboard
|
| 324 |
|
| 325 |
+
**Result:** Trained and tuned XGBoost model ready for deployment!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 326 |
|
| 327 |
---
|
| 328 |
|
| 329 |
## 🤝 Contributing
|
| 330 |
|
| 331 |
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 332 |
|
| 333 |
---
|
| 334 |
|
| 335 |
+
## 📄 License
|
| 336 |
|
| 337 |
+
This project is licensed under the MIT License.
|
| 338 |
|
| 339 |
---
|
| 340 |
|
| 341 |
+
## 🙏 Acknowledgments
|
| 342 |
|
| 343 |
+
- **Google Gemini** for powerful LLM capabilities
|
| 344 |
+
- **FastAPI** for excellent async Python framework
|
| 345 |
+
- **React** community for amazing UI libraries
|
| 346 |
+
- **Polars** for blazing-fast data processing
|
| 347 |
+
- **YData Profiling** for comprehensive EDA reports
|
| 348 |
|
| 349 |
---
|
| 350 |
|
| 351 |
+
## 📧 Contact
|
| 352 |
|
| 353 |
+
**Pulastya B**
|
| 354 |
+
- GitHub: [@Pulastya-B](https://github.com/Pulastya-B)
|
| 355 |
+
- Project: [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 356 |
|
| 357 |
---
|
| 358 |
|
| 359 |
<div align="center">
|
| 360 |
|
| 361 |
+
**Built with ❤️ for DevSprint Hackathon**
|
|
|
|
|
|
|
| 362 |
|
| 363 |
+
⭐ Star this repo if you find it helpful!
|
| 364 |
|
| 365 |
</div>
|
chat_ui.py
DELETED
|
@@ -1,1073 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
AI Agent Data Scientist - Interactive Chat UI
|
| 3 |
-
==============================================
|
| 4 |
-
|
| 5 |
-
A simple web interface to interact with your AI Agent.
|
| 6 |
-
Upload datasets, ask questions, and get AI-powered insights!
|
| 7 |
-
"""
|
| 8 |
-
|
| 9 |
-
import gradio as gr
|
| 10 |
-
import sys
|
| 11 |
-
import os
|
| 12 |
-
import shutil
|
| 13 |
-
from pathlib import Path
|
| 14 |
-
import traceback
|
| 15 |
-
|
| 16 |
-
# Add src to path
|
| 17 |
-
sys.path.append('src')
|
| 18 |
-
|
| 19 |
-
from tools.data_profiling import profile_dataset, detect_data_quality_issues
|
| 20 |
-
from tools.model_training import train_baseline_models
|
| 21 |
-
|
| 22 |
-
# Try to import AI agent (optional)
|
| 23 |
-
try:
|
| 24 |
-
from orchestrator import DataScienceCopilot
|
| 25 |
-
agent = DataScienceCopilot()
|
| 26 |
-
AI_ENABLED = True
|
| 27 |
-
print("✅ AI Agent loaded successfully!")
|
| 28 |
-
print(f"📊 Model: {agent.model}")
|
| 29 |
-
print(f"🔧 Tools available: {len(agent.tool_functions)}")
|
| 30 |
-
except Exception as e:
|
| 31 |
-
print(f"ℹ️ Running in manual mode (AI agent not available)")
|
| 32 |
-
print(f" Error: {str(e)}")
|
| 33 |
-
print("💡 You can still use all the quick actions and tools!")
|
| 34 |
-
AI_ENABLED = False
|
| 35 |
-
agent = None
|
| 36 |
-
|
| 37 |
-
# Store uploaded file path
|
| 38 |
-
current_file = None
|
| 39 |
-
current_profile = None
|
| 40 |
-
last_agent_response = None # Store last agent response for visualization extraction
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
# Helper functions for Gradio 6.x message format
|
| 44 |
-
def add_message(history, role, content):
|
| 45 |
-
"""Add a message to history in Gradio 6.x format."""
|
| 46 |
-
if history is None:
|
| 47 |
-
history = []
|
| 48 |
-
history.append({"role": role, "content": content})
|
| 49 |
-
return history
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
def add_user_message(history, content):
|
| 53 |
-
"""Add a user message to history."""
|
| 54 |
-
return add_message(history, "user", content)
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
def add_assistant_message(history, content):
|
| 58 |
-
"""Add an assistant message to history."""
|
| 59 |
-
return add_message(history, "assistant", content)
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
def update_last_assistant_message(history, content):
|
| 63 |
-
"""Update the last assistant message in history."""
|
| 64 |
-
if history and len(history) > 0 and history[-1].get("role") == "assistant":
|
| 65 |
-
history[-1]["content"] = content
|
| 66 |
-
return history
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def get_last_user_content(history):
|
| 70 |
-
"""Get the content of the last user message."""
|
| 71 |
-
if history:
|
| 72 |
-
for msg in reversed(history):
|
| 73 |
-
if msg.get("role") == "user":
|
| 74 |
-
return msg.get("content", "")
|
| 75 |
-
return ""
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
def analyze_dataset(file, user_message, history):
|
| 79 |
-
"""Process uploaded dataset(s) and user message. Supports single or multiple file uploads."""
|
| 80 |
-
global current_file, current_profile, last_agent_response
|
| 81 |
-
|
| 82 |
-
# Initialize with empty plot list (will collect PNG file paths)
|
| 83 |
-
plots_paths = []
|
| 84 |
-
html_reports = [] # Initialize HTML reports list
|
| 85 |
-
|
| 86 |
-
# Initialize history if None
|
| 87 |
-
if history is None:
|
| 88 |
-
history = []
|
| 89 |
-
|
| 90 |
-
# Debug: Log the call
|
| 91 |
-
print(f"[DEBUG] analyze_dataset called - file: {file is not None}, message: '{user_message}', current_file: {current_file}")
|
| 92 |
-
|
| 93 |
-
try:
|
| 94 |
-
# Handle file uploads (single or multiple)
|
| 95 |
-
if file is not None:
|
| 96 |
-
# file can be a single filepath or a list of filepaths
|
| 97 |
-
files_to_process = file if isinstance(file, list) else [file]
|
| 98 |
-
|
| 99 |
-
# Filter out None values
|
| 100 |
-
files_to_process = [f for f in files_to_process if f is not None]
|
| 101 |
-
|
| 102 |
-
if len(files_to_process) > 0:
|
| 103 |
-
print(f"[DEBUG] Processing {len(files_to_process)} file(s) upload")
|
| 104 |
-
|
| 105 |
-
# Copy all files to simpler paths
|
| 106 |
-
os.makedirs("./temp", exist_ok=True)
|
| 107 |
-
processed_files = []
|
| 108 |
-
seen_files = {} # Track files by content hash to detect duplicates
|
| 109 |
-
duplicate_count = 0
|
| 110 |
-
|
| 111 |
-
for uploaded_file in files_to_process:
|
| 112 |
-
simple_filename = Path(uploaded_file.name if hasattr(uploaded_file, 'name') else uploaded_file).name
|
| 113 |
-
file_source = uploaded_file.name if hasattr(uploaded_file, 'name') else uploaded_file
|
| 114 |
-
|
| 115 |
-
# Calculate file hash to detect duplicates (even with different names)
|
| 116 |
-
import hashlib
|
| 117 |
-
hasher = hashlib.md5()
|
| 118 |
-
with open(file_source, 'rb') as f:
|
| 119 |
-
# Read file in chunks to handle large files efficiently
|
| 120 |
-
for chunk in iter(lambda: f.read(8192), b""):
|
| 121 |
-
hasher.update(chunk)
|
| 122 |
-
file_hash = hasher.hexdigest()
|
| 123 |
-
|
| 124 |
-
# Check if this exact file was already uploaded
|
| 125 |
-
if file_hash in seen_files:
|
| 126 |
-
print(f"[DEBUG] Duplicate file detected: {simple_filename} (same as {seen_files[file_hash]})")
|
| 127 |
-
duplicate_count += 1
|
| 128 |
-
continue # Skip duplicate
|
| 129 |
-
|
| 130 |
-
# Not a duplicate - process it
|
| 131 |
-
simple_path = f"./temp/{simple_filename}"
|
| 132 |
-
|
| 133 |
-
# Handle filename collision (different files with same name)
|
| 134 |
-
if os.path.exists(simple_path):
|
| 135 |
-
# Check if existing file is the same (by comparing with already processed files)
|
| 136 |
-
existing_in_processed = simple_path in processed_files
|
| 137 |
-
if not existing_in_processed:
|
| 138 |
-
# Different file with same name - add suffix
|
| 139 |
-
base_name = Path(simple_filename).stem
|
| 140 |
-
extension = Path(simple_filename).suffix
|
| 141 |
-
counter = 1
|
| 142 |
-
while os.path.exists(f"./temp/{base_name}_{counter}{extension}"):
|
| 143 |
-
counter += 1
|
| 144 |
-
simple_filename = f"{base_name}_{counter}{extension}"
|
| 145 |
-
simple_path = f"./temp/{simple_filename}"
|
| 146 |
-
print(f"[DEBUG] Filename collision - renamed to: {simple_filename}")
|
| 147 |
-
|
| 148 |
-
shutil.copy2(file_source, simple_path)
|
| 149 |
-
processed_files.append(simple_path)
|
| 150 |
-
seen_files[file_hash] = simple_filename
|
| 151 |
-
print(f"[DEBUG] Copied file to: {simple_path}")
|
| 152 |
-
|
| 153 |
-
# Set current_file to the first file (for single-file operations)
|
| 154 |
-
# For multi-file operations, the agent will use all files from ./temp/
|
| 155 |
-
current_file = processed_files[0] if processed_files else None
|
| 156 |
-
|
| 157 |
-
# Only show file upload response if there's no user message
|
| 158 |
-
if not (user_message and user_message.strip()):
|
| 159 |
-
if len(processed_files) == 0:
|
| 160 |
-
# All files were duplicates
|
| 161 |
-
response = f"⚠️ **No New Files Uploaded**\n\n"
|
| 162 |
-
response += f"All {len(files_to_process)} file(s) were duplicates of already uploaded files.\n\n"
|
| 163 |
-
response += "Your previously uploaded dataset is still active."
|
| 164 |
-
elif len(processed_files) == 1:
|
| 165 |
-
# Single file upload - show detailed profile
|
| 166 |
-
response = f"📊 **Dataset Uploaded Successfully!**\n\n"
|
| 167 |
-
if duplicate_count > 0:
|
| 168 |
-
response += f"ℹ️ *({duplicate_count} duplicate file(s) were skipped)*\n\n"
|
| 169 |
-
response += f"**File:** {Path(current_file).name}\n\n"
|
| 170 |
-
|
| 171 |
-
# Get basic profile
|
| 172 |
-
profile = profile_dataset(current_file)
|
| 173 |
-
current_profile = profile
|
| 174 |
-
|
| 175 |
-
response += f"**Dataset Overview:**\n"
|
| 176 |
-
response += f"- Rows: {profile['shape']['rows']:,}\n"
|
| 177 |
-
response += f"- Columns: {profile['shape']['columns']}\n"
|
| 178 |
-
|
| 179 |
-
# Handle memory_usage (can be float or dict)
|
| 180 |
-
memory = profile.get('memory_usage', 0)
|
| 181 |
-
if isinstance(memory, dict):
|
| 182 |
-
memory = memory.get('total_mb', 0)
|
| 183 |
-
response += f"- Memory: {memory:.2f} MB\n\n"
|
| 184 |
-
|
| 185 |
-
response += f"**Column Types:**\n"
|
| 186 |
-
response += f"- Numeric: {len(profile['column_types']['numeric'])} columns\n"
|
| 187 |
-
response += f"- Categorical: {len(profile['column_types']['categorical'])} columns\n"
|
| 188 |
-
response += f"- Datetime: {len(profile['column_types']['datetime'])} columns\n\n"
|
| 189 |
-
|
| 190 |
-
# Check data quality
|
| 191 |
-
quality = detect_data_quality_issues(current_file)
|
| 192 |
-
if quality['critical']:
|
| 193 |
-
response += f"🔴 **Critical Issues:** {len(quality['critical'])}\n"
|
| 194 |
-
for issue in quality['critical'][:3]:
|
| 195 |
-
response += f" - {issue['message']}\n"
|
| 196 |
-
if quality['warning']:
|
| 197 |
-
response += f"🟡 **Warnings:** {len(quality['warning'])}\n"
|
| 198 |
-
for issue in quality['warning'][:3]:
|
| 199 |
-
response += f" - {issue['message']}\n"
|
| 200 |
-
else:
|
| 201 |
-
# Multiple files uploaded
|
| 202 |
-
response = f"📊 **{len(processed_files)} Datasets Uploaded Successfully!**\n\n"
|
| 203 |
-
if duplicate_count > 0:
|
| 204 |
-
response += f"ℹ️ *({duplicate_count} duplicate file(s) were skipped)*\n\n"
|
| 205 |
-
response += f"**Files:**\n"
|
| 206 |
-
for i, fp in enumerate(processed_files, 1):
|
| 207 |
-
response += f"{i}. {Path(fp).name}\n"
|
| 208 |
-
response += f"\n**💡 You can now use multi-dataset operations!**\n\n"
|
| 209 |
-
|
| 210 |
-
response += f"\n\n💬 **What would you like to do with {'this dataset' if len(processed_files) == 1 else 'these datasets'}?**\n\n"
|
| 211 |
-
response += "You can ask me to:\n"
|
| 212 |
-
if len(processed_files) > 1:
|
| 213 |
-
response += "- **Merge these datasets** (e.g., 'merge customers and orders on customer_id')\n"
|
| 214 |
-
response += "- **Combine/concatenate** them (e.g., 'combine all monthly sales files')\n"
|
| 215 |
-
response += "- Train a classification or regression model\n"
|
| 216 |
-
response += "- Analyze specific columns\n"
|
| 217 |
-
response += "- Detect outliers\n"
|
| 218 |
-
response += "- Engineer features\n"
|
| 219 |
-
response += "- Generate predictions\n"
|
| 220 |
-
response += "- And much more!\n"
|
| 221 |
-
|
| 222 |
-
# Add assistant message to history
|
| 223 |
-
history = add_assistant_message(history, response)
|
| 224 |
-
yield history, "", [], []
|
| 225 |
-
return
|
| 226 |
-
# If user uploaded file AND sent a message, don't return - continue to process the message
|
| 227 |
-
elif user_message and user_message.strip():
|
| 228 |
-
# Continue processing the message below
|
| 229 |
-
pass
|
| 230 |
-
|
| 231 |
-
# If user sends a message about the current file
|
| 232 |
-
print(f"[DEBUG] Checking message conditions: user_message={bool(user_message and user_message.strip())}, current_file={bool(current_file)}")
|
| 233 |
-
if user_message and user_message.strip() and current_file:
|
| 234 |
-
print(f"[DEBUG] User message detected. AI_ENABLED={AI_ENABLED}, agent={agent is not None}")
|
| 235 |
-
if AI_ENABLED and agent:
|
| 236 |
-
print(f"[DEBUG] Entering AI Agent block...")
|
| 237 |
-
try:
|
| 238 |
-
# Show immediate processing message
|
| 239 |
-
print(f"🤖 AI Agent analyzing: {user_message}")
|
| 240 |
-
history = add_user_message(history, user_message)
|
| 241 |
-
history = add_assistant_message(history, "🤖 **AI Agent is thinking...**\n\n⏳ Analyzing your request and planning the workflow...")
|
| 242 |
-
yield history, "", [], []
|
| 243 |
-
|
| 244 |
-
# Use the AI agent to process the request
|
| 245 |
-
print(f"📂 File path: {current_file}")
|
| 246 |
-
print(f"📝 Task: {user_message}")
|
| 247 |
-
print(f"🚀 Calling agent.analyze()...")
|
| 248 |
-
|
| 249 |
-
agent_response = agent.analyze(
|
| 250 |
-
file_path=current_file,
|
| 251 |
-
task_description=user_message,
|
| 252 |
-
use_cache=False, # Disable cache to avoid dict hashing issues
|
| 253 |
-
stream=False
|
| 254 |
-
)
|
| 255 |
-
|
| 256 |
-
print(f"✅ Agent response received: {agent_response.get('status', 'unknown')}")
|
| 257 |
-
|
| 258 |
-
# Store agent response for visualization extraction
|
| 259 |
-
last_agent_response = agent_response
|
| 260 |
-
|
| 261 |
-
# Format the response
|
| 262 |
-
if agent_response.get('status') == 'success':
|
| 263 |
-
response = f"🤖 **AI Agent Analysis Complete!**\n\n"
|
| 264 |
-
response += f"{agent_response.get('summary', '')}\n\n"
|
| 265 |
-
|
| 266 |
-
if 'workflow_history' in agent_response and agent_response['workflow_history']:
|
| 267 |
-
response += f"**Execution Summary:**\n"
|
| 268 |
-
response += f"- Tools Executed: {len(agent_response['workflow_history'])}\n"
|
| 269 |
-
response += f"- Iterations: {agent_response.get('iterations', 0)}\n"
|
| 270 |
-
response += f"- Time: {agent_response.get('execution_time', 0):.1f}s\n\n"
|
| 271 |
-
|
| 272 |
-
# Find and display MODEL TRAINING RESULTS with ALL METRICS
|
| 273 |
-
model_results = None
|
| 274 |
-
for step in agent_response['workflow_history']:
|
| 275 |
-
if step.get('tool') == 'train_baseline_models':
|
| 276 |
-
result = step.get('result', {})
|
| 277 |
-
if isinstance(result, dict) and 'result' in result:
|
| 278 |
-
model_results = result['result']
|
| 279 |
-
elif isinstance(result, dict):
|
| 280 |
-
model_results = result
|
| 281 |
-
break
|
| 282 |
-
|
| 283 |
-
if model_results and 'models' in model_results:
|
| 284 |
-
response += f"## 🎯 Model Training Results\n\n"
|
| 285 |
-
task_type = model_results.get('task_type', 'unknown')
|
| 286 |
-
response += f"**Task Type:** {task_type.title()}\n"
|
| 287 |
-
response += f"**Features:** {model_results.get('n_features', 0)}\n"
|
| 288 |
-
response += f"**Training Samples:** {model_results.get('train_size', 0):,}\n"
|
| 289 |
-
response += f"**Test Samples:** {model_results.get('test_size', 0):,}\n\n"
|
| 290 |
-
|
| 291 |
-
# Show ALL models tested
|
| 292 |
-
response += "### 📊 All Models Tested:\n\n"
|
| 293 |
-
models_data = model_results.get('models', {})
|
| 294 |
-
|
| 295 |
-
for model_name, model_info in models_data.items():
|
| 296 |
-
if 'test_metrics' in model_info:
|
| 297 |
-
metrics = model_info['test_metrics']
|
| 298 |
-
response += f"**{model_name}:**\n"
|
| 299 |
-
|
| 300 |
-
if task_type == 'classification':
|
| 301 |
-
response += f"- Accuracy: {metrics.get('accuracy', 0):.4f}\n"
|
| 302 |
-
response += f"- Precision: {metrics.get('precision', 0):.4f}\n"
|
| 303 |
-
response += f"- Recall: {metrics.get('recall', 0):.4f}\n"
|
| 304 |
-
response += f"- F1 Score: {metrics.get('f1', 0):.4f}\n"
|
| 305 |
-
else:
|
| 306 |
-
response += f"- R² Score: {metrics.get('r2', 0):.4f}\n"
|
| 307 |
-
response += f"- RMSE: {metrics.get('rmse', 0):.2f}\n"
|
| 308 |
-
response += f"- MAE: {metrics.get('mae', 0):.2f}\n"
|
| 309 |
-
response += f"- MAPE: {metrics.get('mape', 0):.2f}%\n"
|
| 310 |
-
response += "\n"
|
| 311 |
-
|
| 312 |
-
# Highlight BEST MODEL
|
| 313 |
-
best_model = model_results.get('best_model', {})
|
| 314 |
-
if best_model and best_model.get('name'):
|
| 315 |
-
response += f"### 🏆 Best Model: **{best_model['name']}**\n"
|
| 316 |
-
response += f"Score: {best_model.get('score', 0):.4f}\n\n"
|
| 317 |
-
|
| 318 |
-
# Show workflow execution summary
|
| 319 |
-
response += "### 🔧 Workflow Steps:\n"
|
| 320 |
-
for i, step in enumerate(agent_response['workflow_history'], 1):
|
| 321 |
-
tool_name = step['tool']
|
| 322 |
-
success = step['result'].get('success', False)
|
| 323 |
-
icon = "✅" if success else "❌"
|
| 324 |
-
response += f"{i}. {icon} {tool_name}\n"
|
| 325 |
-
response += "\n"
|
| 326 |
-
|
| 327 |
-
# Check for plots AND reports in workflow results
|
| 328 |
-
html_reports = [] # Separate list for HTML reports
|
| 329 |
-
|
| 330 |
-
for step in agent_response['workflow_history']:
|
| 331 |
-
result = step.get('result', {})
|
| 332 |
-
|
| 333 |
-
# Deep search for plots and reports in nested results
|
| 334 |
-
def find_plots_and_reports(obj, plots_list, reports_list):
|
| 335 |
-
if isinstance(obj, dict):
|
| 336 |
-
# Check direct plot/report keys
|
| 337 |
-
for key in ['plot_path', 'plot_file', 'output_path', 'html_path', 'report_path',
|
| 338 |
-
'plots', 'plot_paths', 'performance_plots', 'feature_importance_plot']:
|
| 339 |
-
if key in obj and obj[key]:
|
| 340 |
-
if isinstance(obj[key], list):
|
| 341 |
-
for path in obj[key]:
|
| 342 |
-
if isinstance(path, str) and os.path.exists(path):
|
| 343 |
-
if path.endswith('.html'):
|
| 344 |
-
# Check if it's a report (in reports folder) or interactive plot
|
| 345 |
-
if '/reports/' in path or 'report' in Path(path).stem.lower():
|
| 346 |
-
reports_list.append(path)
|
| 347 |
-
else:
|
| 348 |
-
reports_list.append(path) # Interactive plots also go to reports
|
| 349 |
-
elif path.endswith(('.png', '.jpg', '.jpeg')):
|
| 350 |
-
plots_list.append(path)
|
| 351 |
-
elif isinstance(obj[key], str) and os.path.exists(obj[key]):
|
| 352 |
-
if obj[key].endswith('.html'):
|
| 353 |
-
if '/reports/' in obj[key] or 'report' in Path(obj[key]).stem.lower():
|
| 354 |
-
reports_list.append(obj[key])
|
| 355 |
-
else:
|
| 356 |
-
reports_list.append(obj[key])
|
| 357 |
-
elif obj[key].endswith(('.png', '.jpg', '.jpeg')):
|
| 358 |
-
plots_list.append(obj[key])
|
| 359 |
-
# Recursively search nested dicts
|
| 360 |
-
for value in obj.values():
|
| 361 |
-
find_plots_and_reports(value, plots_list, reports_list)
|
| 362 |
-
|
| 363 |
-
find_plots_and_reports(result, plots_paths, html_reports)
|
| 364 |
-
|
| 365 |
-
# Remove duplicates while preserving order
|
| 366 |
-
plots_paths = list(dict.fromkeys(plots_paths))
|
| 367 |
-
html_reports = list(dict.fromkeys(html_reports))
|
| 368 |
-
|
| 369 |
-
# Display visualization and report information in response
|
| 370 |
-
if plots_paths or html_reports:
|
| 371 |
-
response += f"## 📊 Generated Outputs\n\n"
|
| 372 |
-
|
| 373 |
-
if plots_paths:
|
| 374 |
-
response += f"### 📈 Visualizations ({len(plots_paths)} plots)\n"
|
| 375 |
-
response += "✅ Plots are displayed in the **Visualization Gallery** below!\n\n"
|
| 376 |
-
|
| 377 |
-
# List plot files
|
| 378 |
-
for i, plot_path in enumerate(plots_paths[:10], 1):
|
| 379 |
-
try:
|
| 380 |
-
plot_name = Path(plot_path).stem.replace('_', ' ').title()
|
| 381 |
-
rel_path = os.path.relpath(plot_path, '.')
|
| 382 |
-
response += f"{i}. 📊 **{plot_name}**\n"
|
| 383 |
-
response += f" 📁 `{rel_path}`\n\n"
|
| 384 |
-
except Exception as e:
|
| 385 |
-
response += f"{i}. ❌ Error: {str(e)}\n"
|
| 386 |
-
|
| 387 |
-
if html_reports:
|
| 388 |
-
response += f"### 📋 Reports & Interactive Plots ({len(html_reports)} files)\n"
|
| 389 |
-
response += "✅ Reports are displayed in the **Reports Viewer** below!\n\n"
|
| 390 |
-
|
| 391 |
-
# List report files
|
| 392 |
-
for i, report_path in enumerate(html_reports[:10], 1):
|
| 393 |
-
try:
|
| 394 |
-
report_name = Path(report_path).stem.replace('_', ' ').title()
|
| 395 |
-
rel_path = os.path.relpath(report_path, '.')
|
| 396 |
-
file_size = os.path.getsize(report_path) / 1024 # KB
|
| 397 |
-
response += f"{i}. 📄 **{report_name}**\n"
|
| 398 |
-
response += f" 📁 `{rel_path}` ({file_size:.1f} KB)\n\n"
|
| 399 |
-
except Exception as e:
|
| 400 |
-
response += f"{i}. ❌ Error: {str(e)}\n"
|
| 401 |
-
else:
|
| 402 |
-
response += "ℹ️ No visualizations or reports were generated in this workflow.\n"
|
| 403 |
-
else:
|
| 404 |
-
response = f"⚠️ **AI Agent Status:** {agent_response.get('status', 'unknown')}\n\n"
|
| 405 |
-
response += f"{agent_response.get('message', agent_response.get('error', 'Unknown error'))}\n"
|
| 406 |
-
|
| 407 |
-
# Update the last assistant message with the response
|
| 408 |
-
history = update_last_assistant_message(history, response)
|
| 409 |
-
|
| 410 |
-
# Return plot paths for gallery and html_reports for HTML viewer
|
| 411 |
-
# Store html_reports in a format the HTML component can use
|
| 412 |
-
yield history, "", plots_paths if plots_paths else [], html_reports if html_reports else []
|
| 413 |
-
return
|
| 414 |
-
except Exception as e:
|
| 415 |
-
import sys
|
| 416 |
-
exc_type, exc_value, exc_traceback = sys.exc_info()
|
| 417 |
-
response = f"⚠️ **AI Agent Error:**\n\n"
|
| 418 |
-
response += f"**Error Type:** {exc_type.__name__}\n\n"
|
| 419 |
-
response += f"**Error Message:** {str(e)}\n\n"
|
| 420 |
-
response += f"**Full Traceback:**\n```python\n{traceback.format_exc()}\n```\n\n"
|
| 421 |
-
response += "💡 **Fallback Options:**\n"
|
| 422 |
-
response += "- Use the **Quick Train** feature on the right\n"
|
| 423 |
-
response += "- Try manual commands: `profile`, `quality`, `columns`\n"
|
| 424 |
-
# Update the last assistant message with error
|
| 425 |
-
history = update_last_assistant_message(history, response)
|
| 426 |
-
yield history, "", plots_paths if plots_paths else []
|
| 427 |
-
return
|
| 428 |
-
else:
|
| 429 |
-
# Manual mode - Handle commands directly
|
| 430 |
-
user_msg_lower = user_message.lower().strip()
|
| 431 |
-
|
| 432 |
-
# Handle simple commands manually
|
| 433 |
-
if 'profile' in user_msg_lower:
|
| 434 |
-
response = "📊 **Dataset Profile:**\n\n"
|
| 435 |
-
if current_profile:
|
| 436 |
-
response += f"**Shape:** {current_profile['shape']['rows']:,} rows × {current_profile['shape']['columns']} columns\n\n"
|
| 437 |
-
response += f"**Column Types:**\n"
|
| 438 |
-
response += f"- Numeric: {len(current_profile['column_types']['numeric'])} columns\n"
|
| 439 |
-
response += f"- Categorical: {len(current_profile['column_types']['categorical'])} columns\n"
|
| 440 |
-
response += f"- Datetime: {len(current_profile['column_types']['datetime'])} columns\n\n"
|
| 441 |
-
response += f"**Overall Stats:**\n"
|
| 442 |
-
response += f"- Total cells: {current_profile['overall_stats']['total_cells']:,}\n"
|
| 443 |
-
response += f"- Null values: {current_profile['overall_stats']['total_nulls']} ({current_profile['overall_stats']['null_percentage']:.1f}%)\n"
|
| 444 |
-
response += f"- Duplicates: {current_profile['overall_stats']['duplicate_rows']}\n"
|
| 445 |
-
else:
|
| 446 |
-
response += "Profile information is available at the top of the chat!"
|
| 447 |
-
|
| 448 |
-
elif 'quality' in user_msg_lower or 'issues' in user_msg_lower:
|
| 449 |
-
quality = detect_data_quality_issues(current_file)
|
| 450 |
-
response = "🔍 **Data Quality Report:**\n\n"
|
| 451 |
-
|
| 452 |
-
if quality['critical']:
|
| 453 |
-
response += f"🔴 **Critical Issues:** {len(quality['critical'])}\n"
|
| 454 |
-
for issue in quality['critical']:
|
| 455 |
-
response += f" • {issue['message']}\n"
|
| 456 |
-
response += "\n"
|
| 457 |
-
|
| 458 |
-
if quality['warning']:
|
| 459 |
-
response += f"🟡 **Warnings:** {len(quality['warning'])}\n"
|
| 460 |
-
for issue in quality['warning'][:5]: # Show first 5
|
| 461 |
-
response += f" • {issue['message']}\n"
|
| 462 |
-
if len(quality['warning']) > 5:
|
| 463 |
-
response += f" • ... and {len(quality['warning']) - 5} more\n"
|
| 464 |
-
response += "\n"
|
| 465 |
-
|
| 466 |
-
if quality['info']:
|
| 467 |
-
response += f"🔵 **Info:** {len(quality['info'])} observations\n"
|
| 468 |
-
|
| 469 |
-
if not quality['critical'] and not quality['warning'] and not quality['info']:
|
| 470 |
-
response += "✅ No issues detected! Your data looks good.\n"
|
| 471 |
-
|
| 472 |
-
elif 'columns' in user_msg_lower or 'column' in user_msg_lower:
|
| 473 |
-
if current_profile:
|
| 474 |
-
response = "📋 **Dataset Columns:**\n\n"
|
| 475 |
-
for col, info in current_profile['columns'].items():
|
| 476 |
-
nulls = info.get('null_count', 0)
|
| 477 |
-
null_pct = (nulls / current_profile['shape']['rows'] * 100) if current_profile['shape']['rows'] > 0 else 0
|
| 478 |
-
response += f"• **{col}** ({info['type']})\n"
|
| 479 |
-
response += f" - Nulls: {nulls} ({null_pct:.1f}%)\n"
|
| 480 |
-
if 'unique' in info:
|
| 481 |
-
response += f" - Unique: {info['unique']}\n"
|
| 482 |
-
else:
|
| 483 |
-
response = "📋 **Columns:** Please upload a file first to see column information."
|
| 484 |
-
|
| 485 |
-
elif 'help' in user_msg_lower:
|
| 486 |
-
response = "💡 **Available Commands:**\n\n"
|
| 487 |
-
response += "**Manual Commands:**\n"
|
| 488 |
-
response += "• `profile` - Show detailed dataset statistics\n"
|
| 489 |
-
response += "• `quality` - Check data quality issues\n"
|
| 490 |
-
response += "• `columns` - List all columns with details\n"
|
| 491 |
-
response += "• `help` - Show this help message\n\n"
|
| 492 |
-
response += "**Quick Actions:**\n"
|
| 493 |
-
response += "• Use the **Quick Train** panel on the right to train models\n"
|
| 494 |
-
response += "• Check **Dataset Info** in the sidebar for quick stats\n"
|
| 495 |
-
|
| 496 |
-
else:
|
| 497 |
-
# Default response for unrecognized commands
|
| 498 |
-
response = f"💬 **You said:** {user_message}\n\n"
|
| 499 |
-
response += "⚠️ AI agent is not available. I can respond to these commands:\n\n"
|
| 500 |
-
response += "• `profile` - Show detailed statistics\n"
|
| 501 |
-
response += "• `quality` - Check data quality\n"
|
| 502 |
-
response += "• `columns` - List all columns\n"
|
| 503 |
-
response += "• `help` - Show available commands\n\n"
|
| 504 |
-
response += "**Or use Quick Train** on the right to train models directly!\n"
|
| 505 |
-
|
| 506 |
-
# Add user message and assistant response
|
| 507 |
-
history = add_user_message(history, user_message)
|
| 508 |
-
history = add_assistant_message(history, response)
|
| 509 |
-
yield history, "", [], []
|
| 510 |
-
return
|
| 511 |
-
|
| 512 |
-
# If no file is uploaded yet
|
| 513 |
-
if user_message and user_message.strip() and not current_file:
|
| 514 |
-
response = "⚠️ **Please upload a dataset first!**\n\n"
|
| 515 |
-
response += "Click the 'Upload Dataset' button above and select a CSV or Parquet file."
|
| 516 |
-
# Add user message and assistant response
|
| 517 |
-
history = add_user_message(history, user_message)
|
| 518 |
-
history = add_assistant_message(history, response)
|
| 519 |
-
yield history, "", [], []
|
| 520 |
-
return
|
| 521 |
-
|
| 522 |
-
except Exception as e:
|
| 523 |
-
error_msg = f"❌ **Error:** {str(e)}\n\n"
|
| 524 |
-
error_msg += "**Traceback:**\n```\n" + traceback.format_exc() + "\n```"
|
| 525 |
-
if user_message:
|
| 526 |
-
# Check if we already added the user message
|
| 527 |
-
last_user = get_last_user_content(history)
|
| 528 |
-
if last_user != user_message:
|
| 529 |
-
history = add_user_message(history, user_message)
|
| 530 |
-
history = add_assistant_message(history, error_msg)
|
| 531 |
-
else:
|
| 532 |
-
history = add_assistant_message(history, error_msg)
|
| 533 |
-
yield history, "", [], []
|
| 534 |
-
return
|
| 535 |
-
|
| 536 |
-
# Default return if nothing matched
|
| 537 |
-
yield history, "", [], []
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
def quick_profile(file):
|
| 541 |
-
"""Quick profile display in the sidebar."""
|
| 542 |
-
if file is None:
|
| 543 |
-
return "No file uploaded yet."
|
| 544 |
-
|
| 545 |
-
try:
|
| 546 |
-
profile = profile_dataset(file.name)
|
| 547 |
-
|
| 548 |
-
info = f"**{Path(file.name).name}**\n\n"
|
| 549 |
-
info += f"📊 {profile['shape']['rows']:,} rows × {profile['shape']['columns']} cols\n\n"
|
| 550 |
-
info += f"**Columns:**\n"
|
| 551 |
-
for col, col_info in list(profile['columns'].items())[:10]:
|
| 552 |
-
info += f"- {col} ({col_info['type']})\n"
|
| 553 |
-
|
| 554 |
-
if len(profile['columns']) > 10:
|
| 555 |
-
info += f"- ... and {len(profile['columns']) - 10} more\n"
|
| 556 |
-
|
| 557 |
-
return info
|
| 558 |
-
except Exception as e:
|
| 559 |
-
return f"Error: {str(e)}"
|
| 560 |
-
|
| 561 |
-
|
| 562 |
-
def train_model_ui(file, target_col, model_type, test_size, progress=gr.Progress()):
|
| 563 |
-
"""Train a model directly from the UI."""
|
| 564 |
-
if file is None:
|
| 565 |
-
return "⚠️ Please upload a dataset first!"
|
| 566 |
-
|
| 567 |
-
if not target_col:
|
| 568 |
-
return "⚠️ Please specify a target column!"
|
| 569 |
-
|
| 570 |
-
# Clean up the target column name - remove surrounding quotes if present
|
| 571 |
-
target_col = target_col.strip().strip("'").strip('"')
|
| 572 |
-
|
| 573 |
-
try:
|
| 574 |
-
# Show progress
|
| 575 |
-
progress(0, desc="🔄 Loading dataset...")
|
| 576 |
-
yield "⏳ **Training in progress...**\n\n📊 Loading dataset..."
|
| 577 |
-
|
| 578 |
-
import time
|
| 579 |
-
time.sleep(0.5) # Brief pause for UI feedback
|
| 580 |
-
|
| 581 |
-
progress(0.2, desc="🔄 Preparing data...")
|
| 582 |
-
yield "⏳ **Training in progress...**\n\n📊 Dataset loaded\n🔄 Preparing data..."
|
| 583 |
-
|
| 584 |
-
time.sleep(0.3)
|
| 585 |
-
# Determine problem type
|
| 586 |
-
problem_type = "classification" if model_type == "Classification" else "regression"
|
| 587 |
-
|
| 588 |
-
progress(0.4, desc="🤖 Training models...")
|
| 589 |
-
yield "⏳ **Training in progress...**\n\n📊 Dataset loaded\n✅ Data prepared\n🤖 Training multiple models..."
|
| 590 |
-
|
| 591 |
-
# Train baseline models
|
| 592 |
-
result = train_baseline_models(
|
| 593 |
-
file.name,
|
| 594 |
-
target_col=target_col,
|
| 595 |
-
task_type=problem_type,
|
| 596 |
-
test_size=test_size
|
| 597 |
-
)
|
| 598 |
-
|
| 599 |
-
progress(0.9, desc="📊 Evaluating results...")
|
| 600 |
-
|
| 601 |
-
# Check if training was successful
|
| 602 |
-
if result.get('status') == 'error':
|
| 603 |
-
yield f"❌ **Training Failed**\n\n{result.get('message', 'Unknown error')}"
|
| 604 |
-
return
|
| 605 |
-
|
| 606 |
-
if 'best_model' not in result:
|
| 607 |
-
yield f"❌ **Training Failed**\n\nNo models were successfully trained. Result: {result}"
|
| 608 |
-
return
|
| 609 |
-
|
| 610 |
-
# Get the best model
|
| 611 |
-
best_model_name = result['best_model']['name']
|
| 612 |
-
if not best_model_name:
|
| 613 |
-
yield f"❌ **Training Failed**\n\nNo model could be selected as best model."
|
| 614 |
-
return
|
| 615 |
-
|
| 616 |
-
best_model_info = result['models'][best_model_name]
|
| 617 |
-
best_metrics = best_model_info.get('test_metrics', {})
|
| 618 |
-
|
| 619 |
-
output = f"✅ **Model Training Complete!**\n\n"
|
| 620 |
-
output += f"## 🏆 Best Model: **{best_model_name}**\n\n"
|
| 621 |
-
|
| 622 |
-
output += f"**Dataset Info:**\n"
|
| 623 |
-
output += f"- Features: {result.get('n_features', 0)}\n"
|
| 624 |
-
output += f"- Training samples: {result.get('train_size', 0):,}\n"
|
| 625 |
-
output += f"- Test samples: {result.get('test_size', 0):,}\n\n"
|
| 626 |
-
|
| 627 |
-
if problem_type == "classification":
|
| 628 |
-
output += f"**Test Metrics:**\n"
|
| 629 |
-
output += f"- ✅ Accuracy: {best_metrics.get('accuracy', 0):.4f}\n"
|
| 630 |
-
output += f"- 🎯 Precision: {best_metrics.get('precision', 0):.4f}\n"
|
| 631 |
-
output += f"- 📊 Recall: {best_metrics.get('recall', 0):.4f}\n"
|
| 632 |
-
output += f"- 🔥 F1 Score: {best_metrics.get('f1', 0):.4f}\n\n"
|
| 633 |
-
else:
|
| 634 |
-
output += f"**Test Metrics:**\n"
|
| 635 |
-
output += f"- 📈 R² Score: {best_metrics.get('r2', 0):.4f}\n"
|
| 636 |
-
output += f"- 📉 RMSE: {best_metrics.get('rmse', 0):.2f}\n"
|
| 637 |
-
output += f"- 📊 MAE: {best_metrics.get('mae', 0):.2f}\n"
|
| 638 |
-
output += f"- 💯 MAPE: {best_metrics.get('mape', 0):.2f}%\n\n"
|
| 639 |
-
|
| 640 |
-
output += f"## 📊 All Models Comparison:\n\n"
|
| 641 |
-
for model_name, model_info in result['models'].items():
|
| 642 |
-
if 'test_metrics' in model_info:
|
| 643 |
-
test_metrics = model_info['test_metrics']
|
| 644 |
-
indicator = "🏆 " if model_name == best_model_name else " "
|
| 645 |
-
if problem_type == "classification":
|
| 646 |
-
f1 = test_metrics.get('f1', 0)
|
| 647 |
-
acc = test_metrics.get('accuracy', 0)
|
| 648 |
-
output += f"{indicator}**{model_name}:**\n"
|
| 649 |
-
output += f" - F1: {f1:.4f} | Accuracy: {acc:.4f}\n"
|
| 650 |
-
else:
|
| 651 |
-
r2 = test_metrics.get('r2', 0)
|
| 652 |
-
rmse = test_metrics.get('rmse', 0)
|
| 653 |
-
output += f"{indicator}**{model_name}:**\n"
|
| 654 |
-
output += f" - R²: {r2:.4f} | RMSE: {rmse:.2f}\n"
|
| 655 |
-
elif 'status' in model_info and model_info['status'] == 'error':
|
| 656 |
-
output += f" ❌ **{model_name}:** {model_info.get('message', 'Error')}\n"
|
| 657 |
-
|
| 658 |
-
# Display generated plots if available
|
| 659 |
-
plots_to_show = []
|
| 660 |
-
|
| 661 |
-
# Check for performance plots
|
| 662 |
-
if 'performance_plots' in result and result['performance_plots']:
|
| 663 |
-
if isinstance(result['performance_plots'], list):
|
| 664 |
-
plots_to_show.extend(result['performance_plots'])
|
| 665 |
-
else:
|
| 666 |
-
plots_to_show.append(result['performance_plots'])
|
| 667 |
-
|
| 668 |
-
# Check for feature importance plot
|
| 669 |
-
if 'feature_importance_plot' in result and result['feature_importance_plot']:
|
| 670 |
-
plots_to_show.append(result['feature_importance_plot'])
|
| 671 |
-
|
| 672 |
-
# Embed plots
|
| 673 |
-
if plots_to_show:
|
| 674 |
-
output += f"\n\n📊 **Visualizations:**\n\n"
|
| 675 |
-
for plot_path in plots_to_show:
|
| 676 |
-
if isinstance(plot_path, str) and plot_path.endswith('.html') and os.path.exists(plot_path):
|
| 677 |
-
try:
|
| 678 |
-
with open(plot_path, 'r', encoding='utf-8') as f:
|
| 679 |
-
plot_html = f.read()
|
| 680 |
-
# Add plot title based on filename
|
| 681 |
-
plot_name = Path(plot_path).stem.replace('_', ' ').title()
|
| 682 |
-
output += f"**{plot_name}:**\n"
|
| 683 |
-
output += f'<iframe srcdoc="{plot_html.replace(chr(34), """)}" width="100%" height="500" frameborder="0"></iframe>\n\n'
|
| 684 |
-
except Exception as e:
|
| 685 |
-
# Fallback to file path
|
| 686 |
-
output += f"📁 {Path(plot_path).name}: `{plot_path}`\n"
|
| 687 |
-
|
| 688 |
-
progress(1.0, desc="✅ Complete!")
|
| 689 |
-
yield output
|
| 690 |
-
|
| 691 |
-
except Exception as e:
|
| 692 |
-
yield f"❌ **Error:** {str(e)}\n\n```\n{traceback.format_exc()}\n```"
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
def clear_conversation():
|
| 696 |
-
"""Clear the conversation and reset state."""
|
| 697 |
-
global current_file, current_profile
|
| 698 |
-
current_file = None
|
| 699 |
-
current_profile = None
|
| 700 |
-
return [], None, "", [], ""
|
| 701 |
-
|
| 702 |
-
|
| 703 |
-
def format_html_reports(html_paths):
|
| 704 |
-
"""Format HTML reports/plots for display in HTML component."""
|
| 705 |
-
if not html_paths or len(html_paths) == 0:
|
| 706 |
-
return "<div style='text-align:center; padding:40px; color:#666;'>No reports generated yet. Try: 'Generate a quality report' or 'Create interactive visualizations'</div>"
|
| 707 |
-
|
| 708 |
-
html_output = """
|
| 709 |
-
<style>
|
| 710 |
-
.report-container {
|
| 711 |
-
padding: 20px;
|
| 712 |
-
background: #f8f9fa;
|
| 713 |
-
}
|
| 714 |
-
.report-card {
|
| 715 |
-
margin-bottom: 30px;
|
| 716 |
-
border: 2px solid #dee2e6;
|
| 717 |
-
border-radius: 12px;
|
| 718 |
-
overflow: hidden;
|
| 719 |
-
background: white;
|
| 720 |
-
box-shadow: 0 4px 6px rgba(0,0,0,0.1);
|
| 721 |
-
}
|
| 722 |
-
.report-header {
|
| 723 |
-
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
| 724 |
-
color: white;
|
| 725 |
-
padding: 15px 20px;
|
| 726 |
-
font-weight: bold;
|
| 727 |
-
font-size: 18px;
|
| 728 |
-
display: flex;
|
| 729 |
-
justify-content: space-between;
|
| 730 |
-
align-items: center;
|
| 731 |
-
}
|
| 732 |
-
.report-meta {
|
| 733 |
-
font-size: 12px;
|
| 734 |
-
opacity: 0.9;
|
| 735 |
-
}
|
| 736 |
-
.report-iframe {
|
| 737 |
-
width: 100%;
|
| 738 |
-
min-height: 600px;
|
| 739 |
-
border: none;
|
| 740 |
-
background: white;
|
| 741 |
-
}
|
| 742 |
-
.report-footer {
|
| 743 |
-
background: #f8f9fa;
|
| 744 |
-
padding: 10px 20px;
|
| 745 |
-
font-size: 12px;
|
| 746 |
-
color: #666;
|
| 747 |
-
border-top: 1px solid #dee2e6;
|
| 748 |
-
}
|
| 749 |
-
</style>
|
| 750 |
-
<div class="report-container">
|
| 751 |
-
"""
|
| 752 |
-
|
| 753 |
-
html_output += f"<h2 style='color: #667eea; margin-bottom: 20px;'>📋 {len(html_paths)} Report(s) Generated</h2>"
|
| 754 |
-
|
| 755 |
-
for i, html_path in enumerate(html_paths, 1):
|
| 756 |
-
try:
|
| 757 |
-
# Get file metadata
|
| 758 |
-
file_name = Path(html_path).name
|
| 759 |
-
file_size = os.path.getsize(html_path) / 1024 # KB
|
| 760 |
-
report_title = Path(html_path).stem.replace('_', ' ').title()
|
| 761 |
-
|
| 762 |
-
# Read the HTML content
|
| 763 |
-
with open(html_path, 'r', encoding='utf-8') as f:
|
| 764 |
-
html_content = f.read()
|
| 765 |
-
|
| 766 |
-
# Escape the content for embedding
|
| 767 |
-
escaped_content = html_content.replace('\\', '\\\\').replace('"', '"').replace("'", "\\'")
|
| 768 |
-
|
| 769 |
-
html_output += f"""
|
| 770 |
-
<div class="report-card">
|
| 771 |
-
<div class="report-header">
|
| 772 |
-
<span>📊 {i}. {report_title}</span>
|
| 773 |
-
<span class="report-meta">{file_size:.1f} KB</span>
|
| 774 |
-
</div>
|
| 775 |
-
<iframe class="report-iframe" srcdoc="{escaped_content}"></iframe>
|
| 776 |
-
<div class="report-footer">
|
| 777 |
-
📁 {html_path}
|
| 778 |
-
</div>
|
| 779 |
-
</div>
|
| 780 |
-
"""
|
| 781 |
-
except Exception as e:
|
| 782 |
-
html_output += f"""
|
| 783 |
-
<div class="report-card">
|
| 784 |
-
<div class="report-header" style="background: linear-gradient(135deg, #f44336 0%, #e91e63 100%);">
|
| 785 |
-
<span>❌ Error loading: {Path(html_path).name}</span>
|
| 786 |
-
</div>
|
| 787 |
-
<div style="padding: 20px;">
|
| 788 |
-
<p><strong>Error:</strong> {str(e)}</p>
|
| 789 |
-
<p><strong>Path:</strong> {html_path}</p>
|
| 790 |
-
</div>
|
| 791 |
-
</div>
|
| 792 |
-
"""
|
| 793 |
-
|
| 794 |
-
html_output += "</div>"
|
| 795 |
-
|
| 796 |
-
return html_output
|
| 797 |
-
|
| 798 |
-
|
| 799 |
-
def extract_and_display_plots(agent_response):
|
| 800 |
-
"""Extract plots from agent response and format them for display."""
|
| 801 |
-
plots_html = ""
|
| 802 |
-
|
| 803 |
-
if not agent_response or agent_response.get('status') != 'success':
|
| 804 |
-
return gr.update(value="<p style='text-align:center; color:#666;'>No visualizations generated yet. Upload a dataset and run analysis!</p>")
|
| 805 |
-
|
| 806 |
-
workflow_history = agent_response.get('workflow_history', [])
|
| 807 |
-
if not workflow_history:
|
| 808 |
-
return gr.update(value="<p style='text-align:center; color:#666;'>No visualizations in this workflow.</p>")
|
| 809 |
-
|
| 810 |
-
# Find all plots
|
| 811 |
-
plots_paths = []
|
| 812 |
-
|
| 813 |
-
def find_plots(obj, plots_list):
|
| 814 |
-
if isinstance(obj, dict):
|
| 815 |
-
# Check direct plot keys
|
| 816 |
-
for key in ['plot_path', 'plot_file', 'html_path', 'output_path',
|
| 817 |
-
'plots', 'plot_paths', 'performance_plots', 'feature_importance_plot']:
|
| 818 |
-
if key in obj and obj[key]:
|
| 819 |
-
if isinstance(obj[key], list):
|
| 820 |
-
for plot_path in obj[key]:
|
| 821 |
-
if isinstance(plot_path, str) and plot_path.endswith('.html') and os.path.exists(plot_path):
|
| 822 |
-
plots_list.append(plot_path)
|
| 823 |
-
elif isinstance(obj[key], str) and obj[key].endswith('.html') and os.path.exists(obj[key]):
|
| 824 |
-
plots_list.append(obj[key])
|
| 825 |
-
# Recursively search nested dicts
|
| 826 |
-
for value in obj.values():
|
| 827 |
-
find_plots(value, plots_list)
|
| 828 |
-
|
| 829 |
-
for step in workflow_history:
|
| 830 |
-
result = step.get('result', {})
|
| 831 |
-
find_plots(result, plots_paths)
|
| 832 |
-
|
| 833 |
-
# Remove duplicates while preserving order
|
| 834 |
-
plots_paths = list(dict.fromkeys(plots_paths))
|
| 835 |
-
|
| 836 |
-
if not plots_paths:
|
| 837 |
-
return gr.update(value="<p style='text-align:center; color:#666;'>No plots were generated in this analysis.</p>")
|
| 838 |
-
|
| 839 |
-
# Build HTML gallery
|
| 840 |
-
plots_html = f"""
|
| 841 |
-
<div style='padding: 20px;'>
|
| 842 |
-
<h2 style='color: #1f77b4; margin-bottom: 20px;'>📊 Visualization Gallery ({len(plots_paths)} plots)</h2>
|
| 843 |
-
"""
|
| 844 |
-
|
| 845 |
-
for i, plot_path in enumerate(plots_paths, 1):
|
| 846 |
-
try:
|
| 847 |
-
with open(plot_path, 'r', encoding='utf-8') as f:
|
| 848 |
-
plot_content = f.read()
|
| 849 |
-
|
| 850 |
-
plot_name = Path(plot_path).stem.replace('_', ' ').title()
|
| 851 |
-
|
| 852 |
-
plots_html += f"""
|
| 853 |
-
<div style='margin-bottom: 30px; border: 1px solid #ddd; border-radius: 8px; overflow: hidden;'>
|
| 854 |
-
<div style='background: linear-gradient(90deg, #1f77b4, #2ca02c); color: white; padding: 10px 15px; font-weight: bold;'>
|
| 855 |
-
{i}. {plot_name}
|
| 856 |
-
</div>
|
| 857 |
-
<div style='padding: 10px; background: white;'>
|
| 858 |
-
<iframe srcdoc='{plot_content.replace("'", "'").replace('"', """)}'
|
| 859 |
-
width='100%' height='500' frameborder='0'
|
| 860 |
-
style='border: none; border-radius: 5px;'></iframe>
|
| 861 |
-
</div>
|
| 862 |
-
<div style='background: #f8f9fa; padding: 8px 15px; font-size: 12px; color: #666;'>
|
| 863 |
-
📁 {plot_path}
|
| 864 |
-
</div>
|
| 865 |
-
</div>
|
| 866 |
-
"""
|
| 867 |
-
except Exception as e:
|
| 868 |
-
plots_html += f"""
|
| 869 |
-
<div style='margin-bottom: 20px; padding: 15px; border: 1px solid #f44336; border-radius: 5px; background: #ffebee;'>
|
| 870 |
-
<strong>❌ Failed to load: {Path(plot_path).name}</strong><br>
|
| 871 |
-
<small>{str(e)}</small>
|
| 872 |
-
</div>
|
| 873 |
-
"""
|
| 874 |
-
|
| 875 |
-
plots_html += "</div>"
|
| 876 |
-
|
| 877 |
-
return gr.update(value=plots_html)
|
| 878 |
-
|
| 879 |
-
|
| 880 |
-
# Custom CSS for better visual feedback
|
| 881 |
-
custom_css = """
|
| 882 |
-
.status-box {
|
| 883 |
-
padding: 10px;
|
| 884 |
-
border-radius: 5px;
|
| 885 |
-
background: linear-gradient(90deg, #e8f5e9 0%, #c8e6c9 100%);
|
| 886 |
-
margin-bottom: 10px;
|
| 887 |
-
text-align: center;
|
| 888 |
-
font-weight: bold;
|
| 889 |
-
}
|
| 890 |
-
"""
|
| 891 |
-
|
| 892 |
-
# Create the Gradio interface
|
| 893 |
-
with gr.Blocks(title="AI Agent Data Scientist", theme=gr.themes.Soft(), css=custom_css) as demo:
|
| 894 |
-
gr.Markdown("""
|
| 895 |
-
# 🤖 AI Agent Data Scientist
|
| 896 |
-
|
| 897 |
-
Upload your dataset and chat with the AI agent to perform data science tasks!
|
| 898 |
-
|
| 899 |
-
**Features:**
|
| 900 |
-
- 📊 Automatic dataset profiling
|
| 901 |
-
- 🤖 Natural language queries
|
| 902 |
-
- 🎯 Model training (classification & regression)
|
| 903 |
-
- 🔍 Data quality analysis
|
| 904 |
-
- 📈 Feature engineering
|
| 905 |
-
- 🎨 **NEW:** Automatic visualization generation!
|
| 906 |
-
- And 59 tools total!
|
| 907 |
-
""")
|
| 908 |
-
|
| 909 |
-
# Store agent response for visualization extraction
|
| 910 |
-
agent_response_state = gr.State(None)
|
| 911 |
-
|
| 912 |
-
with gr.Row():
|
| 913 |
-
# Left column - Main chat interface
|
| 914 |
-
with gr.Column(scale=2):
|
| 915 |
-
# Status indicator
|
| 916 |
-
status_box = gr.Markdown("🟢 **Ready** - Upload a dataset to begin", elem_classes=["status-box"])
|
| 917 |
-
|
| 918 |
-
chatbot = gr.Chatbot(
|
| 919 |
-
label="Chat with AI Agent",
|
| 920 |
-
height=450,
|
| 921 |
-
show_label=True,
|
| 922 |
-
avatar_images=(None, "🤖"),
|
| 923 |
-
sanitize_html=False # Allow HTML content including iframes
|
| 924 |
-
)
|
| 925 |
-
|
| 926 |
-
with gr.Row():
|
| 927 |
-
file_upload = gr.File(
|
| 928 |
-
label="📁 Upload Dataset(s) (CSV/Parquet) - Single or Multiple Files",
|
| 929 |
-
file_types=[".csv", ".parquet"],
|
| 930 |
-
file_count="multiple", # Allow multiple file uploads
|
| 931 |
-
type="filepath"
|
| 932 |
-
)
|
| 933 |
-
|
| 934 |
-
with gr.Row():
|
| 935 |
-
user_input = gr.Textbox(
|
| 936 |
-
label="Your Message",
|
| 937 |
-
placeholder="Ask anything: 'train a model', 'analyze my data', 'generate visualizations'",
|
| 938 |
-
lines=2,
|
| 939 |
-
scale=4
|
| 940 |
-
)
|
| 941 |
-
submit_btn = gr.Button("📤 Send", variant="primary", scale=1)
|
| 942 |
-
|
| 943 |
-
with gr.Row():
|
| 944 |
-
clear_btn = gr.Button("🗑️ Clear", variant="secondary") # Right column - Quick actions and info
|
| 945 |
-
with gr.Column(scale=1):
|
| 946 |
-
gr.Markdown("## 📊 Dataset Info")
|
| 947 |
-
dataset_info = gr.Markdown("Upload a dataset to see information here.")
|
| 948 |
-
|
| 949 |
-
gr.Markdown("## 🎯 Quick Train")
|
| 950 |
-
with gr.Group():
|
| 951 |
-
target_column = gr.Textbox(
|
| 952 |
-
label="Target Column",
|
| 953 |
-
placeholder="e.g., 'price', 'class', 'label'"
|
| 954 |
-
)
|
| 955 |
-
model_type_choice = gr.Radio(
|
| 956 |
-
["Classification", "Regression"],
|
| 957 |
-
label="Model Type",
|
| 958 |
-
value="Classification"
|
| 959 |
-
)
|
| 960 |
-
test_size_slider = gr.Slider(
|
| 961 |
-
0.1, 0.5, 0.3,
|
| 962 |
-
label="Test Size",
|
| 963 |
-
step=0.05
|
| 964 |
-
)
|
| 965 |
-
train_btn = gr.Button("🚀 Train Model", variant="primary")
|
| 966 |
-
|
| 967 |
-
training_output = gr.Markdown("Training results will appear here.")
|
| 968 |
-
|
| 969 |
-
gr.Markdown("""
|
| 970 |
-
## 💡 Example Queries
|
| 971 |
-
|
| 972 |
-
- "Train a classification model to predict [target]"
|
| 973 |
-
- "Show me statistics for [column]"
|
| 974 |
-
- "Detect outliers in the dataset"
|
| 975 |
-
- "What are the most important features?"
|
| 976 |
-
- "Generate a quality report"
|
| 977 |
-
- "Create polynomial features"
|
| 978 |
-
- "Balance the dataset using SMOTE"
|
| 979 |
-
""")
|
| 980 |
-
|
| 981 |
-
# Visualization Gallery Section (Full Width)
|
| 982 |
-
with gr.Row():
|
| 983 |
-
with gr.Column():
|
| 984 |
-
gr.Markdown("## 🎨 Visualization Gallery")
|
| 985 |
-
visualization_gallery = gr.Gallery(
|
| 986 |
-
label="Generated Plots (PNG/JPG)",
|
| 987 |
-
show_label=True,
|
| 988 |
-
elem_id="gallery",
|
| 989 |
-
columns=2,
|
| 990 |
-
height=400
|
| 991 |
-
)
|
| 992 |
-
|
| 993 |
-
# Reports Viewer Section (Full Width)
|
| 994 |
-
with gr.Row():
|
| 995 |
-
with gr.Column():
|
| 996 |
-
gr.Markdown("## 📋 Reports & Interactive Visualizations")
|
| 997 |
-
gr.Markdown("*HTML reports and interactive Plotly charts will be displayed here*")
|
| 998 |
-
reports_viewer = gr.HTML(
|
| 999 |
-
value="<div style='text-align:center; padding:40px; color:#666;'>No reports generated yet. Try: 'Generate a quality report' or 'Create interactive visualizations'</div>",
|
| 1000 |
-
elem_id="reports_viewer"
|
| 1001 |
-
)
|
| 1002 |
-
|
| 1003 |
-
# Create state to hold HTML report paths
|
| 1004 |
-
html_reports_state = gr.State([])
|
| 1005 |
-
|
| 1006 |
-
# Event handlers with streaming support
|
| 1007 |
-
submit_result = submit_btn.click(
|
| 1008 |
-
fn=analyze_dataset,
|
| 1009 |
-
inputs=[file_upload, user_input, chatbot],
|
| 1010 |
-
outputs=[chatbot, user_input, visualization_gallery, html_reports_state],
|
| 1011 |
-
show_progress="full" # Show progress bar
|
| 1012 |
-
)
|
| 1013 |
-
submit_result.then(
|
| 1014 |
-
fn=format_html_reports,
|
| 1015 |
-
inputs=[html_reports_state],
|
| 1016 |
-
outputs=[reports_viewer]
|
| 1017 |
-
)
|
| 1018 |
-
|
| 1019 |
-
user_input_result = user_input.submit(
|
| 1020 |
-
fn=analyze_dataset,
|
| 1021 |
-
inputs=[file_upload, user_input, chatbot],
|
| 1022 |
-
outputs=[chatbot, user_input, visualization_gallery, html_reports_state],
|
| 1023 |
-
show_progress="full"
|
| 1024 |
-
)
|
| 1025 |
-
user_input_result.then(
|
| 1026 |
-
fn=format_html_reports,
|
| 1027 |
-
inputs=[html_reports_state],
|
| 1028 |
-
outputs=[reports_viewer]
|
| 1029 |
-
)
|
| 1030 |
-
|
| 1031 |
-
file_result = file_upload.change(
|
| 1032 |
-
fn=analyze_dataset,
|
| 1033 |
-
inputs=[file_upload, gr.Textbox(value="", visible=False), chatbot],
|
| 1034 |
-
outputs=[chatbot, user_input, visualization_gallery, html_reports_state],
|
| 1035 |
-
show_progress="full"
|
| 1036 |
-
)
|
| 1037 |
-
file_result.then(
|
| 1038 |
-
fn=quick_profile,
|
| 1039 |
-
inputs=[file_upload],
|
| 1040 |
-
outputs=[dataset_info]
|
| 1041 |
-
)
|
| 1042 |
-
file_result.then(
|
| 1043 |
-
fn=format_html_reports,
|
| 1044 |
-
inputs=[html_reports_state],
|
| 1045 |
-
outputs=[reports_viewer]
|
| 1046 |
-
)
|
| 1047 |
-
|
| 1048 |
-
train_btn.click(
|
| 1049 |
-
fn=train_model_ui,
|
| 1050 |
-
inputs=[file_upload, target_column, model_type_choice, test_size_slider],
|
| 1051 |
-
outputs=[training_output],
|
| 1052 |
-
show_progress="full" # Show progress bar
|
| 1053 |
-
)
|
| 1054 |
-
|
| 1055 |
-
clear_btn.click(
|
| 1056 |
-
clear_conversation,
|
| 1057 |
-
outputs=[chatbot, file_upload, user_input, visualization_gallery, reports_viewer]
|
| 1058 |
-
)
|
| 1059 |
-
|
| 1060 |
-
if __name__ == "__main__":
|
| 1061 |
-
print("=" * 70)
|
| 1062 |
-
print("🚀 Starting AI Agent Data Scientist Chat UI...")
|
| 1063 |
-
print("=" * 70)
|
| 1064 |
-
print("\n🌐 The UI will open in your browser automatically.")
|
| 1065 |
-
print("💡 If it doesn't, copy the URL shown below.\n")
|
| 1066 |
-
|
| 1067 |
-
demo.launch(
|
| 1068 |
-
share=False, # Set to True to create a public link
|
| 1069 |
-
server_name="0.0.0.0", # Listen on all interfaces
|
| 1070 |
-
server_port=7865, # Changed port to avoid conflict
|
| 1071 |
-
show_error=True,
|
| 1072 |
-
inbrowser=True # Auto-open browser
|
| 1073 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cloudbuild.yaml
DELETED
|
@@ -1,69 +0,0 @@
|
|
| 1 |
-
# Google Cloud Build configuration for automated deployments
|
| 2 |
-
# Triggered on git push to main branch
|
| 3 |
-
|
| 4 |
-
steps:
|
| 5 |
-
# Step 1: Build the container image
|
| 6 |
-
- name: 'gcr.io/cloud-builders/docker'
|
| 7 |
-
args:
|
| 8 |
-
- 'build'
|
| 9 |
-
- '-t'
|
| 10 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
|
| 11 |
-
- '-t'
|
| 12 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:latest'
|
| 13 |
-
- '.'
|
| 14 |
-
timeout: 600s
|
| 15 |
-
|
| 16 |
-
# Step 2: Push the container image to Container Registry
|
| 17 |
-
- name: 'gcr.io/cloud-builders/docker'
|
| 18 |
-
args:
|
| 19 |
-
- 'push'
|
| 20 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
|
| 21 |
-
|
| 22 |
-
- name: 'gcr.io/cloud-builders/docker'
|
| 23 |
-
args:
|
| 24 |
-
- 'push'
|
| 25 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:latest'
|
| 26 |
-
|
| 27 |
-
# Step 3: Deploy to Cloud Run
|
| 28 |
-
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
|
| 29 |
-
entrypoint: gcloud
|
| 30 |
-
args:
|
| 31 |
-
- 'run'
|
| 32 |
-
- 'deploy'
|
| 33 |
-
- 'data-science-agent'
|
| 34 |
-
- '--image'
|
| 35 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
|
| 36 |
-
- '--region'
|
| 37 |
-
- 'us-central1'
|
| 38 |
-
- '--platform'
|
| 39 |
-
- 'managed'
|
| 40 |
-
- '--allow-unauthenticated'
|
| 41 |
-
- '--memory'
|
| 42 |
-
- '4Gi'
|
| 43 |
-
- '--cpu'
|
| 44 |
-
- '2'
|
| 45 |
-
- '--timeout'
|
| 46 |
-
- '900'
|
| 47 |
-
- '--max-instances'
|
| 48 |
-
- '10'
|
| 49 |
-
- '--min-instances'
|
| 50 |
-
- '0'
|
| 51 |
-
- '--concurrency'
|
| 52 |
-
- '10'
|
| 53 |
-
- '--set-env-vars'
|
| 54 |
-
- 'LLM_PROVIDER=groq,REASONING_EFFORT=medium,CACHE_TTL_SECONDS=86400'
|
| 55 |
-
- '--set-secrets'
|
| 56 |
-
- 'GROQ_API_KEY=GROQ_API_KEY:latest,GOOGLE_API_KEY=GOOGLE_API_KEY:latest,GOOGLE_APPLICATION_CREDENTIALS=GOOGLE_APPLICATION_CREDENTIALS:latest'
|
| 57 |
-
|
| 58 |
-
# Build timeout
|
| 59 |
-
timeout: 1200s
|
| 60 |
-
|
| 61 |
-
# Images to push to Container Registry
|
| 62 |
-
images:
|
| 63 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
|
| 64 |
-
- 'gcr.io/$PROJECT_ID/data-science-agent:latest'
|
| 65 |
-
|
| 66 |
-
# Build options
|
| 67 |
-
options:
|
| 68 |
-
machineType: 'N1_HIGHCPU_8'
|
| 69 |
-
logging: CLOUD_LOGGING_ONLY
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
setup-deployment.sh
DELETED
|
@@ -1,78 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
# Quick setup script for macOS deployment prerequisites
|
| 3 |
-
|
| 4 |
-
set -e
|
| 5 |
-
|
| 6 |
-
RED='\033[0;31m'
|
| 7 |
-
GREEN='\033[0;32m'
|
| 8 |
-
YELLOW='\033[1;33m'
|
| 9 |
-
BLUE='\033[0;34m'
|
| 10 |
-
NC='\033[0m'
|
| 11 |
-
|
| 12 |
-
echo -e "${BLUE}🔧 Data Science Agent - Deployment Setup${NC}"
|
| 13 |
-
echo "=========================================="
|
| 14 |
-
echo ""
|
| 15 |
-
|
| 16 |
-
# Check if Homebrew is installed
|
| 17 |
-
if ! command -v brew &> /dev/null; then
|
| 18 |
-
echo -e "${RED}❌ Homebrew not found${NC}"
|
| 19 |
-
echo "Installing Homebrew..."
|
| 20 |
-
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
|
| 21 |
-
else
|
| 22 |
-
echo -e "${GREEN}✅ Homebrew installed${NC}"
|
| 23 |
-
fi
|
| 24 |
-
|
| 25 |
-
# Install Docker Desktop
|
| 26 |
-
if ! command -v docker &> /dev/null; then
|
| 27 |
-
echo -e "${YELLOW}📦 Installing Docker Desktop...${NC}"
|
| 28 |
-
brew install --cask docker
|
| 29 |
-
echo -e "${GREEN}✅ Docker Desktop installed${NC}"
|
| 30 |
-
echo -e "${YELLOW}⚠️ Please start Docker Desktop application, then run this script again${NC}"
|
| 31 |
-
exit 0
|
| 32 |
-
else
|
| 33 |
-
echo -e "${GREEN}✅ Docker installed${NC}"
|
| 34 |
-
fi
|
| 35 |
-
|
| 36 |
-
# Check if Docker daemon is running
|
| 37 |
-
if ! docker info &> /dev/null; then
|
| 38 |
-
echo -e "${YELLOW}⚠️ Docker is installed but not running${NC}"
|
| 39 |
-
echo "Please start Docker Desktop application, then run this script again"
|
| 40 |
-
exit 0
|
| 41 |
-
fi
|
| 42 |
-
|
| 43 |
-
# Install Google Cloud SDK
|
| 44 |
-
if ! command -v gcloud &> /dev/null; then
|
| 45 |
-
echo -e "${YELLOW}☁️ Installing Google Cloud SDK...${NC}"
|
| 46 |
-
brew install --cask google-cloud-sdk
|
| 47 |
-
echo -e "${GREEN}✅ Google Cloud SDK installed${NC}"
|
| 48 |
-
|
| 49 |
-
echo ""
|
| 50 |
-
echo -e "${YELLOW}📝 Next steps:${NC}"
|
| 51 |
-
echo "1. Restart your terminal to load gcloud"
|
| 52 |
-
echo "2. Run: gcloud auth login"
|
| 53 |
-
echo "3. Run: gcloud auth application-default login"
|
| 54 |
-
echo "4. Run: gcloud config set project YOUR_PROJECT_ID"
|
| 55 |
-
echo "5. Run: ./deploy.sh"
|
| 56 |
-
else
|
| 57 |
-
echo -e "${GREEN}✅ Google Cloud SDK installed${NC}"
|
| 58 |
-
fi
|
| 59 |
-
|
| 60 |
-
echo ""
|
| 61 |
-
echo -e "${BLUE}========================================${NC}"
|
| 62 |
-
echo -e "${GREEN}✅ Setup complete!${NC}"
|
| 63 |
-
echo ""
|
| 64 |
-
echo "Next steps:"
|
| 65 |
-
echo "1. Authenticate with Google Cloud:"
|
| 66 |
-
echo " ${YELLOW}gcloud auth login${NC}"
|
| 67 |
-
echo " ${YELLOW}gcloud auth application-default login${NC}"
|
| 68 |
-
echo ""
|
| 69 |
-
echo "2. Set your GCP project:"
|
| 70 |
-
echo " ${YELLOW}gcloud config set project YOUR_PROJECT_ID${NC}"
|
| 71 |
-
echo ""
|
| 72 |
-
echo "3. Set your API keys:"
|
| 73 |
-
echo " ${YELLOW}export GROQ_API_KEY='your-groq-key'${NC}"
|
| 74 |
-
echo " ${YELLOW}export GOOGLE_API_KEY='your-google-key'${NC}"
|
| 75 |
-
echo ""
|
| 76 |
-
echo "4. Deploy to Cloud Run:"
|
| 77 |
-
echo " ${YELLOW}./deploy.sh${NC}"
|
| 78 |
-
echo ""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_environment.py
DELETED
|
@@ -1,48 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Quick test script to verify all core imports work
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
print("Testing core imports...")
|
| 7 |
-
|
| 8 |
-
try:
|
| 9 |
-
print(" ✓ Python standard library")
|
| 10 |
-
import sys
|
| 11 |
-
import os
|
| 12 |
-
|
| 13 |
-
print(" ✓ Data processing")
|
| 14 |
-
import polars as pl
|
| 15 |
-
import pandas as pd
|
| 16 |
-
import numpy as np
|
| 17 |
-
|
| 18 |
-
print(" ✓ Machine learning")
|
| 19 |
-
import sklearn
|
| 20 |
-
import xgboost
|
| 21 |
-
import lightgbm
|
| 22 |
-
|
| 23 |
-
print(" ✓ Visualization")
|
| 24 |
-
import matplotlib
|
| 25 |
-
import seaborn
|
| 26 |
-
import plotly
|
| 27 |
-
|
| 28 |
-
print(" ✓ LLM clients")
|
| 29 |
-
import groq
|
| 30 |
-
|
| 31 |
-
print(" ✓ Web framework")
|
| 32 |
-
import gradio
|
| 33 |
-
import fastapi
|
| 34 |
-
|
| 35 |
-
print("\n✅ All core dependencies installed successfully!")
|
| 36 |
-
print(f"\nPython version: {sys.version}")
|
| 37 |
-
print(f"Gradio version: {gradio.__version__}")
|
| 38 |
-
print(f"Polars version: {pl.__version__}")
|
| 39 |
-
print(f"Pandas version: {pd.__version__}")
|
| 40 |
-
print(f"NumPy version: {np.__version__}")
|
| 41 |
-
print(f"Scikit-learn version: {sklearn.__version__}")
|
| 42 |
-
|
| 43 |
-
except ImportError as e:
|
| 44 |
-
print(f"\n❌ Import failed: {e}")
|
| 45 |
-
sys.exit(1)
|
| 46 |
-
|
| 47 |
-
print("\n🎉 Environment setup complete! You can now run:")
|
| 48 |
-
print(" .venv/bin/python chat_ui.py")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|