Spaces:

Pulastya0
/

Data-Science-Agent

Running

Pulastya B commited on 13 days ago

Commit

fc23b4d

1 Parent(s): 7bccc03

docs: Clean up repository and update README

- Remove redundant documentation files (CHECKLIST, FRONTEND_INTEGRATION, MIGRATION_COMPLETE, etc.)
- Remove old/unused files (chat_ui.py, test_environment.py, cloudbuild.yaml)
- Delete duplicate README in FRRONTEEEND folder
- Create comprehensive single README with clear sections
- Add badges, architecture diagram, and detailed usage instructions
- Include example workflow and tech stack details
- Add contact information and acknowledgments

Files changed (13) hide show

BIGQUERY_SCHEMAS.md +0 -691
CHECKLIST.md +0 -97
DEPLOYMENT.md +0 -495
FRONTEND_INTEGRATION.md +0 -234
FRRONTEEEND/README.md +0 -20
GEMINI_UPDATE.md +0 -93
MIGRATION_COMPLETE.md +0 -325
QUICK_REFERENCE.txt +0 -71
README.md +248 -515
chat_ui.py +0 -1073
cloudbuild.yaml +0 -69
setup-deployment.sh +0 -78
test_environment.py +0 -48

BIGQUERY_SCHEMAS.md DELETED Viewed

@@ -1,691 +0,0 @@
-# BigQuery Output Schemas for Looker Compatibility
-**Purpose**: Define stable BigQuery table schemas that BI tools (Looker, Data Studio) can query reliably.
-**Design Principles**:
-- ✅ **Stable Schema**: No breaking changes without versioning
-- ✅ **Consistent Naming**: snake_case columns, clear dimension/metric separation
-- ✅ **BI-Friendly Types**: Standard SQL types, no complex nested structures
-- ✅ **Documented Grain**: Clear primary keys and update patterns
-- ✅ **Dashboard-Ready**: Metrics aligned with common visualizations
----
-## 📊 Table 1: `model_metrics`
-**Description**: Model performance metrics tracked over time for monitoring and comparison.
-**Use Cases**:
-- Performance dashboards
-- Model comparison reports
-- Drift detection alerts
-- A/B test analysis
-**Update Frequency**: On every model training run
-**Grain**: One row per model training execution
-### Schema
-| Column Name | Type | Description | Dimension/Metric | Example |
-|------------|------|-------------|------------------|---------|
-| `project_id` | STRING | Google Cloud project ID | Dimension | `my-ml-project` |
-| `dataset_id` | STRING | BigQuery dataset name | Dimension | `ml_models` |
-| `model_id` | STRING | Unique model identifier | Dimension (Primary Key) | `xgboost_churn_20251223_153045` |
-| `model_name` | STRING | Human-readable model name | Dimension | `Customer Churn Predictor` |
-| `model_type` | STRING | Algorithm used | Dimension | `XGBoost`, `RandomForest`, `LightGBM` |
-| `task_type` | STRING | ML task category | Dimension | `classification`, `regression` |
-| `training_dataset` | STRING | Source table/file reference | Dimension | `project.dataset.train_data` |
-| `target_column` | STRING | Prediction target name | Dimension | `churn`, `price`, `survived` |
-| `created_at` | TIMESTAMP | Model training timestamp | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
-| `created_date` | DATE | Training date (for partitioning) | Dimension (Time) | `2025-12-23` |
-| `feature_count` | INTEGER | Number of features used | Metric | `42` |
-| `training_rows` | INTEGER | Training set size | Metric | `10000` |
-| `test_rows` | INTEGER | Test set size | Metric | `2500` |
-| `training_duration_seconds` | FLOAT | Time to train model | Metric | `123.45` |
-| `accuracy` | FLOAT | Overall accuracy (0-1) | Metric | `0.95` |
-| `precision` | FLOAT | Precision score (0-1) | Metric | `0.92` |
-| `recall` | FLOAT | Recall score (0-1) | Metric | `0.88` |
-| `f1_score` | FLOAT | F1 score (0-1) | Metric | `0.90` |
-| `roc_auc` | FLOAT | ROC AUC score (0-1) | Metric | `0.94` |
-| `pr_auc` | FLOAT | Precision-Recall AUC (0-1) | Metric | `0.91` |
-| `mae` | FLOAT | Mean Absolute Error (regression) | Metric | `1234.56` |
-| `mse` | FLOAT | Mean Squared Error (regression) | Metric | `567890.12` |
-| `rmse` | FLOAT | Root Mean Squared Error (regression) | Metric | `753.59` |
-| `r2_score` | FLOAT | R² coefficient (regression) | Metric | `0.85` |
-| `cross_val_mean` | FLOAT | Mean CV score | Metric | `0.93` |
-| `cross_val_std` | FLOAT | CV score std deviation | Metric | `0.02` |
-| `hyperparameters` | STRING (JSON) | Model hyperparameters | Metadata | `{"max_depth": 6, "n_estimators": 100}` |
-| `version` | STRING | Model version tag | Dimension | `v1.2.3` |
-| `environment` | STRING | Training environment | Dimension | `production`, `staging`, `development` |
-| `user_email` | STRING | User who trained model | Dimension | `data-scientist@company.com` |
-### Partitioning & Clustering
-```sql
--- Recommended table setup
-CREATE TABLE `project.dataset.model_metrics`
-(
-  -- columns as above
-)
-PARTITION BY created_date
-CLUSTER BY model_type, task_type, environment
-OPTIONS(
-  description="Model performance metrics for BI dashboards",
-  require_partition_filter=true
-);
-```
-### Primary Dimensions for Looker
-- **Time**: `created_at`, `created_date`
-- **Model**: `model_type`, `model_name`, `task_type`
-- **Performance Tier**: CASE expression on `accuracy`/`f1_score`
-  - `Excellent` (>0.90)
-  - `Good` (0.80-0.90)
-  - `Fair` (0.70-0.80)
-  - `Poor` (<0.70)
-### Sample Looker View
-```lookml
-view: model_metrics {
-  sql_table_name: `project.dataset.model_metrics` ;;
-  dimension: model_id {
-    primary_key: yes
-    type: string
-    sql: ${TABLE}.model_id ;;
-  }
-  dimension_group: created {
-    type: time
-    timeframes: [date, week, month, quarter, year]
-    sql: ${TABLE}.created_at ;;
-  }
-  dimension: model_type {
-    type: string
-    sql: ${TABLE}.model_type ;;
-  }
-  dimension: performance_tier {
-    type: string
-    sql: CASE
-      WHEN ${TABLE}.accuracy >= 0.90 THEN 'Excellent'
-      WHEN ${TABLE}.accuracy >= 0.80 THEN 'Good'
-      WHEN ${TABLE}.accuracy >= 0.70 THEN 'Fair'
-      ELSE 'Poor'
-    END ;;
-  }
-  measure: count {
-    type: count
-  }
-  measure: avg_accuracy {
-    type: average
-    sql: ${TABLE}.accuracy ;;
-    value_format_name: percent_2
-  }
-  measure: avg_f1_score {
-    type: average
-    sql: ${TABLE}.f1_score ;;
-    value_format_name: percent_2
-  }
-}
-```
----
-## 🎯 Table 2: `feature_importance`
-**Description**: Feature importance scores for model interpretability.
-**Use Cases**:
-- Feature impact analysis
-- Feature selection dashboards
-- Model explainability reports
-**Update Frequency**: On every model training run
-**Grain**: One row per feature per model
-### Schema
-| Column Name | Type | Description | Dimension/Metric | Example |
-|------------|------|-------------|------------------|---------|
-| `model_id` | STRING | Foreign key to model_metrics | Dimension (Foreign Key) | `xgboost_churn_20251223_153045` |
-| `feature_name` | STRING | Name of the feature | Dimension (Primary Key) | `age`, `total_purchases`, `days_since_last_login` |
-| `importance_score` | FLOAT | Importance value (0-1) | Metric | `0.35` |
-| `importance_rank` | INTEGER | Rank by importance (1=most important) | Metric | `1`, `2`, `3` |
-| `importance_type` | STRING | Calculation method | Dimension | `gain`, `weight`, `cover`, `shap` |
-| `feature_type` | STRING | Data type category | Dimension | `numeric`, `categorical`, `datetime`, `text` |
-| `is_engineered` | BOOLEAN | Created by feature engineering? | Dimension | `true`, `false` |
-| `created_at` | TIMESTAMP | When importance was calculated | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
-| `created_date` | DATE | Calculation date | Dimension (Time) | `2025-12-23` |
-### Partitioning & Clustering
-```sql
-CREATE TABLE `project.dataset.feature_importance`
-(
-  -- columns as above
-)
-PARTITION BY created_date
-CLUSTER BY model_id, importance_rank
-OPTIONS(
-  description="Feature importance scores for model explainability",
-  require_partition_filter=false  -- Allow cross-model queries
-);
-```
-### Primary Dimensions for Looker
-- **Feature**: `feature_name`, `feature_type`, `is_engineered`
-- **Model**: `model_id` (join to model_metrics)
-- **Importance**: `importance_rank`, `importance_type`
-### Sample Looker View
-```lookml
-view: feature_importance {
-  sql_table_name: `project.dataset.feature_importance` ;;
-  dimension: compound_key {
-    primary_key: yes
-    hidden: yes
-    sql: CONCAT(${TABLE}.model_id, '|', ${TABLE}.feature_name) ;;
-  }
-  dimension: feature_name {
-    type: string
-    sql: ${TABLE}.feature_name ;;
-  }
-  dimension: is_top_10 {
-    type: yesno
-    sql: ${TABLE}.importance_rank <= 10 ;;
-  }
-  measure: avg_importance {
-    type: average
-    sql: ${TABLE}.importance_score ;;
-    value_format_name: percent_2
-  }
-  measure: count_features {
-    type: count_distinct
-    sql: ${TABLE}.feature_name ;;
-  }
-}
-```
----
-## 🔮 Table 3: `predictions`
-**Description**: Model predictions with actuals for monitoring and evaluation.
-**Use Cases**:
-- Prediction monitoring
-- Accuracy tracking over time
-- Segment performance analysis
-- Business impact measurement
-**Update Frequency**: Real-time or batch (daily/hourly)
-**Grain**: One row per prediction
-### Schema
-| Column Name | Type | Description | Dimension/Metric | Example |
-|------------|------|-------------|------------------|---------|
-| `prediction_id` | STRING | Unique prediction identifier | Dimension (Primary Key) | `pred_abc123xyz` |
-| `model_id` | STRING | Model used for prediction | Dimension (Foreign Key) | `xgboost_churn_20251223_153045` |
-| `entity_id` | STRING | Entity being predicted (customer_id, product_id, etc.) | Dimension | `customer_12345` |
-| `predicted_at` | TIMESTAMP | When prediction was made | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
-| `predicted_date` | DATE | Prediction date (for partitioning) | Dimension (Time) | `2025-12-23` |
-| `prediction_value` | FLOAT | Predicted value | Metric | `0.85` (probability), `49.99` (price) |
-| `prediction_class` | STRING | Predicted class (classification) | Dimension | `churn`, `not_churn` |
-| `prediction_confidence` | FLOAT | Model confidence (0-1) | Metric | `0.92` |
-| `actual_value` | FLOAT | True value (when available) | Metric | `1.0` (churned), `52.50` (actual price) |
-| `actual_class` | STRING | True class (when available) | Dimension | `churn`, `not_churn` |
-| `actual_recorded_at` | TIMESTAMP | When actual became known | Dimension (Time) | `2025-12-30 10:00:00 UTC` |
-| `is_correct` | BOOLEAN | Prediction was correct? | Dimension | `true`, `false` |
-| `absolute_error` | FLOAT | \|predicted - actual\| | Metric | `2.51` |
-| `squared_error` | FLOAT | (predicted - actual)² | Metric | `6.30` |
-| `feature_values` | STRING (JSON) | Input features used | Metadata | `{"age": 35, "tenure": 24}` |
-| `segment` | STRING | Business segment | Dimension | `enterprise`, `smb`, `consumer` |
-| `region` | STRING | Geographic region | Dimension | `us-west`, `eu-central` |
-| `model_version` | STRING | Model version | Dimension | `v1.2.3` |
-| `prediction_latency_ms` | FLOAT | Inference time | Metric | `23.4` |
-### Partitioning & Clustering
-```sql
-CREATE TABLE `project.dataset.predictions`
-(
-  -- columns as above
-)
-PARTITION BY predicted_date
-CLUSTER BY model_id, segment, is_correct
-OPTIONS(
-  description="Model predictions with actuals for monitoring",
-  require_partition_filter=true,
-  partition_expiration_days=730  -- 2 years retention
-);
-```
-### Primary Dimensions for Looker
-- **Time**: `predicted_date`, days since prediction
-- **Model**: `model_id`, `model_version`
-- **Segment**: `segment`, `region`
-- **Accuracy**: `is_correct`, error buckets
-### Sample Looker View
-```lookml
-view: predictions {
-  sql_table_name: `project.dataset.predictions` ;;
-  dimension: prediction_id {
-    primary_key: yes
-    type: string
-    sql: ${TABLE}.prediction_id ;;
-  }
-  dimension_group: predicted {
-    type: time
-    timeframes: [date, week, month]
-    sql: ${TABLE}.predicted_at ;;
-  }
-  dimension: segment {
-    type: string
-    sql: ${TABLE}.segment ;;
-  }
-  dimension: error_bucket {
-    type: string
-    sql: CASE
-      WHEN ${TABLE}.absolute_error IS NULL THEN 'No Actual Yet'
-      WHEN ${TABLE}.absolute_error <= 0.1 THEN '0-10%'
-      WHEN ${TABLE}.absolute_error <= 0.2 THEN '10-20%'
-      ELSE '>20%'
-    END ;;
-  }
-  measure: count {
-    type: count
-  }
-  measure: accuracy_rate {
-    type: average
-    sql: CAST(${TABLE}.is_correct AS FLOAT64) ;;
-    value_format_name: percent_1
-  }
-  measure: avg_confidence {
-    type: average
-    sql: ${TABLE}.prediction_confidence ;;
-    value_format_name: percent_2
-  }
-  measure: mae {
-    type: average
-    sql: ${TABLE}.absolute_error ;;
-    value_format_name: decimal_2
-  }
-}
-```
----
-## 📋 Table 4: `data_profile_summary`
-**Description**: Dataset profiling statistics for data quality monitoring.
-**Use Cases**:
-- Data quality dashboards
-- Schema drift detection
-- Data validation reports
-- Column-level monitoring
-**Update Frequency**: Daily or on-demand
-**Grain**: One row per column per dataset per run
-### Schema
-| Column Name | Type | Description | Dimension/Metric | Example |
-|------------|------|-------------|------------------|---------|
-| `profile_id` | STRING | Unique profile run identifier | Dimension (Primary Key) | `profile_abc123xyz` |
-| `dataset_name` | STRING | Source table/file name | Dimension | `project.dataset.customers` |
-| `column_name` | STRING | Column being profiled | Dimension | `age`, `email`, `signup_date` |
-| `profiled_at` | TIMESTAMP | When profiling ran | Dimension (Time) | `2025-12-23 15:30:45 UTC` |
-| `profiled_date` | DATE | Profiling date | Dimension (Time) | `2025-12-23` |
-| `data_type` | STRING | Column data type | Dimension | `INTEGER`, `STRING`, `FLOAT`, `TIMESTAMP` |
-| `inferred_type` | STRING | Smart type inference | Dimension | `numeric`, `categorical`, `datetime`, `text`, `email` |
-| `row_count` | INTEGER | Total rows in dataset | Metric | `10000` |
-| `non_null_count` | INTEGER | Non-null values | Metric | `9850` |
-| `null_count` | INTEGER | Null values | Metric | `150` |
-| `null_percentage` | FLOAT | % null (0-100) | Metric | `1.5` |
-| `unique_count` | INTEGER | Distinct values | Metric | `450` |
-| `uniqueness_percentage` | FLOAT | % unique (0-100) | Metric | `4.5` |
-| `min_value` | STRING | Minimum value (as string) | Metadata | `18`, `2020-01-01` |
-| `max_value` | STRING | Maximum value (as string) | Metadata | `95`, `2025-12-23` |
-| `mean_value` | FLOAT | Mean (numeric only) | Metric | `42.5` |
-| `median_value` | FLOAT | Median (numeric only) | Metric | `38.0` |
-| `std_dev` | FLOAT | Standard deviation (numeric only) | Metric | `15.2` |
-| `skewness` | FLOAT | Distribution skewness | Metric | `0.85` |
-| `kurtosis` | FLOAT | Distribution kurtosis | Metric | `2.1` |
-| `top_value` | STRING | Most common value | Metadata | `male`, `active` |
-| `top_value_frequency` | INTEGER | Count of most common value | Metric | `6500` |
-| `top_value_percentage` | FLOAT | % of most common value | Metric | `65.0` |
-| `has_outliers` | BOOLEAN | Outliers detected? | Dimension | `true`, `false` |
-| `outlier_count` | INTEGER | Number of outliers | Metric | `23` |
-| `outlier_percentage` | FLOAT | % outliers | Metric | `0.23` |
-| `quality_score` | FLOAT | Overall quality score (0-100) | Metric | `92.5` |
-| `quality_issues` | STRING (JSON) | Detected issues | Metadata | `["high_nulls", "duplicate_values"]` |
-| `validation_status` | STRING | Quality check result | Dimension | `pass`, `warn`, `fail` |
-### Partitioning & Clustering
-```sql
-CREATE TABLE `project.dataset.data_profile_summary`
-(
-  -- columns as above
-)
-PARTITION BY profiled_date
-CLUSTER BY dataset_name, validation_status
-OPTIONS(
-  description="Dataset profiling for data quality monitoring",
-  require_partition_filter=true,
-  partition_expiration_days=90  -- 3 months retention
-);
-```
-### Primary Dimensions for Looker
-- **Dataset**: `dataset_name`
-- **Column**: `column_name`, `data_type`, `inferred_type`
-- **Quality**: `validation_status`, `quality_score` buckets
-- **Time**: `profiled_date`
-### Sample Looker View
-```lookml
-view: data_profile_summary {
-  sql_table_name: `project.dataset.data_profile_summary` ;;
-  dimension: compound_key {
-    primary_key: yes
-    hidden: yes
-    sql: CONCAT(${TABLE}.profile_id, '|', ${TABLE}.column_name) ;;
-  }
-  dimension: column_name {
-    type: string
-    sql: ${TABLE}.column_name ;;
-  }
-  dimension: quality_tier {
-    type: string
-    sql: CASE
-      WHEN ${TABLE}.quality_score >= 90 THEN 'Excellent'
-      WHEN ${TABLE}.quality_score >= 75 THEN 'Good'
-      WHEN ${TABLE}.quality_score >= 60 THEN 'Fair'
-      ELSE 'Poor'
-    END ;;
-  }
-  dimension: has_quality_issues {
-    type: yesno
-    sql: ${TABLE}.validation_status IN ('warn', 'fail') ;;
-  }
-  measure: count_columns {
-    type: count_distinct
-    sql: ${TABLE}.column_name ;;
-  }
-  measure: avg_quality_score {
-    type: average
-    sql: ${TABLE}.quality_score ;;
-    value_format_name: decimal_1
-  }
-  measure: avg_null_percentage {
-    type: average
-    sql: ${TABLE}.null_percentage ;;
-    value_format_name: percent_1
-  }
-  measure: columns_with_issues {
-    type: count_distinct
-    sql: ${TABLE}.column_name ;;
-    filters: [has_quality_issues: "yes"]
-  }
-}
-```
----
-## 🔄 Schema Evolution Guidelines
-### ✅ **SAFE Changes** (Non-Breaking)
-1. **Add new columns** (always nullable or with defaults)
-   ```sql
-   ALTER TABLE `project.dataset.model_metrics`
-   ADD COLUMN IF NOT EXISTS new_metric FLOAT64;
-   ```
-2. **Add new tables** (doesn't affect existing dashboards)
-3. **Lengthen STRING columns** (VARCHAR(50) → VARCHAR(100))
-4. **Add indexes/clustering** (performance only)
-5. **Add column descriptions**
-   ```sql
-   ALTER TABLE `project.dataset.model_metrics`
-   ALTER COLUMN accuracy SET OPTIONS (description='Model accuracy (0-1)');
-   ```
-### ❌ **BREAKING Changes** (Require Dashboard Updates)
-1. **Rename columns** → Use views for backward compatibility:
-   ```sql
-   CREATE OR REPLACE VIEW `project.dataset.model_metrics_v2` AS
-   SELECT
-     model_id,
-     accuracy AS acc,  -- renamed column
-     ...
-   FROM `project.dataset.model_metrics`;
-   ```
-2. **Change data types** → Create new column, migrate, deprecate old:
-   ```sql
-   -- Step 1: Add new column
-   ALTER TABLE model_metrics ADD COLUMN created_at_new TIMESTAMP;
-   -- Step 2: Backfill
-   UPDATE model_metrics SET created_at_new = CAST(created_at AS TIMESTAMP) WHERE true;
-   -- Step 3: Update dashboards to use new column
-   -- Step 4: Drop old column after validation period
-   ALTER TABLE model_metrics DROP COLUMN created_at;
-   ```
-3. **Remove columns** → Deprecate first, remove after 90 days
-4. **Change partitioning** → Requires table recreation
-### 🔄 **Versioning Strategy**
-For major schema changes, create versioned tables:
-```
-project.dataset.model_metrics_v1  (deprecated, keep 90 days)
-project.dataset.model_metrics_v2  (current)
-project.dataset.model_metrics     (view pointing to latest version)
-```
----
-## 📊 Dashboard-Ready Metrics Catalog
-### Model Performance Metrics
-| Metric Name | Calculation | Use Case |
-|------------|-------------|----------|
-| **Model Count** | `COUNT(DISTINCT model_id)` | Total models trained |
-| **Avg Accuracy** | `AVG(accuracy)` | Overall model quality |
-| **Accuracy Trend** | `AVG(accuracy) OVER (ORDER BY created_date)` | Performance over time |
-| **Best Model** | `model_id WHERE accuracy = MAX(accuracy)` | Top performer |
-| **Models by Type** | `COUNT(*) GROUP BY model_type` | Algorithm distribution |
-| **Training Time** | `AVG(training_duration_seconds)` | Resource usage |
-| **Recent Models** | `WHERE created_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)` | Latest activity |
-### Feature Importance Metrics
-| Metric Name | Calculation | Use Case |
-|------------|-------------|----------|
-| **Top Features** | `WHERE importance_rank <= 10` | Most impactful features |
-| **Avg Importance** | `AVG(importance_score)` | Feature impact distribution |
-| **Engineered Features** | `COUNT(*) WHERE is_engineered = true` | Feature engineering effectiveness |
-| **Feature Stability** | `STDDEV(importance_score) GROUP BY feature_name` | Consistent predictors |
-### Prediction Metrics
-| Metric Name | Calculation | Use Case |
-|------------|-------------|----------|
-| **Accuracy Rate** | `AVG(CAST(is_correct AS FLOAT64))` | Real-world performance |
-| **MAE** | `AVG(absolute_error)` | Average error magnitude |
-| **RMSE** | `SQRT(AVG(squared_error))` | Error with outlier penalty |
-| **Predictions/Day** | `COUNT(*) GROUP BY predicted_date` | Volume tracking |
-| **Confidence Distribution** | `APPROX_QUANTILES(prediction_confidence, 10)` | Model calibration |
-| **Segment Performance** | `AVG(is_correct) GROUP BY segment` | Fairness check |
-### Data Quality Metrics
-| Metric Name | Calculation | Use Case |
-|------------|-------------|----------|
-| **Data Quality Score** | `AVG(quality_score)` | Overall health |
-| **Null Rate** | `AVG(null_percentage)` | Completeness |
-| **Columns with Issues** | `COUNT(DISTINCT column_name) WHERE validation_status != 'pass'` | Problem areas |
-| **Quality Trend** | `AVG(quality_score) OVER (ORDER BY profiled_date)` | Improving/degrading? |
----
-## 🎯 Sample Looker Explores
-### Explore 1: Model Performance Analysis
-```lookml
-explore: model_metrics {
-  label: "Model Performance"
-  description: "Track model accuracy, training time, and comparison"
-  join: feature_importance {
-    type: left_outer
-    sql_on: ${model_metrics.model_id} = ${feature_importance.model_id} ;;
-    relationship: one_to_many
-  }
-}
-```
-### Explore 2: Prediction Monitoring
-```lookml
-explore: predictions {
-  label: "Prediction Monitoring"
-  description: "Real-time prediction accuracy and drift"
-  join: model_metrics {
-    type: left_outer
-    sql_on: ${predictions.model_id} = ${model_metrics.model_id} ;;
-    relationship: many_to_one
-  }
-}
-```
-### Explore 3: Data Quality Dashboard
-```lookml
-explore: data_profile_summary {
-  label: "Data Quality"
-  description: "Monitor data health and schema drift"
-}
-```
----
-## 📝 Implementation Checklist
-### Phase 1: Setup (Week 1)
-- [ ] Create all 4 BigQuery tables with partitioning
-- [ ] Set up service account permissions
-- [ ] Configure table expiration policies
-- [ ] Document table owners and update SLAs
-### Phase 2: Integration (Week 2)
-- [ ] Update tools to write to these schemas
-- [ ] Add schema validation in CI/CD
-- [ ] Create data dictionary in Looker
-- [ ] Set up table monitoring alerts
-### Phase 3: BI Layer (Week 3)
-- [ ] Create Looker views for all 4 tables
-- [ ] Build explores with joins
-- [ ] Create initial dashboards
-- [ ] Set up scheduled data refreshes
-### Phase 4: Validation (Week 4)
-- [ ] Backfill historical data
-- [ ] Verify dashboard accuracy
-- [ ] Train stakeholders on dashboards
-- [ ] Document runbooks for common issues
----
-## 🔗 Related Tools
-**BigQuery Write Tools** (src/bigquery/):
-- `bigquery_write_results()` - Generic write function
-- Helper: `bigquery_write_model_metrics()` - Specialized writer
-- Helper: `bigquery_write_feature_importance()` - Specialized writer
-- Helper: `bigquery_write_predictions()` - Specialized writer
-- Helper: `bigquery_write_data_profile()` - Specialized writer
-**Example Usage**:
-```python
-from src.bigquery import bigquery_write_results
-# Write model metrics
-bigquery_write_results(
-    data=metrics_df,
-    table_id="project.dataset.model_metrics",
-    write_disposition="WRITE_APPEND"
-)
-```
----
-## 📚 Additional Resources
-- [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices)
-- [Looker LookML Reference](https://cloud.google.com/looker/docs/reference/lookml-quick-reference)
-- [Schema Design for BI](https://cloud.google.com/architecture/bigquery-data-warehouse)
----
-**Last Updated**: December 23, 2025
-**Schema Version**: 1.0.0
-**Maintained By**: Data Science Team
-**Review Cadence**: Quarterly

CHECKLIST.md DELETED Viewed

@@ -1,97 +0,0 @@
-# ✅ Pre-Launch Checklist
-## Before Running the Application
-### 1. Environment Variables ⚠️ **REQUIRED**
-You MUST set your API key before starting:
-```powershell
-# Windows PowerShell
-$env:GOOGLE_API_KEY="your-google-api-key-here"
-# Verify it's set
-echo $env:GOOGLE_API_KEY
-```
-### 2. Build Status ✅
-- [x] Frontend dependencies installed
-- [x] Frontend built (FRRONTEEEND/dist exists)
-- [x] Backend code updated with new endpoints
-- [x] Configuration files in place
-### 3. Quick Start Commands
-**Option A - Use the start script:**
-```powershell
-.\start.ps1
-```
-**Option B - Manual start:**
-```powershell
-# Make sure you're in the project root
-Set-Location "c:\Users\Pulastya\Videos\DS AGENTTTT"
-# Set API key (if not already set)
-$env:GOOGLE_API_KEY="your-key-here"
-# Start the server
-python src\api\app.py
-```
-### 4. Access the Application
-Once the server starts, open your browser to:
-**http://localhost:8080**
-You should see:
-1. **Landing Page** - Professional homepage with agent features
-2. **Launch Console** button - Click to open the chat interface
-3. **Chat Interface** - Modern conversational UI
-### 5. Test the Chat
-Try these sample prompts:
-- "What can you do?"
-- "Explain your data science capabilities"
-- "How do I upload a dataset?"
-- "What ML models do you support?"
-### 6. Expected Console Output
-When you start the server, you should see:
-```
-INFO:     Started server process [####]
-INFO:     Waiting for application startup.
-✅ Agent initialized with provider: groq
-✅ Frontend assets mounted from C:\Users\Pulastya\Videos\DS AGENTTTT\FRRONTEEEND\dist
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://0.0.0.0:8080
-```
-### 7. Troubleshooting Quick Reference
-| Issue | Solution |
-|-------|----------|
-| "Agent not initialized" | Set GOOGLE_API_KEY environment variable |
-| "Frontend not found" | Run `cd FRRONTEEEND && npm run build` |
-| Port 8080 in use | Kill the process or change PORT env var |
-| Import errors | Run `pip install -r requirements.txt` |
-## Next Steps After Launch
-1. **Test the chat** with the agent
-2. **Upload a dataset** (feature coming soon in chat)
-3. **Try the API endpoints** at http://localhost:8080/docs
-4. **Customize the frontend** in FRRONTEEEND/components/
-## Documentation
-- 📖 [MIGRATION_COMPLETE.md](MIGRATION_COMPLETE.md) - What was changed
-- 📖 [FRONTEND_INTEGRATION.md](FRONTEND_INTEGRATION.md) - Technical details
-- 📖 [README.md](README.md) - Main project docs
----
-**Ready to launch?** Run `.\start.ps1` and visit http://localhost:8080 🚀

DEPLOYMENT.md DELETED Viewed

@@ -1,495 +0,0 @@
-# 🚀 Google Cloud Run Deployment Guide
-Complete guide to deploy the Data Science Agent to Google Cloud Run as a serverless API.
-## 📋 Prerequisites
-1. **Google Cloud Platform Account**
-   - Active GCP account with billing enabled
-   - Project created (or use existing project)
-2. **Install Google Cloud SDK**
-   ```bash
-   # macOS (Homebrew)
-   brew install --cask google-cloud-sdk
-   # Or download from: https://cloud.google.com/sdk/install
-   ```
-3. **Authenticate with GCP**
-   ```bash
-   gcloud auth login
-   gcloud auth application-default login
-   ```
-4. **Set Your Project**
-   ```bash
-   gcloud config set project YOUR_PROJECT_ID
-   ```
----
-## 🎯 Deployment Options
-### Option 1: Automated Deployment (Recommended)
-Use the provided deployment script for one-command deployment:
-```bash
-# Set required environment variables
-export GCP_PROJECT_ID="your-project-id"
-export GROQ_API_KEY="your-groq-api-key"
-export GOOGLE_API_KEY="your-google-api-key"  # Optional for Gemini
-# Run deployment script
-./deploy.sh
-```
-**What it does:**
-- ✅ Enables required GCP APIs (Cloud Build, Cloud Run, Secret Manager)
-- ✅ Creates secrets for API keys
-- ✅ Builds Docker container
-- ✅ Deploys to Cloud Run
-- ✅ Returns service URL
-**Configuration options:**
-```bash
-# Optional: Customize deployment
-export CLOUD_RUN_REGION="us-central1"  # Change region
-export MEMORY="4Gi"                     # Increase memory
-export CPU="2"                          # Set CPU count
-export MAX_INSTANCES="10"               # Scale limit
-export TIMEOUT="900"                    # Request timeout (15 min)
-./deploy.sh
-```
----
-### Option 2: Manual Deployment
-Step-by-step manual deployment for full control:
-#### Step 1: Enable APIs
-```bash
-gcloud services enable \
-  cloudbuild.googleapis.com \
-  run.googleapis.com \
-  containerregistry.googleapis.com \
-  secretmanager.googleapis.com
-```
-#### Step 2: Create Secrets
-```bash
-# Create GROQ API key secret
-echo -n "your-groq-api-key" | gcloud secrets create GROQ_API_KEY --data-file=-
-# Create Google API key secret (optional)
-echo -n "your-google-api-key" | gcloud secrets create GOOGLE_API_KEY --data-file=-
-# Grant Cloud Run access to secrets
-PROJECT_NUMBER=$(gcloud projects describe $(gcloud config get-value project) --format="value(projectNumber)")
-gcloud secrets add-iam-policy-binding GROQ_API_KEY \
-  --member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
-  --role="roles/secretmanager.secretAccessor"
-```
-#### Step 3: Build Container
-```bash
-gcloud builds submit --tag gcr.io/$(gcloud config get-value project)/data-science-agent
-```
-#### Step 4: Deploy to Cloud Run
-```bash
-gcloud run deploy data-science-agent \
-  --image gcr.io/$(gcloud config get-value project)/data-science-agent \
-  --platform managed \
-  --region us-central1 \
-  --allow-unauthenticated \
-  --memory 4Gi \
-  --cpu 2 \
-  --timeout 900 \
-  --max-instances 10 \
-  --set-env-vars LLM_PROVIDER=groq,REASONING_EFFORT=medium \
-  --set-secrets GROQ_API_KEY=GROQ_API_KEY:latest,GOOGLE_API_KEY=GOOGLE_API_KEY:latest
-```
----
-### Option 3: CI/CD with Cloud Build Triggers
-Automated deployment on git push:
-#### Step 1: Connect Repository
-```bash
-# Connect GitHub/GitLab/Bitbucket repository
-gcloud beta builds connections create github connection-name \
-  --region=us-central1
-```
-#### Step 2: Create Build Trigger
-```bash
-gcloud builds triggers create github \
-  --name="deploy-data-science-agent" \
-  --repo-name="Data-Science-Agent" \
-  --repo-owner="Surfing-Ninja" \
-  --branch-pattern="^main$" \
-  --build-config="cloudbuild.yaml"
-```
-Now every push to `main` branch automatically deploys! 🎉
----
-## 🧪 Testing the Deployment
-### 1. Health Check
-```bash
-SERVICE_URL=$(gcloud run services describe data-science-agent \
-  --region us-central1 \
-  --format 'value(status.url)')
-curl $SERVICE_URL/health
-```
-**Expected response:**
-```json
-{
-  "status": "healthy",
-  "agent_ready": true,
-  "provider": "groq",
-  "tools_count": 82
-}
-```
-### 2. List Available Tools
-```bash
-curl $SERVICE_URL/tools | jq
-```
-### 3. Profile a Dataset
-```bash
-curl -X POST $SERVICE_URL/profile \
-  -F "file=@test_data/sample.csv"
-```
-### 4. Run Full Analysis
-```bash
-curl -X POST $SERVICE_URL/run \
-  -F "file=@test_data/sample.csv" \
-  -F "task_description=Analyze this dataset, detect outliers, and train a prediction model" \
-  -F "target_col=target" \
-  | jq
-```
----
-## 📊 Monitoring & Logs
-### View Real-time Logs
-```bash
-gcloud run logs tail data-science-agent --region us-central1
-```
-### View Recent Logs
-```bash
-gcloud run logs read data-science-agent \
-  --region us-central1 \
-  --limit 50
-```
-### Cloud Console Monitoring
-- Go to: https://console.cloud.google.com/run
-- Click on `data-science-agent`
-- View: Metrics, Logs, Revisions
----
-## 💰 Cost Estimation
-### Cloud Run Pricing (as of Dec 2024)
-**Free Tier** (per month):
-- 2 million requests
-- 360,000 GB-seconds of memory
-- 180,000 vCPU-seconds
-**Paid Tier** (us-central1):
-- CPU: $0.00002400 per vCPU-second
-- Memory: $0.00000250 per GB-second
-- Requests: $0.40 per million requests
-**Example Cost for 4Gi Memory, 2 vCPU:**
-- 1 request taking 60 seconds
-  - CPU: 2 vCPU × 60s × $0.000024 = $0.00288
-  - Memory: 4GB × 60s × $0.0000025 = $0.0006
-  - Request: $0.0000004
-  - **Total: ~$0.0035 per request**
-**Monthly estimate for 1000 requests/month:**
-- ~$3.50/month (well within free tier for testing!)
----
-## 🔒 Security Best Practices
-### 1. Enable Authentication (Production)
-```bash
-# Deploy with authentication required
-gcloud run deploy data-science-agent \
-  --no-allow-unauthenticated \
-  --region us-central1 \
-  --image gcr.io/PROJECT_ID/data-science-agent
-# Create service account for clients
-gcloud iam service-accounts create api-client
-# Grant invoker role
-gcloud run services add-iam-policy-binding data-science-agent \
-  --member="serviceAccount:api-client@PROJECT_ID.iam.gserviceaccount.com" \
-  --role="roles/run.invoker" \
-  --region us-central1
-```
-### 2. Use VPC Connector (For BigQuery/GCS)
-```bash
-# Create VPC connector
-gcloud compute networks vpc-access connectors create ds-agent-connector \
-  --network default \
-  --region us-central1 \
-  --range 10.8.0.0/28
-# Deploy with VPC
-gcloud run deploy data-science-agent \
-  --vpc-connector ds-agent-connector \
-  --region us-central1
-```
-### 3. Restrict API Keys
-- Set **Application restrictions** in Google Cloud Console
-- Whitelist only Cloud Run service URL
-- Set **API restrictions** to only required APIs
----
-## 🔧 Configuration Options
-### Environment Variables
-```bash
-# Set during deployment
---set-env-vars KEY1=value1,KEY2=value2
-# Available variables:
-LLM_PROVIDER=groq                    # or "gemini"
-REASONING_EFFORT=medium              # low, medium, high
-CACHE_TTL_SECONDS=86400              # Cache lifetime
-ARTIFACT_BACKEND=local               # or "gcs" for cloud storage
-GCS_BUCKET_NAME=your-bucket          # If using GCS backend
-OUTPUT_DIR=/tmp/outputs              # Output directory
-MAX_PARALLEL_TOOLS=5                 # Concurrent tool execution
-MAX_RETRIES=3                        # Tool retry attempts
-TIMEOUT_SECONDS=300                  # Tool timeout
-```
-### Resource Limits
-```bash
---memory 4Gi              # 128Mi to 32Gi
---cpu 2                   # 1 to 8 vCPU
---timeout 900             # Max 3600s (1 hour)
---max-instances 10        # Scale limit
---min-instances 0         # Always-warm instances
---concurrency 10          # Requests per instance
-```
----
-## 🐛 Troubleshooting
-### Build Fails
-```bash
-# Check build logs
-gcloud builds list --limit=5
-gcloud builds log BUILD_ID
-# Common fixes:
-# - Ensure Dockerfile is in root directory
-# - Check requirements.txt has all dependencies
-# - Increase build timeout: --timeout=1200s
-```
-### Deployment Fails
-```bash
-# Check service status
-gcloud run services describe data-science-agent --region us-central1
-# Common fixes:
-# - Ensure APIs are enabled
-# - Check secrets exist and are accessible
-# - Verify service account permissions
-```
-### Runtime Errors
-```bash
-# View logs
-gcloud run logs tail data-science-agent --region us-central1
-# Common issues:
-# - API keys not set: Check secrets
-# - Import errors: Ensure all dependencies in requirements.txt
-# - Memory issues: Increase --memory limit
-# - Timeout: Increase --timeout value
-```
-### Container Crashes
-```bash
-# Test locally first
-docker build -t ds-agent .
-docker run -p 8080:8080 \
-  -e GROQ_API_KEY="your-key" \
-  ds-agent
-curl http://localhost:8080/health
-```
----
-## 🚀 Advanced Features
-### Custom Domain
-```bash
-# Map custom domain
-gcloud run domain-mappings create \
-  --service data-science-agent \
-  --domain api.yourdomain.com \
-  --region us-central1
-```
-### Load Balancing
-```bash
-# Create multiple regional deployments
-for region in us-central1 us-east1 europe-west1; do
-  gcloud run deploy data-science-agent \
-    --image gcr.io/PROJECT_ID/data-science-agent \
-    --region $region
-done
-# Set up global load balancer
-# Follow: https://cloud.google.com/load-balancing/docs/https/setup-global-ext-https-serverless
-```
-### Multi-Region Deployment
-```bash
-# Deploy to multiple regions for high availability
-./deploy.sh CLOUD_RUN_REGION=us-central1
-./deploy.sh CLOUD_RUN_REGION=europe-west1
-./deploy.sh CLOUD_RUN_REGION=asia-east1
-```
----
-## 📝 API Documentation
-Once deployed, access Swagger docs at:
-```
-https://YOUR_SERVICE_URL/docs
-```
-### Available Endpoints
-#### `GET /` - Health Check
-Returns service status and tool count.
-#### `GET /health` - Detailed Health
-Returns agent readiness and provider info.
-#### `GET /tools` - List Tools
-Returns all 82 available tools organized by category.
-#### `POST /run` - Run Full Analysis
-Upload dataset and execute complete data science workflow.
-**Parameters:**
-- `file`: CSV/Parquet file (multipart/form-data)
-- `task_description`: Natural language task description
-- `target_col`: Target column for ML (optional)
-- `use_cache`: Enable caching (default: true)
-- `max_iterations`: Max workflow steps (default: 20)
-#### `POST /profile` - Quick Profile
-Quick dataset profiling without full workflow.
-**Parameters:**
-- `file`: CSV/Parquet file (multipart/form-data)
----
-## 🔄 Updates & Rollbacks
-### Update Deployment
-```bash
-# Rebuild and redeploy
-./deploy.sh
-```
-### Rollback to Previous Revision
-```bash
-# List revisions
-gcloud run revisions list --service data-science-agent --region us-central1
-# Rollback
-gcloud run services update-traffic data-science-agent \
-  --to-revisions REVISION_NAME=100 \
-  --region us-central1
-```
-### Blue/Green Deployment
-```bash
-# Deploy new version with tag
-gcloud run deploy data-science-agent \
-  --tag blue \
-  --no-traffic \
-  --region us-central1
-# Test: https://blue---data-science-agent-HASH.run.app
-# Switch traffic
-gcloud run services update-traffic data-science-agent \
-  --to-tags blue=100 \
-  --region us-central1
-```
----
-## 📚 Additional Resources
-- **Cloud Run Docs**: https://cloud.google.com/run/docs
-- **Pricing Calculator**: https://cloud.google.com/products/calculator
-- **Best Practices**: https://cloud.google.com/run/docs/tips
-- **Quotas & Limits**: https://cloud.google.com/run/quotas
----
-## ✅ Deployment Checklist
-- [ ] GCP project created and billing enabled
-- [ ] Google Cloud SDK installed and authenticated
-- [ ] API keys obtained (GROQ_API_KEY, GOOGLE_API_KEY)
-- [ ] Secrets created in Secret Manager
-- [ ] Docker container builds successfully locally
-- [ ] Cloud Run APIs enabled
-- [ ] Service deployed to Cloud Run
-- [ ] Health check endpoint returns 200
-- [ ] Test dataset profiled successfully
-- [ ] Full analysis workflow tested
-- [ ] Monitoring/logging configured
-- [ ] Cost alerts set up (optional)
-- [ ] Custom domain mapped (optional)
-- [ ] CI/CD pipeline configured (optional)
----
-**Need help?** Check the troubleshooting section or view logs with:
-```bash
-gcloud run logs tail data-science-agent --region us-central1
-```
-Happy deploying! 🎉

FRONTEND_INTEGRATION.md DELETED Viewed

@@ -1,234 +0,0 @@
-# Data Science Agent - Frontend Integration Guide
-## 🎉 New React Frontend
-The application now features a modern, professional React frontend that replaces the old Gradio interface.
-### Features
-- **Beautiful Landing Page**: Showcases the agent's capabilities with modern design
-- **Professional Chat Interface**: NextChat-style conversational UI
-- **Direct Backend Integration**: Communicates with your FastAPI backend
-- **Responsive Design**: Works on all devices
-- **Dark Theme**: Modern, eye-friendly interface
-## 🚀 Quick Start
-### Prerequisites
-- Python 3.13+
-- Node.js 20+
-- npm (comes with Node.js)
-### Running the Application
-#### Option 1: Using the Build Script (Recommended)
-**Windows:**
-```powershell
-.\build-and-deploy.ps1
-```
-**Linux/Mac:**
-```bash
-chmod +x build-and-deploy.sh
-./build-and-deploy.sh
-```
-Then start the server:
-```bash
-python src/api/app.py
-```
-#### Option 2: Manual Steps
-1. **Build the Frontend:**
-```bash
-cd FRRONTEEEND
-npm.cmd install
-npm.cmd run build
-cd ..
-```
-2. **Install Python Dependencies:**
-```bash
-pip install -r requirements.txt
-```
-3. **Start the Backend Server:**
-```bash
-python src/api/app.py
-```
-4. **Access the Application:**
-Open your browser and navigate to: http://localhost:8080
-## 🏗️ Architecture
-### Backend (FastAPI)
-- **Location**: `src/api/app.py`
-- **Port**: 8080
-- **Endpoints**:
-  - `GET /` - Health check & landing page
-  - `POST /chat` - Chat interface endpoint
-  - `POST /run` - Full data science workflow
-  - `POST /profile` - Dataset profiling
-  - `GET /tools` - List available tools
-### Frontend (React + Vite)
-- **Location**: `FRRONTEEEND/`
-- **Build Output**: `FRRONTEEEND/dist/`
-- **Dev Port**: 3000 (development mode)
-- **Production**: Served by FastAPI at port 8080
-## 🔧 Development Mode
-If you want to develop the frontend with hot-reloading:
-1. **Terminal 1 - Backend:**
-```bash
-python src/api/app.py
-```
-2. **Terminal 2 - Frontend:**
-```bash
-cd FRRONTEEEND
-npm.cmd run dev
-```
-Access:
-- Frontend (dev): http://localhost:3000
-- Backend API: http://localhost:8080
-## 🌐 API Integration
-The frontend now communicates with your FastAPI backend instead of calling external APIs directly.
-### Environment Variables
-Create `FRRONTEEEND/.env` for local development:
-```env
-VITE_API_URL=http://localhost:8080
-```
-For production, update `FRRONTEEEND/.env.production`:
-```env
-VITE_API_URL=https://your-cloud-run-url.run.app
-```
-## 📦 Deployment
-### Docker Build
-The Dockerfile now includes a multi-stage build that:
-1. Builds the React frontend
-2. Builds the Python environment
-3. Combines both in the final image
-```bash
-docker build -t data-science-agent .
-docker run -p 8080:8080 data-science-agent
-```
-### Google Cloud Run
-```bash
-gcloud builds submit --tag gcr.io/YOUR-PROJECT-ID/data-science-agent
-gcloud run deploy data-science-agent \
-  --image gcr.io/YOUR-PROJECT-ID/data-science-agent \
-  --platform managed \
-  --region us-central1 \
-  --allow-unauthenticated \
-  --set-env-vars GROQ_API_KEY=your-api-key
-```
-## 🔄 What Changed
-### Removed
-- ❌ Gradio interface (`chat_ui.py` - kept for reference)
-- ❌ Direct Google GenAI calls from frontend
-- ❌ Gradio dependency
-### Added
-- ✅ React + TypeScript frontend with Vite
-- ✅ Professional landing page
-- ✅ Modern chat interface
-- ✅ `/chat` API endpoint
-- ✅ CORS support in FastAPI
-- ✅ Static file serving for React app
-- ✅ Multi-stage Docker build
-## 🛠️ Tech Stack
-### Frontend
-- React 19
-- TypeScript 5.8
-- Vite 6
-- Tailwind CSS
-- Framer Motion (animations)
-- Lucide React (icons)
-### Backend (unchanged)
-- FastAPI
-- Python 3.13
-- Groq API
-- Polars, DuckDB
-- Scikit-learn, XGBoost, LightGBM
-## 📁 Project Structure
-```
-.
-├── FRRONTEEEND/              # React frontend
-│   ├── components/           # React components
-│   ├── dist/                 # Built frontend (after npm run build)
-│   ├── package.json
-│   ├── vite.config.ts
-│   └── .env                  # Frontend environment variables
-├── src/
-│   ├── api/
-│   │   └── app.py           # FastAPI backend (updated)
-│   ├── tools/               # Data science tools
-│   └── orchestrator.py      # Main agent logic
-├── requirements.txt          # Python dependencies (updated)
-├── Dockerfile               # Multi-stage build (updated)
-├── build-and-deploy.ps1     # Windows build script
-└── build-and-deploy.sh      # Linux/Mac build script
-```
-## 🐛 Troubleshooting
-### Frontend doesn't load
-- Make sure you've run `npm run build` in the FRRONTEEEND directory
-- Check that `FRRONTEEEND/dist/` exists and contains files
-### API errors in chat
-- Ensure the backend is running on port 8080
-- Check that `GROQ_API_KEY` is set in your environment
-- Verify the API URL in `.env` file
-### CORS errors
-- The backend now has CORS enabled for development
-- For production, update the `allow_origins` in `src/api/app.py`
-## 📝 Notes
-- The old `chat_ui.py` has been kept for reference but is no longer used
-- All chat functionality now goes through the `/chat` endpoint
-- The frontend is automatically served by FastAPI in production mode
-- Session history is maintained in the frontend (browser)
-## 🎯 Next Steps
-1. **Customize the frontend**: Edit files in `FRRONTEEEND/components/`
-2. **Add file upload**: Extend `ChatInterface.tsx` to handle file uploads
-3. **Add visualization**: Display charts from the backend in the chat
-4. **Authentication**: Add user authentication if needed
-## 📞 Support
-For issues or questions:
-1. Check the console logs (browser & terminal)
-2. Verify environment variables
-3. Ensure all dependencies are installed
-4. Review the API documentation at http://localhost:8080/docs

FRRONTEEEND/README.md DELETED Viewed

@@ -1,20 +0,0 @@
-<div align="center">
-<img width="1200" height="475" alt="GHBanner" src="https://github.com/user-attachments/assets/0aa67016-6eaf-458a-adb2-6e31a0763ed6" />
-</div>
-# Run and deploy your AI Studio app
-This contains everything you need to run your app locally.
-View your app in AI Studio: https://ai.studio/apps/drive/1gChoktTuh429q26FzxS4BPo0q0LnlRE9
-## Run Locally
-**Prerequisites:**  Node.js
-1. Install dependencies:
-   `npm install`
-2. Set the `GEMINI_API_KEY` in [.env.local](.env.local) to your Gemini API key
-3. Run the app:
-   `npm run dev`

GEMINI_UPDATE.md DELETED Viewed

@@ -1,93 +0,0 @@
-# 🔄 Updated to Use Google Gemini!
-## What Changed
-The application now uses **Google Gemini (gemini-2.0-flash-exp)** instead of Groq for the chat interface.
-## Required Setup
-### 1. Set Your Google API Key
-```powershell
-# Windows PowerShell
-$env:GOOGLE_API_KEY="your-google-api-key-here"
-# Verify it's set
-echo $env:GOOGLE_API_KEY
-```
-### 2. Get Your API Key
-If you don't have a Google API key:
-1. Go to [Google AI Studio](https://aistudio.google.com/app/apikey)
-2. Create a new API key
-3. Copy and set it as shown above
-## Quick Start
-```powershell
-# Set your API key
-$env:GOOGLE_API_KEY="your-key-here"
-# Run the application
-.\start.ps1
-```
-Then open: **http://localhost:8080**
-## What's Using Gemini
-- ✅ **Chat Interface** (`/chat` endpoint) - Uses Gemini 2.0 Flash
-- ℹ️ **Full Workflow** (`/run` endpoint) - Uses the main agent (configurable via LLM_PROVIDER)
-## Technical Details
-The `/chat` endpoint now:
-- Uses `google.generativeai` SDK
-- Model: `gemini-2.0-flash-exp`
-- Maintains conversation history
-- Professional data science system instruction
-## Expected Console Output
-When you start the server:
-```
-INFO:     Started server process [####]
-INFO:     Waiting for application startup.
-✅ Agent initialized with provider: gemini
-✅ Frontend assets mounted from C:\Users\Pulastya\Videos\DS AGENTTTT\FRRONTEEEND\dist
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://0.0.0.0:8080
-```
-## Files Updated
-- ✅ [src/api/app.py](src/api/app.py) - `/chat` endpoint now uses Gemini
-- ✅ [.env.example](.env.example) - Updated to GOOGLE_API_KEY
-- ✅ [start.ps1](start.ps1) - Updated environment variable reference
-- ✅ [start.sh](start.sh) - Updated environment variable reference
-- ✅ [CHECKLIST.md](CHECKLIST.md) - Updated instructions
-- ✅ [FRRONTEEEND/.env](FRRONTEEEND/.env) - Added note about Gemini
-## Troubleshooting
-### Error: "API key not configured"
-**Solution**: Make sure you've set the environment variable:
-```powershell
-$env:GOOGLE_API_KEY="your-actual-api-key"
-```
-### Error: "Module google.generativeai not found"
-**Solution**: The dependency is already in requirements.txt. Verify it's installed:
-```bash
-pip install google-generativeai
-```
-### Rate Limits
-Gemini 2.0 Flash has generous rate limits:
-- Free tier: 15 RPM (requests per minute)
-- 1 million TPM (tokens per minute)
----
-**Ready?** Set your `GOOGLE_API_KEY` and run `.\start.ps1` 🚀

MIGRATION_COMPLETE.md DELETED Viewed

@@ -1,325 +0,0 @@
-# 🎉 Frontend Migration Complete!
-## Summary
-Successfully replaced the old Gradio interface with a modern React-based frontend featuring:
-- **Professional Landing Page**: Showcases the agent's capabilities
-- **Modern Chat Interface**: NextChat-style conversational UI
-- **Direct Backend Integration**: Communicates with FastAPI backend
-- **Beautiful Design**: Dark theme with animations and responsive layout
-## What Was Changed
-### ✅ Backend Updates ([src/api/app.py](src/api/app.py))
-1. **Added CORS middleware** for frontend communication
-2. **Created `/chat` endpoint** for conversational interface
-3. **Static file serving** for built React app
-4. **Catch-all route** to serve `index.html` for client-side routing
-### ✅ Frontend Updates
-1. **Removed Google GenAI dependency** from [package.json](FRRONTEEEND/package.json)
-2. **Updated ChatInterface.tsx** to call backend `/chat` endpoint instead of external API
-3. **Added environment configuration**:
-   - `.env` for local development
-   - `.env.production` for production builds
-4. **Updated vite.config.ts** with proxy configuration
-### ✅ Configuration Files
-1. **requirements.txt**: Commented out Gradio (no longer needed)
-2. **Dockerfile**: Added multi-stage build for React frontend
-3. **.dockerignore**: Excluded node_modules and frontend dev files
-4. **New Scripts**:
-   - `start.ps1` / `start.sh` - Quick start scripts
-   - `build-and-deploy.ps1` / `build-and-deploy.sh` - Build scripts
-### ✅ Documentation
-- **FRONTEND_INTEGRATION.md**: Complete integration guide
-- **README.md**: Updated with frontend announcement
-## 🚀 How to Run
-### Quick Start (Recommended)
-**Windows:**
-```powershell
-.\start.ps1
-```
-**Linux/Mac:**
-```bash
-chmod +x start.sh
-./start.sh
-```
-### Manual Steps
-1. **Build Frontend** (already done ✅):
-```bash
-cd FRRONTEEEND
-npm.cmd install
-npm.cmd run build
-cd ..
-```
-2. **Set Environment Variables**:
-```powershell
-# Required
-$env:GROQ_API_KEY="your-groq-api-key-here"
-# Optional
-$env:GOOGLE_API_KEY="your-google-api-key"
-```
-3. **Start Backend**:
-```bash
-python src\api\app.py
-```
-4. **Access Application**:
-Open browser to: **http://localhost:8080**
-## 🏗️ Architecture
-```
-┌─────────────────────────────────────────────────────────┐
-│                      Browser                             │
-│                                                          │
-│  ┌──────────────────────────────────────────────────┐  │
-│  │   React Frontend (Port 8080)                     │  │
-│  │   - Landing Page (HeroGeometric, etc.)           │  │
-│  │   - Chat Interface (ChatInterface.tsx)           │  │
-│  └──────────────────────────────────────────────────┘  │
-│                         │                                │
-│                         │ HTTP POST /chat                │
-└─────────────────────────┼────────────────────────────────┘
-                          │
-                          ▼
-┌─────────────────────────────────────────────────────────┐
-│               FastAPI Backend (Port 8080)                │
-│                                                          │
-│  ┌──────────────────────────────────────────────────┐  │
-│  │   API Endpoints                                   │  │
-│  │   - POST /chat      → Chat with agent            │  │
-│  │   - POST /run       → Full workflow              │  │
-│  │   - POST /profile   → Dataset profiling          │  │
-│  │   - GET  /tools     → List tools                 │  │
-│  │   - GET  /*         → Serve React app            │  │
-│  └──────────────────────────────────────────────────┘  │
-│                         │                                │
-│                         ▼                                │
-│  ┌──────────────────────────────────────────────────┐  │
-│  │   DataScienceCopilot (orchestrator.py)           │  │
-│  │   - 82+ Tools                                     │  │
-│  │   - Groq LLM                                      │  │
-│  │   - Session Memory                                │  │
-│  └──────────────────────────────────────────────────┘  │
-└─────────────────────────────────────────────────────────┘
-```
-## 🎯 Key Endpoints
-### `/chat` - Conversational Interface
-```typescript
-POST /chat
-Content-Type: application/json
-{
-  "messages": [
-    {"role": "user", "content": "Profile my dataset"},
-    {"role": "assistant", "content": "..."}
-  ],
-  "stream": false
-}
-```
-**Response:**
-```json
-{
-  "success": true,
-  "message": "I can help you profile your dataset...",
-  "model": "llama-3.3-70b-versatile",
-  "provider": "groq"
-}
-```
-### `/run` - Complete Workflow
-```bash
-POST /run
-Content-Type: multipart/form-data
-file: <dataset.csv>
-task_description: "Predict house prices"
-target_col: "price"
-```
-### `/profile` - Quick Profiling
-```bash
-POST /profile
-Content-Type: multipart/form-data
-file: <dataset.csv>
-```
-## 📝 Environment Variables
-### Backend (.env or system)
-```env
-# Required
-GROQ_API_KEY=your-groq-api-key
-# Optional
-GOOGLE_API_KEY=your-google-api-key
-GCP_PROJECT_ID=your-project-id
-LLM_PROVIDER=groq  # or "gemini"
-```
-### Frontend (FRRONTEEEND/.env)
-```env
-# Development
-VITE_API_URL=http://localhost:8080
-# Production (FRRONTEEEND/.env.production)
-VITE_API_URL=https://your-cloud-run-url.run.app
-```
-## 🐳 Docker Deployment
-The Dockerfile now includes a multi-stage build:
-```bash
-# Build image
-docker build -t data-science-agent .
-# Run container
-docker run -p 8080:8080 \
-  -e GROQ_API_KEY=your-key \
-  data-science-agent
-```
-## ☁️ Google Cloud Run Deployment
-```bash
-# Build and push
-gcloud builds submit --tag gcr.io/YOUR-PROJECT-ID/data-science-agent
-# Deploy
-gcloud run deploy data-science-agent \
-  --image gcr.io/YOUR-PROJECT-ID/data-science-agent \
-  --platform managed \
-  --region us-central1 \
-  --allow-unauthenticated \
-  --set-env-vars GROQ_API_KEY=your-api-key
-```
-## 🔍 Testing
-### Test Backend API
-```bash
-# Health check
-curl http://localhost:8080/health
-# List tools
-curl http://localhost:8080/tools
-# Chat
-curl -X POST http://localhost:8080/chat \
-  -H "Content-Type: application/json" \
-  -d '{
-    "messages": [
-      {"role": "user", "content": "Hello, what can you do?"}
-    ]
-  }'
-```
-### Test Frontend
-1. Open browser: http://localhost:8080
-2. Click "Launch Console"
-3. Type a message and send
-## 🎨 Frontend Development
-For frontend development with hot-reloading:
-**Terminal 1 - Backend:**
-```bash
-python src\api\app.py
-```
-**Terminal 2 - Frontend:**
-```bash
-cd FRRONTEEEND
-npm.cmd run dev
-```
-Access:
-- Frontend Dev: http://localhost:3000
-- Backend API: http://localhost:8080
-## 📦 Build Status
-✅ **Frontend Built**: FRRONTEEEND/dist/ contains:
-- index.html
-- assets/index-[hash].js (384 KB)
-✅ **Backend Ready**: src/api/app.py configured to:
-- Serve static files from FRRONTEEEND/dist/assets
-- Route all non-API requests to index.html
-- Handle /chat endpoint
-## 🔄 Migration Notes
-### What's Deprecated
-- ❌ `chat_ui.py` - Old Gradio interface (kept for reference)
-- ❌ Direct Google GenAI calls from frontend
-### What's New
-- ✅ React 19 + TypeScript
-- ✅ Vite 6 build system
-- ✅ Tailwind CSS styling
-- ✅ Framer Motion animations
-- ✅ Backend-first architecture
-## 🐛 Troubleshooting
-### Issue: Frontend shows 404
-**Solution**: Make sure you've built the frontend:
-```bash
-cd FRRONTEEEND
-npm.cmd run build
-```
-### Issue: API errors in chat
-**Solution**:
-1. Check backend is running: `python src\api\app.py`
-2. Verify GROQ_API_KEY is set
-3. Check console for errors
-### Issue: CORS errors
-**Solution**: The backend has CORS enabled. If issues persist, check the `allow_origins` in app.py
-### Issue: Module import errors
-**Solution**: Make sure all Python dependencies are installed:
-```bash
-pip install -r requirements.txt
-```
-## 📚 Additional Resources
-- **[FRONTEND_INTEGRATION.md](FRONTEND_INTEGRATION.md)** - Detailed integration guide
-- **[README.md](README.md)** - Main project documentation
-- **[DEPLOYMENT.md](DEPLOYMENT.md)** - Cloud deployment guide
-## ✨ Next Steps
-1. **File Upload**: Add file upload capability to ChatInterface
-2. **Visualizations**: Display charts and plots in chat
-3. **Session Persistence**: Store chat history in backend
-4. **Authentication**: Add user authentication
-5. **Streaming**: Implement streaming responses
-6. **Dark/Light Mode**: Add theme toggle
----
-**Status**: ✅ Ready to use!
-**Last Updated**: December 27, 2025

QUICK_REFERENCE.txt DELETED Viewed

@@ -1,71 +0,0 @@
-╔═══════════════════════════════════════════════════════════════╗
-║           🚀 DATA SCIENCE AGENT - QUICK REFERENCE            ║
-║              Now powered by Google Gemini! 🤖                 ║
-╚═══════════════════════════════════════════════════════════════╝
-┌───────────────────────────────────────────────────────────────┐
-│ 1. SET API KEY (REQUIRED!)                                    │
-└───────────────────────────────────────────────────────────────┘
-  PowerShell:
-    $env:GOOGLE_API_KEY="your-google-api-key-here"
-  Get your key: https://aistudio.google.com/app/apikey
-┌───────────────────────────────────────────────────────────────┐
-│ 2. START THE APPLICATION                                      │
-└───────────────────────────────────────────────────────────────┘
-  .\start.ps1
-┌───────────────────────────────────────────────────────────────┐
-│ 3. ACCESS THE APP                                             │
-└───────────────────────────────────────────────────────────────┘
-  Open browser: http://localhost:8080
-┌───────────────────────────────────────────────────────────────┐
-│ WHAT'S INCLUDED                                               │
-└───────────────────────────────────────────────────────────────┘
-  ✅ Modern React frontend with landing page
-  ✅ Professional chat interface
-  ✅ Google Gemini 2.0 Flash integration
-  ✅ 82+ data science tools
-  ✅ Complete ML pipeline automation
-┌───────────────────────────────────────────────────────────────┐
-│ KEY FILES                                                     │
-└───────────────────────────────────────────────────────────────┘
-  📖 GEMINI_UPDATE.md      - Gemini migration details
-  📖 CHECKLIST.md          - Pre-launch checklist
-  📖 MIGRATION_COMPLETE.md - Full change log
-  📖 FRONTEND_INTEGRATION.md - Technical docs
-┌───────────────────────────────────────────────────────────────┐
-│ TROUBLESHOOTING                                               │
-└───────────────────────────────────────────────────────────────┘
-  Issue: "API key not configured"
-  → Set: $env:GOOGLE_API_KEY="your-key"
-  Issue: "Frontend not found"
-  → Run: cd FRRONTEEEND && npm run build
-  Issue: "Module not found"
-  → Run: pip install -r requirements.txt
-┌───────────────────────────────────────────────────────────────┐
-│ API ENDPOINTS                                                 │
-└───────────────────────────────────────────────────────────────┘
-  POST /chat    - Chat with Gemini agent
-  POST /run     - Full ML workflow
-  POST /profile - Quick dataset profiling
-  GET  /tools   - List available tools
-  GET  /docs    - API documentation
-╔═══════════════════════════════════════════════════════════════╗
-║  Ready to start? Run: .\start.ps1                            ║
-���═══════════════════════════════════════════════════════════════╝

README.md CHANGED Viewed

@@ -1,632 +1,365 @@
-# Data Science Agent 🤖
-A production-grade **autonomous AI agent** for end-to-end data science workflows. Upload datasets, describe your goal in natural language, and let the AI handle profiling, cleaning, feature engineering, model training, and visualization.
-**Key Differentiator**: Not just a chatbot - a true AI agent with 75+ specialized tools, intelligent orchestration, dual LLM support, session memory, code interpreter, and Cloud Run API.
----
-> ## 🎉 **NEW: Modern React Frontend!**
->
-> The application now features a **professional React-based web interface** with a beautiful landing page and chat UI, replacing the old Gradio interface.
->
-> **Quick Start:**
-> ```powershell
-> .\start.ps1  # Windows
-> ```
-> or
-> ```bash
-> ./start.sh   # Linux/Mac
-> ```
->
-> 📖 **[See Full Frontend Integration Guide →](FRONTEND_INTEGRATION.md)**
 ---
-## 🎯 Project Vision
-Build an **autonomous data science system** that achieves **50-70th percentile performance** on Kaggle competitions through intelligent automation, proving AI agents can handle real-world ML workflows end-to-end.
----
-## ✨ Core Features
-### **🤖 Intelligent Agent System**
-- **82+ Specialized Tools** across 11 categories (profiling, cleaning, feature engineering, ML, visualization, BigQuery)
-- **Dual LLM Support**: Groq (llama-3.3-70b) + Google Gemini (2.0-flash-exp)
-- **Smart Orchestration**: LLM-powered function calling with intelligent tool chaining
-- **Session Memory**: Contextual awareness across conversations ("cross-validate it", "try with Ridge")
-- **Code Interpreter**: Write and execute custom Python code for tasks beyond predefined tools
-- **Error Recovery**: Automatic retry with corrected parameters
-- **Reasoning Modules**: Dedicated LLM reasoning layer with 19 specialized functions
-- **Cloud Integration**: BigQuery data access + GCS artifact storage
-### 🎨 **Multiple Interfaces**
-- **Gradio Web UI** (`chat_ui.py`): Upload files, chat interface, visual plots
-- **CLI Interface** (`src/cli.py`): Command-line workflow automation
-- **REST API** (`src/api/app.py`): Cloud Run-ready FastAPI wrapper
-- **Python SDK**: Direct programmatic access
 ### 📊 **Complete ML Pipeline**
-1. **Data Profiling** → Statistics, types, quality issues
-2. **Data Cleaning** → Smart imputation, outlier handling, type conversion
-3. **Feature Engineering** → Time features, encoding, interactions, ratios
-4. **Model Training** → XGBoost, LightGBM, CatBoost, ensemble methods
-5. **Hyperparameter Tuning** → Optuna-based optimization
-6. **Visualization** → Matplotlib, Plotly, interactive dashboards
-7. **EDA Reports** → Sweetviz, ydata-profiling HTML reports
-8. **Explainability** → SHAP values, feature importance
-### ⚡ **Performance & Scale**
-- **Token Optimization**: 34% reduction in LLM context (compressed tool schemas)
-- **SQLite Caching**: Memoization of expensive operations with TTL
-- **Polars & DuckDB**: 10-100x faster than pandas for large datasets
-- **Rate Limiting**: Intelligent API call management (Groq: 12K TPM, Gemini: 10 RPM)
-- **Cloud Ready**: FastAPI service for Google Cloud Run deployment
----
-## 🏗️ Architecture
-### **System Design**
-```
-┌─────────────────────────────────────────────────────────────┐
-│                   User Interfaces                            │
-│  Gradio UI  │  CLI  │  REST API  │  Python SDK               │
-└─────────────────────────┬───────────────────────────────────┘
-                          ▼
-┌─────────────────────────────────────────────────────────────┐
-│              DataScienceCopilot Orchestrator                 │
-│  • LLM Function Calling (Groq/Gemini)                       │
-│  • Session Memory Management                                 │
-│  • Tool Execution & Chaining                                 │
-│  • Error Recovery & Retry Logic                              │
-└─────────────────────────┬───────────────────────────────────┘
-                          ▼
-┌─────────────────────────────────────────────────────────────┐
-│                    75+ Specialized Tools                     │
-│  Data Profiling │ Cleaning │ Feature Engineering             │
-│  Model Training │ Visualization │ EDA Reports                │
-│  NLP/Text │ Computer Vision │ Time Series │ MLOps           │
-└─────────────────────────┬───────────────────────────────────┘
-                          ▼
-┌─────────────────────────────────────────────────────────────┐
-│              Execution & Storage Backends                    │
-│  Local: Polars, sklearn, XGBoost                            │
-│  Cloud: BigQuery, Vertex AI, Cloud Storage (planned)        │
-│  Cache: SQLite with TTL                                      │
-└─────────────────────────────────────────────────────────────┘
-```
-### **Tech Stack**
-| Layer | Technologies |
-|-------|-------------|
-| **LLM** | Groq (llama-3.3-70b), Google Gemini (2.0-flash-exp) |
-| **Data Processing** | Polars, DuckDB, Pandas, PyArrow, BigQuery |
-| **ML/AI** | scikit-learn, XGBoost, LightGBM, CatBoost, Optuna |
-| **Visualization** | Matplotlib, Seaborn, Plotly |
-| **EDA Reports** | Sweetviz, ydata-profiling |
-| **Explainability** | SHAP, LIME |
-| **APIs** | FastAPI, Uvicorn |
-| **UI** | Gradio, Typer + Rich (CLI) |
-| **Storage** | SQLite (cache), CSV, Parquet, Google Cloud Storage |
-| **Cloud** | Google Cloud Run, BigQuery, GCS, Vertex AI (planned) |
 ---
 ## 🚀 Quick Start
-### **Prerequisites**
-- Python 3.9+
-- API Keys: [Groq](https://console.groq.com) or [Google AI Studio](https://makersuite.google.com/app/apikey)
-### **Installation**
 ```bash
-# Clone repository
-git clone https://github.com/Surfing-Ninja/Data-Science-Agent.git
-cd Data-Science-Agent
-# Create virtual environment
-python -m venv .venv
-source .venv/bin/activate  # On Windows: .venv\Scripts\activate
-# Install dependencies
-pip install -r requirements.txt
-# Set up environment variables
 cp .env.example .env
-# Edit .env and add your API keys:
-# GROQ_API_KEY=your_groq_key
-# GOOGLE_API_KEY=your_google_key (optional)
-# LLM_PROVIDER=groq  # or "gemini"
 ```
-### **Usage Examples**
-#### **1. Gradio Web UI** (Recommended for beginners)
 ```bash
-python chat_ui.py
-# Opens at http://localhost:7860
-# Upload CSV → Ask: "Analyze this data and predict house prices"
 ```
-#### **2. CLI Interface**
 ```bash
-# Complete workflow
-python src/cli.py analyze data.csv --target price --task "Predict house prices"
-# Quick profiling
-python src/cli.py profile data.csv
-# Train models only
-python src/cli.py train cleaned.csv Survived --task-type classification
 ```
-#### **3. Python SDK**
-```python
-from src.orchestrator import DataScienceCopilot
-# Initialize agent
-agent = DataScienceCopilot(
-    provider="groq",  # or "gemini"
-    reasoning_effort="medium"
-)
-# Run workflow
-result = agent.analyze(
-    file_path="titanic.csv",
-    task_description="Build a model to predict passenger survival",
-    target_col="Survived"
-)
-print(f"Status: {result['status']}")
-print(f"Best Model: {result['best_model']}")
-print(f"Accuracy: {result['best_score']}")
 ```
-#### **4. REST API** (Cloud Run Ready)
 ```bash
-# Start local server
-cd src/api
-python app.py
-# Server runs at http://localhost:8080
-# Make API call
-curl -X POST http://localhost:8080/run \
-  -F "file=@data.csv" \
-  -F "task_description=Analyze and predict churn" \
-  -F "target_col=churn"
 ```
----
-## 📁 Project Structure
-```
-Data-Science-Agent/
-├── src/
-│   ├── orchestrator.py              # Main agent brain (1,136 lines)
-│   ├── cli.py                       # CLI interface (346 lines)
-│   ├── api/
-│   │   └── app.py                   # FastAPI Cloud Run wrapper (331 lines)
-│   ├── bigquery/                    # BigQuery integration 🆕
-│   │   ├── __init__.py             # BigQuery tools (4 functions)
-│   │   └── client.py               # BigQuery client wrapper
-│   ├── storage/                     # Artifact storage 🆕
-│   │   ├── artifact_store.py       # Local + GCS backends (613 lines)
-│   │   └── helpers.py              # Storage helper functions (125 lines)
-│   ├── reasoning/                   # LLM reasoning layer 🆕
-│   │   ├── __init__.py             # Core reasoning engine (350 lines)
-│   │   ├── data_understanding.py   # Data insights (6 functions)
-│   │   ├── model_explanation.py    # Model interpretation (6 functions)
-│   │   └── business_summary.py     # Business translations (7 functions)
-│   ├── cache/
-│   │   └── cache_manager.py        # SQLite caching with TTL
-│   ├── tools/                       # 82+ specialized tools
-│   │   ├── data_profiling.py       # Dataset analysis
-│   │   ├── data_cleaning.py        # Cleaning & preprocessing
-│   │   ├── feature_engineering.py  # Feature creation
-│   │   ├── model_training.py       # ML training
-│   │   ├── visualization_engine.py # Matplotlib/Seaborn plots
-│   │   ├── plotly_visualizations.py # Interactive charts
-│   │   ├── eda_reports.py          # Sweetviz, ydata-profiling
-│   │   ├── advanced_*.py           # Advanced features
-│   │   └── tools_registry.py       # All 82 tool definitions (1,600+ lines)
-│   └── utils/                       # Helper utilities
-│       ├── polars_helpers.py       # Data manipulation
-│       └── validation.py           # Input validation
-├── chat_ui.py                       # Gradio web interface (912 lines)
-├── examples/
-│   └── titanic_example.py           # Complete workflow demo
-├── outputs/
-│   ├── data/                        # Processed datasets
-│   ├── models/                      # Trained models (.pkl)
-│   ├── plots/                       # Visualizations (.png, .html)
-│   └── reports/                     # EDA reports (.html)
-├── cache_db/                        # SQLite cache storage
-├── requirements.txt                 # Python dependencies
-├── .env.example                     # Environment template
-└── README.md                        # This file
-```
 ---
-## 🛠️ Tool Categories (82 Tools Total)
-### **📊 Data Profiling & Analysis (7 tools)**
-- `profile_dataset`, `detect_data_quality_issues`, `analyze_correlations`, `get_smart_summary`, `compare_datasets`, `calculate_statistics`, `detect_skewness`
-### **☁️ BigQuery Integration (4 tools)** 🆕
-- `bigquery_profile_table`, `bigquery_load_table`, `bigquery_execute_query`, `bigquery_write_results`
-### **🧹 Data Cleaning (8 tools)**
-- `clean_missing_values`, `handle_outliers`, `remove_duplicates`, `filter_rows`, `rename_columns`, `drop_columns`, `sort_data`, `fix_data_types`
-### **🔧 Feature Engineering (13 tools)**
-- `encode_categorical`, `force_numeric_conversion`, `smart_type_inference`, `create_time_features`, `create_interaction_features`, `create_aggregation_features`, `create_ratio_features`, `create_statistical_features`, `create_log_features`, `create_binned_features`, `engineer_text_features`, `auto_feature_engineering`, `auto_feature_selection`
-### **🤖 Model Training & Tuning (6 tools)**
-- `train_baseline_models`, `hyperparameter_tuning`, `train_ensemble_models`, `perform_cross_validation`, `generate_model_report`, `auto_ml_pipeline`
-### **📈 Visualization (11 tools)**
-- `generate_all_plots`, `generate_data_quality_plots`, `generate_eda_plots`, `generate_model_performance_plots`, `generate_feature_importance_plot`, `generate_interactive_scatter`, `generate_interactive_histogram`, `generate_interactive_correlation_heatmap`, `generate_interactive_box_plots`, `generate_interactive_time_series`, `generate_plotly_dashboard`
-### **📊 EDA Reports (3 tools)**
-- `generate_sweetviz_report`, `generate_ydata_profiling_report`, `generate_combined_eda_report`
-### **🔬 Advanced Analysis (11 tools)**
-- `perform_eda_analysis`, `detect_model_issues`, `detect_anomalies`, `detect_and_handle_multicollinearity`, `perform_statistical_tests`, `analyze_root_cause`, `detect_trends_and_seasonality`, `detect_anomalies_advanced`, `perform_hypothesis_testing`, `analyze_distribution`, `perform_segment_analysis`
-### **📝 Data Wrangling (3 tools)**
-- `merge_datasets`, `concat_datasets`, `reshape_dataset`
-### **🚀 MLOps & Production (5 tools)**
-- `monitor_model_drift`, `explain_predictions`, `generate_model_card`, `perform_ab_test_analysis`, `detect_feature_leakage`
-### **⏰ Time Series (3 tools)**
-- `forecast_time_series`, `detect_seasonality_trends`, `create_time_series_features`
-### **💼 Business Intelligence (4 tools)**
-- `perform_cohort_analysis`, `perform_rfm_analysis`, `detect_causal_relationships`, `generate_business_insights`
-### **📚 NLP/Text (4 tools)**
-- `perform_topic_modeling`, `perform_named_entity_recognition`, `analyze_sentiment_advanced`, `perform_text_similarity`
-### **🖼️ Computer Vision (3 tools)**
-- `extract_image_features`, `perform_image_clustering`, `analyze_tabular_image_hybrid`
----
-## 🎯 Advanced Features
-### **1. Session Memory**
-The agent remembers context across conversations:
-```python
-# Conversation 1
-"Train a model on earthquake.csv to predict magnitude"
-→ Agent trains XGBoost, achieves 0.92 R²
-# Conversation 2 (Same session)
-"Cross-validate it"
-→ Agent knows: model=XGBoost, dataset=earthquake.csv, target=magnitude
-→ Runs 5-fold CV automatically
 ```
-### **2. Code Interpreter**
-Execute custom Python code for tasks beyond predefined tools:
-```python
-User: "Make a Plotly scatter with custom dropdown filters"
-Agent: execute_python_code(code='''
-import plotly.graph_objects as go
-df = pd.read_csv('./temp/data.csv')
-# Custom visualization code...
-fig.write_html('./outputs/code/custom_plot.html')
-''')
 ```
-### **3. Token Optimization**
-System stays under LLM token limits even with 75 tools:
-| Component | Before | After | Savings |
-|-----------|--------|-------|---------|
-| Tool Schemas | 8,193 tokens | 5,463 tokens | 34% |
-| Tool Results | 5,000+ tokens | 50-200 tokens | 90%+ |
-### **4. Error Recovery**
-Agent learns from errors and auto-corrects:
-```python
-# Attempt 1
-train_baseline_models(target_col="magnitude")
-→ Error: Column 'magnitude' not found. Hint: Did you mean 'mag'?
-# Attempt 2 (Automatic)
-train_baseline_models(target_col="mag")
-→ Success! Trained 4 models, best: XGBoost (0.92 R²)
 ```
 ---
-## ☁️ Cloud Features
-### **1. BigQuery Integration** 🆕
-Direct access to BigQuery tables without local downloads:
-```python
-# Profile a BigQuery table
-agent.chat("Profile the table project.dataset.sales")
-# Query and analyze
-agent.chat("Query top 10 customers by revenue from BigQuery")
-# Write results back
-agent.chat("Write the cleaned data to BigQuery table project.dataset.sales_clean")
 ```
-**Available Tools:**
-- `bigquery_profile_table`: Get statistics for any BigQuery table
-- `bigquery_load_table`: Load BigQuery data into local Polars DataFrame
-- `bigquery_execute_query`: Run SQL queries directly on BigQuery
-- `bigquery_write_results`: Write processed data back to BigQuery
-**Setup:**
 ```bash
-# Install BigQuery dependencies
-pip install google-cloud-bigquery db-dtypes
-# Set environment variable
-export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
 ```
-**Looker-Compatible Schemas:**
-The project defines stable BigQuery table schemas for BI tools (see [`BIGQUERY_SCHEMAS.md`](BIGQUERY_SCHEMAS.md)):
-- 📊 `model_metrics` - Model performance tracking over time
-- 🎯 `feature_importance` - Feature impact analysis
-- 🔮 `predictions` - Prediction monitoring with actuals
-- 📋 `data_profile_summary` - Data quality metrics
-**Design Principles:**
-- Stable schemas (no breaking changes without versioning)
-- Consistent snake_case naming
-- Clear dimension/metric separation
-- Dashboard-ready with sample Looker views
-### **2. Artifact Storage** 🆕
-Unified storage abstraction - switch between local and GCS with zero code changes:
-```python
-# Local storage (default)
-agent.save_model(model, "my_model.pkl")
-# → Saves to outputs/models/my_model.pkl
-# GCS storage (automatic when GCS credentials present)
-agent.save_model(model, "my_model.pkl")
-# → Saves to gs://your-bucket/models/my_model_v1.pkl with versioning
-```
-**Features:**
-- **Automatic Backend Selection**: Uses GCS if credentials available, falls back to local
-- **Versioning**: Automatic version suffixes for GCS artifacts
-- **Metadata**: Stores creation time, size, checksums
-- **Unified API**: Same code works for local and cloud storage
-**Setup:**
 ```bash
-# Install GCS dependencies
-pip install google-cloud-storage
-# Set bucket (optional, defaults to local)
-export GCS_BUCKET="your-gcs-bucket-name"
-```
-### **3. Reasoning Modules** 🆕
-Dedicated LLM reasoning layer with clear boundaries (no raw data access, no training decisions):
-```python
-from reasoning.data_understanding import explain_dataset
-from reasoning.model_explanation import explain_model_performance
-from reasoning.business_summary import create_executive_summary
-# Data insights
-insights = explain_dataset(summary={
-    "rows": 10000,
-    "columns": 20,
-    "missing_values": {"age": {"count": 150, "percentage": 1.5}}
-})
-# Model explanations
-explanation = explain_model_performance(metrics={
-    "accuracy": 0.95,
-    "precision": 0.92,
-    "recall": 0.88
-}, task_type="classification")
-# Business summaries
-summary = create_executive_summary(
-    project_results={"model_accuracy": 0.95},
-    project_name="churn_prediction",
-    business_objective="Reduce customer churn"
-)
-```
-**19 Reasoning Functions:**
-- **Data Understanding**: explain_dataset, suggest_transformations, identify_feature_engineering_opportunities, explain_missing_values, compare_datasets (6 functions)
-- **Model Explanation**: explain_model_performance, interpret_feature_importance, diagnose_model_failure, explain_prediction, compare_models, explain_overfitting (6 functions)
-- **Business Summary**: create_executive_summary, estimate_business_impact, create_stakeholder_report, translate_technical_to_business, prioritize_next_steps, explain_to_customer, assess_deployment_readiness (7 functions)
-**Design Principles:**
-- ✅ **NO Raw Data Access**: Only summaries/statistics allowed
-- ✅ **NO Training Decisions**: Only explanations, never execution
-- ✅ **Structured Output**: JSON schemas for cacheability
-- ✅ **Dual Backend**: Works with both Gemini and Groq
 ---
-## 🔧 Configuration
-### **Environment Variables** (`.env`)
-```bash
-# LLM Provider
-LLM_PROVIDER=groq               # "groq" or "gemini"
-GROQ_API_KEY=your_groq_key
-GOOGLE_API_KEY=your_google_key  # Optional
-# Model Selection
-GROQ_MODEL=llama-3.3-70b-versatile
-GEMINI_MODEL=gemini-2.0-flash-exp
-REASONING_EFFORT=medium         # low, medium, high
-# Cache Settings
-CACHE_DB_PATH=./cache_db/cache.db
-CACHE_TTL_SECONDS=86400         # 24 hours
-# Cloud Features (Optional)
-GCS_BUCKET=your-gcs-bucket-name                           # For artifact storage
-GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-key.json  # For BigQuery + GCS
-# Cloud Run (for API deployment)
-PORT=8080
 ```
-### **Provider Comparison**
-| Feature | Groq | Gemini |
-|---------|------|--------|
-| **Model** | llama-3.3-70b-versatile | gemini-2.0-flash-exp |
-| **Speed** | ⚡ Extremely fast (LPU) | 🚀 Very fast |
-| **Free Tier** | 100K tokens/day | 1,500 requests/day |
-| **Rate Limit** | 12K tokens/min | 10 requests/min |
-| **Best For** | High-volume, low-latency | Free tier, high quota |
 ---
-## 🚀 Cloud Deployment (Google Cloud Run)
-### **Deploy REST API**
-```bash
-# 1. Build Docker image (Dockerfile provided)
-docker build -t data-science-agent .
-# 2. Push to Google Container Registry
-gcloud builds submit --tag gcr.io/PROJECT_ID/data-science-agent
-# 3. Deploy to Cloud Run
-gcloud run deploy data-science-agent \
-  --image gcr.io/PROJECT_ID/data-science-agent \
-  --platform managed \
-  --region us-central1 \
-  --allow-unauthenticated \
-  --memory 4Gi \
-  --timeout 3600 \
-  --set-env-vars GROQ_API_KEY=your_key,LLM_PROVIDER=groq
-# 4. Test deployment
-curl -X POST https://your-service-url/run \
-  -F "file=@data.csv" \
-  -F "task_description=Predict churn"
-```
-### **API Endpoints**
-- `GET /` - Health check
-- `GET /health` - Readiness probe
-- `POST /run` - Full analysis workflow
-- `POST /profile` - Quick dataset profiling
-- `GET /tools` - List all available tools
----
-## 🗺️ Roadmap
-### **Phase 1: Core Agent** ✅ COMPLETE
-- [x] 75 specialized tools
-- [x] Dual LLM support (Groq + Gemini)
-- [x] CLI + Gradio UI
-- [x] SQLite caching
-- [x] Token optimization
-### **Phase 2: Intelligence** ✅ COMPLETE
-- [x] Session memory
-- [x] Code interpreter
-- [x] Error recovery
-- [x] EDA reports (Sweetviz, ydata-profiling)
-- [x] Interactive Plotly visualizations
-### **Phase 3: Cloud Native** ✅ COMPLETE
-- [x] FastAPI Cloud Run wrapper with 4 REST endpoints
-- [x] BigQuery integration (4 tools: profile, load, query, write)
-- [x] Artifact Storage abstraction (Local ↔ GCS switching)
-- [x] Reasoning modules for LLM explanations (19 functions)
-- [x] Looker-compatible BigQuery schemas (4 stable tables)
-- [ ] Vertex AI model training (planned)
-- [ ] Cloud Logging & Monitoring (planned)
-### **Phase 4: Enterprise** 📋 PLANNED
-- [ ] Multi-user authentication
-- [ ] Team workspaces
-- [ ] Model registry
-- [ ] Automated retraining pipelines
-### **Phase 5: Kaggle Integration** 🎯 FUTURE
-- [ ] Direct Kaggle API integration
-- [ ] Automated competition workflow
-- [ ] Ensemble strategies
-- [ ] Submission automation
 ---
 ## 🤝 Contributing
-Contributions welcome! Areas for improvement:
-1. **New Tools**: Time series forecasting, NLP preprocessing, image augmentation
-2. **Cloud Backends**: AWS, Azure support
-3. **Performance**: Optimize tool execution, reduce latency
-4. **UI/UX**: Better visualization, workflow builder
-5. **Documentation**: Tutorials, video guides, blog posts
 ---
-## 📜 License
-MIT License - See LICENSE file for details
 ---
-## 📧 Support & Community
-- **Issues**: [GitHub Issues](https://github.com/Surfing-Ninja/Data-Science-Agent/issues)
-- **Discussions**: [GitHub Discussions](https://github.com/Surfing-Ninja/Data-Science-Agent/discussions)
 ---
-## 📊 Project Stats
-- **Lines of Code**: ~18,000+
-- **Tools**: 82 specialized functions (75 core + 4 BigQuery + 3 storage helpers)
-- **Reasoning Functions**: 19 LLM-powered explanation modules
-- **Supported Models**: 10+ (LR, Ridge, Lasso, RF, XGBoost, LightGBM, CatBoost, etc.)
-- **Visualization Types**: 20+ (static + interactive)
-- **Data Formats**: CSV, Parquet, JSON, BigQuery tables
-- **Cloud Platforms**: Google Cloud (Run, BigQuery, GCS) - AWS/Azure planned
 ---
 <div align="center">
-**Built with ❤️ for the Data Science Community**
-*"Making data science accessible through AI automation"*
-⭐ Star this repo if you find it useful! ⭐
 </div>

+# 🤖 AI-Powered Data Science Agent
+> **An intelligent autonomous agent that performs end-to-end data science workflows through natural language**
+Upload your dataset, describe what you want in plain English, and watch as the AI agent handles profiling, cleaning, feature engineering, model training, hyperparameter tuning, and comprehensive reporting - all automatically.
+[![React](https://img.shields.io/badge/React-19-61DAFB?logo=react)](https://reactjs.org/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688?logo=fastapi)](https://fastapi.tiangolo.com/)
+[![Gemini](https://img.shields.io/badge/Gemini-2.5_Flash-4285F4?logo=google)](https://ai.google.dev/)
+[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?logo=python)](https://python.org/)
 ---
+## ✨ Key Features
+### 🎯 **Autonomous AI Agent**
+- **82+ Specialized ML Tools** organized across data profiling, cleaning, feature engineering, model training, and visualization
+- **Intelligent Orchestration** with Google Gemini 2.5 Flash for function calling and decision-making
+- **Session Memory** for contextual awareness across conversations
+- **Smart Intent Detection** automatically classifies tasks (ML pipeline, cleaning only, visualization, etc.)
+- **Error Recovery** with automatic retry logic and file tracking
+### 🎨 **Modern Web Interface**
+- **Beautiful React Frontend** with glassmorphism design and smooth animations
+- **Interactive Chat** with file upload support (CSV, Parquet)
+- **Report Viewer** to view YData profiling and Sweetviz HTML reports in-app
+- **Markdown Support** for formatted responses
+- **Session Management** to maintain conversation history
 ### 📊 **Complete ML Pipeline**
+1. **Data Profiling** - Automated statistical analysis and data quality assessment
+2. **Data Cleaning** - Smart missing value handling, outlier treatment, type conversion
+3. **Feature Engineering** - Time-based features, encoding, interactions, statistical features
+4. **Model Training** - Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost
+5. **Hyperparameter Tuning** - Optuna-based optimization with 50+ trials
+6. **Cross-Validation** - Stratified K-fold validation for robust evaluation
+7. **Visualization** - Interactive Plotly dashboards and correlation heatmaps
+8. **Reporting** - Comprehensive HTML reports with YData Profiling
+### ⚡ **Production Ready**
+- **FastAPI Backend** with async support and automatic API documentation
+- **Docker Support** with multi-stage builds for optimized deployment
+- **Rate Limiting** configured for Gemini API (6.5s intervals for 10 RPM limit)
+- **Caching System** for faster repeated queries
+- **CORS Enabled** for frontend-backend communication
 ---
 ## 🚀 Quick Start
+### Prerequisites
+- Python 3.10+
+- Node.js 18+ (for frontend)
+- Google Gemini API key ([Get one here](https://ai.google.dev/))
+### Installation
+**1. Clone the repository**
 ```bash
+git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
+cd DevSprint-Data-Science-Agent
+```
+**2. Set up environment variables**
+```bash
 cp .env.example .env
+# Edit .env and add your GOOGLE_API_KEY
 ```
+**3. Install Python dependencies**
 ```bash
+pip install -r requirements.txt
 ```
+**4. Install frontend dependencies**
 ```bash
+cd FRRONTEEEND
+npm install
+npm run build
+cd ..
 ```
+**5. Run the application**
+**Windows:**
+```powershell
+.\start.ps1
 ```
+**Linux/Mac:**
 ```bash
+chmod +x start.sh
+./start.sh
 ```
+The application will be available at **http://localhost:8080**
 ---
+## 📖 Usage
+### Web Interface
+1. **Navigate to http://localhost:8080**
+2. **Click "Launch Agent"** from the landing page
+3. **Upload your dataset** (CSV or Parquet format)
+4. **Type your request** in natural language:
+   - "Generate a comprehensive report on this dataset"
+   - "Train a model to predict [target_column]"
+   - "Clean the data and show me visualizations"
+   - "Perform feature engineering and train the best model"
+5. **View results** in the chat and click "View Report" buttons to see detailed HTML reports
+### Example Queries
+```
+📊 "Profile this dataset and tell me about data quality issues"
+🧹 "Clean the missing values and handle outliers"
+🎯 "Train a model to predict house prices with target column 'price'"
+📈 "Generate a correlation heatmap and feature importance plot"
+🔧 "Create time-based features and perform hyperparameter tuning"
+📋 "Generate a comprehensive YData profiling report"
 ```
+---
+## 🏗️ Architecture
 ```
+┌───────────────────────────────────────────────��─────────────┐
+│                    React Frontend (Port 8080)                │
+│  Landing Page │ Chat Interface │ Report Viewer               │
+└─────────────────────────┬───────────────────────────────────┘
+                          │
+                          ▼
+┌─────────────────────────────────────────────────────────────┐
+│              FastAPI Backend (Python 3.10+)                  │
+│  /chat │ /run │ /outputs │ /api/health                      │
+└─────────────────────────┬───────────────────────────────────┘
+                          │
+                          ▼
+┌─────────────────────────────────────────────────────────────┐
+│           DataScienceCopilot Orchestrator                    │
+│  • Gemini 2.5 Flash Integration                             │
+│  • 82+ Specialized Tools                                     │
+│  • Session Memory & Context                                  │
+│  • Intelligent Intent Detection                              │
+│  • Error Recovery & Loop Prevention                          │
+└─────────────────────────┬───────────────────────────────────┘
+                          │
+                          ▼
+┌─────────────────────────────────────────────────────────────┐
+│                     Tool Categories                          │
+│  Profiling │ Cleaning │ Feature Engineering │ ML Training   │
+│  Visualization │ EDA Reports │ Data Wrangling               │
+└─────────────────────────────────────────────────────────────┘
 ```
 ---
+## 🛠️ Tech Stack
+### Frontend
+- **React 19** - Modern UI library
+- **TypeScript 5.8** - Type-safe development
+- **Vite 6** - Lightning-fast build tool
+- **Tailwind CSS** - Utility-first styling
+- **Framer Motion** - Smooth animations
+- **React Markdown** - Formatted responses
+### Backend
+- **FastAPI** - High-performance Python web framework
+- **Google Gemini 2.5 Flash** - LLM for agent orchestration
+- **Polars** - Fast dataframe library (10-100x faster than pandas)
+- **Scikit-learn** - Classical ML algorithms
+- **XGBoost / LightGBM / CatBoost** - Gradient boosting frameworks
+- **Optuna** - Hyperparameter optimization
+- **YData Profiling** - Automated EDA reports
+- **Plotly / Matplotlib** - Interactive visualizations
+### DevOps
+- **Docker** - Containerization with multi-stage builds
+- **Python-dotenv** - Environment variable management
+- **SQLite** - Caching layer for performance
+---
+## 🐳 Docker Deployment
+**Build and run with Docker:**
+```bash
+docker build -t ds-agent .
+docker run -p 8080:8080 --env-file .env ds-agent
 ```
+**Or use the deployment script:**
 ```bash
+.\build-and-deploy.ps1  # Windows
+./build-and-deploy.sh   # Linux/Mac
 ```
+---
+## 📂 Project Structure
+```
+.
+├── FRRONTEEEND/              # React frontend
+│   ├── components/           # UI components
+│   │   ├── ChatInterface.tsx # Main chat interface
+│   │   ├── HeroGeometric.tsx # Landing page hero
+│   │   └── ...
+│   ├── dist/                 # Built frontend
+│   └── package.json
+│
+├── src/                      # Python backend
+│   ├── api/
+│   │   └── app.py           # FastAPI application
+│   ├── orchestrator.py      # Agent orchestrator
+│   ├── session_memory.py    # Session management
+│   ├── tools/               # 82+ ML tools
+│   │   ├── data_profiling.py
+│   │   ├── data_cleaning.py
+│   │   ├── feature_engineering.py
+│   │   ├── model_training.py
+│   │   └── ...
+│   └── utils/               # Helper utilities
+│
+├── Dockerfile               # Multi-stage Docker build
+├── requirements.txt         # Python dependencies
+├── start.ps1 / start.sh    # Quick start scripts
+└── README.md               # This file
+```
+---
+## 🔑 Environment Variables
+Create a `.env` file in the root directory:
 ```bash
+# LLM Provider Configuration
+LLM_PROVIDER=gemini
+# API Keys
+GOOGLE_API_KEY=your_gemini_api_key_here
+# Model Configuration
+GEMINI_MODEL=gemini-2.5-flash
+# Cache Configuration
+CACHE_DB_PATH=./cache_db/cache.db
+CACHE_TTL_SECONDS=86400
+# Output Configuration
+OUTPUT_DIR=./outputs
+DATA_DIR=./data
+```
 ---
+## 🎯 Features in Detail
+### Intelligent Intent Detection
+The agent automatically classifies your request and applies the appropriate workflow:
+- **Full ML Pipeline** - Complete end-to-end workflow with training
+- **Exploratory Analysis** - Data profiling and visualization only
+- **Cleaning Only** - Data quality improvements without modeling
+- **Visualization Only** - Generate plots and dashboards
+- **Multi-Intent** - Combine multiple tasks intelligently
+### Session Memory
+The agent remembers context across messages:
+```
+You: "Train a model on this dataset"
+Agent: [Trains XGBoost model with R² = 0.85]
+You: "Now try hyperparameter tuning"
+Agent: [Automatically uses previous model and dataset]
+You: "Cross-validate it"
+Agent: [Applies CV to tuned model from context]
 ```
+### Error Recovery
+- Automatic retry with corrected parameters
+- File existence validation before execution
+- Recovery guidance showing last successful file
+- Loop detection to prevent infinite retries
+### Report Viewing
+- Click "View Report" buttons to see HTML reports in-app
+- Full-screen modal with professional styling
+- Supports YData Profiling, Sweetviz, and custom dashboards
 ---
+## 📊 Example Workflow
+**Upload:** `earthquake_data.csv` (175K rows, 22 columns)
+**Prompt:** "Train a model to predict earthquake magnitude"
+**Agent Actions:**
+1. ✅ Profiles dataset (175,947 rows, 22 columns)
+2. ✅ Detects data quality issues (11.67% missing, outliers)
+3. ✅ Drops high-missing columns (>40% missing)
+4. ✅ Imputes remaining missing values with median/mode
+5. ✅ Handles outliers with IQR clipping
+6. ✅ Extracts time-based features (year, month, hour, cyclical)
+7. ✅ Encodes categorical variables
+8. ✅ Trains 6 baseline models (XGBoost wins with R² = 0.716)
+9. ✅ Performs hyperparameter tuning (R² = 0.743)
+10. ✅ Runs 5-fold cross-validation (RMSE = 0.167 ± 0.0005)
+11. ✅ Generates YData profiling report
+12. ✅ Creates interactive Plotly dashboard
+**Result:** Trained and tuned XGBoost model ready for deployment!
 ---
 ## 🤝 Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
 ---
+## 📄 License
+This project is licensed under the MIT License.
 ---
+## 🙏 Acknowledgments
+- **Google Gemini** for powerful LLM capabilities
+- **FastAPI** for excellent async Python framework
+- **React** community for amazing UI libraries
+- **Polars** for blazing-fast data processing
+- **YData Profiling** for comprehensive EDA reports
 ---
+## 📧 Contact
+**Pulastya B**
+- GitHub: [@Pulastya-B](https://github.com/Pulastya-B)
+- Project: [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
 ---
 <div align="center">
+**Built with ❤️ for DevSprint Hackathon**
+⭐ Star this repo if you find it helpful!
 </div>

chat_ui.py DELETED Viewed

@@ -1,1073 +0,0 @@
-"""
-AI Agent Data Scientist - Interactive Chat UI
-==============================================
-A simple web interface to interact with your AI Agent.
-Upload datasets, ask questions, and get AI-powered insights!
-"""
-import gradio as gr
-import sys
-import os
-import shutil
-from pathlib import Path
-import traceback
-# Add src to path
-sys.path.append('src')
-from tools.data_profiling import profile_dataset, detect_data_quality_issues
-from tools.model_training import train_baseline_models
-# Try to import AI agent (optional)
-try:
-    from orchestrator import DataScienceCopilot
-    agent = DataScienceCopilot()
-    AI_ENABLED = True
-    print("✅ AI Agent loaded successfully!")
-    print(f"📊 Model: {agent.model}")
-    print(f"🔧 Tools available: {len(agent.tool_functions)}")
-except Exception as e:
-    print(f"ℹ️  Running in manual mode (AI agent not available)")
-    print(f"   Error: {str(e)}")
-    print("💡 You can still use all the quick actions and tools!")
-    AI_ENABLED = False
-    agent = None
-# Store uploaded file path
-current_file = None
-current_profile = None
-last_agent_response = None  # Store last agent response for visualization extraction
-# Helper functions for Gradio 6.x message format
-def add_message(history, role, content):
-    """Add a message to history in Gradio 6.x format."""
-    if history is None:
-        history = []
-    history.append({"role": role, "content": content})
-    return history
-def add_user_message(history, content):
-    """Add a user message to history."""
-    return add_message(history, "user", content)
-def add_assistant_message(history, content):
-    """Add an assistant message to history."""
-    return add_message(history, "assistant", content)
-def update_last_assistant_message(history, content):
-    """Update the last assistant message in history."""
-    if history and len(history) > 0 and history[-1].get("role") == "assistant":
-        history[-1]["content"] = content
-    return history
-def get_last_user_content(history):
-    """Get the content of the last user message."""
-    if history:
-        for msg in reversed(history):
-            if msg.get("role") == "user":
-                return msg.get("content", "")
-    return ""
-def analyze_dataset(file, user_message, history):
-    """Process uploaded dataset(s) and user message. Supports single or multiple file uploads."""
-    global current_file, current_profile, last_agent_response
-    # Initialize with empty plot list (will collect PNG file paths)
-    plots_paths = []
-    html_reports = []  # Initialize HTML reports list
-    # Initialize history if None
-    if history is None:
-        history = []
-    # Debug: Log the call
-    print(f"[DEBUG] analyze_dataset called - file: {file is not None}, message: '{user_message}', current_file: {current_file}")
-    try:
-        # Handle file uploads (single or multiple)
-        if file is not None:
-            # file can be a single filepath or a list of filepaths
-            files_to_process = file if isinstance(file, list) else [file]
-            # Filter out None values
-            files_to_process = [f for f in files_to_process if f is not None]
-            if len(files_to_process) > 0:
-                print(f"[DEBUG] Processing {len(files_to_process)} file(s) upload")
-                # Copy all files to simpler paths
-                os.makedirs("./temp", exist_ok=True)
-                processed_files = []
-                seen_files = {}  # Track files by content hash to detect duplicates
-                duplicate_count = 0
-                for uploaded_file in files_to_process:
-                    simple_filename = Path(uploaded_file.name if hasattr(uploaded_file, 'name') else uploaded_file).name
-                    file_source = uploaded_file.name if hasattr(uploaded_file, 'name') else uploaded_file
-                    # Calculate file hash to detect duplicates (even with different names)
-                    import hashlib
-                    hasher = hashlib.md5()
-                    with open(file_source, 'rb') as f:
-                        # Read file in chunks to handle large files efficiently
-                        for chunk in iter(lambda: f.read(8192), b""):
-                            hasher.update(chunk)
-                    file_hash = hasher.hexdigest()
-                    # Check if this exact file was already uploaded
-                    if file_hash in seen_files:
-                        print(f"[DEBUG] Duplicate file detected: {simple_filename} (same as {seen_files[file_hash]})")
-                        duplicate_count += 1
-                        continue  # Skip duplicate
-                    # Not a duplicate - process it
-                    simple_path = f"./temp/{simple_filename}"
-                    # Handle filename collision (different files with same name)
-                    if os.path.exists(simple_path):
-                        # Check if existing file is the same (by comparing with already processed files)
-                        existing_in_processed = simple_path in processed_files
-                        if not existing_in_processed:
-                            # Different file with same name - add suffix
-                            base_name = Path(simple_filename).stem
-                            extension = Path(simple_filename).suffix
-                            counter = 1
-                            while os.path.exists(f"./temp/{base_name}_{counter}{extension}"):
-                                counter += 1
-                            simple_filename = f"{base_name}_{counter}{extension}"
-                            simple_path = f"./temp/{simple_filename}"
-                            print(f"[DEBUG] Filename collision - renamed to: {simple_filename}")
-                    shutil.copy2(file_source, simple_path)
-                    processed_files.append(simple_path)
-                    seen_files[file_hash] = simple_filename
-                    print(f"[DEBUG] Copied file to: {simple_path}")
-                # Set current_file to the first file (for single-file operations)
-                # For multi-file operations, the agent will use all files from ./temp/
-                current_file = processed_files[0] if processed_files else None
-                # Only show file upload response if there's no user message
-                if not (user_message and user_message.strip()):
-                    if len(processed_files) == 0:
-                        # All files were duplicates
-                        response = f"⚠️ **No New Files Uploaded**\n\n"
-                        response += f"All {len(files_to_process)} file(s) were duplicates of already uploaded files.\n\n"
-                        response += "Your previously uploaded dataset is still active."
-                    elif len(processed_files) == 1:
-                        # Single file upload - show detailed profile
-                        response = f"📊 **Dataset Uploaded Successfully!**\n\n"
-                        if duplicate_count > 0:
-                            response += f"ℹ️ *({duplicate_count} duplicate file(s) were skipped)*\n\n"
-                        response += f"**File:** {Path(current_file).name}\n\n"
-                        # Get basic profile
-                        profile = profile_dataset(current_file)
-                        current_profile = profile
-                        response += f"**Dataset Overview:**\n"
-                        response += f"- Rows: {profile['shape']['rows']:,}\n"
-                        response += f"- Columns: {profile['shape']['columns']}\n"
-                        # Handle memory_usage (can be float or dict)
-                        memory = profile.get('memory_usage', 0)
-                        if isinstance(memory, dict):
-                            memory = memory.get('total_mb', 0)
-                        response += f"- Memory: {memory:.2f} MB\n\n"
-                        response += f"**Column Types:**\n"
-                        response += f"- Numeric: {len(profile['column_types']['numeric'])} columns\n"
-                        response += f"- Categorical: {len(profile['column_types']['categorical'])} columns\n"
-                        response += f"- Datetime: {len(profile['column_types']['datetime'])} columns\n\n"
-                        # Check data quality
-                        quality = detect_data_quality_issues(current_file)
-                        if quality['critical']:
-                            response += f"🔴 **Critical Issues:** {len(quality['critical'])}\n"
-                            for issue in quality['critical'][:3]:
-                                response += f"  - {issue['message']}\n"
-                        if quality['warning']:
-                            response += f"🟡 **Warnings:** {len(quality['warning'])}\n"
-                            for issue in quality['warning'][:3]:
-                                response += f"  - {issue['message']}\n"
-                    else:
-                        # Multiple files uploaded
-                        response = f"📊 **{len(processed_files)} Datasets Uploaded Successfully!**\n\n"
-                        if duplicate_count > 0:
-                            response += f"ℹ️ *({duplicate_count} duplicate file(s) were skipped)*\n\n"
-                        response += f"**Files:**\n"
-                        for i, fp in enumerate(processed_files, 1):
-                            response += f"{i}. {Path(fp).name}\n"
-                        response += f"\n**💡 You can now use multi-dataset operations!**\n\n"
-                    response += f"\n\n💬 **What would you like to do with {'this dataset' if len(processed_files) == 1 else 'these datasets'}?**\n\n"
-                    response += "You can ask me to:\n"
-                    if len(processed_files) > 1:
-                        response += "- **Merge these datasets** (e.g., 'merge customers and orders on customer_id')\n"
-                        response += "- **Combine/concatenate** them (e.g., 'combine all monthly sales files')\n"
-                    response += "- Train a classification or regression model\n"
-                    response += "- Analyze specific columns\n"
-                    response += "- Detect outliers\n"
-                    response += "- Engineer features\n"
-                    response += "- Generate predictions\n"
-                    response += "- And much more!\n"
-                    # Add assistant message to history
-                    history = add_assistant_message(history, response)
-                    yield history, "", [], []
-                    return
-                # If user uploaded file AND sent a message, don't return - continue to process the message
-                elif user_message and user_message.strip():
-                    # Continue processing the message below
-                    pass
-        # If user sends a message about the current file
-        print(f"[DEBUG] Checking message conditions: user_message={bool(user_message and user_message.strip())}, current_file={bool(current_file)}")
-        if user_message and user_message.strip() and current_file:
-            print(f"[DEBUG] User message detected. AI_ENABLED={AI_ENABLED}, agent={agent is not None}")
-            if AI_ENABLED and agent:
-                print(f"[DEBUG] Entering AI Agent block...")
-                try:
-                    # Show immediate processing message
-                    print(f"🤖 AI Agent analyzing: {user_message}")
-                    history = add_user_message(history, user_message)
-                    history = add_assistant_message(history, "🤖 **AI Agent is thinking...**\n\n⏳ Analyzing your request and planning the workflow...")
-                    yield history, "", [], []
-                    # Use the AI agent to process the request
-                    print(f"📂 File path: {current_file}")
-                    print(f"📝 Task: {user_message}")
-                    print(f"🚀 Calling agent.analyze()...")
-                    agent_response = agent.analyze(
-                        file_path=current_file,
-                        task_description=user_message,
-                        use_cache=False,  # Disable cache to avoid dict hashing issues
-                        stream=False
-                    )
-                    print(f"✅ Agent response received: {agent_response.get('status', 'unknown')}")
-                    # Store agent response for visualization extraction
-                    last_agent_response = agent_response
-                    # Format the response
-                    if agent_response.get('status') == 'success':
-                        response = f"🤖 **AI Agent Analysis Complete!**\n\n"
-                        response += f"{agent_response.get('summary', '')}\n\n"
-                        if 'workflow_history' in agent_response and agent_response['workflow_history']:
-                            response += f"**Execution Summary:**\n"
-                            response += f"- Tools Executed: {len(agent_response['workflow_history'])}\n"
-                            response += f"- Iterations: {agent_response.get('iterations', 0)}\n"
-                            response += f"- Time: {agent_response.get('execution_time', 0):.1f}s\n\n"
-                            # Find and display MODEL TRAINING RESULTS with ALL METRICS
-                            model_results = None
-                            for step in agent_response['workflow_history']:
-                                if step.get('tool') == 'train_baseline_models':
-                                    result = step.get('result', {})
-                                    if isinstance(result, dict) and 'result' in result:
-                                        model_results = result['result']
-                                    elif isinstance(result, dict):
-                                        model_results = result
-                                    break
-                            if model_results and 'models' in model_results:
-                                response += f"## 🎯 Model Training Results\n\n"
-                                task_type = model_results.get('task_type', 'unknown')
-                                response += f"**Task Type:** {task_type.title()}\n"
-                                response += f"**Features:** {model_results.get('n_features', 0)}\n"
-                                response += f"**Training Samples:** {model_results.get('train_size', 0):,}\n"
-                                response += f"**Test Samples:** {model_results.get('test_size', 0):,}\n\n"
-                                # Show ALL models tested
-                                response += "### 📊 All Models Tested:\n\n"
-                                models_data = model_results.get('models', {})
-                                for model_name, model_info in models_data.items():
-                                    if 'test_metrics' in model_info:
-                                        metrics = model_info['test_metrics']
-                                        response += f"**{model_name}:**\n"
-                                        if task_type == 'classification':
-                                            response += f"- Accuracy: {metrics.get('accuracy', 0):.4f}\n"
-                                            response += f"- Precision: {metrics.get('precision', 0):.4f}\n"
-                                            response += f"- Recall: {metrics.get('recall', 0):.4f}\n"
-                                            response += f"- F1 Score: {metrics.get('f1', 0):.4f}\n"
-                                        else:
-                                            response += f"- R² Score: {metrics.get('r2', 0):.4f}\n"
-                                            response += f"- RMSE: {metrics.get('rmse', 0):.2f}\n"
-                                            response += f"- MAE: {metrics.get('mae', 0):.2f}\n"
-                                            response += f"- MAPE: {metrics.get('mape', 0):.2f}%\n"
-                                        response += "\n"
-                                # Highlight BEST MODEL
-                                best_model = model_results.get('best_model', {})
-                                if best_model and best_model.get('name'):
-                                    response += f"### 🏆 Best Model: **{best_model['name']}**\n"
-                                    response += f"Score: {best_model.get('score', 0):.4f}\n\n"
-                            # Show workflow execution summary
-                            response += "### 🔧 Workflow Steps:\n"
-                            for i, step in enumerate(agent_response['workflow_history'], 1):
-                                tool_name = step['tool']
-                                success = step['result'].get('success', False)
-                                icon = "✅" if success else "❌"
-                                response += f"{i}. {icon} {tool_name}\n"
-                            response += "\n"
-                            # Check for plots AND reports in workflow results
-                            html_reports = []  # Separate list for HTML reports
-                            for step in agent_response['workflow_history']:
-                                result = step.get('result', {})
-                                # Deep search for plots and reports in nested results
-                                def find_plots_and_reports(obj, plots_list, reports_list):
-                                    if isinstance(obj, dict):
-                                        # Check direct plot/report keys
-                                        for key in ['plot_path', 'plot_file', 'output_path', 'html_path', 'report_path',
-                                                   'plots', 'plot_paths', 'performance_plots', 'feature_importance_plot']:
-                                            if key in obj and obj[key]:
-                                                if isinstance(obj[key], list):
-                                                    for path in obj[key]:
-                                                        if isinstance(path, str) and os.path.exists(path):
-                                                            if path.endswith('.html'):
-                                                                # Check if it's a report (in reports folder) or interactive plot
-                                                                if '/reports/' in path or 'report' in Path(path).stem.lower():
-                                                                    reports_list.append(path)
-                                                                else:
-                                                                    reports_list.append(path)  # Interactive plots also go to reports
-                                                            elif path.endswith(('.png', '.jpg', '.jpeg')):
-                                                                plots_list.append(path)
-                                                elif isinstance(obj[key], str) and os.path.exists(obj[key]):
-                                                    if obj[key].endswith('.html'):
-                                                        if '/reports/' in obj[key] or 'report' in Path(obj[key]).stem.lower():
-                                                            reports_list.append(obj[key])
-                                                        else:
-                                                            reports_list.append(obj[key])
-                                                    elif obj[key].endswith(('.png', '.jpg', '.jpeg')):
-                                                        plots_list.append(obj[key])
-                                        # Recursively search nested dicts
-                                        for value in obj.values():
-                                            find_plots_and_reports(value, plots_list, reports_list)
-                                find_plots_and_reports(result, plots_paths, html_reports)
-                            # Remove duplicates while preserving order
-                            plots_paths = list(dict.fromkeys(plots_paths))
-                            html_reports = list(dict.fromkeys(html_reports))
-                            # Display visualization and report information in response
-                            if plots_paths or html_reports:
-                                response += f"## 📊 Generated Outputs\n\n"
-                                if plots_paths:
-                                    response += f"### 📈 Visualizations ({len(plots_paths)} plots)\n"
-                                    response += "✅ Plots are displayed in the **Visualization Gallery** below!\n\n"
-                                    # List plot files
-                                    for i, plot_path in enumerate(plots_paths[:10], 1):
-                                        try:
-                                            plot_name = Path(plot_path).stem.replace('_', ' ').title()
-                                            rel_path = os.path.relpath(plot_path, '.')
-                                            response += f"{i}. 📊 **{plot_name}**\n"
-                                            response += f"   📁 `{rel_path}`\n\n"
-                                        except Exception as e:
-                                            response += f"{i}. ❌ Error: {str(e)}\n"
-                                if html_reports:
-                                    response += f"### 📋 Reports & Interactive Plots ({len(html_reports)} files)\n"
-                                    response += "✅ Reports are displayed in the **Reports Viewer** below!\n\n"
-                                    # List report files
-                                    for i, report_path in enumerate(html_reports[:10], 1):
-                                        try:
-                                            report_name = Path(report_path).stem.replace('_', ' ').title()
-                                            rel_path = os.path.relpath(report_path, '.')
-                                            file_size = os.path.getsize(report_path) / 1024  # KB
-                                            response += f"{i}. 📄 **{report_name}**\n"
-                                            response += f"   📁 `{rel_path}` ({file_size:.1f} KB)\n\n"
-                                        except Exception as e:
-                                            response += f"{i}. ❌ Error: {str(e)}\n"
-                            else:
-                                response += "ℹ️ No visualizations or reports were generated in this workflow.\n"
-                    else:
-                        response = f"⚠️ **AI Agent Status:** {agent_response.get('status', 'unknown')}\n\n"
-                        response += f"{agent_response.get('message', agent_response.get('error', 'Unknown error'))}\n"
-                    # Update the last assistant message with the response
-                    history = update_last_assistant_message(history, response)
-                    # Return plot paths for gallery and html_reports for HTML viewer
-                    # Store html_reports in a format the HTML component can use
-                    yield history, "", plots_paths if plots_paths else [], html_reports if html_reports else []
-                    return
-                except Exception as e:
-                    import sys
-                    exc_type, exc_value, exc_traceback = sys.exc_info()
-                    response = f"⚠️ **AI Agent Error:**\n\n"
-                    response += f"**Error Type:** {exc_type.__name__}\n\n"
-                    response += f"**Error Message:** {str(e)}\n\n"
-                    response += f"**Full Traceback:**\n```python\n{traceback.format_exc()}\n```\n\n"
-                    response += "💡 **Fallback Options:**\n"
-                    response += "- Use the **Quick Train** feature on the right\n"
-                    response += "- Try manual commands: `profile`, `quality`, `columns`\n"
-                    # Update the last assistant message with error
-                    history = update_last_assistant_message(history, response)
-                    yield history, "", plots_paths if plots_paths else []
-                    return
-            else:
-                # Manual mode - Handle commands directly
-                user_msg_lower = user_message.lower().strip()
-                # Handle simple commands manually
-                if 'profile' in user_msg_lower:
-                    response = "📊 **Dataset Profile:**\n\n"
-                    if current_profile:
-                        response += f"**Shape:** {current_profile['shape']['rows']:,} rows × {current_profile['shape']['columns']} columns\n\n"
-                        response += f"**Column Types:**\n"
-                        response += f"- Numeric: {len(current_profile['column_types']['numeric'])} columns\n"
-                        response += f"- Categorical: {len(current_profile['column_types']['categorical'])} columns\n"
-                        response += f"- Datetime: {len(current_profile['column_types']['datetime'])} columns\n\n"
-                        response += f"**Overall Stats:**\n"
-                        response += f"- Total cells: {current_profile['overall_stats']['total_cells']:,}\n"
-                        response += f"- Null values: {current_profile['overall_stats']['total_nulls']} ({current_profile['overall_stats']['null_percentage']:.1f}%)\n"
-                        response += f"- Duplicates: {current_profile['overall_stats']['duplicate_rows']}\n"
-                    else:
-                        response += "Profile information is available at the top of the chat!"
-                elif 'quality' in user_msg_lower or 'issues' in user_msg_lower:
-                    quality = detect_data_quality_issues(current_file)
-                    response = "🔍 **Data Quality Report:**\n\n"
-                    if quality['critical']:
-                        response += f"🔴 **Critical Issues:** {len(quality['critical'])}\n"
-                        for issue in quality['critical']:
-                            response += f"  • {issue['message']}\n"
-                        response += "\n"
-                    if quality['warning']:
-                        response += f"🟡 **Warnings:** {len(quality['warning'])}\n"
-                        for issue in quality['warning'][:5]:  # Show first 5
-                            response += f"  • {issue['message']}\n"
-                        if len(quality['warning']) > 5:
-                            response += f"  • ... and {len(quality['warning']) - 5} more\n"
-                        response += "\n"
-                    if quality['info']:
-                        response += f"🔵 **Info:** {len(quality['info'])} observations\n"
-                    if not quality['critical'] and not quality['warning'] and not quality['info']:
-                        response += "✅ No issues detected! Your data looks good.\n"
-                elif 'columns' in user_msg_lower or 'column' in user_msg_lower:
-                    if current_profile:
-                        response = "📋 **Dataset Columns:**\n\n"
-                        for col, info in current_profile['columns'].items():
-                            nulls = info.get('null_count', 0)
-                            null_pct = (nulls / current_profile['shape']['rows'] * 100) if current_profile['shape']['rows'] > 0 else 0
-                            response += f"• **{col}** ({info['type']})\n"
-                            response += f"  - Nulls: {nulls} ({null_pct:.1f}%)\n"
-                            if 'unique' in info:
-                                response += f"  - Unique: {info['unique']}\n"
-                    else:
-                        response = "📋 **Columns:** Please upload a file first to see column information."
-                elif 'help' in user_msg_lower:
-                    response = "💡 **Available Commands:**\n\n"
-                    response += "**Manual Commands:**\n"
-                    response += "• `profile` - Show detailed dataset statistics\n"
-                    response += "• `quality` - Check data quality issues\n"
-                    response += "• `columns` - List all columns with details\n"
-                    response += "• `help` - Show this help message\n\n"
-                    response += "**Quick Actions:**\n"
-                    response += "• Use the **Quick Train** panel on the right to train models\n"
-                    response += "• Check **Dataset Info** in the sidebar for quick stats\n"
-                else:
-                    # Default response for unrecognized commands
-                    response = f"💬 **You said:** {user_message}\n\n"
-                    response += "⚠️ AI agent is not available. I can respond to these commands:\n\n"
-                    response += "• `profile` - Show detailed statistics\n"
-                    response += "• `quality` - Check data quality\n"
-                    response += "• `columns` - List all columns\n"
-                    response += "• `help` - Show available commands\n\n"
-                    response += "**Or use Quick Train** on the right to train models directly!\n"
-                # Add user message and assistant response
-                history = add_user_message(history, user_message)
-                history = add_assistant_message(history, response)
-                yield history, "", [], []
-                return
-        # If no file is uploaded yet
-        if user_message and user_message.strip() and not current_file:
-            response = "⚠️ **Please upload a dataset first!**\n\n"
-            response += "Click the 'Upload Dataset' button above and select a CSV or Parquet file."
-            # Add user message and assistant response
-            history = add_user_message(history, user_message)
-            history = add_assistant_message(history, response)
-            yield history, "", [], []
-            return
-    except Exception as e:
-        error_msg = f"❌ **Error:** {str(e)}\n\n"
-        error_msg += "**Traceback:**\n```\n" + traceback.format_exc() + "\n```"
-        if user_message:
-            # Check if we already added the user message
-            last_user = get_last_user_content(history)
-            if last_user != user_message:
-                history = add_user_message(history, user_message)
-            history = add_assistant_message(history, error_msg)
-        else:
-            history = add_assistant_message(history, error_msg)
-        yield history, "", [], []
-        return
-    # Default return if nothing matched
-    yield history, "", [], []
-def quick_profile(file):
-    """Quick profile display in the sidebar."""
-    if file is None:
-        return "No file uploaded yet."
-    try:
-        profile = profile_dataset(file.name)
-        info = f"**{Path(file.name).name}**\n\n"
-        info += f"📊 {profile['shape']['rows']:,} rows × {profile['shape']['columns']} cols\n\n"
-        info += f"**Columns:**\n"
-        for col, col_info in list(profile['columns'].items())[:10]:
-            info += f"- {col} ({col_info['type']})\n"
-        if len(profile['columns']) > 10:
-            info += f"- ... and {len(profile['columns']) - 10} more\n"
-        return info
-    except Exception as e:
-        return f"Error: {str(e)}"
-def train_model_ui(file, target_col, model_type, test_size, progress=gr.Progress()):
-    """Train a model directly from the UI."""
-    if file is None:
-        return "⚠️ Please upload a dataset first!"
-    if not target_col:
-        return "⚠️ Please specify a target column!"
-    # Clean up the target column name - remove surrounding quotes if present
-    target_col = target_col.strip().strip("'").strip('"')
-    try:
-        # Show progress
-        progress(0, desc="🔄 Loading dataset...")
-        yield "⏳ **Training in progress...**\n\n📊 Loading dataset..."
-        import time
-        time.sleep(0.5)  # Brief pause for UI feedback
-        progress(0.2, desc="🔄 Preparing data...")
-        yield "⏳ **Training in progress...**\n\n📊 Dataset loaded\n🔄 Preparing data..."
-        time.sleep(0.3)
-        # Determine problem type
-        problem_type = "classification" if model_type == "Classification" else "regression"
-        progress(0.4, desc="🤖 Training models...")
-        yield "⏳ **Training in progress...**\n\n📊 Dataset loaded\n✅ Data prepared\n🤖 Training multiple models..."
-        # Train baseline models
-        result = train_baseline_models(
-            file.name,
-            target_col=target_col,
-            task_type=problem_type,
-            test_size=test_size
-        )
-        progress(0.9, desc="📊 Evaluating results...")
-        # Check if training was successful
-        if result.get('status') == 'error':
-            yield f"❌ **Training Failed**\n\n{result.get('message', 'Unknown error')}"
-            return
-        if 'best_model' not in result:
-            yield f"❌ **Training Failed**\n\nNo models were successfully trained. Result: {result}"
-            return
-        # Get the best model
-        best_model_name = result['best_model']['name']
-        if not best_model_name:
-            yield f"❌ **Training Failed**\n\nNo model could be selected as best model."
-            return
-        best_model_info = result['models'][best_model_name]
-        best_metrics = best_model_info.get('test_metrics', {})
-        output = f"✅ **Model Training Complete!**\n\n"
-        output += f"## 🏆 Best Model: **{best_model_name}**\n\n"
-        output += f"**Dataset Info:**\n"
-        output += f"- Features: {result.get('n_features', 0)}\n"
-        output += f"- Training samples: {result.get('train_size', 0):,}\n"
-        output += f"- Test samples: {result.get('test_size', 0):,}\n\n"
-        if problem_type == "classification":
-            output += f"**Test Metrics:**\n"
-            output += f"- ✅ Accuracy: {best_metrics.get('accuracy', 0):.4f}\n"
-            output += f"- 🎯 Precision: {best_metrics.get('precision', 0):.4f}\n"
-            output += f"- 📊 Recall: {best_metrics.get('recall', 0):.4f}\n"
-            output += f"- 🔥 F1 Score: {best_metrics.get('f1', 0):.4f}\n\n"
-        else:
-            output += f"**Test Metrics:**\n"
-            output += f"- 📈 R² Score: {best_metrics.get('r2', 0):.4f}\n"
-            output += f"- 📉 RMSE: {best_metrics.get('rmse', 0):.2f}\n"
-            output += f"- 📊 MAE: {best_metrics.get('mae', 0):.2f}\n"
-            output += f"- 💯 MAPE: {best_metrics.get('mape', 0):.2f}%\n\n"
-        output += f"## 📊 All Models Comparison:\n\n"
-        for model_name, model_info in result['models'].items():
-            if 'test_metrics' in model_info:
-                test_metrics = model_info['test_metrics']
-                indicator = "🏆 " if model_name == best_model_name else "   "
-                if problem_type == "classification":
-                    f1 = test_metrics.get('f1', 0)
-                    acc = test_metrics.get('accuracy', 0)
-                    output += f"{indicator}**{model_name}:**\n"
-                    output += f"   - F1: {f1:.4f} | Accuracy: {acc:.4f}\n"
-                else:
-                    r2 = test_metrics.get('r2', 0)
-                    rmse = test_metrics.get('rmse', 0)
-                    output += f"{indicator}**{model_name}:**\n"
-                    output += f"   - R²: {r2:.4f} | RMSE: {rmse:.2f}\n"
-            elif 'status' in model_info and model_info['status'] == 'error':
-                output += f"   ❌ **{model_name}:** {model_info.get('message', 'Error')}\n"
-        # Display generated plots if available
-        plots_to_show = []
-        # Check for performance plots
-        if 'performance_plots' in result and result['performance_plots']:
-            if isinstance(result['performance_plots'], list):
-                plots_to_show.extend(result['performance_plots'])
-            else:
-                plots_to_show.append(result['performance_plots'])
-        # Check for feature importance plot
-        if 'feature_importance_plot' in result and result['feature_importance_plot']:
-            plots_to_show.append(result['feature_importance_plot'])
-        # Embed plots
-        if plots_to_show:
-            output += f"\n\n📊 **Visualizations:**\n\n"
-            for plot_path in plots_to_show:
-                if isinstance(plot_path, str) and plot_path.endswith('.html') and os.path.exists(plot_path):
-                    try:
-                        with open(plot_path, 'r', encoding='utf-8') as f:
-                            plot_html = f.read()
-                        # Add plot title based on filename
-                        plot_name = Path(plot_path).stem.replace('_', ' ').title()
-                        output += f"**{plot_name}:**\n"
-                        output += f'<iframe srcdoc="{plot_html.replace(chr(34), "&quot;")}" width="100%" height="500" frameborder="0"></iframe>\n\n'
-                    except Exception as e:
-                        # Fallback to file path
-                        output += f"📁 {Path(plot_path).name}: `{plot_path}`\n"
-        progress(1.0, desc="✅ Complete!")
-        yield output
-    except Exception as e:
-        yield f"❌ **Error:** {str(e)}\n\n```\n{traceback.format_exc()}\n```"
-def clear_conversation():
-    """Clear the conversation and reset state."""
-    global current_file, current_profile
-    current_file = None
-    current_profile = None
-    return [], None, "", [], ""
-def format_html_reports(html_paths):
-    """Format HTML reports/plots for display in HTML component."""
-    if not html_paths or len(html_paths) == 0:
-        return "<div style='text-align:center; padding:40px; color:#666;'>No reports generated yet. Try: 'Generate a quality report' or 'Create interactive visualizations'</div>"
-    html_output = """
-    <style>
-        .report-container {
-            padding: 20px;
-            background: #f8f9fa;
-        }
-        .report-card {
-            margin-bottom: 30px;
-            border: 2px solid #dee2e6;
-            border-radius: 12px;
-            overflow: hidden;
-            background: white;
-            box-shadow: 0 4px 6px rgba(0,0,0,0.1);
-        }
-        .report-header {
-            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-            color: white;
-            padding: 15px 20px;
-            font-weight: bold;
-            font-size: 18px;
-            display: flex;
-            justify-content: space-between;
-            align-items: center;
-        }
-        .report-meta {
-            font-size: 12px;
-            opacity: 0.9;
-        }
-        .report-iframe {
-            width: 100%;
-            min-height: 600px;
-            border: none;
-            background: white;
-        }
-        .report-footer {
-            background: #f8f9fa;
-            padding: 10px 20px;
-            font-size: 12px;
-            color: #666;
-            border-top: 1px solid #dee2e6;
-        }
-    </style>
-    <div class="report-container">
-    """
-    html_output += f"<h2 style='color: #667eea; margin-bottom: 20px;'>📋 {len(html_paths)} Report(s) Generated</h2>"
-    for i, html_path in enumerate(html_paths, 1):
-        try:
-            # Get file metadata
-            file_name = Path(html_path).name
-            file_size = os.path.getsize(html_path) / 1024  # KB
-            report_title = Path(html_path).stem.replace('_', ' ').title()
-            # Read the HTML content
-            with open(html_path, 'r', encoding='utf-8') as f:
-                html_content = f.read()
-            # Escape the content for embedding
-            escaped_content = html_content.replace('\\', '\\\\').replace('"', '&quot;').replace("'", "\\'")
-            html_output += f"""
-            <div class="report-card">
-                <div class="report-header">
-                    <span>📊 {i}. {report_title}</span>
-                    <span class="report-meta">{file_size:.1f} KB</span>
-                </div>
-                <iframe class="report-iframe" srcdoc="{escaped_content}"></iframe>
-                <div class="report-footer">
-                    📁 {html_path}
-                </div>
-            </div>
-            """
-        except Exception as e:
-            html_output += f"""
-            <div class="report-card">
-                <div class="report-header" style="background: linear-gradient(135deg, #f44336 0%, #e91e63 100%);">
-                    <span>❌ Error loading: {Path(html_path).name}</span>
-                </div>
-                <div style="padding: 20px;">
-                    <p><strong>Error:</strong> {str(e)}</p>
-                    <p><strong>Path:</strong> {html_path}</p>
-                </div>
-            </div>
-            """
-    html_output += "</div>"
-    return html_output
-def extract_and_display_plots(agent_response):
-    """Extract plots from agent response and format them for display."""
-    plots_html = ""
-    if not agent_response or agent_response.get('status') != 'success':
-        return gr.update(value="<p style='text-align:center; color:#666;'>No visualizations generated yet. Upload a dataset and run analysis!</p>")
-    workflow_history = agent_response.get('workflow_history', [])
-    if not workflow_history:
-        return gr.update(value="<p style='text-align:center; color:#666;'>No visualizations in this workflow.</p>")
-    # Find all plots
-    plots_paths = []
-    def find_plots(obj, plots_list):
-        if isinstance(obj, dict):
-            # Check direct plot keys
-            for key in ['plot_path', 'plot_file', 'html_path', 'output_path',
-                       'plots', 'plot_paths', 'performance_plots', 'feature_importance_plot']:
-                if key in obj and obj[key]:
-                    if isinstance(obj[key], list):
-                        for plot_path in obj[key]:
-                            if isinstance(plot_path, str) and plot_path.endswith('.html') and os.path.exists(plot_path):
-                                plots_list.append(plot_path)
-                    elif isinstance(obj[key], str) and obj[key].endswith('.html') and os.path.exists(obj[key]):
-                        plots_list.append(obj[key])
-            # Recursively search nested dicts
-            for value in obj.values():
-                find_plots(value, plots_list)
-    for step in workflow_history:
-        result = step.get('result', {})
-        find_plots(result, plots_paths)
-    # Remove duplicates while preserving order
-    plots_paths = list(dict.fromkeys(plots_paths))
-    if not plots_paths:
-        return gr.update(value="<p style='text-align:center; color:#666;'>No plots were generated in this analysis.</p>")
-    # Build HTML gallery
-    plots_html = f"""
-    <div style='padding: 20px;'>
-        <h2 style='color: #1f77b4; margin-bottom: 20px;'>📊 Visualization Gallery ({len(plots_paths)} plots)</h2>
-    """
-    for i, plot_path in enumerate(plots_paths, 1):
-        try:
-            with open(plot_path, 'r', encoding='utf-8') as f:
-                plot_content = f.read()
-            plot_name = Path(plot_path).stem.replace('_', ' ').title()
-            plots_html += f"""
-            <div style='margin-bottom: 30px; border: 1px solid #ddd; border-radius: 8px; overflow: hidden;'>
-                <div style='background: linear-gradient(90deg, #1f77b4, #2ca02c); color: white; padding: 10px 15px; font-weight: bold;'>
-                    {i}. {plot_name}
-                </div>
-                <div style='padding: 10px; background: white;'>
-                    <iframe srcdoc='{plot_content.replace("'", "&apos;").replace('"', "&quot;")}'
-                            width='100%' height='500' frameborder='0'
-                            style='border: none; border-radius: 5px;'></iframe>
-                </div>
-                <div style='background: #f8f9fa; padding: 8px 15px; font-size: 12px; color: #666;'>
-                    📁 {plot_path}
-                </div>
-            </div>
-            """
-        except Exception as e:
-            plots_html += f"""
-            <div style='margin-bottom: 20px; padding: 15px; border: 1px solid #f44336; border-radius: 5px; background: #ffebee;'>
-                <strong>❌ Failed to load: {Path(plot_path).name}</strong><br>
-                <small>{str(e)}</small>
-            </div>
-            """
-    plots_html += "</div>"
-    return gr.update(value=plots_html)
-# Custom CSS for better visual feedback
-custom_css = """
-.status-box {
-    padding: 10px;
-    border-radius: 5px;
-    background: linear-gradient(90deg, #e8f5e9 0%, #c8e6c9 100%);
-    margin-bottom: 10px;
-    text-align: center;
-    font-weight: bold;
-}
-"""
-    # Create the Gradio interface
-with gr.Blocks(title="AI Agent Data Scientist", theme=gr.themes.Soft(), css=custom_css) as demo:
-    gr.Markdown("""
-    # 🤖 AI Agent Data Scientist
-    Upload your dataset and chat with the AI agent to perform data science tasks!
-    **Features:**
-    - 📊 Automatic dataset profiling
-    - 🤖 Natural language queries
-    - 🎯 Model training (classification & regression)
-    - 🔍 Data quality analysis
-    - 📈 Feature engineering
-    - 🎨 **NEW:** Automatic visualization generation!
-    - And 59 tools total!
-    """)
-    # Store agent response for visualization extraction
-    agent_response_state = gr.State(None)
-    with gr.Row():
-        # Left column - Main chat interface
-        with gr.Column(scale=2):
-            # Status indicator
-            status_box = gr.Markdown("🟢 **Ready** - Upload a dataset to begin", elem_classes=["status-box"])
-            chatbot = gr.Chatbot(
-                label="Chat with AI Agent",
-                height=450,
-                show_label=True,
-                avatar_images=(None, "🤖"),
-                sanitize_html=False  # Allow HTML content including iframes
-            )
-            with gr.Row():
-                file_upload = gr.File(
-                    label="📁 Upload Dataset(s) (CSV/Parquet) - Single or Multiple Files",
-                    file_types=[".csv", ".parquet"],
-                    file_count="multiple",  # Allow multiple file uploads
-                    type="filepath"
-                )
-            with gr.Row():
-                user_input = gr.Textbox(
-                    label="Your Message",
-                    placeholder="Ask anything: 'train a model', 'analyze my data', 'generate visualizations'",
-                    lines=2,
-                    scale=4
-                )
-                submit_btn = gr.Button("📤 Send", variant="primary", scale=1)
-            with gr.Row():
-                clear_btn = gr.Button("🗑️ Clear", variant="secondary")        # Right column - Quick actions and info
-        with gr.Column(scale=1):
-            gr.Markdown("## 📊 Dataset Info")
-            dataset_info = gr.Markdown("Upload a dataset to see information here.")
-            gr.Markdown("## 🎯 Quick Train")
-            with gr.Group():
-                target_column = gr.Textbox(
-                    label="Target Column",
-                    placeholder="e.g., 'price', 'class', 'label'"
-                )
-                model_type_choice = gr.Radio(
-                    ["Classification", "Regression"],
-                    label="Model Type",
-                    value="Classification"
-                )
-                test_size_slider = gr.Slider(
-                    0.1, 0.5, 0.3,
-                    label="Test Size",
-                    step=0.05
-                )
-                train_btn = gr.Button("🚀 Train Model", variant="primary")
-            training_output = gr.Markdown("Training results will appear here.")
-            gr.Markdown("""
-            ## 💡 Example Queries
-            - "Train a classification model to predict [target]"
-            - "Show me statistics for [column]"
-            - "Detect outliers in the dataset"
-            - "What are the most important features?"
-            - "Generate a quality report"
-            - "Create polynomial features"
-            - "Balance the dataset using SMOTE"
-            """)
-    # Visualization Gallery Section (Full Width)
-    with gr.Row():
-        with gr.Column():
-            gr.Markdown("## 🎨 Visualization Gallery")
-            visualization_gallery = gr.Gallery(
-                label="Generated Plots (PNG/JPG)",
-                show_label=True,
-                elem_id="gallery",
-                columns=2,
-                height=400
-            )
-    # Reports Viewer Section (Full Width)
-    with gr.Row():
-        with gr.Column():
-            gr.Markdown("## 📋 Reports & Interactive Visualizations")
-            gr.Markdown("*HTML reports and interactive Plotly charts will be displayed here*")
-            reports_viewer = gr.HTML(
-                value="<div style='text-align:center; padding:40px; color:#666;'>No reports generated yet. Try: 'Generate a quality report' or 'Create interactive visualizations'</div>",
-                elem_id="reports_viewer"
-            )
-    # Create state to hold HTML report paths
-    html_reports_state = gr.State([])
-    # Event handlers with streaming support
-    submit_result = submit_btn.click(
-        fn=analyze_dataset,
-        inputs=[file_upload, user_input, chatbot],
-        outputs=[chatbot, user_input, visualization_gallery, html_reports_state],
-        show_progress="full"  # Show progress bar
-    )
-    submit_result.then(
-        fn=format_html_reports,
-        inputs=[html_reports_state],
-        outputs=[reports_viewer]
-    )
-    user_input_result = user_input.submit(
-        fn=analyze_dataset,
-        inputs=[file_upload, user_input, chatbot],
-        outputs=[chatbot, user_input, visualization_gallery, html_reports_state],
-        show_progress="full"
-    )
-    user_input_result.then(
-        fn=format_html_reports,
-        inputs=[html_reports_state],
-        outputs=[reports_viewer]
-    )
-    file_result = file_upload.change(
-        fn=analyze_dataset,
-        inputs=[file_upload, gr.Textbox(value="", visible=False), chatbot],
-        outputs=[chatbot, user_input, visualization_gallery, html_reports_state],
-        show_progress="full"
-    )
-    file_result.then(
-        fn=quick_profile,
-        inputs=[file_upload],
-        outputs=[dataset_info]
-    )
-    file_result.then(
-        fn=format_html_reports,
-        inputs=[html_reports_state],
-        outputs=[reports_viewer]
-    )
-    train_btn.click(
-        fn=train_model_ui,
-        inputs=[file_upload, target_column, model_type_choice, test_size_slider],
-        outputs=[training_output],
-        show_progress="full"  # Show progress bar
-    )
-    clear_btn.click(
-        clear_conversation,
-        outputs=[chatbot, file_upload, user_input, visualization_gallery, reports_viewer]
-    )
-if __name__ == "__main__":
-    print("=" * 70)
-    print("🚀 Starting AI Agent Data Scientist Chat UI...")
-    print("=" * 70)
-    print("\n🌐 The UI will open in your browser automatically.")
-    print("💡 If it doesn't, copy the URL shown below.\n")
-    demo.launch(
-        share=False,  # Set to True to create a public link
-        server_name="0.0.0.0",  # Listen on all interfaces
-        server_port=7865,  # Changed port to avoid conflict
-        show_error=True,
-        inbrowser=True  # Auto-open browser
-    )

cloudbuild.yaml DELETED Viewed

@@ -1,69 +0,0 @@
-# Google Cloud Build configuration for automated deployments
-# Triggered on git push to main branch
-steps:
-  # Step 1: Build the container image
-  - name: 'gcr.io/cloud-builders/docker'
-    args:
-      - 'build'
-      - '-t'
-      - 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
-      - '-t'
-      - 'gcr.io/$PROJECT_ID/data-science-agent:latest'
-      - '.'
-    timeout: 600s
-  # Step 2: Push the container image to Container Registry
-  - name: 'gcr.io/cloud-builders/docker'
-    args:
-      - 'push'
-      - 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
-  - name: 'gcr.io/cloud-builders/docker'
-    args:
-      - 'push'
-      - 'gcr.io/$PROJECT_ID/data-science-agent:latest'
-  # Step 3: Deploy to Cloud Run
-  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
-    entrypoint: gcloud
-    args:
-      - 'run'
-      - 'deploy'
-      - 'data-science-agent'
-      - '--image'
-      - 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
-      - '--region'
-      - 'us-central1'
-      - '--platform'
-      - 'managed'
-      - '--allow-unauthenticated'
-      - '--memory'
-      - '4Gi'
-      - '--cpu'
-      - '2'
-      - '--timeout'
-      - '900'
-      - '--max-instances'
-      - '10'
-      - '--min-instances'
-      - '0'
-      - '--concurrency'
-      - '10'
-      - '--set-env-vars'
-      - 'LLM_PROVIDER=groq,REASONING_EFFORT=medium,CACHE_TTL_SECONDS=86400'
-      - '--set-secrets'
-      - 'GROQ_API_KEY=GROQ_API_KEY:latest,GOOGLE_API_KEY=GOOGLE_API_KEY:latest,GOOGLE_APPLICATION_CREDENTIALS=GOOGLE_APPLICATION_CREDENTIALS:latest'
-# Build timeout
-timeout: 1200s
-# Images to push to Container Registry
-images:
-  - 'gcr.io/$PROJECT_ID/data-science-agent:$COMMIT_SHA'
-  - 'gcr.io/$PROJECT_ID/data-science-agent:latest'
-# Build options
-options:
-  machineType: 'N1_HIGHCPU_8'
-  logging: CLOUD_LOGGING_ONLY

setup-deployment.sh DELETED Viewed

@@ -1,78 +0,0 @@
-#!/bin/bash
-# Quick setup script for macOS deployment prerequisites
-set -e
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-BLUE='\033[0;34m'
-NC='\033[0m'
-echo -e "${BLUE}🔧 Data Science Agent - Deployment Setup${NC}"
-echo "=========================================="
-echo ""
-# Check if Homebrew is installed
-if ! command -v brew &> /dev/null; then
-    echo -e "${RED}❌ Homebrew not found${NC}"
-    echo "Installing Homebrew..."
-    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-else
-    echo -e "${GREEN}✅ Homebrew installed${NC}"
-fi
-# Install Docker Desktop
-if ! command -v docker &> /dev/null; then
-    echo -e "${YELLOW}📦 Installing Docker Desktop...${NC}"
-    brew install --cask docker
-    echo -e "${GREEN}✅ Docker Desktop installed${NC}"
-    echo -e "${YELLOW}⚠️  Please start Docker Desktop application, then run this script again${NC}"
-    exit 0
-else
-    echo -e "${GREEN}✅ Docker installed${NC}"
-fi
-# Check if Docker daemon is running
-if ! docker info &> /dev/null; then
-    echo -e "${YELLOW}⚠️  Docker is installed but not running${NC}"
-    echo "Please start Docker Desktop application, then run this script again"
-    exit 0
-fi
-# Install Google Cloud SDK
-if ! command -v gcloud &> /dev/null; then
-    echo -e "${YELLOW}☁️  Installing Google Cloud SDK...${NC}"
-    brew install --cask google-cloud-sdk
-    echo -e "${GREEN}✅ Google Cloud SDK installed${NC}"
-    echo ""
-    echo -e "${YELLOW}📝 Next steps:${NC}"
-    echo "1. Restart your terminal to load gcloud"
-    echo "2. Run: gcloud auth login"
-    echo "3. Run: gcloud auth application-default login"
-    echo "4. Run: gcloud config set project YOUR_PROJECT_ID"
-    echo "5. Run: ./deploy.sh"
-else
-    echo -e "${GREEN}✅ Google Cloud SDK installed${NC}"
-fi
-echo ""
-echo -e "${BLUE}========================================${NC}"
-echo -e "${GREEN}✅ Setup complete!${NC}"
-echo ""
-echo "Next steps:"
-echo "1. Authenticate with Google Cloud:"
-echo "   ${YELLOW}gcloud auth login${NC}"
-echo "   ${YELLOW}gcloud auth application-default login${NC}"
-echo ""
-echo "2. Set your GCP project:"
-echo "   ${YELLOW}gcloud config set project YOUR_PROJECT_ID${NC}"
-echo ""
-echo "3. Set your API keys:"
-echo "   ${YELLOW}export GROQ_API_KEY='your-groq-key'${NC}"
-echo "   ${YELLOW}export GOOGLE_API_KEY='your-google-key'${NC}"
-echo ""
-echo "4. Deploy to Cloud Run:"
-echo "   ${YELLOW}./deploy.sh${NC}"
-echo ""

test_environment.py DELETED Viewed

@@ -1,48 +0,0 @@
-#!/usr/bin/env python3
-"""
-Quick test script to verify all core imports work
-"""
-print("Testing core imports...")
-try:
-    print("  ✓ Python standard library")
-    import sys
-    import os
-    print("  ✓ Data processing")
-    import polars as pl
-    import pandas as pd
-    import numpy as np
-    print("  ✓ Machine learning")
-    import sklearn
-    import xgboost
-    import lightgbm
-    print("  ✓ Visualization")
-    import matplotlib
-    import seaborn
-    import plotly
-    print("  ✓ LLM clients")
-    import groq
-    print("  ✓ Web framework")
-    import gradio
-    import fastapi
-    print("\n✅ All core dependencies installed successfully!")
-    print(f"\nPython version: {sys.version}")
-    print(f"Gradio version: {gradio.__version__}")
-    print(f"Polars version: {pl.__version__}")
-    print(f"Pandas version: {pd.__version__}")
-    print(f"NumPy version: {np.__version__}")
-    print(f"Scikit-learn version: {sklearn.__version__}")
-except ImportError as e:
-    print(f"\n❌ Import failed: {e}")
-    sys.exit(1)
-print("\n🎉 Environment setup complete! You can now run:")
-print("   .venv/bin/python chat_ui.py")