File size: 42,291 Bytes
09cd93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
# AI-Powered Data Science Agent

## Overview

The AI-Powered Data Science Agent is an intelligent autonomous system designed to perform complete end-to-end data science workflows through natural language interaction. This agent leverages Google Gemini 2.5 Flash for advanced reasoning and function calling capabilities, combined with a comprehensive suite of over 82 specialized machine learning tools.

The system enables users to upload datasets in CSV or Parquet format and describe their analytical objectives in plain English. The agent autonomously handles the entire pipeline including data profiling, quality assessment, cleaning, feature engineering, model training, hyperparameter optimization, cross-validation, and comprehensive reporting generation.

Key capabilities include intelligent intent classification, session memory for contextual awareness, error recovery mechanisms, and a modern React-based web interface for seamless user interaction.

[![React](https://img.shields.io/badge/React-19-61DAFB?logo=react)](https://reactjs.org/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688?logo=fastapi)](https://fastapi.tiangolo.com/)
[![Gemini](https://img.shields.io/badge/Gemini-2.5_Flash-4285F4?logo=google)](https://ai.google.dev/)
[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?logo=python)](https://python.org/)

---

## Key Features

### Autonomous AI Agent System

The core orchestration engine integrates Google Gemini 2.5 Flash with over 82 specialized machine learning tools organized across multiple categories:

- **Data Profiling Tools**: Generate comprehensive statistical summaries, distribution analysis, correlation matrices, data quality reports, and automated anomaly detection
- **Data Cleaning Tools**: Handle missing values with intelligent imputation strategies (mean, median, mode, forward/backward fill, KNN), outlier detection and treatment using IQR and Z-score methods, duplicate removal, and data type conversions
- **Feature Engineering Tools**: Create time-based features (hour, day, month, year, cyclical encodings), polynomial features, interaction terms, statistical aggregations, lag features, rolling window statistics, and domain-specific transformations
- **Model Training Tools**: Support for multiple algorithm families including linear models (Ridge, Lasso, ElasticNet), tree-based models (Random Forest, Gradient Boosting), and advanced gradient boosting frameworks (XGBoost, LightGBM, CatBoost)
- **Visualization Tools**: Generate interactive Plotly visualizations, Matplotlib static plots, correlation heatmaps, distribution plots, scatter matrices, feature importance charts, and residual analysis plots

The intelligent orchestration system uses function calling capabilities to dynamically select and execute appropriate tools based on user intent. The agent maintains session memory for contextual awareness across conversation turns, enabling multi-turn dialogues where previous actions and results inform subsequent decisions.

Smart intent detection automatically classifies incoming requests into categories such as full ML pipeline execution, exploratory data analysis, data cleaning only, visualization generation, or multi-intent tasks requiring combined workflows.

Error recovery mechanisms include automatic retry logic with corrected parameters, file existence validation before tool execution, recovery guidance displaying the last successful file state, and loop detection to prevent infinite retry cycles.

### Modern Web Interface

The frontend is built with React 19 and TypeScript 5.8, featuring a modern glassmorphism design aesthetic with smooth animations powered by Framer Motion. Key interface components include:

- **Landing Page**: Geometric hero section with animated background paths, key capabilities showcase, problem-solution presentation, process flow visualization, and technology stack display
- **Chat Interface**: Real-time message streaming, file upload support for CSV and Parquet formats, markdown rendering for formatted responses with code syntax highlighting, loading states with animated indicators, and error handling with user-friendly messages
- **Report Viewer**: In-application modal viewer for HTML reports generated by YData Profiling and custom dashboard tools. Full-screen modal with professional styling, iframe embedding for report content, and download capabilities
- **Session Management**: Maintains conversation history across browser sessions, allows users to review previous analyses, and provides context for follow-up questions

### Complete Machine Learning Pipeline

The agent executes a comprehensive end-to-end pipeline:

1. **Data Profiling and Assessment**: Automatically generates statistical summaries including descriptive statistics (mean, median, standard deviation, quartiles), distribution analysis with histogram generation, correlation analysis with heatmap visualization, missing value analysis with percentage calculations, data type detection and validation, outlier detection using multiple methods (IQR, Z-score, isolation forest), and cardinality analysis for categorical variables

2. **Data Cleaning and Preprocessing**: Handles missing values with context-aware imputation strategies, removes or treats outliers based on statistical thresholds, performs data type conversions and casting, removes duplicate records, handles inconsistent formatting in categorical variables, and validates data integrity constraints

3. Quick Start Guide

### Prerequisites

Before beginning the installation, ensure your system meets the following requirements:

- **Python**: Version 3.10 or higher with pip package manager
- **Node.js**: V Steps

**Step 1: Clone the Repository**

Clone the repository from GitHub and navigate to the project directory:

```bash
git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
cd DevSprint-Data-Science-Agent
```

**Step 2: Configure Environment Variables**

Create a `.env` file in the root directory with the following configuration:

```bash
# LLM Provider Configuration
LLM_PROVIDER=gemini

# Google Gemini API Key (required)
GOOGLE_API_KEY=your_api_key_here

# Model Configuration
GEMINI_MODEL=gemini-2.5-flash

# Cache Configuration
CACHE_DB_PATH=./cache_db/cache.db
CACHE_TTL_SECONDS=86400

# Output and Data Directories
OUTPUT_DIR=./outputs
DATA_DIR=./data
```

Replace `your_api_key_here` with your actual Google Gemini API key obtained from https://ai.google.dev/

**Step 3: Install Python Dependencies**

Install all required Python packages using pip:

```bash
pip install -r requirements.txt
```

ThiUsage Guide

### Web Interface Workflow

**Step 1: Access the Application**

Open your web browser and navigate to http://localhost:8080. You will see the landing page with an overview of the agent's capabilities.

**Step 2: Launch the Chat Interface**

Click the "Launch Agent" button to access the interactive chat interface.

**Step 3: Upload Your Dataset**

Click the file upload button (paperclip icon) and select your dataset file. Supported formats:
- CSV files (.csv) with any delimiter (comma, tab, semicolon, etc.)
- Parquet files (.parquet) for high-performance columnar storage

The agent will automatically detect the file format and load the data using appropriate parsers.

**Step 4: Describe Your Task**

Type your request in natural language in the chat input box. The agent understands various types of requests and will automatically determine the appropriate workflow.

**Step 5: Review Results**

The agent will execute the requested workflow and display results in the chat interface. For analyses that generate HTML reports (such as YData Profiling), a "View Report" button will appear. Click this button to open the report in a full-screen modal viewer.

### Example Queries and Use Cases

**Data Profiling and Exploration:**
```
"Generate a comprehensive profile report on this dataset"
"Show me the statistical summary and distribution of all variables"
"Analyze data quality issues including missing values and outliers"
"Create a correlation matrix and identify highly correlated features"
```

**Data Cleaning:**
```
"Clean the missing values using median imputation for numeric columns"
"Handle outliers in the dataset using IQR method"
"Remove duplicate records and fix data type inconsistencies"
"Drop columns with more than 50% missing values"
```

**Predictive Modeling:**
```
"Train a model to predict the target column 'price' using all features"
"Build a classification model for the 'churn' column"
"Compare multiple regression algorithms and select the best one"
"Train an XGBoost model with default hyperparameters"
```

**Feature Engineering:**
```
"Extract time-based features from the datetime column"
"Create interaction terms between numeric features"
"Apply target encoding for high-cardinality categorical variables"
"Generate polynomial features of degree 2"
```

**Model Optimization:**
```
"Perform hyperparameter tuning on the trained model using Optuna"
"Run 5-fold cross-validation to evaluate model performance"
"Optimize the XGBoost model for better accuracy"
```

**Visualization:**
```
"Generate a correlation heatmap for numeric features"
"Create distribution plots for all numeric columns"
"Show feature importance for the trained model"
"Generate interactive Plotly visualizations"
```

**End-to-End Pipeline:**
```
"Profile the data, clean it, engineer features, and train the best model"
"Perform complete analysis and predict the target column 'sales'"
"Do everything needed to build a production-ready model
.\start.ps1
```

**For Linux/macOS:**
```bash
chmod +x start.sh
./start.sh
```

The startup script will:
1. Technology Stack

### Frontend Technologies

- **React 19.2.3**: Latest version of React with improved concurrent rendering, automatic batching, and enhanced hooks for building performant user interfaces
- **TypeScript 5.8.2**: Provides static type checking, enhanced IDE support, and improved code maintainability with advanced type inference
- **Vite 6.2.0**: Next-generation frontend build tool offering instant server start, lightning-fast hot module replacement (HMR), and optimized production builds
- **Tailwind CSS 3.4.1**: Utility-first CSS framework enabling rapid UI development with pre-built classes and responsive design utilities
- **Framer Motion 12.23.26**: Production-ready animation library for React with declarative animations, gestures, and smooth transitions
- **React Markdown 9.0.1**: Markdown rendering component supporting GitHub-flavored markdown, code syntax highlighting, and custom renderers
- **Lucide React**: Icon library providing consistent, customizable SVG icons for the user interface

### Backend Technologies

- **FastAPI 0.109+**: Modern, high-performance Python web framework with automatic OpenAPI documentation, async/await support, and built-in request validation
- **Google Gemini 2.5 Flash**: Large language model with advanced reasoning capabilities, function calling support, and high token limits for agent orchestration
- **Polars 0.20+**: High-performance DataFrame library written in Rust, offering 10-100x speed improvements over pandas for large datasets
- **Scikit-learn 1.3+**: Comprehensive machine learning library providing classical algorithms for classification, regression, clustering, and preprocessing
- **XGBoost 2.0+**: Optimized gradient boosting framework with parallel tree construction, regularization, and efficient handling of sparse data
- **LightGBM 4.1+**: Gradient boosting framework by Microsoft with leaf-wise tree growth, categorical feature support, and memory efficiency
- **CatBoost 1.2+**: Gradient boosting library by Yandex with native categorical feature handling, GPU support, and symmetric tree structure
- **Optuna 3.5+**: Hyperparameter optimization framework with Bayesian optimization, pruning strategies, and distributed optimization support
- **YData Profiling 4.6+**: Automated exploratory data analysis tool generating comprehensive HTML reports with statistical summaries and data quality insights
- **Plotly 5.18+**: Interactive visualization library creating web-based charts with zooming, panning, and hover tooltips
- **Matplotlib 3.8+**: Fundamental plotting library for Python offering publication-quality static visualizations
- **Pydantic 2.5+**: Data validation library using Python type annotations for request/response models

###Docker Deployment

The application includes a multi-stage Dockerfile for optimized containerized deployment.

### Building the Docker Image

Build the Docker image with the following command:

```bash
docker build -t ds-agent:latest .
```

The multi-stage build process:
1. **Stage 1 (Builder)**: Installs Node.js dependencies and builds the React frontend
2. **Stage 2 (Runtime)**: Sets up Python environment, installs backend dependencies, and copies built frontend
3. Result: Optimized image size by excluding development dependencies and build tools

### Running the Container

Run the containerized application:

```bash
docker run -d \
  -p 8080:8080 \
  --env-file .env \
  --name ds-agent-container \
  ds-agent:latest
```

Parameters explained:
- `-d`: Run container in detached mode (background)
- `-p 8080:8080`: Map container port 8080 to host port 8080
- `--env-file .env`: Load environment variables from .env file
- `--name ds-agent-container`: Assign a name to the container for easy management

### Docker Compose (Recommended)

For easier management, create a `docker-compose.yml` file:

```yaml
version: '3.8'

services:
  ds-agent:
    build: .
    container_name: ds-agent
    ports:
      - "8080:8080"
    env_file:
      - .env
    volumes:
   Environment Configuration

The application uses environment variables for configuration management. Create a `.env` file in the project root directory with the following variables:

### Required Configuration

```bash
# LLM Provider Selection
LLM_PROVIDER=gemini
# Options: gemini (currently supported)

# Google Gemini API Key (REQUIRED)
GOOGLE_API_KEY=your_api_key_here
# Obtain from: https://ai.google.dev/
# Free tier limits: 10 RPM, 20 RPD

# Gemini Model Selection
GEMINI_MODEL=gemini-2.5-flash
# Options:
#   - gemini-2.5-flash (recommended, balanced performance)
#   - gemini-1.5-pro (higher capability, lower rate limits)
#   - gemini-1.5-flash (faster, lower cost)
```

### Optional Configuration
Advanced Features

### Intelligent Intent Detection and Classification

The orchestration system employs sophisticated intent detection to automatically classify user requests and route them to appropriate workflow pipelines. The classification system analyzes incoming natural language queries using keyword matching, pattern recognition, and contextual understanding.

**Intent Categories:**

1. **Full ML Pipeline Intent**: Triggered by keywords such as "train", "model", "predict", "machine learning", "regression", "classification". Executes complete workflow including data profiling, cleaning, feature engineering, model training, hyperparameter tuning, and evaluation.

2. **Exploratory Analysis Intent**: Activated by keywords like "explore", "profile", "report", "analysis", "overview", "insights", "understand". Performs comprehensive data profiling with statistical summaries, distribution analysis, correlation matrices, and automated insights generation.

3. **Data Cleaning Intent**: Detected via keywords such as "clean", "missing", "outliers", "duplicates", "impute", "handle". Focuses on data quality improvement operations without proceeding to modeling.

4. **Visualization Intent**: Identified through keywords like "plot", "visualize", "chart", "graph", "heatmap", "distribution". Generates requested visualizations without performing modeling or extensive preprocessing.

5. **Feature Engineering Intent**: Recognized by keywords such as "feature", "engineer", "create features", "transform", "encode". Applies feature transformation and creation operations.

6. **Multi-Intent Workflows**: The system can detect and handle requests combining multiple intents, executing them in a logical sequence.

The intent classification system uses confidence scoring to handle ambiguous requests and can ask clarifying questions when intent is unclear.

### Context-Aware Session Memory

The agent implements persistent session memory that maintains conversation context across multiple turns. This enables natural multi-turn dialogues where subsequent requests can reference previous operations without requiring full context repetition.

**Session Memory Capabilities:**

- **Workflow History**: Stores complete history of executed tools, parameters, and results for the current session
- **File State Tracking**: Maintains references to uploaded files, intermediate processed datasets, and generated outputs
- **Model Persistence**: Remembers trained models and their performance metrics for comparison and further tuning
- **Error Context**: Stores information about encountered errors to avoid repeating failed operations
- **User Preferences**: Learns from user choices (e.g., preferred visualization types, imputation strategies)

**Example Multi-Turn Conversation:**

```Complete Workflow Example

This section demonstrates a complete end-to-end workflow for a real-world dataset, showing the agent's autonomous decision-making and execution capabilities.

### Dataset: Earthquake Magnitude Prediction

**Input Dataset:** `earthquake_data.csv`
- Rows: 175,947 earthquake records
- Columns: 22 features including latitude, longitude, depth, time, location, and magnitude
- Target Variable: Earthquake magnitude (continuous regression task)
- Data Quality: 11.67% missing values, presence of outliers, mixed data types

**User Prompt:**
```
"Train a model to predict earthquake magnitude with the highest possible accuracy"
```

### Automated Workflow Execution

**Phase 1: Data Profiling and Assessment** (Step 1)
- Tool: `generate_ydata_profile`
- Action: Comprehensive statistical analysis of all 22 features
- Findings:
  - Total records: 175,947
  - Missing values detected in 8 columns
  - Outliers present in depth, latitude, longitude
  - High cardinality in location column (15,000+ unique values)
  - Strong correlation between depth and magnitude (r=0.62)
- Output: YData Profiling HTML report saved to `outputs/earthquake_profile.html`
- Time: 18.3 seconds
API Reference

The FastAPI backend exposes several endpoints for programmatic interaction.

### Endpoints

**POST /chat**
- Description: Send a message to the agent with optional file upload
- Content-Type: multipart/form-data
- Parameters:
  - message (string, required): User's natural language request
  - file (file, optional): Dataset file (CSV or Parquet)
- Response: JSON with agent's response message and workflow history
- Example:
```bash
curl -X POST http://localhost:8080/chat \
  -F "message=Generate a data profile report" \
  -F "file=@dataset.csv"
```

**POST /run**
- Description: Execute a complete analysis workflow
- Content-Type: application/json
- Parameters:
  - query (string, required): Analysis request
  - use_cache (boolean, optional): Enable caching (default: true)
- Response: JSON with analysis results and generated artifacts
- Example:
```json
{
  "query": "Train a regression model to predict sales",
  "use_cache": true
}
```

**GET /outputs/{file_path}**
- Description: Retrieve generated reports and artifacts
- Parameters:
  - file_path (string, required): Path to output file
- Response: File content (HTML, PNG, CSV, etc.)
- Example:
```bash
curl http://localhost:8080/outputs/ydata_profile.html
```

**GET /api/health**
- Description: Health check endpoint
- Response: JSON with status information
- Example response:
```json
{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2025-12-27T10:30:00Z"
}
```

### Interactive API Documentation

FastAPI automatically generates interactive API documentation:
- Swagger UI: http://localhost:8080/docs
- ReDoc: http://localhost:8080/redoc

## Contributing

Contributions to improve the AI-Powered Data Science Agent are welcome. Please follow these guidelines:

### Development Setup

1. Fork the repository and clone your fork
2. Create a new branch for your feature: `git checkout -b feature/your-feature-name`
3. Install development dependencies: `pip install -r requirements-dev.txt`
4. Make your changes with appropriate tests
5. Ensure all tests pass: `pytest tests/`
6. Format code with black: `black src/`
7. Lint code with flake8: `flake8 src/`
8. Commit with descriptive messages
9. Push to your fork and submit a pull request

### Code Style

- Follow PEP 8 guidelines for Python code
- Use type hints for function parameters and return values
- Write docstrings for all functions and classes
- Keep functions focused and under 50 lines when possible
- Use meaningful variable names

### Testing

- Write unit tests for new features
- Ensure existing tests pass before submitting PR
- Aim for >80% code coverage

## License

This project is licensed under the MIT License. See the LICENSE file for complete terms.

Copyright (c) 2025 Pulastya B

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## Acknowledgments

This project builds upon several excellent open-source technologies and frameworks:

- **Google Gemini 2.5 Flash**: Advanced language model with function calling capabilities enabling intelligent agent orchestration
- **FastAPI**: Modern, high-performance web framework for building APIs with Python, providing automatic documentation and validation
- **React**: JavaScript library for building user interfaces, enabling component-based architecture and efficient rendering
- **Polars**: High-performance DataFrame library written in Rust, offering significant speed improvements over traditional data processing libraries
- **Scikit-learn**: Machine learning library providing simple and efficient tools for data analysis and modeling
- **XGBoost, LightGBM, CatBoost**: Gradient boosting frameworks offering state-of-the-art performance for structured data
- **Optuna**: Hyperparameter optimization framework with efficient search algorithms
- **YData Profiling**: Automated exploratory data analysis tool generating comprehensive reports
- **Plotly**: Interactive visualization library for creating publication-quality graphs
- **TypeScript**: Typed superset of JavaScript enhancing code quality and developer experience
- **Tailwind CSS**: Utility-first CSS framework for rapid UI development
- **Vite**: Next-generation frontend build tool with instant server start

Special thanks to the open-source community for creating and maintaining these exceptional tools.

## Contact and Support

**Developer:** Pulastya B

**GitHub Profile:** [@Pulastya-B](https://github.com/Pulastya-B)

**Project Repository:** [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)

**Issues and Bug Reports:** Please use the GitHub Issues page to report bugs or request features

**Documentation:** Additional documentation and tutorials available in the repository wiki

**Project Status:** Active development - Built for DevSprint Hackathon

For questions, suggestions, or collaboration opportunities, please open an issue on GitHub or contact through the repository.

---

**Last Updated:** December 27, 2025

**Version:** 1.0.0
Step 6 - Temporal Feature Extraction:
- Tool: `extract_time_features`
- Input column: 'timestamp'
- Features created:
  - year, month, day_of_week, hour
  - Cyclical encodings: hour_sin, hour_cos, month_sin, month_cos
- Justification: Earthquakes may have temporal patterns
- New columns: 8 time-based features

Step 7 - Categorical Encoding:
- Tool: `encode_categorical_features`
- Method: Target encoding for 'location' (high cardinality), one-hot encoding for 'type'
- Result: All categorical variables converted to numeric
- New columns: 3 (reduced from high-cardinality location)

Step 8 - Statistical Features:
- Tool: `create_statistical_features`
- Features created:
  - Distance from nearest plate boundary (calculated from lat/lon)
  - Depth-to-magnitude ratio
  - Regional earthquake frequency (rolling count)
- New columns: 3 domain-specific features

Final feature count: 28 engineered features

**Phase 5: Model Training and Selection** (Step 9)
- Tool: `train_baseline_models`
- Algorithms trained in parallel:

1. Ridge Regression: RΒ² = 0.534, RMSE = 0.312
2. Lasso Regression: RΒ² = 0.541, RMSE = 0.309
3. ElasticNet: RΒ² = 0.538, RMSE = 0.311
4. Random Forest: RΒ² = 0.698, RMSE = 0.251
5. XGBoost: RΒ² = 0.716, RMSE = 0.243 (BEST)
6. LightGBM: RΒ² = 0.709, RMSE = 0.247
7. CatBoost: RΒ² = 0.712, RMSE = 0.245

- Best model selected: XGBoost
- Validation split: 80/20 stratified split
- Time: 124.7 seconds

**Phase 6: Hyperparameter Optimization** (Step 10)
- Tool: `optimize_hyperparameters_optuna`
- Framework: Optuna with Tree-structured Parzen Estimator (TPE)
- Search space:
  - max_depth: [3, 10]
  - learning_rate: [0.001, 0.3] (log scale)
  - n_estimators: [100, 1000]
  - min_child_weight: [1, 10]
  - subsample: [0.6, 1.0]
  - colsample_bytree: [0.6, 1.0]
- Trials: 50 iterations
- Best parameters found:
  - max_depth: 7
  - learning_rate: 0.0847
  - n_estimators: 673
  - min_child_weight: 3
  - subsample: 0.8234
  - colsample_bytree: 0.9123
- Optimized performance: RΒ² = 0.743, RMSE = 0.231
- Improvement: +3.8% RΒ² over baseline
- Time: 312.4 seconds

**Phase 7: Model Validation** (Step 11)
- Tool: `cross_validate_model`
- Method: 5-fold stratified cross-validation
- Results:
  - Fold 1: RΒ² = 0.741, RMSE = 0.232
  - Fold 2: RΒ² = 0.745, RMSE = 0.230
  - Fold 3: RΒ² = 0.738, RMSE = 0.234
  - Fold 4: RΒ² = 0.747, RMSE = 0.229
  - Fold 5: RΒ² = 0.742, RMSE = 0.232
- Mean performance: RΒ² = 0.743 Β± 0.003, RMSE = 0.231 Β± 0.002
- Interpretation: Low variance across folds indicates robust, generalizable model
- Time: 267.8 seconds

**Phase 8: Visualization and Reporting** (Steps 12-13)

Step 12 - Feature Importance Analysis:
- Tool: `plot_feature_importance`
- Top 10 features by importance:
  1. depth (0.284)
  2. distance_to_plate_boundary (0.167)
  3. latitude (0.142)
  4. longitude (0.138)
  5. regional_frequency (0.095)
  6. depth_magnitude_ratio (0.067)
  7. hour_sin (0.034)
  8. month (0.028)
  9. location_encoded (0.024)
  10. year (0.021)
- Output: Interactive Plotly bar chart saved to `outputs/feature_importance.html`

Step 13 - Comprehensive Dashboard:
- Tool: `create_plotly_dashboard`
- Visualizations included:
  - Correlation heatmap (28x28 features)
  - Actual vs Predicted scatter plot
  - Residual distribution plot
  - Feature importance ranking
  - Temporal patterns in predictions
- Output: Multi-panel interactive dashboard saved to `outputs/model_dashboard.html`

### Final Results Summary

**Model Performance:**
- Algorithm: XGBoost with optimized hyperparameters
- Training RΒ²: 0.743
- Cross-validated RΒ²: 0.743 Β± 0.003
- RMSE: 0.231 (on magnitude scale 0-10)
- MAE: 0.176
- Explanation: Model explains 74.3% of variance in earthquake magnitudes

**Artifacts Generated:**
- Trained model file: `outputs/xgboost_model_optimized.pkl`
- YData profiling report: `outputs/earthquake_profile.html`
- Feature importance plot: `outputs/feature_importance.html`
- Interactive dashboard: `outputs/model_dashboard.html`
- Cleaned dataset: `data/earthquake_data_cleaned.parquet`
- Feature engineered dataset: `data/earthquake_data_featured.parquet`

**Total Execution Time:** 12 minutes 43 seconds

**Key Insights:**
1. Depth is the strongest predictor of earthquake magnitude (28.4% importance)
2. Spatial features (distance to plate boundaries, lat/lon) are highly informative
3. Temporal patterns show cyclical variations in earthquake characteristics
4. Model performance is consistent across cross-validation folds (low variance)
5. The optimized XGBoost model provides reliable magnitude predictions suitable for deployment

### Robust Error Recovery System

The agent implements a comprehensive error recovery system designed to handle failures gracefully and guide users toward successful task completion.

**Error Recovery Mechanisms:**

1. **Automatic Retry with Correction**: When a tool execution fails due to incorrect parameters, the agent analyzes the error message, adjusts parameters based on the error type, and automatically retries the operation with corrected inputs.

2. **File Existence Validation**: Before executing tools that require specific file inputs, the system validates file existence and accessibility, providing clear guidance when files are missing.

3. **Column Name Validation**: Validates that requested column names exist in the dataset before performing operations, suggesting similar column names when exact matches aren't found.

4. **Dependency Tracking**: Ensures tools are executed in proper sequence, checking that prerequisite operations (e.g., data cleaning before training) have been completed.

5. **Loop Detection**: Monitors tool execution patterns to detect and prevent infinite retry loops. If the same operation fails multiple times with the same error, the agent stops retrying and requests user intervention.

6. **Recovery Guidance**: When errors cannot be automatically resolved, the system provides detailed guidance including:
   - Clear explanation of what went wrong
   - The last successful file state that can be used to continue
   - Suggested alternative approaches
   - Specific parameter corrections needed

7. **Graceful Degradation**: If a requested operation cannot be completed, the agent attempts to provide partial results or alternative analysis that may still be valuable.

**Example Error Recovery Flow:**

```
Request: "Train a model to predict 'Price' column"

Error Detected: Column 'Price' not found in dataset
Recovery Action: Search for similar columns β†’ Find 'price', 'PRICE', 'SalePrice'
Agent Response: "Column 'Price' not found. Did you mean 'SalePrice'? I found these similar columns: ['SalePrice', 'price_usd']. Please specify which column to use."

User: "Yes, use SalePrice"
Agent: [Continues with corrected column name]
```

### Interactive Report Viewing

The web interface includes an integrated report viewer that displays comprehensive HTML reports generated during analysis without requiring users to download files or switch to external tools.

**Report Viewer Features:**

- **In-Application Display**: Reports open in a full-screen modal overlay within the chat interface
- **Multiple Report Types**: Supports YData Profiling reports and custom HTML dashboards
- **Professional Styling**: Modal features glassmorphism design, smooth animations, and responsive layout
- **Interactive Navigation**: Users can zoom, scroll, and interact with report elements directly in the viewer
- **Download Option**: Reports can be downloaded as standalone HTML files for sharing or archival
- **Automatic Detection**: System automatically detects when tools generate HTML reports and creates "View Report" buttons in the chat interface

**Supported Report Types:**

1. **YData Profiling Reports**: Comprehensive automated EDA with variable statistics, distributions, correlations, missing value analysis, and alerts for data quality issues

2. **Custom Dashboards**: User-created Plotly dashboards with multiple interactive visualizations

The report extraction system uses multiple strategies to locate report files, including checking tool return values, parsing workflow history, and using regex pattern matching on agent responses.
- Use different API keys for development and production
- Rotate API keys periodically
- Set restrictive file permissions on `.env` (chmod 600 on Linux/macOS)inux/macOS:**
```bash
chmod +x build-and-deploy.sh
./build-and-deploy.sh
```

These scripts handle building the image, stopping any existing containers, and starting a new container with proper configuration.FRRONTEEEND
npm install
npm run build
cd ..
```

**5. Run the application**

**Windows:**
```powershell
.\start.ps1
```

**Linux/Mac:**
```bash
chmod +x start.sh
./start.sh
```

The application will be available at **http://localhost:8080**

---

## πŸ“– Usage

### Web Interface

1. **Navigate to http://localhost:8080**
2. **Click "Launch Agent"** from the landing page
3. **Upload your dataset** (CSV or Parquet format)
4. **Type your request** in natural language:
   - "Generate a comprehensive report on this dataset"
   - "Train a model to predict [target_column]"
   - "Clean the data and show me visualizations"
   - "Perform feature engineering and train the best model"
5. **View results** in the chat and click "View Report" buttons to see detailed HTML reports

### Example Queries

```
πŸ“Š "Profile this dataset and tell me about data quality issues"

🧹 "Clean the missing values and handle outliers"

🎯 "Train a model to predict house prices with target column 'price'"

πŸ“ˆ "Generate a correlation heatmap and feature importance plot"

πŸ”§ "Create time-based features and perform hyperparameter tuning"

πŸ“‹ "Generate a comprehensive YData profiling report"
```

---

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    React Frontend (Port 8080)                β”‚
β”‚  Landing Page β”‚ Chat Interface β”‚ Report Viewer               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              FastAPI Backend (Python 3.10+)                  β”‚
β”‚  /chat β”‚ /run β”‚ /outputs β”‚ /api/health                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           DataScienceCopilot Orchestrator                    β”‚
β”‚  β€’ Gemini 2.5 Flash Integration                             β”‚
β”‚  β€’ 82+ Specialized Tools                                     β”‚
β”‚  β€’ Session Memory & Context                                  β”‚
β”‚  β€’ Intelligent Intent Detection                              β”‚
β”‚  β€’ Error Recovery & Loop Prevention                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Tool Categories                          β”‚
β”‚  Profiling β”‚ Cleaning β”‚ Feature Engineering β”‚ ML Training   β”‚
β”‚  Visualization β”‚ EDA Reports β”‚ Data Wrangling               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

##  Tech Stack

### Frontend
- **React 19** - Modern UI library
- **TypeScript 5.8** - Type-safe development
- **Vite 6** - Lightning-fast build tool
- **Tailwind CSS** - Utility-first styling
- **Framer Motion** - Smooth animations
- **React Markdown** - Formatted responses

### Backend
- **FastAPI** - High-performance Python web framework
- **Google Gemini 2.5 Flash** - LLM for agent orchestration
- **Polars** - Fast dataframe library (10-100x faster than pandas)
- **Scikit-learn** - Classical ML algorithms
- **XGBoost / LightGBM / CatBoost** - Gradient boosting frameworks
- **Optuna** - Hyperparameter optimization
- **YData Profiling** - Automated EDA reports
- **Plotly / Matplotlib** - Interactive visualizations

### DevOps
- **Docker** - Containerization with multi-stage builds
- **Python-dotenv** - Environment variable management
- **SQLite** - Caching layer for performance

---

## 🐳 Docker Deployment

**Build and run with Docker:**

```bash
docker build -t ds-agent .
docker run -p 8080:8080 --env-file .env ds-agent
```

**Or use the deployment script:**

```bash
.\build-and-deploy.ps1  # Windows
./build-and-deploy.sh   # Linux/Mac
```

---

## πŸ“‚ Project Structure

```
.
β”œβ”€β”€ FRRONTEEEND/              # React frontend
β”‚   β”œβ”€β”€ components/           # UI components
β”‚   β”‚   β”œβ”€β”€ ChatInterface.tsx # Main chat interface
β”‚   β”‚   β”œβ”€β”€ HeroGeometric.tsx # Landing page hero
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ dist/                 # Built frontend
β”‚   └── package.json
β”‚
β”œβ”€β”€ src/                      # Python backend
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── app.py           # FastAPI application
β”‚   β”œβ”€β”€ orchestrator.py      # Agent orchestrator
β”‚   β”œβ”€β”€ session_memory.py    # Session management
β”‚   β”œβ”€β”€ tools/               # 82+ ML tools
β”‚   β”‚   β”œβ”€β”€ data_profiling.py
β”‚   β”‚   β”œβ”€β”€ data_cleaning.py
β”‚   β”‚   β”œβ”€β”€ feature_engineering.py
β”‚   β”‚   β”œβ”€β”€ model_training.py
β”‚   β”‚   └── ...
β”‚   └── utils/               # Helper utilities
β”‚
β”œβ”€β”€ Dockerfile               # Multi-stage Docker build
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ start.ps1 / start.sh    # Quick start scripts
└── README.md               # This file
```

---

## πŸ”‘ Environment Variables

Create a `.env` file in the root directory:

```bash
# LLM Provider Configuration
LLM_PROVIDER=gemini

# API Keys
GOOGLE_API_KEY=your_gemini_api_key_here

# Model Configuration
GEMINI_MODEL=gemini-2.5-flash

# Cache Configuration
CACHE_DB_PATH=./cache_db/cache.db
CACHE_TTL_SECONDS=86400

# Output Configuration
OUTPUT_DIR=./outputs
DATA_DIR=./data
```

---

## 🎯 Features in Detail

### Intelligent Intent Detection
The agent automatically classifies your request and applies the appropriate workflow:
- **Full ML Pipeline** - Complete end-to-end workflow with training
- **Exploratory Analysis** - Data profiling and visualization only
- **Cleaning Only** - Data quality improvements without modeling
- **Visualization Only** - Generate plots and dashboards
- **Multi-Intent** - Combine multiple tasks intelligently

### Session Memory
The agent remembers context across messages:
```
You: "Train a model on this dataset"
Agent: [Trains XGBoost model with RΒ² = 0.85]

You: "Now try hyperparameter tuning"
Agent: [Automatically uses previous model and dataset]

You: "Cross-validate it"
Agent: [Applies CV to tuned model from context]
```

### Error Recovery
- Automatic retry with corrected parameters
- File existence validation before execution
- Recovery guidance showing last successful file
- Loop detection to prevent infinite retries

### Report Viewing
- Click "View Report" buttons to see HTML reports in-app
- Full-screen modal with professional styling
- Supports YData Profiling and custom dashboards

---

## πŸ“Š Example Workflow

**Upload:** `earthquake_data.csv` (175K rows, 22 columns)

**Prompt:** "Train a model to predict earthquake magnitude"

**Agent Actions:**
1. βœ… Profiles dataset (175,947 rows, 22 columns)
2. βœ… Detects data quality issues (11.67% missing, outliers)
3. βœ… Drops high-missing columns (>40% missing)
4. βœ… Imputes remaining missing values with median/mode
5. βœ… Handles outliers with IQR clipping
6. βœ… Extracts time-based features (year, month, hour, cyclical)
7. βœ… Encodes categorical variables
8. βœ… Trains 6 baseline models (XGBoost wins with RΒ² = 0.716)
9. βœ… Performs hyperparameter tuning (RΒ² = 0.743)
10. βœ… Runs 5-fold cross-validation (RMSE = 0.167 Β± 0.0005)
11. βœ… Generates YData profiling report
12. βœ… Creates interactive Plotly dashboard

**Result:** Trained and tuned XGBoost model ready for deployment!

---

## 🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

---

## πŸ“„ License

This project is licensed under the MIT License.

---

## πŸ™ Acknowledgments

- **Google Gemini** for powerful LLM capabilities
- **FastAPI** for excellent async Python framework
- **React** community for amazing UI libraries
- **Polars** for blazing-fast data processing
- **YData Profiling** for comprehensive EDA reports

---

## πŸ“§ Contact

**Pulastya B**
- GitHub: [@Pulastya-B](https://github.com/Pulastya-B)
- Project: [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)

---

<div align="center">

**Built with ❀️ for DevSprint Hackathon**

⭐ Star this repo if you find it helpful!

</div>