Pulastya B commited on
Commit
50a857f
Β·
1 Parent(s): fc23b4d

docs: Enhance README with comprehensive details and professional formatting

Browse files

- Remove all emojis for professional appearance
- Expand all sections with detailed technical information
- Add comprehensive workflow example with earthquake dataset
- Include detailed feature descriptions and implementations
- Add API reference documentation with examples
- Expand environment configuration with security best practices
- Add detailed Docker deployment instructions
- Include contribution guidelines and code style requirements
- Provide extensive technology stack details with version numbers
- Add licensing information and acknowledgments

Files changed (1) hide show
  1. README.md +207 -47
README.md CHANGED
@@ -1,8 +1,12 @@
1
- # πŸ€– AI-Powered Data Science Agent
2
 
3
- > **An intelligent autonomous agent that performs end-to-end data science workflows through natural language**
4
 
5
- Upload your dataset, describe what you want in plain English, and watch as the AI agent handles profiling, cleaning, feature engineering, model training, hyperparameter tuning, and comprehensive reporting - all automatically.
 
 
 
 
6
 
7
  [![React](https://img.shields.io/badge/React-19-61DAFB?logo=react)](https://reactjs.org/)
8
  [![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688?logo=fastapi)](https://fastapi.tiangolo.com/)
@@ -11,67 +15,223 @@ Upload your dataset, describe what you want in plain English, and watch as the A
11
 
12
  ---
13
 
14
- ## ✨ Key Features
15
-
16
- ### 🎯 **Autonomous AI Agent**
17
- - **82+ Specialized ML Tools** organized across data profiling, cleaning, feature engineering, model training, and visualization
18
- - **Intelligent Orchestration** with Google Gemini 2.5 Flash for function calling and decision-making
19
- - **Session Memory** for contextual awareness across conversations
20
- - **Smart Intent Detection** automatically classifies tasks (ML pipeline, cleaning only, visualization, etc.)
21
- - **Error Recovery** with automatic retry logic and file tracking
22
-
23
- ### 🎨 **Modern Web Interface**
24
- - **Beautiful React Frontend** with glassmorphism design and smooth animations
25
- - **Interactive Chat** with file upload support (CSV, Parquet)
26
- - **Report Viewer** to view YData profiling and Sweetviz HTML reports in-app
27
- - **Markdown Support** for formatted responses
28
- - **Session Management** to maintain conversation history
29
-
30
- ### πŸ“Š **Complete ML Pipeline**
31
- 1. **Data Profiling** - Automated statistical analysis and data quality assessment
32
- 2. **Data Cleaning** - Smart missing value handling, outlier treatment, type conversion
33
- 3. **Feature Engineering** - Time-based features, encoding, interactions, statistical features
34
- 4. **Model Training** - Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost
35
- 5. **Hyperparameter Tuning** - Optuna-based optimization with 50+ trials
36
- 6. **Cross-Validation** - Stratified K-fold validation for robust evaluation
37
- 7. **Visualization** - Interactive Plotly dashboards and correlation heatmaps
38
- 8. **Reporting** - Comprehensive HTML reports with YData Profiling
39
-
40
- ### ⚑ **Production Ready**
41
- - **FastAPI Backend** with async support and automatic API documentation
42
- - **Docker Support** with multi-stage builds for optimized deployment
43
- - **Rate Limiting** configured for Gemini API (6.5s intervals for 10 RPM limit)
44
- - **Caching System** for faster repeated queries
45
- - **CORS Enabled** for frontend-backend communication
46
 
47
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- ## πŸš€ Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ### Prerequisites
52
- - Python 3.10+
53
- - Node.js 18+ (for frontend)
54
- - Google Gemini API key ([Get one here](https://ai.google.dev/))
55
 
56
- ### Installation
 
 
 
 
 
 
 
57
 
58
- **1. Clone the repository**
59
  ```bash
60
  git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
61
  cd DevSprint-Data-Science-Agent
62
  ```
63
 
64
- **2. Set up environment variables**
 
 
 
65
  ```bash
66
- cp .env.example .env
67
- # Edit .env and add your GOOGLE_API_KEY
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ```
69
 
70
- **3. Install Python dependencies**
 
 
 
 
 
71
  ```bash
72
  pip install -r requirements.txt
73
  ```
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  **4. Install frontend dependencies**
76
  ```bash
77
  cd FRRONTEEEND
@@ -163,7 +323,7 @@ The application will be available at **http://localhost:8080**
163
 
164
  ---
165
 
166
- ## πŸ› οΈ Tech Stack
167
 
168
  ### Frontend
169
  - **React 19** - Modern UI library
 
1
+ # AI-Powered Data Science Agent
2
 
3
+ ## Overview
4
 
5
+ The AI-Powered Data Science Agent is an intelligent autonomous system designed to perform complete end-to-end data science workflows through natural language interaction. This agent leverages Google Gemini 2.5 Flash for advanced reasoning and function calling capabilities, combined with a comprehensive suite of over 82 specialized machine learning tools.
6
+
7
+ The system enables users to upload datasets in CSV or Parquet format and describe their analytical objectives in plain English. The agent autonomously handles the entire pipeline including data profiling, quality assessment, cleaning, feature engineering, model training, hyperparameter optimization, cross-validation, and comprehensive reporting generation.
8
+
9
+ Key capabilities include intelligent intent classification, session memory for contextual awareness, error recovery mechanisms, and a modern React-based web interface for seamless user interaction.
10
 
11
  [![React](https://img.shields.io/badge/React-19-61DAFB?logo=react)](https://reactjs.org/)
12
  [![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688?logo=fastapi)](https://fastapi.tiangolo.com/)
 
15
 
16
  ---
17
 
18
+ ## Key Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ ### Autonomous AI Agent System
21
+
22
+ The core orchestration engine integrates Google Gemini 2.5 Flash with over 82 specialized machine learning tools organized across multiple categories:
23
+
24
+ - **Data Profiling Tools**: Generate comprehensive statistical summaries, distribution analysis, correlation matrices, data quality reports, and automated anomaly detection
25
+ - **Data Cleaning Tools**: Handle missing values with intelligent imputation strategies (mean, median, mode, forward/backward fill, KNN), outlier detection and treatment using IQR and Z-score methods, duplicate removal, and data type conversions
26
+ - **Feature Engineering Tools**: Create time-based features (hour, day, month, year, cyclical encodings), polynomial features, interaction terms, statistical aggregations, lag features, rolling window statistics, and domain-specific transformations
27
+ - **Model Training Tools**: Support for multiple algorithm families including linear models (Ridge, Lasso, ElasticNet), tree-based models (Random Forest, Gradient Boosting), and advanced gradient boosting frameworks (XGBoost, LightGBM, CatBoost)
28
+ - **Visualization Tools**: Generate interactive Plotly visualizations, Matplotlib static plots, correlation heatmaps, distribution plots, scatter matrices, feature importance charts, and residual analysis plots
29
+
30
+ The intelligent orchestration system uses function calling capabilities to dynamically select and execute appropriate tools based on user intent. The agent maintains session memory for contextual awareness across conversation turns, enabling multi-turn dialogues where previous actions and results inform subsequent decisions.
31
+
32
+ Smart intent detection automatically classifies incoming requests into categories such as full ML pipeline execution, exploratory data analysis, data cleaning only, visualization generation, or multi-intent tasks requiring combined workflows.
33
+
34
+ Error recovery mechanisms include automatic retry logic with corrected parameters, file existence validation before tool execution, recovery guidance displaying the last successful file state, and loop detection to prevent infinite retry cycles.
35
+
36
+ ### Modern Web Interface
37
+
38
+ The frontend is built with React 19 and TypeScript 5.8, featuring a modern glassmorphism design aesthetic with smooth animations powered by Framer Motion. Key interface components include:
39
 
40
+ - **Landing Page**: Geometric hero section with animated background paths, key capabilities showcase, problem-solution presentation, process flow visualization, and technology stack display
41
+ - **Chat Interface**: Real-time message streaming, file upload support for CSV and Parquet formats, markdown rendering for formatted responses with code syntax highlighting, loading states with animated indicators, and error handling with user-friendly messages
42
+ - **Report Viewer**: In-application modal viewer for HTML reports generated by YData Profiling, Sweetviz, and custom dashboard tools. Full-screen modal with professional styling, iframe embedding for report content, and download capabilities
43
+ - **Session Management**: Maintains conversation history across browser sessions, allows users to review previous analyses, and provides context for follow-up questions
44
+
45
+ ### Complete Machine Learning Pipeline
46
+
47
+ The agent executes a comprehensive end-to-end pipeline:
48
+
49
+ 1. **Data Profiling and Assessment**: Automatically generates statistical summaries including descriptive statistics (mean, median, standard deviation, quartiles), distribution analysis with histogram generation, correlation analysis with heatmap visualization, missing value analysis with percentage calculations, data type detection and validation, outlier detection using multiple methods (IQR, Z-score, isolation forest), and cardinality analysis for categorical variables
50
+
51
+ 2. **Data Cleaning and Preprocessing**: Handles missing values with context-aware imputation strategies, removes or treats outliers based on statistical thresholds, performs data type conversions and casting, removes duplicate records, handles inconsistent formatting in categorical variables, and validates data integrity constraints
52
+
53
+ 3. Quick Start Guide
54
 
55
  ### Prerequisites
 
 
 
56
 
57
+ Before beginning the installation, ensure your system meets the following requirements:
58
+
59
+ - **Python**: Version 3.10 or higher with pip package manager
60
+ - **Node.js**: V Steps
61
+
62
+ **Step 1: Clone the Repository**
63
+
64
+ Clone the repository from GitHub and navigate to the project directory:
65
 
 
66
  ```bash
67
  git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
68
  cd DevSprint-Data-Science-Agent
69
  ```
70
 
71
+ **Step 2: Configure Environment Variables**
72
+
73
+ Create a `.env` file in the root directory with the following configuration:
74
+
75
  ```bash
76
+ # LLM Provider Configuration
77
+ LLM_PROVIDER=gemini
78
+
79
+ # Google Gemini API Key (required)
80
+ GOOGLE_API_KEY=your_api_key_here
81
+
82
+ # Model Configuration
83
+ GEMINI_MODEL=gemini-2.5-flash
84
+
85
+ # Cache Configuration
86
+ CACHE_DB_PATH=./cache_db/cache.db
87
+ CACHE_TTL_SECONDS=86400
88
+
89
+ # Output and Data Directories
90
+ OUTPUT_DIR=./outputs
91
+ DATA_DIR=./data
92
  ```
93
 
94
+ Replace `your_api_key_here` with your actual Google Gemini API key obtained from https://ai.google.dev/
95
+
96
+ **Step 3: Install Python Dependencies**
97
+
98
+ Install all required Python packages using pip:
99
+
100
  ```bash
101
  pip install -r requirements.txt
102
  ```
103
 
104
+ ThiUsage Guide
105
+
106
+ ### Web Interface Workflow
107
+
108
+ **Step 1: Access the Application**
109
+
110
+ Open your web browser and navigate to http://localhost:8080. You will see the landing page with an overview of the agent's capabilities.
111
+
112
+ **Step 2: Launch the Chat Interface**
113
+
114
+ Click the "Launch Agent" button to access the interactive chat interface.
115
+
116
+ **Step 3: Upload Your Dataset**
117
+
118
+ Click the file upload button (paperclip icon) and select your dataset file. Supported formats:
119
+ - CSV files (.csv) with any delimiter (comma, tab, semicolon, etc.)
120
+ - Parquet files (.parquet) for high-performance columnar storage
121
+
122
+ The agent will automatically detect the file format and load the data using appropriate parsers.
123
+
124
+ **Step 4: Describe Your Task**
125
+
126
+ Type your request in natural language in the chat input box. The agent understands various types of requests and will automatically determine the appropriate workflow.
127
+
128
+ **Step 5: Review Results**
129
+
130
+ The agent will execute the requested workflow and display results in the chat interface. For analyses that generate HTML reports (such as YData Profiling or Sweetviz), a "View Report" button will appear. Click this button to open the report in a full-screen modal viewer.
131
+
132
+ ### Example Queries and Use Cases
133
+
134
+ **Data Profiling and Exploration:**
135
+ ```
136
+ "Generate a comprehensive profile report on this dataset"
137
+ "Show me the statistical summary and distribution of all variables"
138
+ "Analyze data quality issues including missing values and outliers"
139
+ "Create a correlation matrix and identify highly correlated features"
140
+ ```
141
+
142
+ **Data Cleaning:**
143
+ ```
144
+ "Clean the missing values using median imputation for numeric columns"
145
+ "Handle outliers in the dataset using IQR method"
146
+ "Remove duplicate records and fix data type inconsistencies"
147
+ "Drop columns with more than 50% missing values"
148
+ ```
149
+
150
+ **Predictive Modeling:**
151
+ ```
152
+ "Train a model to predict the target column 'price' using all features"
153
+ "Build a classification model for the 'churn' column"
154
+ "Compare multiple regression algorithms and select the best one"
155
+ "Train an XGBoost model with default hyperparameters"
156
+ ```
157
+
158
+ **Feature Engineering:**
159
+ ```
160
+ "Extract time-based features from the datetime column"
161
+ "Create interaction terms between numeric features"
162
+ "Apply target encoding for high-cardinality categorical variables"
163
+ "Generate polynomial features of degree 2"
164
+ ```
165
+
166
+ **Model Optimization:**
167
+ ```
168
+ "Perform hyperparameter tuning on the trained model using Optuna"
169
+ "Run 5-fold cross-validation to evaluate model performance"
170
+ "Optimize the XGBoost model for better accuracy"
171
+ ```
172
+
173
+ **Visualization:**
174
+ ```
175
+ "Generate a correlation heatmap for numeric features"
176
+ "Create distribution plots for all numeric columns"
177
+ "Show feature importance for the trained model"
178
+ "Generate interactive Plotly visualizations"
179
+ ```
180
+
181
+ **End-to-End Pipeline:**
182
+ ```
183
+ "Profile the data, clean it, engineer features, and train the best model"
184
+ "Perform complete analysis and predict the target column 'sales'"
185
+ "Do everything needed to build a production-ready model
186
+ .\start.ps1
187
+ ```
188
+
189
+ **For Linux/macOS:**
190
+ ```bash
191
+ chmod +x start.sh
192
+ ./start.sh
193
+ ```
194
+
195
+ The startup script will:
196
+ 1. Technology Stack
197
+
198
+ ### Frontend Technologies
199
+
200
+ - **React 19.2.3**: Latest version of React with improved concurrent rendering, automatic batching, and enhanced hooks for building performant user interfaces
201
+ - **TypeScript 5.8.2**: Provides static type checking, enhanced IDE support, and improved code maintainability with advanced type inference
202
+ - **Vite 6.2.0**: Next-generation frontend build tool offering instant server start, lightning-fast hot module replacement (HMR), and optimized production builds
203
+ - **Tailwind CSS 3.4.1**: Utility-first CSS framework enabling rapid UI development with pre-built classes and responsive design utilities
204
+ - **Framer Motion 12.23.26**: Production-ready animation library for React with declarative animations, gestures, and smooth transitions
205
+ - **React Markdown 9.0.1**: Markdown rendering component supporting GitHub-flavored markdown, code syntax highlighting, and custom renderers
206
+ - **Lucide React**: Icon library providing consistent, customizable SVG icons for the user interface
207
+
208
+ ### Backend Technologies
209
+
210
+ - **FastAPI 0.109+**: Modern, high-performance Python web framework with automatic OpenAPI documentation, async/await support, and built-in request validation
211
+ - **Google Gemini 2.5 Flash**: Large language model with advanced reasoning capabilities, function calling support, and high token limits for agent orchestration
212
+ - **Polars 0.20+**: High-performance DataFrame library written in Rust, offering 10-100x speed improvements over pandas for large datasets
213
+ - **Scikit-learn 1.3+**: Comprehensive machine learning library providing classical algorithms for classification, regression, clustering, and preprocessing
214
+ - **XGBoost 2.0+**: Optimized gradient boosting framework with parallel tree construction, regularization, and efficient handling of sparse data
215
+ - **LightGBM 4.1+**: Gradient boosting framework by Microsoft with leaf-wise tree growth, categorical feature support, and memory efficiency
216
+ - **CatBoost 1.2+**: Gradient boosting library by Yandex with native categorical feature handling, GPU support, and symmetric tree structure
217
+ - **Optuna 3.5+**: Hyperparameter optimization framework with Bayesian optimization, pruning strategies, and distributed optimization support
218
+ - **YData Profiling 4.6+**: Automated exploratory data analysis tool generating comprehensive HTML reports with statistical summaries and data quality insights
219
+ - **Plotly 5.18+**: Interactive visualization library creating web-based charts with zooming, panning, and hover tooltips
220
+ - **Matplotlib 3.8+**: Fundamental plotting library for Python offering publication-quality static visualizations
221
+ - **Pydantic 2.5+**: Data validation library using Python type annotations for request/response models
222
+
223
+ ### Data Processing and Storage
224
+
225
+ - **Polars**: Primary dataframe library for all data manipulation operations
226
+ - **Pandas 2.1+**: Secondary support for compatibility with legacy tools and libraries
227
+ - **SQLite**: Embedded database for caching query results and session management
228
+ - **Python-dotenv**: Environment variable management from .env files
229
+
230
+ ### Development and Deployment
231
+
232
+ - **Docker**: Containerization platform with multi-stage builds for optimized image size and consistent deployment
233
+ - **Uvicorn**: Lightning-fast ASGI server for running FastAPI applications
234
+ - **Git**: Version control system for code management and collaboration
235
  **4. Install frontend dependencies**
236
  ```bash
237
  cd FRRONTEEEND
 
323
 
324
  ---
325
 
326
+ ## Tech Stack
327
 
328
  ### Frontend
329
  - **React 19** - Modern UI library