Pulastya B commited on
Commit
09cd93c
Β·
1 Parent(s): f1ab2a8

feat: Migrate to HuggingFace Spaces for 16GB free RAM

Browse files
Files changed (8) hide show
  1. .spacesrc +8 -0
  2. DEPLOYMENT_HUGGINGFACE.md +204 -0
  3. Dockerfile +30 -21
  4. Dockerfile.render +84 -0
  5. Dockerfile.spaces +93 -0
  6. README.md +82 -953
  7. README_SPACES.md +122 -0
  8. README_original.md +993 -0
.spacesrc ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ sdk: docker
2
+ sdk_version: "3.12"
3
+ app_file: src/api/app.py
4
+ app_port: 7860
5
+ emoji: πŸ€–
6
+ colorFrom: blue
7
+ colorTo: purple
8
+ pinned: false
DEPLOYMENT_HUGGINGFACE.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deploying to HuggingFace Spaces πŸ€—
2
+
3
+ This guide shows how to deploy the DevSprint Data Science Agent to HuggingFace Spaces with **16GB RAM for free** - perfect for memory-intensive data science workloads.
4
+
5
+ ## Why HuggingFace Spaces?
6
+
7
+ - βœ… **16GB RAM** (vs Render's 512MB free tier)
8
+ - βœ… **Completely free** for public Spaces
9
+ - βœ… **Perfect for ML/AI demos**
10
+ - βœ… **Persistent storage** for uploaded files
11
+ - βœ… **Auto-restart** on crashes
12
+ - βœ… **Built-in secrets management**
13
+
14
+ ## Prerequisites
15
+
16
+ 1. **HuggingFace Account**: Sign up at https://huggingface.co/join
17
+ 2. **Google Gemini API Key**: Get from https://aistudio.google.com/app/apikey
18
+ 3. **Git**: Installed locally
19
+
20
+ ## Quick Deployment
21
+
22
+ ### Step 1: Create a New Space
23
+
24
+ 1. Go to https://huggingface.co/new-space
25
+ 2. Fill in details:
26
+ - **Owner**: Your username
27
+ - **Space name**: `devs-print-data-science-agent` (or any name)
28
+ - **License**: MIT
29
+ - **Select the Space SDK**: Docker
30
+ - **Visibility**: Public (for free 16GB RAM)
31
+
32
+ 3. Click **Create Space**
33
+
34
+ ### Step 2: Set Up Repository
35
+
36
+ After creating the Space, HuggingFace will show you a Git repository URL like:
37
+ ```
38
+ https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
39
+ ```
40
+
41
+ ### Step 3: Prepare Files
42
+
43
+ **IMPORTANT**: Rename `Dockerfile.spaces` to `Dockerfile` and `README_SPACES.md` to `README.md` before pushing:
44
+
45
+ ```powershell
46
+ # Backup original files
47
+ Copy-Item Dockerfile Dockerfile.render
48
+ Copy-Item README.md README_original.md
49
+
50
+ # Use HuggingFace versions
51
+ Copy-Item Dockerfile.spaces Dockerfile -Force
52
+ Copy-Item README_SPACES.md README.md -Force
53
+ ```
54
+
55
+ ### Step 4: Push to HuggingFace
56
+
57
+ ```powershell
58
+ # Add HuggingFace remote
59
+ git remote add huggingface https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
60
+
61
+ # Push to HuggingFace
62
+ git push huggingface main
63
+ ```
64
+
65
+ **Note**: Use your HuggingFace username and access token (not password) for authentication.
66
+ - Get access token: https://huggingface.co/settings/tokens
67
+
68
+ ### Step 5: Configure Secrets
69
+
70
+ 1. Go to your Space: `https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent`
71
+ 2. Click **Settings** tab
72
+ 3. Scroll to **Repository secrets**
73
+ 4. Click **New secret**
74
+ 5. Add:
75
+ - **Name**: `GEMINI_API_KEY`
76
+ - **Value**: Your Google Gemini API key
77
+ 6. Click **Save**
78
+
79
+ ### Step 6: Wait for Build
80
+
81
+ HuggingFace will automatically:
82
+ 1. Build your Docker container (5-10 minutes)
83
+ 2. Deploy to 16GB RAM instance
84
+ 3. Show build logs in the **Logs** tab
85
+ 4. Start your app on port 7860
86
+
87
+ Once deployed, your Space will be live at:
88
+ ```
89
+ https://YOUR_USERNAME-devs-print-data-science-agent.hf.space
90
+ ```
91
+
92
+ ## Dockerfile Changes for Spaces
93
+
94
+ The `Dockerfile.spaces` includes these HuggingFace-specific optimizations:
95
+
96
+ 1. **Port 7860**: HuggingFace Spaces standard port
97
+ ```dockerfile
98
+ ENV PORT=7860
99
+ EXPOSE 7860
100
+ CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"]
101
+ ```
102
+
103
+ 2. **Non-root user**: Security requirement
104
+ ```dockerfile
105
+ RUN useradd -m -u 1000 user
106
+ USER user
107
+ WORKDIR /home/user/app
108
+ ```
109
+
110
+ 3. **User-writable directories**: For uploads and outputs
111
+ ```dockerfile
112
+ ENV OUTPUT_DIR=/home/user/app/outputs
113
+ ENV CACHE_DB_PATH=/home/user/app/cache_db/cache.db
114
+ ```
115
+
116
+ ## README Metadata
117
+
118
+ The `README_SPACES.md` includes YAML frontmatter required by HuggingFace:
119
+
120
+ ```yaml
121
+ ---
122
+ title: DevSprint Data Science Agent
123
+ emoji: πŸ€–
124
+ colorFrom: blue
125
+ colorTo: purple
126
+ sdk: docker
127
+ pinned: false
128
+ license: mit
129
+ app_port: 7860
130
+ ---
131
+ ```
132
+
133
+ ## Troubleshooting
134
+
135
+ ### Build Fails
136
+
137
+ - Check **Logs** tab for errors
138
+ - Common issue: Missing dependencies in `requirements.txt`
139
+ - Solution: Add any missing packages and push again
140
+
141
+ ### App Crashes on Startup
142
+
143
+ - Check if `GEMINI_API_KEY` is set in Repository secrets
144
+ - Verify port 7860 is exposed in Dockerfile
145
+ - Check logs for Python import errors
146
+
147
+ ### Memory Issues
148
+
149
+ - HuggingFace Spaces provides 16GB RAM
150
+ - Your memory optimization (sampling to 50k rows) will work great here
151
+ - For even larger datasets (>100MB), consider increasing sample size in [src/tools/eda_reports.py](src/tools/eda_reports.py)
152
+
153
+ ### File Uploads Not Persisting
154
+
155
+ - Files in `/home/user/app/outputs` persist between restarts
156
+ - Temp files in `/tmp` are ephemeral (cleared on restart)
157
+ - For production, consider using HuggingFace Datasets or external storage
158
+
159
+ ## Updating Your Space
160
+
161
+ To push updates:
162
+
163
+ ```powershell
164
+ # Make your changes locally
165
+ git add .
166
+ git commit -m "Your update message"
167
+
168
+ # Push to both GitHub and HuggingFace
169
+ git push origin main
170
+ git push huggingface main
171
+ ```
172
+
173
+ HuggingFace will automatically rebuild and redeploy.
174
+
175
+ ## Comparison: Render vs HuggingFace Spaces
176
+
177
+ | Feature | Render (Free) | HuggingFace Spaces |
178
+ |---------|---------------|-------------------|
179
+ | RAM | 512MB | **16GB** βœ… |
180
+ | CPU | Shared | Shared |
181
+ | Storage | Ephemeral | Persistent |
182
+ | Cost | Free | Free (public) |
183
+ | Build Time | 3-5 min | 5-10 min |
184
+ | Auto-restart | βœ… | βœ… |
185
+ | Custom Domain | ❌ | ❌ |
186
+ | Best For | Simple APIs | **ML/Data Science** βœ… |
187
+
188
+ ## Going to Production
189
+
190
+ For private Spaces with more resources:
191
+
192
+ - **HuggingFace Pro**: $9/mo for private Spaces
193
+ - **Upgraded Hardware**: Up to 32GB RAM, GPUs available
194
+ - **Custom domains**: Available with Pro
195
+
196
+ ## Support
197
+
198
+ - **HuggingFace Docs**: https://huggingface.co/docs/hub/spaces-overview
199
+ - **Community Forum**: https://discuss.huggingface.co/
200
+ - **Status Page**: https://status.huggingface.co/
201
+
202
+ ---
203
+
204
+ **Ready to deploy?** Follow the steps above and your agent will be live in 10-15 minutes! πŸš€
Dockerfile CHANGED
@@ -1,7 +1,11 @@
 
 
 
 
 
1
  # ===============================
2
  # Stage 1: Build Frontend
3
  # ===============================
4
- # Cache bust: 2025-12-28 fix
5
  FROM node:20-alpine AS frontend-builder
6
 
7
  WORKDIR /frontend
@@ -52,33 +56,38 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
52
  COPY --from=builder /opt/venv /opt/venv
53
  ENV PATH="/opt/venv/bin:$PATH"
54
 
 
 
 
 
55
  # App working directory
56
- WORKDIR /app
57
 
58
- # Copy backend code
59
- COPY src/ /app/src/
60
- COPY examples/ /app/examples/
61
 
62
  # Copy frontend build
63
- COPY --from=frontend-builder /frontend/dist /app/FRRONTEEEND/dist
64
 
65
- # Cloud Run ephemeral directories
66
  RUN mkdir -p \
67
- /tmp/data_science_agent \
68
- /tmp/outputs/models \
69
- /tmp/outputs/plots \
70
- /tmp/outputs/reports \
71
- /tmp/outputs/data \
72
- /tmp/cache_db
73
-
74
- # Environment variables
75
  ENV PYTHONUNBUFFERED=1
76
- ENV PORT=8080
77
- ENV OUTPUT_DIR=/tmp/outputs
78
- ENV CACHE_DB_PATH=/tmp/cache_db/cache.db
79
  ENV ARTIFACT_BACKEND=local
80
 
81
- EXPOSE 8080
 
82
 
83
- # Start FastAPI
84
- CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "8080"]
 
1
+ # ===============================
2
+ # HuggingFace Spaces Dockerfile
3
+ # ===============================
4
+ # Optimized for 16GB RAM, Port 7860
5
+
6
  # ===============================
7
  # Stage 1: Build Frontend
8
  # ===============================
 
9
  FROM node:20-alpine AS frontend-builder
10
 
11
  WORKDIR /frontend
 
56
  COPY --from=builder /opt/venv /opt/venv
57
  ENV PATH="/opt/venv/bin:$PATH"
58
 
59
+ # Create non-root user for HuggingFace Spaces
60
+ RUN useradd -m -u 1000 user
61
+ USER user
62
+
63
  # App working directory
64
+ WORKDIR /home/user/app
65
 
66
+ # Copy backend code (as user)
67
+ COPY --chown=user:user src/ ./src/
68
+ COPY --chown=user:user examples/ ./examples/
69
 
70
  # Copy frontend build
71
+ COPY --from=frontend-builder --chown=user:user /frontend/dist ./FRRONTEEEND/dist
72
 
73
+ # HuggingFace Spaces directories (user-writable)
74
  RUN mkdir -p \
75
+ /home/user/app/data \
76
+ /home/user/app/outputs/models \
77
+ /home/user/app/outputs/plots \
78
+ /home/user/app/outputs/reports \
79
+ /home/user/app/outputs/data \
80
+ /home/user/app/cache_db
81
+
82
+ # Environment variables for HuggingFace Spaces
83
  ENV PYTHONUNBUFFERED=1
84
+ ENV PORT=7860
85
+ ENV OUTPUT_DIR=/home/user/app/outputs
86
+ ENV CACHE_DB_PATH=/home/user/app/cache_db/cache.db
87
  ENV ARTIFACT_BACKEND=local
88
 
89
+ # HuggingFace Spaces uses port 7860 by default
90
+ EXPOSE 7860
91
 
92
+ # Start FastAPI on port 7860
93
+ CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"]
Dockerfile.render ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ===============================
2
+ # Stage 1: Build Frontend
3
+ # ===============================
4
+ # Cache bust: 2025-12-28 fix
5
+ FROM node:20-alpine AS frontend-builder
6
+
7
+ WORKDIR /frontend
8
+
9
+ COPY FRRONTEEEND/package*.json ./
10
+ RUN npm install
11
+
12
+ COPY FRRONTEEEND/ ./
13
+ RUN npm run build
14
+
15
+
16
+ # ===============================
17
+ # Stage 2: Build Python environment
18
+ # ===============================
19
+ FROM python:3.12-slim AS builder
20
+
21
+ # Install build dependencies (needed for ML wheels)
22
+ RUN apt-get update && apt-get install -y --no-install-recommends \
23
+ gcc \
24
+ g++ \
25
+ make \
26
+ && rm -rf /var/lib/apt/lists/*
27
+
28
+ # Create virtual environment
29
+ RUN python -m venv /opt/venv
30
+ ENV PATH="/opt/venv/bin:$PATH"
31
+
32
+ # Upgrade pip tooling
33
+ RUN pip install --upgrade pip setuptools wheel
34
+
35
+ # Install Python dependencies
36
+ COPY requirements.txt .
37
+ RUN pip install --no-cache-dir -r requirements.txt
38
+
39
+
40
+ # ===============================
41
+ # Stage 3: Runtime environment
42
+ # ===============================
43
+ FROM python:3.12-slim
44
+
45
+ # Install runtime shared libraries
46
+ RUN apt-get update && apt-get install -y --no-install-recommends \
47
+ libgomp1 \
48
+ libstdc++6 \
49
+ && rm -rf /var/lib/apt/lists/*
50
+
51
+ # Copy virtual environment
52
+ COPY --from=builder /opt/venv /opt/venv
53
+ ENV PATH="/opt/venv/bin:$PATH"
54
+
55
+ # App working directory
56
+ WORKDIR /app
57
+
58
+ # Copy backend code
59
+ COPY src/ /app/src/
60
+ COPY examples/ /app/examples/
61
+
62
+ # Copy frontend build
63
+ COPY --from=frontend-builder /frontend/dist /app/FRRONTEEEND/dist
64
+
65
+ # Cloud Run ephemeral directories
66
+ RUN mkdir -p \
67
+ /tmp/data_science_agent \
68
+ /tmp/outputs/models \
69
+ /tmp/outputs/plots \
70
+ /tmp/outputs/reports \
71
+ /tmp/outputs/data \
72
+ /tmp/cache_db
73
+
74
+ # Environment variables
75
+ ENV PYTHONUNBUFFERED=1
76
+ ENV PORT=8080
77
+ ENV OUTPUT_DIR=/tmp/outputs
78
+ ENV CACHE_DB_PATH=/tmp/cache_db/cache.db
79
+ ENV ARTIFACT_BACKEND=local
80
+
81
+ EXPOSE 8080
82
+
83
+ # Start FastAPI
84
+ CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "8080"]
Dockerfile.spaces ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ===============================
2
+ # HuggingFace Spaces Dockerfile
3
+ # ===============================
4
+ # Optimized for 16GB RAM, Port 7860
5
+
6
+ # ===============================
7
+ # Stage 1: Build Frontend
8
+ # ===============================
9
+ FROM node:20-alpine AS frontend-builder
10
+
11
+ WORKDIR /frontend
12
+
13
+ COPY FRRONTEEEND/package*.json ./
14
+ RUN npm install
15
+
16
+ COPY FRRONTEEEND/ ./
17
+ RUN npm run build
18
+
19
+
20
+ # ===============================
21
+ # Stage 2: Build Python environment
22
+ # ===============================
23
+ FROM python:3.12-slim AS builder
24
+
25
+ # Install build dependencies (needed for ML wheels)
26
+ RUN apt-get update && apt-get install -y --no-install-recommends \
27
+ gcc \
28
+ g++ \
29
+ make \
30
+ && rm -rf /var/lib/apt/lists/*
31
+
32
+ # Create virtual environment
33
+ RUN python -m venv /opt/venv
34
+ ENV PATH="/opt/venv/bin:$PATH"
35
+
36
+ # Upgrade pip tooling
37
+ RUN pip install --upgrade pip setuptools wheel
38
+
39
+ # Install Python dependencies
40
+ COPY requirements.txt .
41
+ RUN pip install --no-cache-dir -r requirements.txt
42
+
43
+
44
+ # ===============================
45
+ # Stage 3: Runtime environment
46
+ # ===============================
47
+ FROM python:3.12-slim
48
+
49
+ # Install runtime shared libraries
50
+ RUN apt-get update && apt-get install -y --no-install-recommends \
51
+ libgomp1 \
52
+ libstdc++6 \
53
+ && rm -rf /var/lib/apt/lists/*
54
+
55
+ # Copy virtual environment
56
+ COPY --from=builder /opt/venv /opt/venv
57
+ ENV PATH="/opt/venv/bin:$PATH"
58
+
59
+ # Create non-root user for HuggingFace Spaces
60
+ RUN useradd -m -u 1000 user
61
+ USER user
62
+
63
+ # App working directory
64
+ WORKDIR /home/user/app
65
+
66
+ # Copy backend code (as user)
67
+ COPY --chown=user:user src/ ./src/
68
+ COPY --chown=user:user examples/ ./examples/
69
+
70
+ # Copy frontend build
71
+ COPY --from=frontend-builder --chown=user:user /frontend/dist ./FRRONTEEEND/dist
72
+
73
+ # HuggingFace Spaces directories (user-writable)
74
+ RUN mkdir -p \
75
+ /home/user/app/data \
76
+ /home/user/app/outputs/models \
77
+ /home/user/app/outputs/plots \
78
+ /home/user/app/outputs/reports \
79
+ /home/user/app/outputs/data \
80
+ /home/user/app/cache_db
81
+
82
+ # Environment variables for HuggingFace Spaces
83
+ ENV PYTHONUNBUFFERED=1
84
+ ENV PORT=7860
85
+ ENV OUTPUT_DIR=/home/user/app/outputs
86
+ ENV CACHE_DB_PATH=/home/user/app/cache_db/cache.db
87
+ ENV ARTIFACT_BACKEND=local
88
+
89
+ # HuggingFace Spaces uses port 7860 by default
90
+ EXPOSE 7860
91
+
92
+ # Start FastAPI on port 7860
93
+ CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,993 +1,122 @@
1
- # AI-Powered Data Science Agent
2
-
3
- ## Overview
4
-
5
- The AI-Powered Data Science Agent is an intelligent autonomous system designed to perform complete end-to-end data science workflows through natural language interaction. This agent leverages Google Gemini 2.5 Flash for advanced reasoning and function calling capabilities, combined with a comprehensive suite of over 82 specialized machine learning tools.
6
-
7
- The system enables users to upload datasets in CSV or Parquet format and describe their analytical objectives in plain English. The agent autonomously handles the entire pipeline including data profiling, quality assessment, cleaning, feature engineering, model training, hyperparameter optimization, cross-validation, and comprehensive reporting generation.
8
-
9
- Key capabilities include intelligent intent classification, session memory for contextual awareness, error recovery mechanisms, and a modern React-based web interface for seamless user interaction.
10
-
11
- [![React](https://img.shields.io/badge/React-19-61DAFB?logo=react)](https://reactjs.org/)
12
- [![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688?logo=fastapi)](https://fastapi.tiangolo.com/)
13
- [![Gemini](https://img.shields.io/badge/Gemini-2.5_Flash-4285F4?logo=google)](https://ai.google.dev/)
14
- [![Python](https://img.shields.io/badge/Python-3.10+-3776AB?logo=python)](https://python.org/)
15
-
16
  ---
17
 
18
- ## Key Features
19
-
20
- ### Autonomous AI Agent System
21
-
22
- The core orchestration engine integrates Google Gemini 2.5 Flash with over 82 specialized machine learning tools organized across multiple categories:
23
-
24
- - **Data Profiling Tools**: Generate comprehensive statistical summaries, distribution analysis, correlation matrices, data quality reports, and automated anomaly detection
25
- - **Data Cleaning Tools**: Handle missing values with intelligent imputation strategies (mean, median, mode, forward/backward fill, KNN), outlier detection and treatment using IQR and Z-score methods, duplicate removal, and data type conversions
26
- - **Feature Engineering Tools**: Create time-based features (hour, day, month, year, cyclical encodings), polynomial features, interaction terms, statistical aggregations, lag features, rolling window statistics, and domain-specific transformations
27
- - **Model Training Tools**: Support for multiple algorithm families including linear models (Ridge, Lasso, ElasticNet), tree-based models (Random Forest, Gradient Boosting), and advanced gradient boosting frameworks (XGBoost, LightGBM, CatBoost)
28
- - **Visualization Tools**: Generate interactive Plotly visualizations, Matplotlib static plots, correlation heatmaps, distribution plots, scatter matrices, feature importance charts, and residual analysis plots
29
-
30
- The intelligent orchestration system uses function calling capabilities to dynamically select and execute appropriate tools based on user intent. The agent maintains session memory for contextual awareness across conversation turns, enabling multi-turn dialogues where previous actions and results inform subsequent decisions.
31
-
32
- Smart intent detection automatically classifies incoming requests into categories such as full ML pipeline execution, exploratory data analysis, data cleaning only, visualization generation, or multi-intent tasks requiring combined workflows.
33
-
34
- Error recovery mechanisms include automatic retry logic with corrected parameters, file existence validation before tool execution, recovery guidance displaying the last successful file state, and loop detection to prevent infinite retry cycles.
35
-
36
- ### Modern Web Interface
37
-
38
- The frontend is built with React 19 and TypeScript 5.8, featuring a modern glassmorphism design aesthetic with smooth animations powered by Framer Motion. Key interface components include:
39
-
40
- - **Landing Page**: Geometric hero section with animated background paths, key capabilities showcase, problem-solution presentation, process flow visualization, and technology stack display
41
- - **Chat Interface**: Real-time message streaming, file upload support for CSV and Parquet formats, markdown rendering for formatted responses with code syntax highlighting, loading states with animated indicators, and error handling with user-friendly messages
42
- - **Report Viewer**: In-application modal viewer for HTML reports generated by YData Profiling and custom dashboard tools. Full-screen modal with professional styling, iframe embedding for report content, and download capabilities
43
- - **Session Management**: Maintains conversation history across browser sessions, allows users to review previous analyses, and provides context for follow-up questions
44
-
45
- ### Complete Machine Learning Pipeline
46
-
47
- The agent executes a comprehensive end-to-end pipeline:
48
-
49
- 1. **Data Profiling and Assessment**: Automatically generates statistical summaries including descriptive statistics (mean, median, standard deviation, quartiles), distribution analysis with histogram generation, correlation analysis with heatmap visualization, missing value analysis with percentage calculations, data type detection and validation, outlier detection using multiple methods (IQR, Z-score, isolation forest), and cardinality analysis for categorical variables
50
-
51
- 2. **Data Cleaning and Preprocessing**: Handles missing values with context-aware imputation strategies, removes or treats outliers based on statistical thresholds, performs data type conversions and casting, removes duplicate records, handles inconsistent formatting in categorical variables, and validates data integrity constraints
52
-
53
- 3. Quick Start Guide
54
-
55
- ### Prerequisites
56
-
57
- Before beginning the installation, ensure your system meets the following requirements:
58
-
59
- - **Python**: Version 3.10 or higher with pip package manager
60
- - **Node.js**: V Steps
61
-
62
- **Step 1: Clone the Repository**
63
-
64
- Clone the repository from GitHub and navigate to the project directory:
65
-
66
- ```bash
67
- git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
68
- cd DevSprint-Data-Science-Agent
69
- ```
70
-
71
- **Step 2: Configure Environment Variables**
72
-
73
- Create a `.env` file in the root directory with the following configuration:
74
-
75
- ```bash
76
- # LLM Provider Configuration
77
- LLM_PROVIDER=gemini
78
-
79
- # Google Gemini API Key (required)
80
- GOOGLE_API_KEY=your_api_key_here
81
-
82
- # Model Configuration
83
- GEMINI_MODEL=gemini-2.5-flash
84
-
85
- # Cache Configuration
86
- CACHE_DB_PATH=./cache_db/cache.db
87
- CACHE_TTL_SECONDS=86400
88
-
89
- # Output and Data Directories
90
- OUTPUT_DIR=./outputs
91
- DATA_DIR=./data
92
- ```
93
-
94
- Replace `your_api_key_here` with your actual Google Gemini API key obtained from https://ai.google.dev/
95
-
96
- **Step 3: Install Python Dependencies**
97
-
98
- Install all required Python packages using pip:
99
-
100
- ```bash
101
- pip install -r requirements.txt
102
- ```
103
-
104
- ThiUsage Guide
105
-
106
- ### Web Interface Workflow
107
-
108
- **Step 1: Access the Application**
109
-
110
- Open your web browser and navigate to http://localhost:8080. You will see the landing page with an overview of the agent's capabilities.
111
-
112
- **Step 2: Launch the Chat Interface**
113
-
114
- Click the "Launch Agent" button to access the interactive chat interface.
115
-
116
- **Step 3: Upload Your Dataset**
117
-
118
- Click the file upload button (paperclip icon) and select your dataset file. Supported formats:
119
- - CSV files (.csv) with any delimiter (comma, tab, semicolon, etc.)
120
- - Parquet files (.parquet) for high-performance columnar storage
121
-
122
- The agent will automatically detect the file format and load the data using appropriate parsers.
123
-
124
- **Step 4: Describe Your Task**
125
-
126
- Type your request in natural language in the chat input box. The agent understands various types of requests and will automatically determine the appropriate workflow.
127
-
128
- **Step 5: Review Results**
129
-
130
- The agent will execute the requested workflow and display results in the chat interface. For analyses that generate HTML reports (such as YData Profiling), a "View Report" button will appear. Click this button to open the report in a full-screen modal viewer.
131
-
132
- ### Example Queries and Use Cases
133
-
134
- **Data Profiling and Exploration:**
135
- ```
136
- "Generate a comprehensive profile report on this dataset"
137
- "Show me the statistical summary and distribution of all variables"
138
- "Analyze data quality issues including missing values and outliers"
139
- "Create a correlation matrix and identify highly correlated features"
140
- ```
141
-
142
- **Data Cleaning:**
143
- ```
144
- "Clean the missing values using median imputation for numeric columns"
145
- "Handle outliers in the dataset using IQR method"
146
- "Remove duplicate records and fix data type inconsistencies"
147
- "Drop columns with more than 50% missing values"
148
- ```
149
-
150
- **Predictive Modeling:**
151
- ```
152
- "Train a model to predict the target column 'price' using all features"
153
- "Build a classification model for the 'churn' column"
154
- "Compare multiple regression algorithms and select the best one"
155
- "Train an XGBoost model with default hyperparameters"
156
- ```
157
-
158
- **Feature Engineering:**
159
- ```
160
- "Extract time-based features from the datetime column"
161
- "Create interaction terms between numeric features"
162
- "Apply target encoding for high-cardinality categorical variables"
163
- "Generate polynomial features of degree 2"
164
- ```
165
-
166
- **Model Optimization:**
167
- ```
168
- "Perform hyperparameter tuning on the trained model using Optuna"
169
- "Run 5-fold cross-validation to evaluate model performance"
170
- "Optimize the XGBoost model for better accuracy"
171
- ```
172
 
173
- **Visualization:**
174
- ```
175
- "Generate a correlation heatmap for numeric features"
176
- "Create distribution plots for all numeric columns"
177
- "Show feature importance for the trained model"
178
- "Generate interactive Plotly visualizations"
179
- ```
180
 
181
- **End-to-End Pipeline:**
182
- ```
183
- "Profile the data, clean it, engineer features, and train the best model"
184
- "Perform complete analysis and predict the target column 'sales'"
185
- "Do everything needed to build a production-ready model
186
- .\start.ps1
187
- ```
188
 
189
- **For Linux/macOS:**
190
- ```bash
191
- chmod +x start.sh
192
- ./start.sh
193
- ```
 
 
194
 
195
- The startup script will:
196
- 1. Technology Stack
197
 
198
- ### Frontend Technologies
 
 
 
 
199
 
200
- - **React 19.2.3**: Latest version of React with improved concurrent rendering, automatic batching, and enhanced hooks for building performant user interfaces
201
- - **TypeScript 5.8.2**: Provides static type checking, enhanced IDE support, and improved code maintainability with advanced type inference
202
- - **Vite 6.2.0**: Next-generation frontend build tool offering instant server start, lightning-fast hot module replacement (HMR), and optimized production builds
203
- - **Tailwind CSS 3.4.1**: Utility-first CSS framework enabling rapid UI development with pre-built classes and responsive design utilities
204
- - **Framer Motion 12.23.26**: Production-ready animation library for React with declarative animations, gestures, and smooth transitions
205
- - **React Markdown 9.0.1**: Markdown rendering component supporting GitHub-flavored markdown, code syntax highlighting, and custom renderers
206
- - **Lucide React**: Icon library providing consistent, customizable SVG icons for the user interface
207
 
208
- ### Backend Technologies
 
 
 
209
 
210
- - **FastAPI 0.109+**: Modern, high-performance Python web framework with automatic OpenAPI documentation, async/await support, and built-in request validation
211
- - **Google Gemini 2.5 Flash**: Large language model with advanced reasoning capabilities, function calling support, and high token limits for agent orchestration
212
- - **Polars 0.20+**: High-performance DataFrame library written in Rust, offering 10-100x speed improvements over pandas for large datasets
213
- - **Scikit-learn 1.3+**: Comprehensive machine learning library providing classical algorithms for classification, regression, clustering, and preprocessing
214
- - **XGBoost 2.0+**: Optimized gradient boosting framework with parallel tree construction, regularization, and efficient handling of sparse data
215
- - **LightGBM 4.1+**: Gradient boosting framework by Microsoft with leaf-wise tree growth, categorical feature support, and memory efficiency
216
- - **CatBoost 1.2+**: Gradient boosting library by Yandex with native categorical feature handling, GPU support, and symmetric tree structure
217
- - **Optuna 3.5+**: Hyperparameter optimization framework with Bayesian optimization, pruning strategies, and distributed optimization support
218
- - **YData Profiling 4.6+**: Automated exploratory data analysis tool generating comprehensive HTML reports with statistical summaries and data quality insights
219
- - **Plotly 5.18+**: Interactive visualization library creating web-based charts with zooming, panning, and hover tooltips
220
- - **Matplotlib 3.8+**: Fundamental plotting library for Python offering publication-quality static visualizations
221
- - **Pydantic 2.5+**: Data validation library using Python type annotations for request/response models
222
 
223
- ###Docker Deployment
 
 
 
224
 
225
- The application includes a multi-stage Dockerfile for optimized containerized deployment.
226
 
227
- ### Building the Docker Image
228
 
229
- Build the Docker image with the following command:
230
 
231
- ```bash
232
- docker build -t ds-agent:latest .
233
  ```
234
-
235
- The multi-stage build process:
236
- 1. **Stage 1 (Builder)**: Installs Node.js dependencies and builds the React frontend
237
- 2. **Stage 2 (Runtime)**: Sets up Python environment, installs backend dependencies, and copies built frontend
238
- 3. Result: Optimized image size by excluding development dependencies and build tools
239
-
240
- ### Running the Container
241
-
242
- Run the containerized application:
243
-
244
- ```bash
245
- docker run -d \
246
- -p 8080:8080 \
247
- --env-file .env \
248
- --name ds-agent-container \
249
- ds-agent:latest
250
  ```
251
 
252
- Parameters explained:
253
- - `-d`: Run container in detached mode (background)
254
- - `-p 8080:8080`: Map container port 8080 to host port 8080
255
- - `--env-file .env`: Load environment variables from .env file
256
- - `--name ds-agent-container`: Assign a name to the container for easy management
257
-
258
- ### Docker Compose (Recommended)
259
-
260
- For easier management, create a `docker-compose.yml` file:
261
 
262
- ```yaml
263
- version: '3.8'
264
-
265
- services:
266
- ds-agent:
267
- build: .
268
- container_name: ds-agent
269
- ports:
270
- - "8080:8080"
271
- env_file:
272
- - .env
273
- volumes:
274
- Environment Configuration
275
-
276
- The application uses environment variables for configuration management. Create a `.env` file in the project root directory with the following variables:
277
-
278
- ### Required Configuration
279
 
280
  ```bash
281
- # LLM Provider Selection
282
- LLM_PROVIDER=gemini
283
- # Options: gemini (currently supported)
284
-
285
- # Google Gemini API Key (REQUIRED)
286
- GOOGLE_API_KEY=your_api_key_here
287
- # Obtain from: https://ai.google.dev/
288
- # Free tier limits: 10 RPM, 20 RPD
289
-
290
- # Gemini Model Selection
291
- GEMINI_MODEL=gemini-2.5-flash
292
- # Options:
293
- # - gemini-2.5-flash (recommended, balanced performance)
294
- # - gemini-1.5-pro (higher capability, lower rate limits)
295
- # - gemini-1.5-flash (faster, lower cost)
296
- ```
297
-
298
- ### Optional Configuration
299
- Advanced Features
300
-
301
- ### Intelligent Intent Detection and Classification
302
-
303
- The orchestration system employs sophisticated intent detection to automatically classify user requests and route them to appropriate workflow pipelines. The classification system analyzes incoming natural language queries using keyword matching, pattern recognition, and contextual understanding.
304
-
305
- **Intent Categories:**
306
-
307
- 1. **Full ML Pipeline Intent**: Triggered by keywords such as "train", "model", "predict", "machine learning", "regression", "classification". Executes complete workflow including data profiling, cleaning, feature engineering, model training, hyperparameter tuning, and evaluation.
308
-
309
- 2. **Exploratory Analysis Intent**: Activated by keywords like "explore", "profile", "report", "analysis", "overview", "insights", "understand". Performs comprehensive data profiling with statistical summaries, distribution analysis, correlation matrices, and automated insights generation.
310
-
311
- 3. **Data Cleaning Intent**: Detected via keywords such as "clean", "missing", "outliers", "duplicates", "impute", "handle". Focuses on data quality improvement operations without proceeding to modeling.
312
-
313
- 4. **Visualization Intent**: Identified through keywords like "plot", "visualize", "chart", "graph", "heatmap", "distribution". Generates requested visualizations without performing modeling or extensive preprocessing.
314
-
315
- 5. **Feature Engineering Intent**: Recognized by keywords such as "feature", "engineer", "create features", "transform", "encode". Applies feature transformation and creation operations.
316
-
317
- 6. **Multi-Intent Workflows**: The system can detect and handle requests combining multiple intents, executing them in a logical sequence.
318
-
319
- The intent classification system uses confidence scoring to handle ambiguous requests and can ask clarifying questions when intent is unclear.
320
-
321
- ### Context-Aware Session Memory
322
-
323
- The agent implements persistent session memory that maintains conversation context across multiple turns. This enables natural multi-turn dialogues where subsequent requests can reference previous operations without requiring full context repetition.
324
-
325
- **Session Memory Capabilities:**
326
-
327
- - **Workflow History**: Stores complete history of executed tools, parameters, and results for the current session
328
- - **File State Tracking**: Maintains references to uploaded files, intermediate processed datasets, and generated outputs
329
- - **Model Persistence**: Remembers trained models and their performance metrics for comparison and further tuning
330
- - **Error Context**: Stores information about encountered errors to avoid repeating failed operations
331
- - **User Preferences**: Learns from user choices (e.g., preferred visualization types, imputation strategies)
332
-
333
- **Example Multi-Turn Conversation:**
334
 
335
- ```Complete Workflow Example
336
-
337
- This section demonstrates a complete end-to-end workflow for a real-world dataset, showing the agent's autonomous decision-making and execution capabilities.
338
 
339
- ### Dataset: Earthquake Magnitude Prediction
 
340
 
341
- **Input Dataset:** `earthquake_data.csv`
342
- - Rows: 175,947 earthquake records
343
- - Columns: 22 features including latitude, longitude, depth, time, location, and magnitude
344
- - Target Variable: Earthquake magnitude (continuous regression task)
345
- - Data Quality: 11.67% missing values, presence of outliers, mixed data types
346
 
347
- **User Prompt:**
348
- ```
349
- "Train a model to predict earthquake magnitude with the highest possible accuracy"
350
  ```
351
 
352
- ### Automated Workflow Execution
353
-
354
- **Phase 1: Data Profiling and Assessment** (Step 1)
355
- - Tool: `generate_ydata_profile`
356
- - Action: Comprehensive statistical analysis of all 22 features
357
- - Findings:
358
- - Total records: 175,947
359
- - Missing values detected in 8 columns
360
- - Outliers present in depth, latitude, longitude
361
- - High cardinality in location column (15,000+ unique values)
362
- - Strong correlation between depth and magnitude (r=0.62)
363
- - Output: YData Profiling HTML report saved to `outputs/earthquake_profile.html`
364
- - Time: 18.3 seconds
365
- API Reference
366
-
367
- The FastAPI backend exposes several endpoints for programmatic interaction.
368
 
369
- ### Endpoints
370
-
371
- **POST /chat**
372
- - Description: Send a message to the agent with optional file upload
373
- - Content-Type: multipart/form-data
374
- - Parameters:
375
- - message (string, required): User's natural language request
376
- - file (file, optional): Dataset file (CSV or Parquet)
377
- - Response: JSON with agent's response message and workflow history
378
- - Example:
379
- ```bash
380
- curl -X POST http://localhost:8080/chat \
381
- -F "message=Generate a data profile report" \
382
- -F "file=@dataset.csv"
383
  ```
384
-
385
- **POST /run**
386
- - Description: Execute a complete analysis workflow
387
- - Content-Type: application/json
388
- - Parameters:
389
- - query (string, required): Analysis request
390
- - use_cache (boolean, optional): Enable caching (default: true)
391
- - Response: JSON with analysis results and generated artifacts
392
- - Example:
393
- ```json
394
- {
395
- "query": "Train a regression model to predict sales",
396
- "use_cache": true
397
- }
 
398
  ```
399
 
400
- **GET /outputs/{file_path}**
401
- - Description: Retrieve generated reports and artifacts
402
- - Parameters:
403
- - file_path (string, required): Path to output file
404
- - Response: File content (HTML, PNG, CSV, etc.)
405
- - Example:
406
- ```bash
407
- curl http://localhost:8080/outputs/ydata_profile.html
408
- ```
409
 
410
- **GET /api/health**
411
- - Description: Health check endpoint
412
- - Response: JSON with status information
413
- - Example response:
414
- ```json
415
- {
416
- "status": "healthy",
417
- "version": "1.0.0",
418
- "timestamp": "2025-12-27T10:30:00Z"
419
- }
420
- ```
421
 
422
- ### Interactive API Documentation
423
 
424
- FastAPI automatically generates interactive API documentation:
425
- - Swagger UI: http://localhost:8080/docs
426
- - ReDoc: http://localhost:8080/redoc
 
 
427
 
428
  ## Contributing
429
 
430
- Contributions to improve the AI-Powered Data Science Agent are welcome. Please follow these guidelines:
431
-
432
- ### Development Setup
433
-
434
- 1. Fork the repository and clone your fork
435
- 2. Create a new branch for your feature: `git checkout -b feature/your-feature-name`
436
- 3. Install development dependencies: `pip install -r requirements-dev.txt`
437
- 4. Make your changes with appropriate tests
438
- 5. Ensure all tests pass: `pytest tests/`
439
- 6. Format code with black: `black src/`
440
- 7. Lint code with flake8: `flake8 src/`
441
- 8. Commit with descriptive messages
442
- 9. Push to your fork and submit a pull request
443
-
444
- ### Code Style
445
-
446
- - Follow PEP 8 guidelines for Python code
447
- - Use type hints for function parameters and return values
448
- - Write docstrings for all functions and classes
449
- - Keep functions focused and under 50 lines when possible
450
- - Use meaningful variable names
451
-
452
- ### Testing
453
-
454
- - Write unit tests for new features
455
- - Ensure existing tests pass before submitting PR
456
- - Aim for >80% code coverage
457
 
458
  ## License
459
 
460
- This project is licensed under the MIT License. See the LICENSE file for complete terms.
461
-
462
- Copyright (c) 2025 Pulastya B
463
-
464
- Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
465
-
466
- The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
467
-
468
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
469
-
470
- ## Acknowledgments
471
-
472
- This project builds upon several excellent open-source technologies and frameworks:
473
-
474
- - **Google Gemini 2.5 Flash**: Advanced language model with function calling capabilities enabling intelligent agent orchestration
475
- - **FastAPI**: Modern, high-performance web framework for building APIs with Python, providing automatic documentation and validation
476
- - **React**: JavaScript library for building user interfaces, enabling component-based architecture and efficient rendering
477
- - **Polars**: High-performance DataFrame library written in Rust, offering significant speed improvements over traditional data processing libraries
478
- - **Scikit-learn**: Machine learning library providing simple and efficient tools for data analysis and modeling
479
- - **XGBoost, LightGBM, CatBoost**: Gradient boosting frameworks offering state-of-the-art performance for structured data
480
- - **Optuna**: Hyperparameter optimization framework with efficient search algorithms
481
- - **YData Profiling**: Automated exploratory data analysis tool generating comprehensive reports
482
- - **Plotly**: Interactive visualization library for creating publication-quality graphs
483
- - **TypeScript**: Typed superset of JavaScript enhancing code quality and developer experience
484
- - **Tailwind CSS**: Utility-first CSS framework for rapid UI development
485
- - **Vite**: Next-generation frontend build tool with instant server start
486
-
487
- Special thanks to the open-source community for creating and maintaining these exceptional tools.
488
-
489
- ## Contact and Support
490
-
491
- **Developer:** Pulastya B
492
-
493
- **GitHub Profile:** [@Pulastya-B](https://github.com/Pulastya-B)
494
-
495
- **Project Repository:** [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
496
-
497
- **Issues and Bug Reports:** Please use the GitHub Issues page to report bugs or request features
498
-
499
- **Documentation:** Additional documentation and tutorials available in the repository wiki
500
-
501
- **Project Status:** Active development - Built for DevSprint Hackathon
502
-
503
- For questions, suggestions, or collaboration opportunities, please open an issue on GitHub or contact through the repository.
504
-
505
- ---
506
-
507
- **Last Updated:** December 27, 2025
508
-
509
- **Version:** 1.0.0
510
- Step 6 - Temporal Feature Extraction:
511
- - Tool: `extract_time_features`
512
- - Input column: 'timestamp'
513
- - Features created:
514
- - year, month, day_of_week, hour
515
- - Cyclical encodings: hour_sin, hour_cos, month_sin, month_cos
516
- - Justification: Earthquakes may have temporal patterns
517
- - New columns: 8 time-based features
518
-
519
- Step 7 - Categorical Encoding:
520
- - Tool: `encode_categorical_features`
521
- - Method: Target encoding for 'location' (high cardinality), one-hot encoding for 'type'
522
- - Result: All categorical variables converted to numeric
523
- - New columns: 3 (reduced from high-cardinality location)
524
-
525
- Step 8 - Statistical Features:
526
- - Tool: `create_statistical_features`
527
- - Features created:
528
- - Distance from nearest plate boundary (calculated from lat/lon)
529
- - Depth-to-magnitude ratio
530
- - Regional earthquake frequency (rolling count)
531
- - New columns: 3 domain-specific features
532
-
533
- Final feature count: 28 engineered features
534
-
535
- **Phase 5: Model Training and Selection** (Step 9)
536
- - Tool: `train_baseline_models`
537
- - Algorithms trained in parallel:
538
-
539
- 1. Ridge Regression: RΒ² = 0.534, RMSE = 0.312
540
- 2. Lasso Regression: RΒ² = 0.541, RMSE = 0.309
541
- 3. ElasticNet: RΒ² = 0.538, RMSE = 0.311
542
- 4. Random Forest: RΒ² = 0.698, RMSE = 0.251
543
- 5. XGBoost: RΒ² = 0.716, RMSE = 0.243 (BEST)
544
- 6. LightGBM: RΒ² = 0.709, RMSE = 0.247
545
- 7. CatBoost: RΒ² = 0.712, RMSE = 0.245
546
-
547
- - Best model selected: XGBoost
548
- - Validation split: 80/20 stratified split
549
- - Time: 124.7 seconds
550
-
551
- **Phase 6: Hyperparameter Optimization** (Step 10)
552
- - Tool: `optimize_hyperparameters_optuna`
553
- - Framework: Optuna with Tree-structured Parzen Estimator (TPE)
554
- - Search space:
555
- - max_depth: [3, 10]
556
- - learning_rate: [0.001, 0.3] (log scale)
557
- - n_estimators: [100, 1000]
558
- - min_child_weight: [1, 10]
559
- - subsample: [0.6, 1.0]
560
- - colsample_bytree: [0.6, 1.0]
561
- - Trials: 50 iterations
562
- - Best parameters found:
563
- - max_depth: 7
564
- - learning_rate: 0.0847
565
- - n_estimators: 673
566
- - min_child_weight: 3
567
- - subsample: 0.8234
568
- - colsample_bytree: 0.9123
569
- - Optimized performance: RΒ² = 0.743, RMSE = 0.231
570
- - Improvement: +3.8% RΒ² over baseline
571
- - Time: 312.4 seconds
572
-
573
- **Phase 7: Model Validation** (Step 11)
574
- - Tool: `cross_validate_model`
575
- - Method: 5-fold stratified cross-validation
576
- - Results:
577
- - Fold 1: RΒ² = 0.741, RMSE = 0.232
578
- - Fold 2: RΒ² = 0.745, RMSE = 0.230
579
- - Fold 3: RΒ² = 0.738, RMSE = 0.234
580
- - Fold 4: RΒ² = 0.747, RMSE = 0.229
581
- - Fold 5: RΒ² = 0.742, RMSE = 0.232
582
- - Mean performance: RΒ² = 0.743 Β± 0.003, RMSE = 0.231 Β± 0.002
583
- - Interpretation: Low variance across folds indicates robust, generalizable model
584
- - Time: 267.8 seconds
585
-
586
- **Phase 8: Visualization and Reporting** (Steps 12-13)
587
-
588
- Step 12 - Feature Importance Analysis:
589
- - Tool: `plot_feature_importance`
590
- - Top 10 features by importance:
591
- 1. depth (0.284)
592
- 2. distance_to_plate_boundary (0.167)
593
- 3. latitude (0.142)
594
- 4. longitude (0.138)
595
- 5. regional_frequency (0.095)
596
- 6. depth_magnitude_ratio (0.067)
597
- 7. hour_sin (0.034)
598
- 8. month (0.028)
599
- 9. location_encoded (0.024)
600
- 10. year (0.021)
601
- - Output: Interactive Plotly bar chart saved to `outputs/feature_importance.html`
602
-
603
- Step 13 - Comprehensive Dashboard:
604
- - Tool: `create_plotly_dashboard`
605
- - Visualizations included:
606
- - Correlation heatmap (28x28 features)
607
- - Actual vs Predicted scatter plot
608
- - Residual distribution plot
609
- - Feature importance ranking
610
- - Temporal patterns in predictions
611
- - Output: Multi-panel interactive dashboard saved to `outputs/model_dashboard.html`
612
-
613
- ### Final Results Summary
614
-
615
- **Model Performance:**
616
- - Algorithm: XGBoost with optimized hyperparameters
617
- - Training RΒ²: 0.743
618
- - Cross-validated RΒ²: 0.743 Β± 0.003
619
- - RMSE: 0.231 (on magnitude scale 0-10)
620
- - MAE: 0.176
621
- - Explanation: Model explains 74.3% of variance in earthquake magnitudes
622
-
623
- **Artifacts Generated:**
624
- - Trained model file: `outputs/xgboost_model_optimized.pkl`
625
- - YData profiling report: `outputs/earthquake_profile.html`
626
- - Feature importance plot: `outputs/feature_importance.html`
627
- - Interactive dashboard: `outputs/model_dashboard.html`
628
- - Cleaned dataset: `data/earthquake_data_cleaned.parquet`
629
- - Feature engineered dataset: `data/earthquake_data_featured.parquet`
630
-
631
- **Total Execution Time:** 12 minutes 43 seconds
632
-
633
- **Key Insights:**
634
- 1. Depth is the strongest predictor of earthquake magnitude (28.4% importance)
635
- 2. Spatial features (distance to plate boundaries, lat/lon) are highly informative
636
- 3. Temporal patterns show cyclical variations in earthquake characteristics
637
- 4. Model performance is consistent across cross-validation folds (low variance)
638
- 5. The optimized XGBoost model provides reliable magnitude predictions suitable for deployment
639
-
640
- ### Robust Error Recovery System
641
-
642
- The agent implements a comprehensive error recovery system designed to handle failures gracefully and guide users toward successful task completion.
643
-
644
- **Error Recovery Mechanisms:**
645
-
646
- 1. **Automatic Retry with Correction**: When a tool execution fails due to incorrect parameters, the agent analyzes the error message, adjusts parameters based on the error type, and automatically retries the operation with corrected inputs.
647
-
648
- 2. **File Existence Validation**: Before executing tools that require specific file inputs, the system validates file existence and accessibility, providing clear guidance when files are missing.
649
-
650
- 3. **Column Name Validation**: Validates that requested column names exist in the dataset before performing operations, suggesting similar column names when exact matches aren't found.
651
-
652
- 4. **Dependency Tracking**: Ensures tools are executed in proper sequence, checking that prerequisite operations (e.g., data cleaning before training) have been completed.
653
-
654
- 5. **Loop Detection**: Monitors tool execution patterns to detect and prevent infinite retry loops. If the same operation fails multiple times with the same error, the agent stops retrying and requests user intervention.
655
-
656
- 6. **Recovery Guidance**: When errors cannot be automatically resolved, the system provides detailed guidance including:
657
- - Clear explanation of what went wrong
658
- - The last successful file state that can be used to continue
659
- - Suggested alternative approaches
660
- - Specific parameter corrections needed
661
-
662
- 7. **Graceful Degradation**: If a requested operation cannot be completed, the agent attempts to provide partial results or alternative analysis that may still be valuable.
663
-
664
- **Example Error Recovery Flow:**
665
-
666
- ```
667
- Request: "Train a model to predict 'Price' column"
668
-
669
- Error Detected: Column 'Price' not found in dataset
670
- Recovery Action: Search for similar columns β†’ Find 'price', 'PRICE', 'SalePrice'
671
- Agent Response: "Column 'Price' not found. Did you mean 'SalePrice'? I found these similar columns: ['SalePrice', 'price_usd']. Please specify which column to use."
672
-
673
- User: "Yes, use SalePrice"
674
- Agent: [Continues with corrected column name]
675
- ```
676
-
677
- ### Interactive Report Viewing
678
-
679
- The web interface includes an integrated report viewer that displays comprehensive HTML reports generated during analysis without requiring users to download files or switch to external tools.
680
-
681
- **Report Viewer Features:**
682
-
683
- - **In-Application Display**: Reports open in a full-screen modal overlay within the chat interface
684
- - **Multiple Report Types**: Supports YData Profiling reports and custom HTML dashboards
685
- - **Professional Styling**: Modal features glassmorphism design, smooth animations, and responsive layout
686
- - **Interactive Navigation**: Users can zoom, scroll, and interact with report elements directly in the viewer
687
- - **Download Option**: Reports can be downloaded as standalone HTML files for sharing or archival
688
- - **Automatic Detection**: System automatically detects when tools generate HTML reports and creates "View Report" buttons in the chat interface
689
-
690
- **Supported Report Types:**
691
-
692
- 1. **YData Profiling Reports**: Comprehensive automated EDA with variable statistics, distributions, correlations, missing value analysis, and alerts for data quality issues
693
-
694
- 2. **Custom Dashboards**: User-created Plotly dashboards with multiple interactive visualizations
695
-
696
- The report extraction system uses multiple strategies to locate report files, including checking tool return values, parsing workflow history, and using regex pattern matching on agent responses.
697
- - Use different API keys for development and production
698
- - Rotate API keys periodically
699
- - Set restrictive file permissions on `.env` (chmod 600 on Linux/macOS)inux/macOS:**
700
- ```bash
701
- chmod +x build-and-deploy.sh
702
- ./build-and-deploy.sh
703
- ```
704
-
705
- These scripts handle building the image, stopping any existing containers, and starting a new container with proper configuration.FRRONTEEEND
706
- npm install
707
- npm run build
708
- cd ..
709
- ```
710
-
711
- **5. Run the application**
712
-
713
- **Windows:**
714
- ```powershell
715
- .\start.ps1
716
- ```
717
-
718
- **Linux/Mac:**
719
- ```bash
720
- chmod +x start.sh
721
- ./start.sh
722
- ```
723
-
724
- The application will be available at **http://localhost:8080**
725
-
726
- ---
727
-
728
- ## πŸ“– Usage
729
-
730
- ### Web Interface
731
-
732
- 1. **Navigate to http://localhost:8080**
733
- 2. **Click "Launch Agent"** from the landing page
734
- 3. **Upload your dataset** (CSV or Parquet format)
735
- 4. **Type your request** in natural language:
736
- - "Generate a comprehensive report on this dataset"
737
- - "Train a model to predict [target_column]"
738
- - "Clean the data and show me visualizations"
739
- - "Perform feature engineering and train the best model"
740
- 5. **View results** in the chat and click "View Report" buttons to see detailed HTML reports
741
-
742
- ### Example Queries
743
-
744
- ```
745
- πŸ“Š "Profile this dataset and tell me about data quality issues"
746
-
747
- 🧹 "Clean the missing values and handle outliers"
748
-
749
- 🎯 "Train a model to predict house prices with target column 'price'"
750
-
751
- πŸ“ˆ "Generate a correlation heatmap and feature importance plot"
752
-
753
- πŸ”§ "Create time-based features and perform hyperparameter tuning"
754
-
755
- πŸ“‹ "Generate a comprehensive YData profiling report"
756
- ```
757
-
758
- ---
759
-
760
- ## πŸ—οΈ Architecture
761
-
762
- ```
763
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
764
- β”‚ React Frontend (Port 8080) β”‚
765
- β”‚ Landing Page β”‚ Chat Interface β”‚ Report Viewer β”‚
766
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
767
- β”‚
768
- β–Ό
769
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
770
- β”‚ FastAPI Backend (Python 3.10+) β”‚
771
- β”‚ /chat β”‚ /run β”‚ /outputs β”‚ /api/health β”‚
772
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
773
- β”‚
774
- β–Ό
775
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
776
- β”‚ DataScienceCopilot Orchestrator β”‚
777
- β”‚ β€’ Gemini 2.5 Flash Integration β”‚
778
- β”‚ β€’ 82+ Specialized Tools β”‚
779
- β”‚ β€’ Session Memory & Context β”‚
780
- β”‚ β€’ Intelligent Intent Detection β”‚
781
- β”‚ β€’ Error Recovery & Loop Prevention β”‚
782
- └─────────────────────────┬───────────────────────────────────���
783
- β”‚
784
- β–Ό
785
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
786
- β”‚ Tool Categories β”‚
787
- β”‚ Profiling β”‚ Cleaning β”‚ Feature Engineering β”‚ ML Training β”‚
788
- β”‚ Visualization β”‚ EDA Reports β”‚ Data Wrangling β”‚
789
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
790
- ```
791
-
792
- ---
793
-
794
- ## Tech Stack
795
-
796
- ### Frontend
797
- - **React 19** - Modern UI library
798
- - **TypeScript 5.8** - Type-safe development
799
- - **Vite 6** - Lightning-fast build tool
800
- - **Tailwind CSS** - Utility-first styling
801
- - **Framer Motion** - Smooth animations
802
- - **React Markdown** - Formatted responses
803
-
804
- ### Backend
805
- - **FastAPI** - High-performance Python web framework
806
- - **Google Gemini 2.5 Flash** - LLM for agent orchestration
807
- - **Polars** - Fast dataframe library (10-100x faster than pandas)
808
- - **Scikit-learn** - Classical ML algorithms
809
- - **XGBoost / LightGBM / CatBoost** - Gradient boosting frameworks
810
- - **Optuna** - Hyperparameter optimization
811
- - **YData Profiling** - Automated EDA reports
812
- - **Plotly / Matplotlib** - Interactive visualizations
813
-
814
- ### DevOps
815
- - **Docker** - Containerization with multi-stage builds
816
- - **Python-dotenv** - Environment variable management
817
- - **SQLite** - Caching layer for performance
818
-
819
- ---
820
-
821
- ## 🐳 Docker Deployment
822
-
823
- **Build and run with Docker:**
824
-
825
- ```bash
826
- docker build -t ds-agent .
827
- docker run -p 8080:8080 --env-file .env ds-agent
828
- ```
829
-
830
- **Or use the deployment script:**
831
-
832
- ```bash
833
- .\build-and-deploy.ps1 # Windows
834
- ./build-and-deploy.sh # Linux/Mac
835
- ```
836
-
837
- ---
838
-
839
- ## πŸ“‚ Project Structure
840
-
841
- ```
842
- .
843
- β”œβ”€β”€ FRRONTEEEND/ # React frontend
844
- β”‚ β”œβ”€β”€ components/ # UI components
845
- β”‚ β”‚ β”œβ”€β”€ ChatInterface.tsx # Main chat interface
846
- β”‚ β”‚ β”œβ”€β”€ HeroGeometric.tsx # Landing page hero
847
- β”‚ β”‚ └── ...
848
- β”‚ β”œβ”€β”€ dist/ # Built frontend
849
- β”‚ └── package.json
850
- β”‚
851
- β”œβ”€β”€ src/ # Python backend
852
- β”‚ β”œβ”€β”€ api/
853
- β”‚ β”‚ └── app.py # FastAPI application
854
- β”‚ β”œβ”€β”€ orchestrator.py # Agent orchestrator
855
- β”‚ β”œβ”€β”€ session_memory.py # Session management
856
- β”‚ β”œβ”€β”€ tools/ # 82+ ML tools
857
- β”‚ β”‚ β”œβ”€β”€ data_profiling.py
858
- β”‚ β”‚ β”œβ”€β”€ data_cleaning.py
859
- β”‚ β”‚ β”œβ”€β”€ feature_engineering.py
860
- β”‚ β”‚ β”œβ”€β”€ model_training.py
861
- β”‚ β”‚ └── ...
862
- β”‚ └── utils/ # Helper utilities
863
- β”‚
864
- β”œβ”€β”€ Dockerfile # Multi-stage Docker build
865
- β”œβ”€β”€ requirements.txt # Python dependencies
866
- β”œβ”€β”€ start.ps1 / start.sh # Quick start scripts
867
- └── README.md # This file
868
- ```
869
-
870
- ---
871
-
872
- ## πŸ”‘ Environment Variables
873
-
874
- Create a `.env` file in the root directory:
875
-
876
- ```bash
877
- # LLM Provider Configuration
878
- LLM_PROVIDER=gemini
879
-
880
- # API Keys
881
- GOOGLE_API_KEY=your_gemini_api_key_here
882
-
883
- # Model Configuration
884
- GEMINI_MODEL=gemini-2.5-flash
885
-
886
- # Cache Configuration
887
- CACHE_DB_PATH=./cache_db/cache.db
888
- CACHE_TTL_SECONDS=86400
889
-
890
- # Output Configuration
891
- OUTPUT_DIR=./outputs
892
- DATA_DIR=./data
893
- ```
894
-
895
- ---
896
-
897
- ## 🎯 Features in Detail
898
-
899
- ### Intelligent Intent Detection
900
- The agent automatically classifies your request and applies the appropriate workflow:
901
- - **Full ML Pipeline** - Complete end-to-end workflow with training
902
- - **Exploratory Analysis** - Data profiling and visualization only
903
- - **Cleaning Only** - Data quality improvements without modeling
904
- - **Visualization Only** - Generate plots and dashboards
905
- - **Multi-Intent** - Combine multiple tasks intelligently
906
-
907
- ### Session Memory
908
- The agent remembers context across messages:
909
- ```
910
- You: "Train a model on this dataset"
911
- Agent: [Trains XGBoost model with RΒ² = 0.85]
912
-
913
- You: "Now try hyperparameter tuning"
914
- Agent: [Automatically uses previous model and dataset]
915
-
916
- You: "Cross-validate it"
917
- Agent: [Applies CV to tuned model from context]
918
- ```
919
-
920
- ### Error Recovery
921
- - Automatic retry with corrected parameters
922
- - File existence validation before execution
923
- - Recovery guidance showing last successful file
924
- - Loop detection to prevent infinite retries
925
-
926
- ### Report Viewing
927
- - Click "View Report" buttons to see HTML reports in-app
928
- - Full-screen modal with professional styling
929
- - Supports YData Profiling and custom dashboards
930
-
931
- ---
932
-
933
- ## πŸ“Š Example Workflow
934
-
935
- **Upload:** `earthquake_data.csv` (175K rows, 22 columns)
936
-
937
- **Prompt:** "Train a model to predict earthquake magnitude"
938
-
939
- **Agent Actions:**
940
- 1. βœ… Profiles dataset (175,947 rows, 22 columns)
941
- 2. βœ… Detects data quality issues (11.67% missing, outliers)
942
- 3. βœ… Drops high-missing columns (>40% missing)
943
- 4. βœ… Imputes remaining missing values with median/mode
944
- 5. βœ… Handles outliers with IQR clipping
945
- 6. βœ… Extracts time-based features (year, month, hour, cyclical)
946
- 7. βœ… Encodes categorical variables
947
- 8. βœ… Trains 6 baseline models (XGBoost wins with RΒ² = 0.716)
948
- 9. βœ… Performs hyperparameter tuning (RΒ² = 0.743)
949
- 10. βœ… Runs 5-fold cross-validation (RMSE = 0.167 Β± 0.0005)
950
- 11. βœ… Generates YData profiling report
951
- 12. βœ… Creates interactive Plotly dashboard
952
-
953
- **Result:** Trained and tuned XGBoost model ready for deployment!
954
-
955
- ---
956
-
957
- ## 🀝 Contributing
958
-
959
- Contributions are welcome! Please feel free to submit a Pull Request.
960
-
961
- ---
962
-
963
- ## πŸ“„ License
964
-
965
- This project is licensed under the MIT License.
966
-
967
- ---
968
-
969
- ## πŸ™ Acknowledgments
970
-
971
- - **Google Gemini** for powerful LLM capabilities
972
- - **FastAPI** for excellent async Python framework
973
- - **React** community for amazing UI libraries
974
- - **Polars** for blazing-fast data processing
975
- - **YData Profiling** for comprehensive EDA reports
976
-
977
- ---
978
-
979
- ## πŸ“§ Contact
980
-
981
- **Pulastya B**
982
- - GitHub: [@Pulastya-B](https://github.com/Pulastya-B)
983
- - Project: [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
984
-
985
- ---
986
-
987
- <div align="center">
988
-
989
- **Built with ❀️ for DevSprint Hackathon**
990
-
991
- ⭐ Star this repo if you find it helpful!
992
-
993
- </div>
 
1
+ ---
2
+ title: DevSprint Data Science Agent
3
+ emoji: πŸ€–
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ app_port: 7860
 
 
 
 
 
 
10
  ---
11
 
12
+ # DevSprint Data Science Agent πŸ€–
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
+ An intelligent AI agent for automated data science workflows, powered by Google Gemini 2.5 Flash with 82+ specialized tools for data analysis, visualization, and machine learning.
 
 
 
 
 
 
15
 
16
+ ## Features
 
 
 
 
 
 
17
 
18
+ - πŸ” **Automated EDA**: YData profiling, statistical analysis, data quality reports
19
+ - πŸ“Š **Smart Visualizations**: Plotly dashboards, matplotlib plots, interactive charts
20
+ - 🧹 **Data Cleaning**: Missing value handling, outlier detection, type conversion
21
+ - πŸ› οΈ **Feature Engineering**: Automated feature creation, encoding, scaling
22
+ - πŸ€– **ML Training**: AutoML with XGBoost, LightGBM, CatBoost, Neural Networks
23
+ - πŸ’¬ **Natural Language Interface**: Chat-based interaction for complex workflows
24
+ - πŸ“ˆ **Business Intelligence**: KPI tracking, trend analysis, forecasting
25
 
26
+ ## Tech Stack
 
27
 
28
+ - **Backend**: FastAPI + Python 3.12
29
+ - **LLM**: Google Gemini 2.5 Flash (text-based tool calling)
30
+ - **Data Processing**: Polars (high-performance dataframes)
31
+ - **Frontend**: React 19 + TypeScript + Vite
32
+ - **ML Libraries**: Scikit-learn, XGBoost, LightGBM, CatBoost, PyTorch
33
 
34
+ ## Usage
 
 
 
 
 
 
35
 
36
+ 1. Upload your CSV/Excel dataset
37
+ 2. Ask questions in natural language (e.g., "Generate a detailed profiling report")
38
+ 3. The agent automatically selects and executes the right tools
39
+ 4. View generated reports, visualizations, and insights
40
 
41
+ ## Memory Optimization
 
 
 
 
 
 
 
 
 
 
 
42
 
43
+ For large datasets (>50k rows or >10MB), the agent automatically:
44
+ - Samples to 50,000 rows for profiling
45
+ - Enables minimal mode to reduce memory usage
46
+ - Disables expensive correlation/interaction calculations
47
 
48
+ This ensures smooth operation even with large datasets on HuggingFace's 16GB RAM.
49
 
50
+ ## Environment Variables
51
 
52
+ Set `GEMINI_API_KEY` in HuggingFace Spaces settings (Settings β†’ Repository secrets):
53
 
 
 
54
  ```
55
+ GEMINI_API_KEY=your_google_gemini_api_key_here
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
58
+ Get your API key from: https://aistudio.google.com/app/apikey
 
 
 
 
 
 
 
 
59
 
60
+ ## Local Development
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ```bash
63
+ # Clone repository
64
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
65
+ cd devs-print-data-science-agent
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ # Install dependencies
68
+ pip install -r requirements.txt
69
+ npm install --prefix FRRONTEEEND
70
 
71
+ # Build frontend
72
+ cd FRRONTEEEND && npm run build && cd ..
73
 
74
+ # Set API key
75
+ export GEMINI_API_KEY=your_key_here
 
 
 
76
 
77
+ # Run server
78
+ uvicorn src.api.app:app --host 0.0.0.0 --port 7860
 
79
  ```
80
 
81
+ ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ```
84
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
85
+ β”‚ React Frontend β”‚ ← User uploads data + asks questions
86
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
87
+ β”‚
88
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
89
+ β”‚ FastAPI Server β”‚ ← Serves frontend + API endpoints
90
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
91
+ β”‚
92
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
93
+ β”‚ Orchestrator β”‚ ← LLM-driven tool selection & execution
94
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
95
+ β”‚
96
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
97
+ β”‚ 82+ Tools β”‚ ← Specialized data science functions
98
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
99
  ```
100
 
101
+ ## Key Components
 
 
 
 
 
 
 
 
102
 
103
+ - **Orchestrator** ([src/orchestrator.py](src/orchestrator.py)): ReAct-based tool calling with Gemini
104
+ - **Tools Registry** ([src/tools/](src/tools/)): 82+ specialized data science tools
105
+ - **Session Memory** ([src/session_memory.py](src/session_memory.py)): Conversation history + file tracking
106
+ - **Artifact Store** ([src/storage/artifact_store.py](src/storage/artifact_store.py)): File management + metadata
 
 
 
 
 
 
 
107
 
108
+ ## Deployment
109
 
110
+ This Space uses a **Docker** deployment for maximum compatibility:
111
+ - Base image: `python:3.12-slim`
112
+ - Multi-stage build (Node.js for frontend, Python for backend)
113
+ - Auto-exposes port 7860 for HuggingFace
114
+ - All dependencies bundled in container
115
 
116
  ## Contributing
117
 
118
+ Built for DevSprint Hackathon 2025. Contributions welcome post-hackathon!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ## License
121
 
122
+ MIT License - see LICENSE file for details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README_SPACES.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DevSprint Data Science Agent
3
+ emoji: πŸ€–
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ app_port: 7860
10
+ ---
11
+
12
+ # DevSprint Data Science Agent πŸ€–
13
+
14
+ An intelligent AI agent for automated data science workflows, powered by Google Gemini 2.5 Flash with 82+ specialized tools for data analysis, visualization, and machine learning.
15
+
16
+ ## Features
17
+
18
+ - πŸ” **Automated EDA**: YData profiling, statistical analysis, data quality reports
19
+ - πŸ“Š **Smart Visualizations**: Plotly dashboards, matplotlib plots, interactive charts
20
+ - 🧹 **Data Cleaning**: Missing value handling, outlier detection, type conversion
21
+ - πŸ› οΈ **Feature Engineering**: Automated feature creation, encoding, scaling
22
+ - πŸ€– **ML Training**: AutoML with XGBoost, LightGBM, CatBoost, Neural Networks
23
+ - πŸ’¬ **Natural Language Interface**: Chat-based interaction for complex workflows
24
+ - πŸ“ˆ **Business Intelligence**: KPI tracking, trend analysis, forecasting
25
+
26
+ ## Tech Stack
27
+
28
+ - **Backend**: FastAPI + Python 3.12
29
+ - **LLM**: Google Gemini 2.5 Flash (text-based tool calling)
30
+ - **Data Processing**: Polars (high-performance dataframes)
31
+ - **Frontend**: React 19 + TypeScript + Vite
32
+ - **ML Libraries**: Scikit-learn, XGBoost, LightGBM, CatBoost, PyTorch
33
+
34
+ ## Usage
35
+
36
+ 1. Upload your CSV/Excel dataset
37
+ 2. Ask questions in natural language (e.g., "Generate a detailed profiling report")
38
+ 3. The agent automatically selects and executes the right tools
39
+ 4. View generated reports, visualizations, and insights
40
+
41
+ ## Memory Optimization
42
+
43
+ For large datasets (>50k rows or >10MB), the agent automatically:
44
+ - Samples to 50,000 rows for profiling
45
+ - Enables minimal mode to reduce memory usage
46
+ - Disables expensive correlation/interaction calculations
47
+
48
+ This ensures smooth operation even with large datasets on HuggingFace's 16GB RAM.
49
+
50
+ ## Environment Variables
51
+
52
+ Set `GEMINI_API_KEY` in HuggingFace Spaces settings (Settings β†’ Repository secrets):
53
+
54
+ ```
55
+ GEMINI_API_KEY=your_google_gemini_api_key_here
56
+ ```
57
+
58
+ Get your API key from: https://aistudio.google.com/app/apikey
59
+
60
+ ## Local Development
61
+
62
+ ```bash
63
+ # Clone repository
64
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
65
+ cd devs-print-data-science-agent
66
+
67
+ # Install dependencies
68
+ pip install -r requirements.txt
69
+ npm install --prefix FRRONTEEEND
70
+
71
+ # Build frontend
72
+ cd FRRONTEEEND && npm run build && cd ..
73
+
74
+ # Set API key
75
+ export GEMINI_API_KEY=your_key_here
76
+
77
+ # Run server
78
+ uvicorn src.api.app:app --host 0.0.0.0 --port 7860
79
+ ```
80
+
81
+ ## Architecture
82
+
83
+ ```
84
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
85
+ β”‚ React Frontend β”‚ ← User uploads data + asks questions
86
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
87
+ β”‚
88
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
89
+ β”‚ FastAPI Server β”‚ ← Serves frontend + API endpoints
90
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
91
+ β”‚
92
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
93
+ β”‚ Orchestrator β”‚ ← LLM-driven tool selection & execution
94
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
95
+ β”‚
96
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
97
+ β”‚ 82+ Tools β”‚ ← Specialized data science functions
98
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
99
+ ```
100
+
101
+ ## Key Components
102
+
103
+ - **Orchestrator** ([src/orchestrator.py](src/orchestrator.py)): ReAct-based tool calling with Gemini
104
+ - **Tools Registry** ([src/tools/](src/tools/)): 82+ specialized data science tools
105
+ - **Session Memory** ([src/session_memory.py](src/session_memory.py)): Conversation history + file tracking
106
+ - **Artifact Store** ([src/storage/artifact_store.py](src/storage/artifact_store.py)): File management + metadata
107
+
108
+ ## Deployment
109
+
110
+ This Space uses a **Docker** deployment for maximum compatibility:
111
+ - Base image: `python:3.12-slim`
112
+ - Multi-stage build (Node.js for frontend, Python for backend)
113
+ - Auto-exposes port 7860 for HuggingFace
114
+ - All dependencies bundled in container
115
+
116
+ ## Contributing
117
+
118
+ Built for DevSprint Hackathon 2025. Contributions welcome post-hackathon!
119
+
120
+ ## License
121
+
122
+ MIT License - see LICENSE file for details
README_original.md ADDED
@@ -0,0 +1,993 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI-Powered Data Science Agent
2
+
3
+ ## Overview
4
+
5
+ The AI-Powered Data Science Agent is an intelligent autonomous system designed to perform complete end-to-end data science workflows through natural language interaction. This agent leverages Google Gemini 2.5 Flash for advanced reasoning and function calling capabilities, combined with a comprehensive suite of over 82 specialized machine learning tools.
6
+
7
+ The system enables users to upload datasets in CSV or Parquet format and describe their analytical objectives in plain English. The agent autonomously handles the entire pipeline including data profiling, quality assessment, cleaning, feature engineering, model training, hyperparameter optimization, cross-validation, and comprehensive reporting generation.
8
+
9
+ Key capabilities include intelligent intent classification, session memory for contextual awareness, error recovery mechanisms, and a modern React-based web interface for seamless user interaction.
10
+
11
+ [![React](https://img.shields.io/badge/React-19-61DAFB?logo=react)](https://reactjs.org/)
12
+ [![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688?logo=fastapi)](https://fastapi.tiangolo.com/)
13
+ [![Gemini](https://img.shields.io/badge/Gemini-2.5_Flash-4285F4?logo=google)](https://ai.google.dev/)
14
+ [![Python](https://img.shields.io/badge/Python-3.10+-3776AB?logo=python)](https://python.org/)
15
+
16
+ ---
17
+
18
+ ## Key Features
19
+
20
+ ### Autonomous AI Agent System
21
+
22
+ The core orchestration engine integrates Google Gemini 2.5 Flash with over 82 specialized machine learning tools organized across multiple categories:
23
+
24
+ - **Data Profiling Tools**: Generate comprehensive statistical summaries, distribution analysis, correlation matrices, data quality reports, and automated anomaly detection
25
+ - **Data Cleaning Tools**: Handle missing values with intelligent imputation strategies (mean, median, mode, forward/backward fill, KNN), outlier detection and treatment using IQR and Z-score methods, duplicate removal, and data type conversions
26
+ - **Feature Engineering Tools**: Create time-based features (hour, day, month, year, cyclical encodings), polynomial features, interaction terms, statistical aggregations, lag features, rolling window statistics, and domain-specific transformations
27
+ - **Model Training Tools**: Support for multiple algorithm families including linear models (Ridge, Lasso, ElasticNet), tree-based models (Random Forest, Gradient Boosting), and advanced gradient boosting frameworks (XGBoost, LightGBM, CatBoost)
28
+ - **Visualization Tools**: Generate interactive Plotly visualizations, Matplotlib static plots, correlation heatmaps, distribution plots, scatter matrices, feature importance charts, and residual analysis plots
29
+
30
+ The intelligent orchestration system uses function calling capabilities to dynamically select and execute appropriate tools based on user intent. The agent maintains session memory for contextual awareness across conversation turns, enabling multi-turn dialogues where previous actions and results inform subsequent decisions.
31
+
32
+ Smart intent detection automatically classifies incoming requests into categories such as full ML pipeline execution, exploratory data analysis, data cleaning only, visualization generation, or multi-intent tasks requiring combined workflows.
33
+
34
+ Error recovery mechanisms include automatic retry logic with corrected parameters, file existence validation before tool execution, recovery guidance displaying the last successful file state, and loop detection to prevent infinite retry cycles.
35
+
36
+ ### Modern Web Interface
37
+
38
+ The frontend is built with React 19 and TypeScript 5.8, featuring a modern glassmorphism design aesthetic with smooth animations powered by Framer Motion. Key interface components include:
39
+
40
+ - **Landing Page**: Geometric hero section with animated background paths, key capabilities showcase, problem-solution presentation, process flow visualization, and technology stack display
41
+ - **Chat Interface**: Real-time message streaming, file upload support for CSV and Parquet formats, markdown rendering for formatted responses with code syntax highlighting, loading states with animated indicators, and error handling with user-friendly messages
42
+ - **Report Viewer**: In-application modal viewer for HTML reports generated by YData Profiling and custom dashboard tools. Full-screen modal with professional styling, iframe embedding for report content, and download capabilities
43
+ - **Session Management**: Maintains conversation history across browser sessions, allows users to review previous analyses, and provides context for follow-up questions
44
+
45
+ ### Complete Machine Learning Pipeline
46
+
47
+ The agent executes a comprehensive end-to-end pipeline:
48
+
49
+ 1. **Data Profiling and Assessment**: Automatically generates statistical summaries including descriptive statistics (mean, median, standard deviation, quartiles), distribution analysis with histogram generation, correlation analysis with heatmap visualization, missing value analysis with percentage calculations, data type detection and validation, outlier detection using multiple methods (IQR, Z-score, isolation forest), and cardinality analysis for categorical variables
50
+
51
+ 2. **Data Cleaning and Preprocessing**: Handles missing values with context-aware imputation strategies, removes or treats outliers based on statistical thresholds, performs data type conversions and casting, removes duplicate records, handles inconsistent formatting in categorical variables, and validates data integrity constraints
52
+
53
+ 3. Quick Start Guide
54
+
55
+ ### Prerequisites
56
+
57
+ Before beginning the installation, ensure your system meets the following requirements:
58
+
59
+ - **Python**: Version 3.10 or higher with pip package manager
60
+ - **Node.js**: V Steps
61
+
62
+ **Step 1: Clone the Repository**
63
+
64
+ Clone the repository from GitHub and navigate to the project directory:
65
+
66
+ ```bash
67
+ git clone https://github.com/Pulastya-B/DevSprint-Data-Science-Agent.git
68
+ cd DevSprint-Data-Science-Agent
69
+ ```
70
+
71
+ **Step 2: Configure Environment Variables**
72
+
73
+ Create a `.env` file in the root directory with the following configuration:
74
+
75
+ ```bash
76
+ # LLM Provider Configuration
77
+ LLM_PROVIDER=gemini
78
+
79
+ # Google Gemini API Key (required)
80
+ GOOGLE_API_KEY=your_api_key_here
81
+
82
+ # Model Configuration
83
+ GEMINI_MODEL=gemini-2.5-flash
84
+
85
+ # Cache Configuration
86
+ CACHE_DB_PATH=./cache_db/cache.db
87
+ CACHE_TTL_SECONDS=86400
88
+
89
+ # Output and Data Directories
90
+ OUTPUT_DIR=./outputs
91
+ DATA_DIR=./data
92
+ ```
93
+
94
+ Replace `your_api_key_here` with your actual Google Gemini API key obtained from https://ai.google.dev/
95
+
96
+ **Step 3: Install Python Dependencies**
97
+
98
+ Install all required Python packages using pip:
99
+
100
+ ```bash
101
+ pip install -r requirements.txt
102
+ ```
103
+
104
+ ThiUsage Guide
105
+
106
+ ### Web Interface Workflow
107
+
108
+ **Step 1: Access the Application**
109
+
110
+ Open your web browser and navigate to http://localhost:8080. You will see the landing page with an overview of the agent's capabilities.
111
+
112
+ **Step 2: Launch the Chat Interface**
113
+
114
+ Click the "Launch Agent" button to access the interactive chat interface.
115
+
116
+ **Step 3: Upload Your Dataset**
117
+
118
+ Click the file upload button (paperclip icon) and select your dataset file. Supported formats:
119
+ - CSV files (.csv) with any delimiter (comma, tab, semicolon, etc.)
120
+ - Parquet files (.parquet) for high-performance columnar storage
121
+
122
+ The agent will automatically detect the file format and load the data using appropriate parsers.
123
+
124
+ **Step 4: Describe Your Task**
125
+
126
+ Type your request in natural language in the chat input box. The agent understands various types of requests and will automatically determine the appropriate workflow.
127
+
128
+ **Step 5: Review Results**
129
+
130
+ The agent will execute the requested workflow and display results in the chat interface. For analyses that generate HTML reports (such as YData Profiling), a "View Report" button will appear. Click this button to open the report in a full-screen modal viewer.
131
+
132
+ ### Example Queries and Use Cases
133
+
134
+ **Data Profiling and Exploration:**
135
+ ```
136
+ "Generate a comprehensive profile report on this dataset"
137
+ "Show me the statistical summary and distribution of all variables"
138
+ "Analyze data quality issues including missing values and outliers"
139
+ "Create a correlation matrix and identify highly correlated features"
140
+ ```
141
+
142
+ **Data Cleaning:**
143
+ ```
144
+ "Clean the missing values using median imputation for numeric columns"
145
+ "Handle outliers in the dataset using IQR method"
146
+ "Remove duplicate records and fix data type inconsistencies"
147
+ "Drop columns with more than 50% missing values"
148
+ ```
149
+
150
+ **Predictive Modeling:**
151
+ ```
152
+ "Train a model to predict the target column 'price' using all features"
153
+ "Build a classification model for the 'churn' column"
154
+ "Compare multiple regression algorithms and select the best one"
155
+ "Train an XGBoost model with default hyperparameters"
156
+ ```
157
+
158
+ **Feature Engineering:**
159
+ ```
160
+ "Extract time-based features from the datetime column"
161
+ "Create interaction terms between numeric features"
162
+ "Apply target encoding for high-cardinality categorical variables"
163
+ "Generate polynomial features of degree 2"
164
+ ```
165
+
166
+ **Model Optimization:**
167
+ ```
168
+ "Perform hyperparameter tuning on the trained model using Optuna"
169
+ "Run 5-fold cross-validation to evaluate model performance"
170
+ "Optimize the XGBoost model for better accuracy"
171
+ ```
172
+
173
+ **Visualization:**
174
+ ```
175
+ "Generate a correlation heatmap for numeric features"
176
+ "Create distribution plots for all numeric columns"
177
+ "Show feature importance for the trained model"
178
+ "Generate interactive Plotly visualizations"
179
+ ```
180
+
181
+ **End-to-End Pipeline:**
182
+ ```
183
+ "Profile the data, clean it, engineer features, and train the best model"
184
+ "Perform complete analysis and predict the target column 'sales'"
185
+ "Do everything needed to build a production-ready model
186
+ .\start.ps1
187
+ ```
188
+
189
+ **For Linux/macOS:**
190
+ ```bash
191
+ chmod +x start.sh
192
+ ./start.sh
193
+ ```
194
+
195
+ The startup script will:
196
+ 1. Technology Stack
197
+
198
+ ### Frontend Technologies
199
+
200
+ - **React 19.2.3**: Latest version of React with improved concurrent rendering, automatic batching, and enhanced hooks for building performant user interfaces
201
+ - **TypeScript 5.8.2**: Provides static type checking, enhanced IDE support, and improved code maintainability with advanced type inference
202
+ - **Vite 6.2.0**: Next-generation frontend build tool offering instant server start, lightning-fast hot module replacement (HMR), and optimized production builds
203
+ - **Tailwind CSS 3.4.1**: Utility-first CSS framework enabling rapid UI development with pre-built classes and responsive design utilities
204
+ - **Framer Motion 12.23.26**: Production-ready animation library for React with declarative animations, gestures, and smooth transitions
205
+ - **React Markdown 9.0.1**: Markdown rendering component supporting GitHub-flavored markdown, code syntax highlighting, and custom renderers
206
+ - **Lucide React**: Icon library providing consistent, customizable SVG icons for the user interface
207
+
208
+ ### Backend Technologies
209
+
210
+ - **FastAPI 0.109+**: Modern, high-performance Python web framework with automatic OpenAPI documentation, async/await support, and built-in request validation
211
+ - **Google Gemini 2.5 Flash**: Large language model with advanced reasoning capabilities, function calling support, and high token limits for agent orchestration
212
+ - **Polars 0.20+**: High-performance DataFrame library written in Rust, offering 10-100x speed improvements over pandas for large datasets
213
+ - **Scikit-learn 1.3+**: Comprehensive machine learning library providing classical algorithms for classification, regression, clustering, and preprocessing
214
+ - **XGBoost 2.0+**: Optimized gradient boosting framework with parallel tree construction, regularization, and efficient handling of sparse data
215
+ - **LightGBM 4.1+**: Gradient boosting framework by Microsoft with leaf-wise tree growth, categorical feature support, and memory efficiency
216
+ - **CatBoost 1.2+**: Gradient boosting library by Yandex with native categorical feature handling, GPU support, and symmetric tree structure
217
+ - **Optuna 3.5+**: Hyperparameter optimization framework with Bayesian optimization, pruning strategies, and distributed optimization support
218
+ - **YData Profiling 4.6+**: Automated exploratory data analysis tool generating comprehensive HTML reports with statistical summaries and data quality insights
219
+ - **Plotly 5.18+**: Interactive visualization library creating web-based charts with zooming, panning, and hover tooltips
220
+ - **Matplotlib 3.8+**: Fundamental plotting library for Python offering publication-quality static visualizations
221
+ - **Pydantic 2.5+**: Data validation library using Python type annotations for request/response models
222
+
223
+ ###Docker Deployment
224
+
225
+ The application includes a multi-stage Dockerfile for optimized containerized deployment.
226
+
227
+ ### Building the Docker Image
228
+
229
+ Build the Docker image with the following command:
230
+
231
+ ```bash
232
+ docker build -t ds-agent:latest .
233
+ ```
234
+
235
+ The multi-stage build process:
236
+ 1. **Stage 1 (Builder)**: Installs Node.js dependencies and builds the React frontend
237
+ 2. **Stage 2 (Runtime)**: Sets up Python environment, installs backend dependencies, and copies built frontend
238
+ 3. Result: Optimized image size by excluding development dependencies and build tools
239
+
240
+ ### Running the Container
241
+
242
+ Run the containerized application:
243
+
244
+ ```bash
245
+ docker run -d \
246
+ -p 8080:8080 \
247
+ --env-file .env \
248
+ --name ds-agent-container \
249
+ ds-agent:latest
250
+ ```
251
+
252
+ Parameters explained:
253
+ - `-d`: Run container in detached mode (background)
254
+ - `-p 8080:8080`: Map container port 8080 to host port 8080
255
+ - `--env-file .env`: Load environment variables from .env file
256
+ - `--name ds-agent-container`: Assign a name to the container for easy management
257
+
258
+ ### Docker Compose (Recommended)
259
+
260
+ For easier management, create a `docker-compose.yml` file:
261
+
262
+ ```yaml
263
+ version: '3.8'
264
+
265
+ services:
266
+ ds-agent:
267
+ build: .
268
+ container_name: ds-agent
269
+ ports:
270
+ - "8080:8080"
271
+ env_file:
272
+ - .env
273
+ volumes:
274
+ Environment Configuration
275
+
276
+ The application uses environment variables for configuration management. Create a `.env` file in the project root directory with the following variables:
277
+
278
+ ### Required Configuration
279
+
280
+ ```bash
281
+ # LLM Provider Selection
282
+ LLM_PROVIDER=gemini
283
+ # Options: gemini (currently supported)
284
+
285
+ # Google Gemini API Key (REQUIRED)
286
+ GOOGLE_API_KEY=your_api_key_here
287
+ # Obtain from: https://ai.google.dev/
288
+ # Free tier limits: 10 RPM, 20 RPD
289
+
290
+ # Gemini Model Selection
291
+ GEMINI_MODEL=gemini-2.5-flash
292
+ # Options:
293
+ # - gemini-2.5-flash (recommended, balanced performance)
294
+ # - gemini-1.5-pro (higher capability, lower rate limits)
295
+ # - gemini-1.5-flash (faster, lower cost)
296
+ ```
297
+
298
+ ### Optional Configuration
299
+ Advanced Features
300
+
301
+ ### Intelligent Intent Detection and Classification
302
+
303
+ The orchestration system employs sophisticated intent detection to automatically classify user requests and route them to appropriate workflow pipelines. The classification system analyzes incoming natural language queries using keyword matching, pattern recognition, and contextual understanding.
304
+
305
+ **Intent Categories:**
306
+
307
+ 1. **Full ML Pipeline Intent**: Triggered by keywords such as "train", "model", "predict", "machine learning", "regression", "classification". Executes complete workflow including data profiling, cleaning, feature engineering, model training, hyperparameter tuning, and evaluation.
308
+
309
+ 2. **Exploratory Analysis Intent**: Activated by keywords like "explore", "profile", "report", "analysis", "overview", "insights", "understand". Performs comprehensive data profiling with statistical summaries, distribution analysis, correlation matrices, and automated insights generation.
310
+
311
+ 3. **Data Cleaning Intent**: Detected via keywords such as "clean", "missing", "outliers", "duplicates", "impute", "handle". Focuses on data quality improvement operations without proceeding to modeling.
312
+
313
+ 4. **Visualization Intent**: Identified through keywords like "plot", "visualize", "chart", "graph", "heatmap", "distribution". Generates requested visualizations without performing modeling or extensive preprocessing.
314
+
315
+ 5. **Feature Engineering Intent**: Recognized by keywords such as "feature", "engineer", "create features", "transform", "encode". Applies feature transformation and creation operations.
316
+
317
+ 6. **Multi-Intent Workflows**: The system can detect and handle requests combining multiple intents, executing them in a logical sequence.
318
+
319
+ The intent classification system uses confidence scoring to handle ambiguous requests and can ask clarifying questions when intent is unclear.
320
+
321
+ ### Context-Aware Session Memory
322
+
323
+ The agent implements persistent session memory that maintains conversation context across multiple turns. This enables natural multi-turn dialogues where subsequent requests can reference previous operations without requiring full context repetition.
324
+
325
+ **Session Memory Capabilities:**
326
+
327
+ - **Workflow History**: Stores complete history of executed tools, parameters, and results for the current session
328
+ - **File State Tracking**: Maintains references to uploaded files, intermediate processed datasets, and generated outputs
329
+ - **Model Persistence**: Remembers trained models and their performance metrics for comparison and further tuning
330
+ - **Error Context**: Stores information about encountered errors to avoid repeating failed operations
331
+ - **User Preferences**: Learns from user choices (e.g., preferred visualization types, imputation strategies)
332
+
333
+ **Example Multi-Turn Conversation:**
334
+
335
+ ```Complete Workflow Example
336
+
337
+ This section demonstrates a complete end-to-end workflow for a real-world dataset, showing the agent's autonomous decision-making and execution capabilities.
338
+
339
+ ### Dataset: Earthquake Magnitude Prediction
340
+
341
+ **Input Dataset:** `earthquake_data.csv`
342
+ - Rows: 175,947 earthquake records
343
+ - Columns: 22 features including latitude, longitude, depth, time, location, and magnitude
344
+ - Target Variable: Earthquake magnitude (continuous regression task)
345
+ - Data Quality: 11.67% missing values, presence of outliers, mixed data types
346
+
347
+ **User Prompt:**
348
+ ```
349
+ "Train a model to predict earthquake magnitude with the highest possible accuracy"
350
+ ```
351
+
352
+ ### Automated Workflow Execution
353
+
354
+ **Phase 1: Data Profiling and Assessment** (Step 1)
355
+ - Tool: `generate_ydata_profile`
356
+ - Action: Comprehensive statistical analysis of all 22 features
357
+ - Findings:
358
+ - Total records: 175,947
359
+ - Missing values detected in 8 columns
360
+ - Outliers present in depth, latitude, longitude
361
+ - High cardinality in location column (15,000+ unique values)
362
+ - Strong correlation between depth and magnitude (r=0.62)
363
+ - Output: YData Profiling HTML report saved to `outputs/earthquake_profile.html`
364
+ - Time: 18.3 seconds
365
+ API Reference
366
+
367
+ The FastAPI backend exposes several endpoints for programmatic interaction.
368
+
369
+ ### Endpoints
370
+
371
+ **POST /chat**
372
+ - Description: Send a message to the agent with optional file upload
373
+ - Content-Type: multipart/form-data
374
+ - Parameters:
375
+ - message (string, required): User's natural language request
376
+ - file (file, optional): Dataset file (CSV or Parquet)
377
+ - Response: JSON with agent's response message and workflow history
378
+ - Example:
379
+ ```bash
380
+ curl -X POST http://localhost:8080/chat \
381
+ -F "message=Generate a data profile report" \
382
+ -F "file=@dataset.csv"
383
+ ```
384
+
385
+ **POST /run**
386
+ - Description: Execute a complete analysis workflow
387
+ - Content-Type: application/json
388
+ - Parameters:
389
+ - query (string, required): Analysis request
390
+ - use_cache (boolean, optional): Enable caching (default: true)
391
+ - Response: JSON with analysis results and generated artifacts
392
+ - Example:
393
+ ```json
394
+ {
395
+ "query": "Train a regression model to predict sales",
396
+ "use_cache": true
397
+ }
398
+ ```
399
+
400
+ **GET /outputs/{file_path}**
401
+ - Description: Retrieve generated reports and artifacts
402
+ - Parameters:
403
+ - file_path (string, required): Path to output file
404
+ - Response: File content (HTML, PNG, CSV, etc.)
405
+ - Example:
406
+ ```bash
407
+ curl http://localhost:8080/outputs/ydata_profile.html
408
+ ```
409
+
410
+ **GET /api/health**
411
+ - Description: Health check endpoint
412
+ - Response: JSON with status information
413
+ - Example response:
414
+ ```json
415
+ {
416
+ "status": "healthy",
417
+ "version": "1.0.0",
418
+ "timestamp": "2025-12-27T10:30:00Z"
419
+ }
420
+ ```
421
+
422
+ ### Interactive API Documentation
423
+
424
+ FastAPI automatically generates interactive API documentation:
425
+ - Swagger UI: http://localhost:8080/docs
426
+ - ReDoc: http://localhost:8080/redoc
427
+
428
+ ## Contributing
429
+
430
+ Contributions to improve the AI-Powered Data Science Agent are welcome. Please follow these guidelines:
431
+
432
+ ### Development Setup
433
+
434
+ 1. Fork the repository and clone your fork
435
+ 2. Create a new branch for your feature: `git checkout -b feature/your-feature-name`
436
+ 3. Install development dependencies: `pip install -r requirements-dev.txt`
437
+ 4. Make your changes with appropriate tests
438
+ 5. Ensure all tests pass: `pytest tests/`
439
+ 6. Format code with black: `black src/`
440
+ 7. Lint code with flake8: `flake8 src/`
441
+ 8. Commit with descriptive messages
442
+ 9. Push to your fork and submit a pull request
443
+
444
+ ### Code Style
445
+
446
+ - Follow PEP 8 guidelines for Python code
447
+ - Use type hints for function parameters and return values
448
+ - Write docstrings for all functions and classes
449
+ - Keep functions focused and under 50 lines when possible
450
+ - Use meaningful variable names
451
+
452
+ ### Testing
453
+
454
+ - Write unit tests for new features
455
+ - Ensure existing tests pass before submitting PR
456
+ - Aim for >80% code coverage
457
+
458
+ ## License
459
+
460
+ This project is licensed under the MIT License. See the LICENSE file for complete terms.
461
+
462
+ Copyright (c) 2025 Pulastya B
463
+
464
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
465
+
466
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
467
+
468
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
469
+
470
+ ## Acknowledgments
471
+
472
+ This project builds upon several excellent open-source technologies and frameworks:
473
+
474
+ - **Google Gemini 2.5 Flash**: Advanced language model with function calling capabilities enabling intelligent agent orchestration
475
+ - **FastAPI**: Modern, high-performance web framework for building APIs with Python, providing automatic documentation and validation
476
+ - **React**: JavaScript library for building user interfaces, enabling component-based architecture and efficient rendering
477
+ - **Polars**: High-performance DataFrame library written in Rust, offering significant speed improvements over traditional data processing libraries
478
+ - **Scikit-learn**: Machine learning library providing simple and efficient tools for data analysis and modeling
479
+ - **XGBoost, LightGBM, CatBoost**: Gradient boosting frameworks offering state-of-the-art performance for structured data
480
+ - **Optuna**: Hyperparameter optimization framework with efficient search algorithms
481
+ - **YData Profiling**: Automated exploratory data analysis tool generating comprehensive reports
482
+ - **Plotly**: Interactive visualization library for creating publication-quality graphs
483
+ - **TypeScript**: Typed superset of JavaScript enhancing code quality and developer experience
484
+ - **Tailwind CSS**: Utility-first CSS framework for rapid UI development
485
+ - **Vite**: Next-generation frontend build tool with instant server start
486
+
487
+ Special thanks to the open-source community for creating and maintaining these exceptional tools.
488
+
489
+ ## Contact and Support
490
+
491
+ **Developer:** Pulastya B
492
+
493
+ **GitHub Profile:** [@Pulastya-B](https://github.com/Pulastya-B)
494
+
495
+ **Project Repository:** [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
496
+
497
+ **Issues and Bug Reports:** Please use the GitHub Issues page to report bugs or request features
498
+
499
+ **Documentation:** Additional documentation and tutorials available in the repository wiki
500
+
501
+ **Project Status:** Active development - Built for DevSprint Hackathon
502
+
503
+ For questions, suggestions, or collaboration opportunities, please open an issue on GitHub or contact through the repository.
504
+
505
+ ---
506
+
507
+ **Last Updated:** December 27, 2025
508
+
509
+ **Version:** 1.0.0
510
+ Step 6 - Temporal Feature Extraction:
511
+ - Tool: `extract_time_features`
512
+ - Input column: 'timestamp'
513
+ - Features created:
514
+ - year, month, day_of_week, hour
515
+ - Cyclical encodings: hour_sin, hour_cos, month_sin, month_cos
516
+ - Justification: Earthquakes may have temporal patterns
517
+ - New columns: 8 time-based features
518
+
519
+ Step 7 - Categorical Encoding:
520
+ - Tool: `encode_categorical_features`
521
+ - Method: Target encoding for 'location' (high cardinality), one-hot encoding for 'type'
522
+ - Result: All categorical variables converted to numeric
523
+ - New columns: 3 (reduced from high-cardinality location)
524
+
525
+ Step 8 - Statistical Features:
526
+ - Tool: `create_statistical_features`
527
+ - Features created:
528
+ - Distance from nearest plate boundary (calculated from lat/lon)
529
+ - Depth-to-magnitude ratio
530
+ - Regional earthquake frequency (rolling count)
531
+ - New columns: 3 domain-specific features
532
+
533
+ Final feature count: 28 engineered features
534
+
535
+ **Phase 5: Model Training and Selection** (Step 9)
536
+ - Tool: `train_baseline_models`
537
+ - Algorithms trained in parallel:
538
+
539
+ 1. Ridge Regression: RΒ² = 0.534, RMSE = 0.312
540
+ 2. Lasso Regression: RΒ² = 0.541, RMSE = 0.309
541
+ 3. ElasticNet: RΒ² = 0.538, RMSE = 0.311
542
+ 4. Random Forest: RΒ² = 0.698, RMSE = 0.251
543
+ 5. XGBoost: RΒ² = 0.716, RMSE = 0.243 (BEST)
544
+ 6. LightGBM: RΒ² = 0.709, RMSE = 0.247
545
+ 7. CatBoost: RΒ² = 0.712, RMSE = 0.245
546
+
547
+ - Best model selected: XGBoost
548
+ - Validation split: 80/20 stratified split
549
+ - Time: 124.7 seconds
550
+
551
+ **Phase 6: Hyperparameter Optimization** (Step 10)
552
+ - Tool: `optimize_hyperparameters_optuna`
553
+ - Framework: Optuna with Tree-structured Parzen Estimator (TPE)
554
+ - Search space:
555
+ - max_depth: [3, 10]
556
+ - learning_rate: [0.001, 0.3] (log scale)
557
+ - n_estimators: [100, 1000]
558
+ - min_child_weight: [1, 10]
559
+ - subsample: [0.6, 1.0]
560
+ - colsample_bytree: [0.6, 1.0]
561
+ - Trials: 50 iterations
562
+ - Best parameters found:
563
+ - max_depth: 7
564
+ - learning_rate: 0.0847
565
+ - n_estimators: 673
566
+ - min_child_weight: 3
567
+ - subsample: 0.8234
568
+ - colsample_bytree: 0.9123
569
+ - Optimized performance: RΒ² = 0.743, RMSE = 0.231
570
+ - Improvement: +3.8% RΒ² over baseline
571
+ - Time: 312.4 seconds
572
+
573
+ **Phase 7: Model Validation** (Step 11)
574
+ - Tool: `cross_validate_model`
575
+ - Method: 5-fold stratified cross-validation
576
+ - Results:
577
+ - Fold 1: RΒ² = 0.741, RMSE = 0.232
578
+ - Fold 2: RΒ² = 0.745, RMSE = 0.230
579
+ - Fold 3: RΒ² = 0.738, RMSE = 0.234
580
+ - Fold 4: RΒ² = 0.747, RMSE = 0.229
581
+ - Fold 5: RΒ² = 0.742, RMSE = 0.232
582
+ - Mean performance: RΒ² = 0.743 Β± 0.003, RMSE = 0.231 Β± 0.002
583
+ - Interpretation: Low variance across folds indicates robust, generalizable model
584
+ - Time: 267.8 seconds
585
+
586
+ **Phase 8: Visualization and Reporting** (Steps 12-13)
587
+
588
+ Step 12 - Feature Importance Analysis:
589
+ - Tool: `plot_feature_importance`
590
+ - Top 10 features by importance:
591
+ 1. depth (0.284)
592
+ 2. distance_to_plate_boundary (0.167)
593
+ 3. latitude (0.142)
594
+ 4. longitude (0.138)
595
+ 5. regional_frequency (0.095)
596
+ 6. depth_magnitude_ratio (0.067)
597
+ 7. hour_sin (0.034)
598
+ 8. month (0.028)
599
+ 9. location_encoded (0.024)
600
+ 10. year (0.021)
601
+ - Output: Interactive Plotly bar chart saved to `outputs/feature_importance.html`
602
+
603
+ Step 13 - Comprehensive Dashboard:
604
+ - Tool: `create_plotly_dashboard`
605
+ - Visualizations included:
606
+ - Correlation heatmap (28x28 features)
607
+ - Actual vs Predicted scatter plot
608
+ - Residual distribution plot
609
+ - Feature importance ranking
610
+ - Temporal patterns in predictions
611
+ - Output: Multi-panel interactive dashboard saved to `outputs/model_dashboard.html`
612
+
613
+ ### Final Results Summary
614
+
615
+ **Model Performance:**
616
+ - Algorithm: XGBoost with optimized hyperparameters
617
+ - Training RΒ²: 0.743
618
+ - Cross-validated RΒ²: 0.743 Β± 0.003
619
+ - RMSE: 0.231 (on magnitude scale 0-10)
620
+ - MAE: 0.176
621
+ - Explanation: Model explains 74.3% of variance in earthquake magnitudes
622
+
623
+ **Artifacts Generated:**
624
+ - Trained model file: `outputs/xgboost_model_optimized.pkl`
625
+ - YData profiling report: `outputs/earthquake_profile.html`
626
+ - Feature importance plot: `outputs/feature_importance.html`
627
+ - Interactive dashboard: `outputs/model_dashboard.html`
628
+ - Cleaned dataset: `data/earthquake_data_cleaned.parquet`
629
+ - Feature engineered dataset: `data/earthquake_data_featured.parquet`
630
+
631
+ **Total Execution Time:** 12 minutes 43 seconds
632
+
633
+ **Key Insights:**
634
+ 1. Depth is the strongest predictor of earthquake magnitude (28.4% importance)
635
+ 2. Spatial features (distance to plate boundaries, lat/lon) are highly informative
636
+ 3. Temporal patterns show cyclical variations in earthquake characteristics
637
+ 4. Model performance is consistent across cross-validation folds (low variance)
638
+ 5. The optimized XGBoost model provides reliable magnitude predictions suitable for deployment
639
+
640
+ ### Robust Error Recovery System
641
+
642
+ The agent implements a comprehensive error recovery system designed to handle failures gracefully and guide users toward successful task completion.
643
+
644
+ **Error Recovery Mechanisms:**
645
+
646
+ 1. **Automatic Retry with Correction**: When a tool execution fails due to incorrect parameters, the agent analyzes the error message, adjusts parameters based on the error type, and automatically retries the operation with corrected inputs.
647
+
648
+ 2. **File Existence Validation**: Before executing tools that require specific file inputs, the system validates file existence and accessibility, providing clear guidance when files are missing.
649
+
650
+ 3. **Column Name Validation**: Validates that requested column names exist in the dataset before performing operations, suggesting similar column names when exact matches aren't found.
651
+
652
+ 4. **Dependency Tracking**: Ensures tools are executed in proper sequence, checking that prerequisite operations (e.g., data cleaning before training) have been completed.
653
+
654
+ 5. **Loop Detection**: Monitors tool execution patterns to detect and prevent infinite retry loops. If the same operation fails multiple times with the same error, the agent stops retrying and requests user intervention.
655
+
656
+ 6. **Recovery Guidance**: When errors cannot be automatically resolved, the system provides detailed guidance including:
657
+ - Clear explanation of what went wrong
658
+ - The last successful file state that can be used to continue
659
+ - Suggested alternative approaches
660
+ - Specific parameter corrections needed
661
+
662
+ 7. **Graceful Degradation**: If a requested operation cannot be completed, the agent attempts to provide partial results or alternative analysis that may still be valuable.
663
+
664
+ **Example Error Recovery Flow:**
665
+
666
+ ```
667
+ Request: "Train a model to predict 'Price' column"
668
+
669
+ Error Detected: Column 'Price' not found in dataset
670
+ Recovery Action: Search for similar columns β†’ Find 'price', 'PRICE', 'SalePrice'
671
+ Agent Response: "Column 'Price' not found. Did you mean 'SalePrice'? I found these similar columns: ['SalePrice', 'price_usd']. Please specify which column to use."
672
+
673
+ User: "Yes, use SalePrice"
674
+ Agent: [Continues with corrected column name]
675
+ ```
676
+
677
+ ### Interactive Report Viewing
678
+
679
+ The web interface includes an integrated report viewer that displays comprehensive HTML reports generated during analysis without requiring users to download files or switch to external tools.
680
+
681
+ **Report Viewer Features:**
682
+
683
+ - **In-Application Display**: Reports open in a full-screen modal overlay within the chat interface
684
+ - **Multiple Report Types**: Supports YData Profiling reports and custom HTML dashboards
685
+ - **Professional Styling**: Modal features glassmorphism design, smooth animations, and responsive layout
686
+ - **Interactive Navigation**: Users can zoom, scroll, and interact with report elements directly in the viewer
687
+ - **Download Option**: Reports can be downloaded as standalone HTML files for sharing or archival
688
+ - **Automatic Detection**: System automatically detects when tools generate HTML reports and creates "View Report" buttons in the chat interface
689
+
690
+ **Supported Report Types:**
691
+
692
+ 1. **YData Profiling Reports**: Comprehensive automated EDA with variable statistics, distributions, correlations, missing value analysis, and alerts for data quality issues
693
+
694
+ 2. **Custom Dashboards**: User-created Plotly dashboards with multiple interactive visualizations
695
+
696
+ The report extraction system uses multiple strategies to locate report files, including checking tool return values, parsing workflow history, and using regex pattern matching on agent responses.
697
+ - Use different API keys for development and production
698
+ - Rotate API keys periodically
699
+ - Set restrictive file permissions on `.env` (chmod 600 on Linux/macOS)inux/macOS:**
700
+ ```bash
701
+ chmod +x build-and-deploy.sh
702
+ ./build-and-deploy.sh
703
+ ```
704
+
705
+ These scripts handle building the image, stopping any existing containers, and starting a new container with proper configuration.FRRONTEEEND
706
+ npm install
707
+ npm run build
708
+ cd ..
709
+ ```
710
+
711
+ **5. Run the application**
712
+
713
+ **Windows:**
714
+ ```powershell
715
+ .\start.ps1
716
+ ```
717
+
718
+ **Linux/Mac:**
719
+ ```bash
720
+ chmod +x start.sh
721
+ ./start.sh
722
+ ```
723
+
724
+ The application will be available at **http://localhost:8080**
725
+
726
+ ---
727
+
728
+ ## πŸ“– Usage
729
+
730
+ ### Web Interface
731
+
732
+ 1. **Navigate to http://localhost:8080**
733
+ 2. **Click "Launch Agent"** from the landing page
734
+ 3. **Upload your dataset** (CSV or Parquet format)
735
+ 4. **Type your request** in natural language:
736
+ - "Generate a comprehensive report on this dataset"
737
+ - "Train a model to predict [target_column]"
738
+ - "Clean the data and show me visualizations"
739
+ - "Perform feature engineering and train the best model"
740
+ 5. **View results** in the chat and click "View Report" buttons to see detailed HTML reports
741
+
742
+ ### Example Queries
743
+
744
+ ```
745
+ πŸ“Š "Profile this dataset and tell me about data quality issues"
746
+
747
+ 🧹 "Clean the missing values and handle outliers"
748
+
749
+ 🎯 "Train a model to predict house prices with target column 'price'"
750
+
751
+ πŸ“ˆ "Generate a correlation heatmap and feature importance plot"
752
+
753
+ πŸ”§ "Create time-based features and perform hyperparameter tuning"
754
+
755
+ πŸ“‹ "Generate a comprehensive YData profiling report"
756
+ ```
757
+
758
+ ---
759
+
760
+ ## πŸ—οΈ Architecture
761
+
762
+ ```
763
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
764
+ β”‚ React Frontend (Port 8080) β”‚
765
+ β”‚ Landing Page β”‚ Chat Interface β”‚ Report Viewer β”‚
766
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
767
+ β”‚
768
+ β–Ό
769
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
770
+ β”‚ FastAPI Backend (Python 3.10+) β”‚
771
+ β”‚ /chat β”‚ /run β”‚ /outputs β”‚ /api/health β”‚
772
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
773
+ β”‚
774
+ β–Ό
775
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
776
+ β”‚ DataScienceCopilot Orchestrator β”‚
777
+ β”‚ β€’ Gemini 2.5 Flash Integration β”‚
778
+ β”‚ β€’ 82+ Specialized Tools β”‚
779
+ β”‚ β€’ Session Memory & Context β”‚
780
+ β”‚ β€’ Intelligent Intent Detection β”‚
781
+ β”‚ β€’ Error Recovery & Loop Prevention β”‚
782
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
783
+ β”‚
784
+ β–Ό
785
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
786
+ β”‚ Tool Categories β”‚
787
+ β”‚ Profiling β”‚ Cleaning β”‚ Feature Engineering β”‚ ML Training β”‚
788
+ β”‚ Visualization β”‚ EDA Reports β”‚ Data Wrangling β”‚
789
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
790
+ ```
791
+
792
+ ---
793
+
794
+ ## Tech Stack
795
+
796
+ ### Frontend
797
+ - **React 19** - Modern UI library
798
+ - **TypeScript 5.8** - Type-safe development
799
+ - **Vite 6** - Lightning-fast build tool
800
+ - **Tailwind CSS** - Utility-first styling
801
+ - **Framer Motion** - Smooth animations
802
+ - **React Markdown** - Formatted responses
803
+
804
+ ### Backend
805
+ - **FastAPI** - High-performance Python web framework
806
+ - **Google Gemini 2.5 Flash** - LLM for agent orchestration
807
+ - **Polars** - Fast dataframe library (10-100x faster than pandas)
808
+ - **Scikit-learn** - Classical ML algorithms
809
+ - **XGBoost / LightGBM / CatBoost** - Gradient boosting frameworks
810
+ - **Optuna** - Hyperparameter optimization
811
+ - **YData Profiling** - Automated EDA reports
812
+ - **Plotly / Matplotlib** - Interactive visualizations
813
+
814
+ ### DevOps
815
+ - **Docker** - Containerization with multi-stage builds
816
+ - **Python-dotenv** - Environment variable management
817
+ - **SQLite** - Caching layer for performance
818
+
819
+ ---
820
+
821
+ ## 🐳 Docker Deployment
822
+
823
+ **Build and run with Docker:**
824
+
825
+ ```bash
826
+ docker build -t ds-agent .
827
+ docker run -p 8080:8080 --env-file .env ds-agent
828
+ ```
829
+
830
+ **Or use the deployment script:**
831
+
832
+ ```bash
833
+ .\build-and-deploy.ps1 # Windows
834
+ ./build-and-deploy.sh # Linux/Mac
835
+ ```
836
+
837
+ ---
838
+
839
+ ## πŸ“‚ Project Structure
840
+
841
+ ```
842
+ .
843
+ β”œβ”€β”€ FRRONTEEEND/ # React frontend
844
+ β”‚ β”œβ”€β”€ components/ # UI components
845
+ β”‚ β”‚ β”œβ”€β”€ ChatInterface.tsx # Main chat interface
846
+ β”‚ β”‚ β”œβ”€β”€ HeroGeometric.tsx # Landing page hero
847
+ β”‚ β”‚ └── ...
848
+ β”‚ β”œβ”€β”€ dist/ # Built frontend
849
+ β”‚ └── package.json
850
+ β”‚
851
+ β”œβ”€β”€ src/ # Python backend
852
+ β”‚ β”œβ”€β”€ api/
853
+ β”‚ β”‚ └── app.py # FastAPI application
854
+ β”‚ β”œβ”€β”€ orchestrator.py # Agent orchestrator
855
+ β”‚ β”œβ”€β”€ session_memory.py # Session management
856
+ β”‚ β”œβ”€β”€ tools/ # 82+ ML tools
857
+ β”‚ β”‚ β”œβ”€β”€ data_profiling.py
858
+ β”‚ β”‚ β”œβ”€β”€ data_cleaning.py
859
+ β”‚ β”‚ β”œβ”€β”€ feature_engineering.py
860
+ β”‚ β”‚ β”œβ”€β”€ model_training.py
861
+ β”‚ β”‚ └── ...
862
+ β”‚ └── utils/ # Helper utilities
863
+ β”‚
864
+ β”œβ”€β”€ Dockerfile # Multi-stage Docker build
865
+ β”œβ”€β”€ requirements.txt # Python dependencies
866
+ β”œβ”€β”€ start.ps1 / start.sh # Quick start scripts
867
+ └── README.md # This file
868
+ ```
869
+
870
+ ---
871
+
872
+ ## πŸ”‘ Environment Variables
873
+
874
+ Create a `.env` file in the root directory:
875
+
876
+ ```bash
877
+ # LLM Provider Configuration
878
+ LLM_PROVIDER=gemini
879
+
880
+ # API Keys
881
+ GOOGLE_API_KEY=your_gemini_api_key_here
882
+
883
+ # Model Configuration
884
+ GEMINI_MODEL=gemini-2.5-flash
885
+
886
+ # Cache Configuration
887
+ CACHE_DB_PATH=./cache_db/cache.db
888
+ CACHE_TTL_SECONDS=86400
889
+
890
+ # Output Configuration
891
+ OUTPUT_DIR=./outputs
892
+ DATA_DIR=./data
893
+ ```
894
+
895
+ ---
896
+
897
+ ## 🎯 Features in Detail
898
+
899
+ ### Intelligent Intent Detection
900
+ The agent automatically classifies your request and applies the appropriate workflow:
901
+ - **Full ML Pipeline** - Complete end-to-end workflow with training
902
+ - **Exploratory Analysis** - Data profiling and visualization only
903
+ - **Cleaning Only** - Data quality improvements without modeling
904
+ - **Visualization Only** - Generate plots and dashboards
905
+ - **Multi-Intent** - Combine multiple tasks intelligently
906
+
907
+ ### Session Memory
908
+ The agent remembers context across messages:
909
+ ```
910
+ You: "Train a model on this dataset"
911
+ Agent: [Trains XGBoost model with RΒ² = 0.85]
912
+
913
+ You: "Now try hyperparameter tuning"
914
+ Agent: [Automatically uses previous model and dataset]
915
+
916
+ You: "Cross-validate it"
917
+ Agent: [Applies CV to tuned model from context]
918
+ ```
919
+
920
+ ### Error Recovery
921
+ - Automatic retry with corrected parameters
922
+ - File existence validation before execution
923
+ - Recovery guidance showing last successful file
924
+ - Loop detection to prevent infinite retries
925
+
926
+ ### Report Viewing
927
+ - Click "View Report" buttons to see HTML reports in-app
928
+ - Full-screen modal with professional styling
929
+ - Supports YData Profiling and custom dashboards
930
+
931
+ ---
932
+
933
+ ## πŸ“Š Example Workflow
934
+
935
+ **Upload:** `earthquake_data.csv` (175K rows, 22 columns)
936
+
937
+ **Prompt:** "Train a model to predict earthquake magnitude"
938
+
939
+ **Agent Actions:**
940
+ 1. βœ… Profiles dataset (175,947 rows, 22 columns)
941
+ 2. βœ… Detects data quality issues (11.67% missing, outliers)
942
+ 3. βœ… Drops high-missing columns (>40% missing)
943
+ 4. βœ… Imputes remaining missing values with median/mode
944
+ 5. βœ… Handles outliers with IQR clipping
945
+ 6. βœ… Extracts time-based features (year, month, hour, cyclical)
946
+ 7. βœ… Encodes categorical variables
947
+ 8. βœ… Trains 6 baseline models (XGBoost wins with RΒ² = 0.716)
948
+ 9. βœ… Performs hyperparameter tuning (RΒ² = 0.743)
949
+ 10. βœ… Runs 5-fold cross-validation (RMSE = 0.167 Β± 0.0005)
950
+ 11. βœ… Generates YData profiling report
951
+ 12. βœ… Creates interactive Plotly dashboard
952
+
953
+ **Result:** Trained and tuned XGBoost model ready for deployment!
954
+
955
+ ---
956
+
957
+ ## 🀝 Contributing
958
+
959
+ Contributions are welcome! Please feel free to submit a Pull Request.
960
+
961
+ ---
962
+
963
+ ## πŸ“„ License
964
+
965
+ This project is licensed under the MIT License.
966
+
967
+ ---
968
+
969
+ ## πŸ™ Acknowledgments
970
+
971
+ - **Google Gemini** for powerful LLM capabilities
972
+ - **FastAPI** for excellent async Python framework
973
+ - **React** community for amazing UI libraries
974
+ - **Polars** for blazing-fast data processing
975
+ - **YData Profiling** for comprehensive EDA reports
976
+
977
+ ---
978
+
979
+ ## πŸ“§ Contact
980
+
981
+ **Pulastya B**
982
+ - GitHub: [@Pulastya-B](https://github.com/Pulastya-B)
983
+ - Project: [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
984
+
985
+ ---
986
+
987
+ <div align="center">
988
+
989
+ **Built with ❀️ for DevSprint Hackathon**
990
+
991
+ ⭐ Star this repo if you find it helpful!
992
+
993
+ </div>