Spaces:
Running
Running
Pulastya B
commited on
Commit
·
d92d2aa
1
Parent(s):
e237c76
refactor: Remove Sweetviz and use YData Profiling as primary EDA tool
Browse files- Remove Sweetviz from requirements.txt due to NumPy 2.x incompatibility
- Remove generate_sweetviz_report and generate_combined_eda_report functions
- Update orchestrator to use only generate_ydata_profiling_report
- Remove Sweetviz imports from tools __init__.py
- Update tools_registry.py to remove Sweetviz tool definitions
- Update README.md to remove all Sweetviz references
- Uninstall sweetviz package
YData Profiling provides:
- Full NumPy 2.x compatibility
- More comprehensive analysis than Sweetviz
- Better maintained with regular updates
- Automated insights and data quality warnings
- README.md +481 -13
- requirements.txt +1 -2
- src/orchestrator.py +3 -7
- src/tools/__init__.py +3 -7
- src/tools/eda_reports.py +1 -210
- src/tools/tools_registry.py +2 -36
README.md
CHANGED
|
@@ -39,7 +39,7 @@ The frontend is built with React 19 and TypeScript 5.8, featuring a modern glass
|
|
| 39 |
|
| 40 |
- **Landing Page**: Geometric hero section with animated background paths, key capabilities showcase, problem-solution presentation, process flow visualization, and technology stack display
|
| 41 |
- **Chat Interface**: Real-time message streaming, file upload support for CSV and Parquet formats, markdown rendering for formatted responses with code syntax highlighting, loading states with animated indicators, and error handling with user-friendly messages
|
| 42 |
-
- **Report Viewer**: In-application modal viewer for HTML reports generated by YData Profiling
|
| 43 |
- **Session Management**: Maintains conversation history across browser sessions, allows users to review previous analyses, and provides context for follow-up questions
|
| 44 |
|
| 45 |
### Complete Machine Learning Pipeline
|
|
@@ -127,7 +127,7 @@ Type your request in natural language in the chat input box. The agent understan
|
|
| 127 |
|
| 128 |
**Step 5: Review Results**
|
| 129 |
|
| 130 |
-
The agent will execute the requested workflow and display results in the chat interface. For analyses that generate HTML reports (such as YData Profiling
|
| 131 |
|
| 132 |
### Example Queries and Use Cases
|
| 133 |
|
|
@@ -220,21 +220,489 @@ The startup script will:
|
|
| 220 |
- **Matplotlib 3.8+**: Fundamental plotting library for Python offering publication-quality static visualizations
|
| 221 |
- **Pydantic 2.5+**: Data validation library using Python type annotations for request/response models
|
| 222 |
|
| 223 |
-
###
|
| 224 |
|
| 225 |
-
|
| 226 |
-
- **Pandas 2.1+**: Secondary support for compatibility with legacy tools and libraries
|
| 227 |
-
- **SQLite**: Embedded database for caching query results and session management
|
| 228 |
-
- **Python-dotenv**: Environment variable management from .env files
|
| 229 |
|
| 230 |
-
###
|
| 231 |
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
```bash
|
| 237 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
npm install
|
| 239 |
npm run build
|
| 240 |
cd ..
|
|
|
|
| 39 |
|
| 40 |
- **Landing Page**: Geometric hero section with animated background paths, key capabilities showcase, problem-solution presentation, process flow visualization, and technology stack display
|
| 41 |
- **Chat Interface**: Real-time message streaming, file upload support for CSV and Parquet formats, markdown rendering for formatted responses with code syntax highlighting, loading states with animated indicators, and error handling with user-friendly messages
|
| 42 |
+
- **Report Viewer**: In-application modal viewer for HTML reports generated by YData Profiling and custom dashboard tools. Full-screen modal with professional styling, iframe embedding for report content, and download capabilities
|
| 43 |
- **Session Management**: Maintains conversation history across browser sessions, allows users to review previous analyses, and provides context for follow-up questions
|
| 44 |
|
| 45 |
### Complete Machine Learning Pipeline
|
|
|
|
| 127 |
|
| 128 |
**Step 5: Review Results**
|
| 129 |
|
| 130 |
+
The agent will execute the requested workflow and display results in the chat interface. For analyses that generate HTML reports (such as YData Profiling), a "View Report" button will appear. Click this button to open the report in a full-screen modal viewer.
|
| 131 |
|
| 132 |
### Example Queries and Use Cases
|
| 133 |
|
|
|
|
| 220 |
- **Matplotlib 3.8+**: Fundamental plotting library for Python offering publication-quality static visualizations
|
| 221 |
- **Pydantic 2.5+**: Data validation library using Python type annotations for request/response models
|
| 222 |
|
| 223 |
+
###Docker Deployment
|
| 224 |
|
| 225 |
+
The application includes a multi-stage Dockerfile for optimized containerized deployment.
|
|
|
|
|
|
|
|
|
|
| 226 |
|
| 227 |
+
### Building the Docker Image
|
| 228 |
|
| 229 |
+
Build the Docker image with the following command:
|
| 230 |
+
|
| 231 |
+
```bash
|
| 232 |
+
docker build -t ds-agent:latest .
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
The multi-stage build process:
|
| 236 |
+
1. **Stage 1 (Builder)**: Installs Node.js dependencies and builds the React frontend
|
| 237 |
+
2. **Stage 2 (Runtime)**: Sets up Python environment, installs backend dependencies, and copies built frontend
|
| 238 |
+
3. Result: Optimized image size by excluding development dependencies and build tools
|
| 239 |
+
|
| 240 |
+
### Running the Container
|
| 241 |
+
|
| 242 |
+
Run the containerized application:
|
| 243 |
+
|
| 244 |
+
```bash
|
| 245 |
+
docker run -d \
|
| 246 |
+
-p 8080:8080 \
|
| 247 |
+
--env-file .env \
|
| 248 |
+
--name ds-agent-container \
|
| 249 |
+
ds-agent:latest
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
+
Parameters explained:
|
| 253 |
+
- `-d`: Run container in detached mode (background)
|
| 254 |
+
- `-p 8080:8080`: Map container port 8080 to host port 8080
|
| 255 |
+
- `--env-file .env`: Load environment variables from .env file
|
| 256 |
+
- `--name ds-agent-container`: Assign a name to the container for easy management
|
| 257 |
+
|
| 258 |
+
### Docker Compose (Recommended)
|
| 259 |
+
|
| 260 |
+
For easier management, create a `docker-compose.yml` file:
|
| 261 |
+
|
| 262 |
+
```yaml
|
| 263 |
+
version: '3.8'
|
| 264 |
+
|
| 265 |
+
services:
|
| 266 |
+
ds-agent:
|
| 267 |
+
build: .
|
| 268 |
+
container_name: ds-agent
|
| 269 |
+
ports:
|
| 270 |
+
- "8080:8080"
|
| 271 |
+
env_file:
|
| 272 |
+
- .env
|
| 273 |
+
volumes:
|
| 274 |
+
Environment Configuration
|
| 275 |
+
|
| 276 |
+
The application uses environment variables for configuration management. Create a `.env` file in the project root directory with the following variables:
|
| 277 |
+
|
| 278 |
+
### Required Configuration
|
| 279 |
+
|
| 280 |
+
```bash
|
| 281 |
+
# LLM Provider Selection
|
| 282 |
+
LLM_PROVIDER=gemini
|
| 283 |
+
# Options: gemini (currently supported)
|
| 284 |
+
|
| 285 |
+
# Google Gemini API Key (REQUIRED)
|
| 286 |
+
GOOGLE_API_KEY=your_api_key_here
|
| 287 |
+
# Obtain from: https://ai.google.dev/
|
| 288 |
+
# Free tier limits: 10 RPM, 20 RPD
|
| 289 |
+
|
| 290 |
+
# Gemini Model Selection
|
| 291 |
+
GEMINI_MODEL=gemini-2.5-flash
|
| 292 |
+
# Options:
|
| 293 |
+
# - gemini-2.5-flash (recommended, balanced performance)
|
| 294 |
+
# - gemini-1.5-pro (higher capability, lower rate limits)
|
| 295 |
+
# - gemini-1.5-flash (faster, lower cost)
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
### Optional Configuration
|
| 299 |
+
Advanced Features
|
| 300 |
+
|
| 301 |
+
### Intelligent Intent Detection and Classification
|
| 302 |
+
|
| 303 |
+
The orchestration system employs sophisticated intent detection to automatically classify user requests and route them to appropriate workflow pipelines. The classification system analyzes incoming natural language queries using keyword matching, pattern recognition, and contextual understanding.
|
| 304 |
+
|
| 305 |
+
**Intent Categories:**
|
| 306 |
+
|
| 307 |
+
1. **Full ML Pipeline Intent**: Triggered by keywords such as "train", "model", "predict", "machine learning", "regression", "classification". Executes complete workflow including data profiling, cleaning, feature engineering, model training, hyperparameter tuning, and evaluation.
|
| 308 |
+
|
| 309 |
+
2. **Exploratory Analysis Intent**: Activated by keywords like "explore", "profile", "report", "analysis", "overview", "insights", "understand". Performs comprehensive data profiling with statistical summaries, distribution analysis, correlation matrices, and automated insights generation.
|
| 310 |
+
|
| 311 |
+
3. **Data Cleaning Intent**: Detected via keywords such as "clean", "missing", "outliers", "duplicates", "impute", "handle". Focuses on data quality improvement operations without proceeding to modeling.
|
| 312 |
+
|
| 313 |
+
4. **Visualization Intent**: Identified through keywords like "plot", "visualize", "chart", "graph", "heatmap", "distribution". Generates requested visualizations without performing modeling or extensive preprocessing.
|
| 314 |
+
|
| 315 |
+
5. **Feature Engineering Intent**: Recognized by keywords such as "feature", "engineer", "create features", "transform", "encode". Applies feature transformation and creation operations.
|
| 316 |
+
|
| 317 |
+
6. **Multi-Intent Workflows**: The system can detect and handle requests combining multiple intents, executing them in a logical sequence.
|
| 318 |
+
|
| 319 |
+
The intent classification system uses confidence scoring to handle ambiguous requests and can ask clarifying questions when intent is unclear.
|
| 320 |
+
|
| 321 |
+
### Context-Aware Session Memory
|
| 322 |
+
|
| 323 |
+
The agent implements persistent session memory that maintains conversation context across multiple turns. This enables natural multi-turn dialogues where subsequent requests can reference previous operations without requiring full context repetition.
|
| 324 |
+
|
| 325 |
+
**Session Memory Capabilities:**
|
| 326 |
+
|
| 327 |
+
- **Workflow History**: Stores complete history of executed tools, parameters, and results for the current session
|
| 328 |
+
- **File State Tracking**: Maintains references to uploaded files, intermediate processed datasets, and generated outputs
|
| 329 |
+
- **Model Persistence**: Remembers trained models and their performance metrics for comparison and further tuning
|
| 330 |
+
- **Error Context**: Stores information about encountered errors to avoid repeating failed operations
|
| 331 |
+
- **User Preferences**: Learns from user choices (e.g., preferred visualization types, imputation strategies)
|
| 332 |
+
|
| 333 |
+
**Example Multi-Turn Conversation:**
|
| 334 |
+
|
| 335 |
+
```Complete Workflow Example
|
| 336 |
+
|
| 337 |
+
This section demonstrates a complete end-to-end workflow for a real-world dataset, showing the agent's autonomous decision-making and execution capabilities.
|
| 338 |
+
|
| 339 |
+
### Dataset: Earthquake Magnitude Prediction
|
| 340 |
+
|
| 341 |
+
**Input Dataset:** `earthquake_data.csv`
|
| 342 |
+
- Rows: 175,947 earthquake records
|
| 343 |
+
- Columns: 22 features including latitude, longitude, depth, time, location, and magnitude
|
| 344 |
+
- Target Variable: Earthquake magnitude (continuous regression task)
|
| 345 |
+
- Data Quality: 11.67% missing values, presence of outliers, mixed data types
|
| 346 |
+
|
| 347 |
+
**User Prompt:**
|
| 348 |
+
```
|
| 349 |
+
"Train a model to predict earthquake magnitude with the highest possible accuracy"
|
| 350 |
+
```
|
| 351 |
+
|
| 352 |
+
### Automated Workflow Execution
|
| 353 |
+
|
| 354 |
+
**Phase 1: Data Profiling and Assessment** (Step 1)
|
| 355 |
+
- Tool: `generate_ydata_profile`
|
| 356 |
+
- Action: Comprehensive statistical analysis of all 22 features
|
| 357 |
+
- Findings:
|
| 358 |
+
- Total records: 175,947
|
| 359 |
+
- Missing values detected in 8 columns
|
| 360 |
+
- Outliers present in depth, latitude, longitude
|
| 361 |
+
- High cardinality in location column (15,000+ unique values)
|
| 362 |
+
- Strong correlation between depth and magnitude (r=0.62)
|
| 363 |
+
- Output: YData Profiling HTML report saved to `outputs/earthquake_profile.html`
|
| 364 |
+
- Time: 18.3 seconds
|
| 365 |
+
API Reference
|
| 366 |
+
|
| 367 |
+
The FastAPI backend exposes several endpoints for programmatic interaction.
|
| 368 |
+
|
| 369 |
+
### Endpoints
|
| 370 |
+
|
| 371 |
+
**POST /chat**
|
| 372 |
+
- Description: Send a message to the agent with optional file upload
|
| 373 |
+
- Content-Type: multipart/form-data
|
| 374 |
+
- Parameters:
|
| 375 |
+
- message (string, required): User's natural language request
|
| 376 |
+
- file (file, optional): Dataset file (CSV or Parquet)
|
| 377 |
+
- Response: JSON with agent's response message and workflow history
|
| 378 |
+
- Example:
|
| 379 |
```bash
|
| 380 |
+
curl -X POST http://localhost:8080/chat \
|
| 381 |
+
-F "message=Generate a data profile report" \
|
| 382 |
+
-F "file=@dataset.csv"
|
| 383 |
+
```
|
| 384 |
+
|
| 385 |
+
**POST /run**
|
| 386 |
+
- Description: Execute a complete analysis workflow
|
| 387 |
+
- Content-Type: application/json
|
| 388 |
+
- Parameters:
|
| 389 |
+
- query (string, required): Analysis request
|
| 390 |
+
- use_cache (boolean, optional): Enable caching (default: true)
|
| 391 |
+
- Response: JSON with analysis results and generated artifacts
|
| 392 |
+
- Example:
|
| 393 |
+
```json
|
| 394 |
+
{
|
| 395 |
+
"query": "Train a regression model to predict sales",
|
| 396 |
+
"use_cache": true
|
| 397 |
+
}
|
| 398 |
+
```
|
| 399 |
+
|
| 400 |
+
**GET /outputs/{file_path}**
|
| 401 |
+
- Description: Retrieve generated reports and artifacts
|
| 402 |
+
- Parameters:
|
| 403 |
+
- file_path (string, required): Path to output file
|
| 404 |
+
- Response: File content (HTML, PNG, CSV, etc.)
|
| 405 |
+
- Example:
|
| 406 |
+
```bash
|
| 407 |
+
curl http://localhost:8080/outputs/ydata_profile.html
|
| 408 |
+
```
|
| 409 |
+
|
| 410 |
+
**GET /api/health**
|
| 411 |
+
- Description: Health check endpoint
|
| 412 |
+
- Response: JSON with status information
|
| 413 |
+
- Example response:
|
| 414 |
+
```json
|
| 415 |
+
{
|
| 416 |
+
"status": "healthy",
|
| 417 |
+
"version": "1.0.0",
|
| 418 |
+
"timestamp": "2025-12-27T10:30:00Z"
|
| 419 |
+
}
|
| 420 |
+
```
|
| 421 |
+
|
| 422 |
+
### Interactive API Documentation
|
| 423 |
+
|
| 424 |
+
FastAPI automatically generates interactive API documentation:
|
| 425 |
+
- Swagger UI: http://localhost:8080/docs
|
| 426 |
+
- ReDoc: http://localhost:8080/redoc
|
| 427 |
+
|
| 428 |
+
## Contributing
|
| 429 |
+
|
| 430 |
+
Contributions to improve the AI-Powered Data Science Agent are welcome. Please follow these guidelines:
|
| 431 |
+
|
| 432 |
+
### Development Setup
|
| 433 |
+
|
| 434 |
+
1. Fork the repository and clone your fork
|
| 435 |
+
2. Create a new branch for your feature: `git checkout -b feature/your-feature-name`
|
| 436 |
+
3. Install development dependencies: `pip install -r requirements-dev.txt`
|
| 437 |
+
4. Make your changes with appropriate tests
|
| 438 |
+
5. Ensure all tests pass: `pytest tests/`
|
| 439 |
+
6. Format code with black: `black src/`
|
| 440 |
+
7. Lint code with flake8: `flake8 src/`
|
| 441 |
+
8. Commit with descriptive messages
|
| 442 |
+
9. Push to your fork and submit a pull request
|
| 443 |
+
|
| 444 |
+
### Code Style
|
| 445 |
+
|
| 446 |
+
- Follow PEP 8 guidelines for Python code
|
| 447 |
+
- Use type hints for function parameters and return values
|
| 448 |
+
- Write docstrings for all functions and classes
|
| 449 |
+
- Keep functions focused and under 50 lines when possible
|
| 450 |
+
- Use meaningful variable names
|
| 451 |
+
|
| 452 |
+
### Testing
|
| 453 |
+
|
| 454 |
+
- Write unit tests for new features
|
| 455 |
+
- Ensure existing tests pass before submitting PR
|
| 456 |
+
- Aim for >80% code coverage
|
| 457 |
+
|
| 458 |
+
## License
|
| 459 |
+
|
| 460 |
+
This project is licensed under the MIT License. See the LICENSE file for complete terms.
|
| 461 |
+
|
| 462 |
+
Copyright (c) 2025 Pulastya B
|
| 463 |
+
|
| 464 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
| 465 |
+
|
| 466 |
+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
| 467 |
+
|
| 468 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
| 469 |
+
|
| 470 |
+
## Acknowledgments
|
| 471 |
+
|
| 472 |
+
This project builds upon several excellent open-source technologies and frameworks:
|
| 473 |
+
|
| 474 |
+
- **Google Gemini 2.5 Flash**: Advanced language model with function calling capabilities enabling intelligent agent orchestration
|
| 475 |
+
- **FastAPI**: Modern, high-performance web framework for building APIs with Python, providing automatic documentation and validation
|
| 476 |
+
- **React**: JavaScript library for building user interfaces, enabling component-based architecture and efficient rendering
|
| 477 |
+
- **Polars**: High-performance DataFrame library written in Rust, offering significant speed improvements over traditional data processing libraries
|
| 478 |
+
- **Scikit-learn**: Machine learning library providing simple and efficient tools for data analysis and modeling
|
| 479 |
+
- **XGBoost, LightGBM, CatBoost**: Gradient boosting frameworks offering state-of-the-art performance for structured data
|
| 480 |
+
- **Optuna**: Hyperparameter optimization framework with efficient search algorithms
|
| 481 |
+
- **YData Profiling**: Automated exploratory data analysis tool generating comprehensive reports
|
| 482 |
+
- **Plotly**: Interactive visualization library for creating publication-quality graphs
|
| 483 |
+
- **TypeScript**: Typed superset of JavaScript enhancing code quality and developer experience
|
| 484 |
+
- **Tailwind CSS**: Utility-first CSS framework for rapid UI development
|
| 485 |
+
- **Vite**: Next-generation frontend build tool with instant server start
|
| 486 |
+
|
| 487 |
+
Special thanks to the open-source community for creating and maintaining these exceptional tools.
|
| 488 |
+
|
| 489 |
+
## Contact and Support
|
| 490 |
+
|
| 491 |
+
**Developer:** Pulastya B
|
| 492 |
+
|
| 493 |
+
**GitHub Profile:** [@Pulastya-B](https://github.com/Pulastya-B)
|
| 494 |
+
|
| 495 |
+
**Project Repository:** [DevSprint-Data-Science-Agent](https://github.com/Pulastya-B/DevSprint-Data-Science-Agent)
|
| 496 |
+
|
| 497 |
+
**Issues and Bug Reports:** Please use the GitHub Issues page to report bugs or request features
|
| 498 |
+
|
| 499 |
+
**Documentation:** Additional documentation and tutorials available in the repository wiki
|
| 500 |
+
|
| 501 |
+
**Project Status:** Active development - Built for DevSprint Hackathon
|
| 502 |
+
|
| 503 |
+
For questions, suggestions, or collaboration opportunities, please open an issue on GitHub or contact through the repository.
|
| 504 |
+
|
| 505 |
+
---
|
| 506 |
+
|
| 507 |
+
**Last Updated:** December 27, 2025
|
| 508 |
+
|
| 509 |
+
**Version:** 1.0.0
|
| 510 |
+
Step 6 - Temporal Feature Extraction:
|
| 511 |
+
- Tool: `extract_time_features`
|
| 512 |
+
- Input column: 'timestamp'
|
| 513 |
+
- Features created:
|
| 514 |
+
- year, month, day_of_week, hour
|
| 515 |
+
- Cyclical encodings: hour_sin, hour_cos, month_sin, month_cos
|
| 516 |
+
- Justification: Earthquakes may have temporal patterns
|
| 517 |
+
- New columns: 8 time-based features
|
| 518 |
+
|
| 519 |
+
Step 7 - Categorical Encoding:
|
| 520 |
+
- Tool: `encode_categorical_features`
|
| 521 |
+
- Method: Target encoding for 'location' (high cardinality), one-hot encoding for 'type'
|
| 522 |
+
- Result: All categorical variables converted to numeric
|
| 523 |
+
- New columns: 3 (reduced from high-cardinality location)
|
| 524 |
+
|
| 525 |
+
Step 8 - Statistical Features:
|
| 526 |
+
- Tool: `create_statistical_features`
|
| 527 |
+
- Features created:
|
| 528 |
+
- Distance from nearest plate boundary (calculated from lat/lon)
|
| 529 |
+
- Depth-to-magnitude ratio
|
| 530 |
+
- Regional earthquake frequency (rolling count)
|
| 531 |
+
- New columns: 3 domain-specific features
|
| 532 |
+
|
| 533 |
+
Final feature count: 28 engineered features
|
| 534 |
+
|
| 535 |
+
**Phase 5: Model Training and Selection** (Step 9)
|
| 536 |
+
- Tool: `train_baseline_models`
|
| 537 |
+
- Algorithms trained in parallel:
|
| 538 |
+
|
| 539 |
+
1. Ridge Regression: R² = 0.534, RMSE = 0.312
|
| 540 |
+
2. Lasso Regression: R² = 0.541, RMSE = 0.309
|
| 541 |
+
3. ElasticNet: R² = 0.538, RMSE = 0.311
|
| 542 |
+
4. Random Forest: R² = 0.698, RMSE = 0.251
|
| 543 |
+
5. XGBoost: R² = 0.716, RMSE = 0.243 (BEST)
|
| 544 |
+
6. LightGBM: R² = 0.709, RMSE = 0.247
|
| 545 |
+
7. CatBoost: R² = 0.712, RMSE = 0.245
|
| 546 |
+
|
| 547 |
+
- Best model selected: XGBoost
|
| 548 |
+
- Validation split: 80/20 stratified split
|
| 549 |
+
- Time: 124.7 seconds
|
| 550 |
+
|
| 551 |
+
**Phase 6: Hyperparameter Optimization** (Step 10)
|
| 552 |
+
- Tool: `optimize_hyperparameters_optuna`
|
| 553 |
+
- Framework: Optuna with Tree-structured Parzen Estimator (TPE)
|
| 554 |
+
- Search space:
|
| 555 |
+
- max_depth: [3, 10]
|
| 556 |
+
- learning_rate: [0.001, 0.3] (log scale)
|
| 557 |
+
- n_estimators: [100, 1000]
|
| 558 |
+
- min_child_weight: [1, 10]
|
| 559 |
+
- subsample: [0.6, 1.0]
|
| 560 |
+
- colsample_bytree: [0.6, 1.0]
|
| 561 |
+
- Trials: 50 iterations
|
| 562 |
+
- Best parameters found:
|
| 563 |
+
- max_depth: 7
|
| 564 |
+
- learning_rate: 0.0847
|
| 565 |
+
- n_estimators: 673
|
| 566 |
+
- min_child_weight: 3
|
| 567 |
+
- subsample: 0.8234
|
| 568 |
+
- colsample_bytree: 0.9123
|
| 569 |
+
- Optimized performance: R² = 0.743, RMSE = 0.231
|
| 570 |
+
- Improvement: +3.8% R² over baseline
|
| 571 |
+
- Time: 312.4 seconds
|
| 572 |
+
|
| 573 |
+
**Phase 7: Model Validation** (Step 11)
|
| 574 |
+
- Tool: `cross_validate_model`
|
| 575 |
+
- Method: 5-fold stratified cross-validation
|
| 576 |
+
- Results:
|
| 577 |
+
- Fold 1: R² = 0.741, RMSE = 0.232
|
| 578 |
+
- Fold 2: R² = 0.745, RMSE = 0.230
|
| 579 |
+
- Fold 3: R² = 0.738, RMSE = 0.234
|
| 580 |
+
- Fold 4: R² = 0.747, RMSE = 0.229
|
| 581 |
+
- Fold 5: R² = 0.742, RMSE = 0.232
|
| 582 |
+
- Mean performance: R² = 0.743 ± 0.003, RMSE = 0.231 ± 0.002
|
| 583 |
+
- Interpretation: Low variance across folds indicates robust, generalizable model
|
| 584 |
+
- Time: 267.8 seconds
|
| 585 |
+
|
| 586 |
+
**Phase 8: Visualization and Reporting** (Steps 12-13)
|
| 587 |
+
|
| 588 |
+
Step 12 - Feature Importance Analysis:
|
| 589 |
+
- Tool: `plot_feature_importance`
|
| 590 |
+
- Top 10 features by importance:
|
| 591 |
+
1. depth (0.284)
|
| 592 |
+
2. distance_to_plate_boundary (0.167)
|
| 593 |
+
3. latitude (0.142)
|
| 594 |
+
4. longitude (0.138)
|
| 595 |
+
5. regional_frequency (0.095)
|
| 596 |
+
6. depth_magnitude_ratio (0.067)
|
| 597 |
+
7. hour_sin (0.034)
|
| 598 |
+
8. month (0.028)
|
| 599 |
+
9. location_encoded (0.024)
|
| 600 |
+
10. year (0.021)
|
| 601 |
+
- Output: Interactive Plotly bar chart saved to `outputs/feature_importance.html`
|
| 602 |
+
|
| 603 |
+
Step 13 - Comprehensive Dashboard:
|
| 604 |
+
- Tool: `create_plotly_dashboard`
|
| 605 |
+
- Visualizations included:
|
| 606 |
+
- Correlation heatmap (28x28 features)
|
| 607 |
+
- Actual vs Predicted scatter plot
|
| 608 |
+
- Residual distribution plot
|
| 609 |
+
- Feature importance ranking
|
| 610 |
+
- Temporal patterns in predictions
|
| 611 |
+
- Output: Multi-panel interactive dashboard saved to `outputs/model_dashboard.html`
|
| 612 |
+
|
| 613 |
+
### Final Results Summary
|
| 614 |
+
|
| 615 |
+
**Model Performance:**
|
| 616 |
+
- Algorithm: XGBoost with optimized hyperparameters
|
| 617 |
+
- Training R²: 0.743
|
| 618 |
+
- Cross-validated R²: 0.743 ± 0.003
|
| 619 |
+
- RMSE: 0.231 (on magnitude scale 0-10)
|
| 620 |
+
- MAE: 0.176
|
| 621 |
+
- Explanation: Model explains 74.3% of variance in earthquake magnitudes
|
| 622 |
+
|
| 623 |
+
**Artifacts Generated:**
|
| 624 |
+
- Trained model file: `outputs/xgboost_model_optimized.pkl`
|
| 625 |
+
- YData profiling report: `outputs/earthquake_profile.html`
|
| 626 |
+
- Feature importance plot: `outputs/feature_importance.html`
|
| 627 |
+
- Interactive dashboard: `outputs/model_dashboard.html`
|
| 628 |
+
- Cleaned dataset: `data/earthquake_data_cleaned.parquet`
|
| 629 |
+
- Feature engineered dataset: `data/earthquake_data_featured.parquet`
|
| 630 |
+
|
| 631 |
+
**Total Execution Time:** 12 minutes 43 seconds
|
| 632 |
+
|
| 633 |
+
**Key Insights:**
|
| 634 |
+
1. Depth is the strongest predictor of earthquake magnitude (28.4% importance)
|
| 635 |
+
2. Spatial features (distance to plate boundaries, lat/lon) are highly informative
|
| 636 |
+
3. Temporal patterns show cyclical variations in earthquake characteristics
|
| 637 |
+
4. Model performance is consistent across cross-validation folds (low variance)
|
| 638 |
+
5. The optimized XGBoost model provides reliable magnitude predictions suitable for deployment
|
| 639 |
+
|
| 640 |
+
### Robust Error Recovery System
|
| 641 |
+
|
| 642 |
+
The agent implements a comprehensive error recovery system designed to handle failures gracefully and guide users toward successful task completion.
|
| 643 |
+
|
| 644 |
+
**Error Recovery Mechanisms:**
|
| 645 |
+
|
| 646 |
+
1. **Automatic Retry with Correction**: When a tool execution fails due to incorrect parameters, the agent analyzes the error message, adjusts parameters based on the error type, and automatically retries the operation with corrected inputs.
|
| 647 |
+
|
| 648 |
+
2. **File Existence Validation**: Before executing tools that require specific file inputs, the system validates file existence and accessibility, providing clear guidance when files are missing.
|
| 649 |
+
|
| 650 |
+
3. **Column Name Validation**: Validates that requested column names exist in the dataset before performing operations, suggesting similar column names when exact matches aren't found.
|
| 651 |
+
|
| 652 |
+
4. **Dependency Tracking**: Ensures tools are executed in proper sequence, checking that prerequisite operations (e.g., data cleaning before training) have been completed.
|
| 653 |
+
|
| 654 |
+
5. **Loop Detection**: Monitors tool execution patterns to detect and prevent infinite retry loops. If the same operation fails multiple times with the same error, the agent stops retrying and requests user intervention.
|
| 655 |
+
|
| 656 |
+
6. **Recovery Guidance**: When errors cannot be automatically resolved, the system provides detailed guidance including:
|
| 657 |
+
- Clear explanation of what went wrong
|
| 658 |
+
- The last successful file state that can be used to continue
|
| 659 |
+
- Suggested alternative approaches
|
| 660 |
+
- Specific parameter corrections needed
|
| 661 |
+
|
| 662 |
+
7. **Graceful Degradation**: If a requested operation cannot be completed, the agent attempts to provide partial results or alternative analysis that may still be valuable.
|
| 663 |
+
|
| 664 |
+
**Example Error Recovery Flow:**
|
| 665 |
+
|
| 666 |
+
```
|
| 667 |
+
Request: "Train a model to predict 'Price' column"
|
| 668 |
+
|
| 669 |
+
Error Detected: Column 'Price' not found in dataset
|
| 670 |
+
Recovery Action: Search for similar columns → Find 'price', 'PRICE', 'SalePrice'
|
| 671 |
+
Agent Response: "Column 'Price' not found. Did you mean 'SalePrice'? I found these similar columns: ['SalePrice', 'price_usd']. Please specify which column to use."
|
| 672 |
+
|
| 673 |
+
User: "Yes, use SalePrice"
|
| 674 |
+
Agent: [Continues with corrected column name]
|
| 675 |
+
```
|
| 676 |
+
|
| 677 |
+
### Interactive Report Viewing
|
| 678 |
+
|
| 679 |
+
The web interface includes an integrated report viewer that displays comprehensive HTML reports generated during analysis without requiring users to download files or switch to external tools.
|
| 680 |
+
|
| 681 |
+
**Report Viewer Features:**
|
| 682 |
+
|
| 683 |
+
- **In-Application Display**: Reports open in a full-screen modal overlay within the chat interface
|
| 684 |
+
- **Multiple Report Types**: Supports YData Profiling reports and custom HTML dashboards
|
| 685 |
+
- **Professional Styling**: Modal features glassmorphism design, smooth animations, and responsive layout
|
| 686 |
+
- **Interactive Navigation**: Users can zoom, scroll, and interact with report elements directly in the viewer
|
| 687 |
+
- **Download Option**: Reports can be downloaded as standalone HTML files for sharing or archival
|
| 688 |
+
- **Automatic Detection**: System automatically detects when tools generate HTML reports and creates "View Report" buttons in the chat interface
|
| 689 |
+
|
| 690 |
+
**Supported Report Types:**
|
| 691 |
+
|
| 692 |
+
1. **YData Profiling Reports**: Comprehensive automated EDA with variable statistics, distributions, correlations, missing value analysis, and alerts for data quality issues
|
| 693 |
+
|
| 694 |
+
2. **Custom Dashboards**: User-created Plotly dashboards with multiple interactive visualizations
|
| 695 |
+
|
| 696 |
+
The report extraction system uses multiple strategies to locate report files, including checking tool return values, parsing workflow history, and using regex pattern matching on agent responses.
|
| 697 |
+
- Use different API keys for development and production
|
| 698 |
+
- Rotate API keys periodically
|
| 699 |
+
- Set restrictive file permissions on `.env` (chmod 600 on Linux/macOS)inux/macOS:**
|
| 700 |
+
```bash
|
| 701 |
+
chmod +x build-and-deploy.sh
|
| 702 |
+
./build-and-deploy.sh
|
| 703 |
+
```
|
| 704 |
+
|
| 705 |
+
These scripts handle building the image, stopping any existing containers, and starting a new container with proper configuration.FRRONTEEEND
|
| 706 |
npm install
|
| 707 |
npm run build
|
| 708 |
cd ..
|
requirements.txt
CHANGED
|
@@ -31,8 +31,7 @@ seaborn>=0.13.1
|
|
| 31 |
plotly>=5.18.0 # Interactive visualizations
|
| 32 |
|
| 33 |
# EDA Report Generation
|
| 34 |
-
|
| 35 |
-
ydata-profiling>=4.17.0 # Updated for Python 3.13 compatibility
|
| 36 |
|
| 37 |
# User Interface
|
| 38 |
# gradio>=5.49.1 # Replaced with React frontend
|
|
|
|
| 31 |
plotly>=5.18.0 # Interactive visualizations
|
| 32 |
|
| 33 |
# EDA Report Generation
|
| 34 |
+
ydata-profiling>=4.17.0 # Comprehensive automated EDA reports with Python 3.13 compatibility
|
|
|
|
| 35 |
|
| 36 |
# User Interface
|
| 37 |
# gradio>=5.49.1 # Replaced with React frontend
|
src/orchestrator.py
CHANGED
|
@@ -104,10 +104,8 @@ from tools import (
|
|
| 104 |
generate_interactive_box_plots,
|
| 105 |
generate_interactive_time_series,
|
| 106 |
generate_plotly_dashboard,
|
| 107 |
-
# EDA Report Generation (
|
| 108 |
-
generate_sweetviz_report,
|
| 109 |
generate_ydata_profiling_report,
|
| 110 |
-
generate_combined_eda_report,
|
| 111 |
# Code Interpreter (2) - NEW PHASE 2 - TRUE AI AGENT CAPABILITY
|
| 112 |
execute_python_code,
|
| 113 |
execute_code_from_file,
|
|
@@ -332,10 +330,8 @@ class DataScienceCopilot:
|
|
| 332 |
"generate_interactive_box_plots": generate_interactive_box_plots,
|
| 333 |
"generate_interactive_time_series": generate_interactive_time_series,
|
| 334 |
"generate_plotly_dashboard": generate_plotly_dashboard,
|
| 335 |
-
# EDA Report Generation (
|
| 336 |
-
"generate_sweetviz_report": generate_sweetviz_report,
|
| 337 |
"generate_ydata_profiling_report": generate_ydata_profiling_report,
|
| 338 |
-
"generate_combined_eda_report": generate_combined_eda_report,
|
| 339 |
# Code Interpreter (2) - NEW PHASE 2 - TRUE AI AGENT CAPABILITY
|
| 340 |
"execute_python_code": execute_python_code,
|
| 341 |
"execute_code_from_file": execute_code_from_file,
|
|
@@ -668,7 +664,7 @@ Use specialized tools FIRST. Only use execute_python_code for:
|
|
| 668 |
- NEW Automation: auto_ml_pipeline (zero-config full pipeline), auto_feature_selection
|
| 669 |
- NEW Visualization: generate_all_plots, generate_data_quality_plots, generate_eda_plots, generate_model_performance_plots, generate_feature_importance_plot
|
| 670 |
- NEW Interactive Plotly Visualizations: generate_interactive_scatter, generate_interactive_histogram, generate_interactive_correlation_heatmap, generate_interactive_box_plots, generate_interactive_time_series, generate_plotly_dashboard (interactive web-based plots with zoom/pan/hover)
|
| 671 |
-
- NEW EDA Report Generation:
|
| 672 |
- NEW Enhanced Feature Engineering: create_ratio_features, create_statistical_features, create_log_features, create_binned_features
|
| 673 |
|
| 674 |
**RULES:**
|
|
|
|
| 104 |
generate_interactive_box_plots,
|
| 105 |
generate_interactive_time_series,
|
| 106 |
generate_plotly_dashboard,
|
| 107 |
+
# EDA Report Generation (1) - NEW PHASE 2
|
|
|
|
| 108 |
generate_ydata_profiling_report,
|
|
|
|
| 109 |
# Code Interpreter (2) - NEW PHASE 2 - TRUE AI AGENT CAPABILITY
|
| 110 |
execute_python_code,
|
| 111 |
execute_code_from_file,
|
|
|
|
| 330 |
"generate_interactive_box_plots": generate_interactive_box_plots,
|
| 331 |
"generate_interactive_time_series": generate_interactive_time_series,
|
| 332 |
"generate_plotly_dashboard": generate_plotly_dashboard,
|
| 333 |
+
# EDA Report Generation (1) - NEW PHASE 2
|
|
|
|
| 334 |
"generate_ydata_profiling_report": generate_ydata_profiling_report,
|
|
|
|
| 335 |
# Code Interpreter (2) - NEW PHASE 2 - TRUE AI AGENT CAPABILITY
|
| 336 |
"execute_python_code": execute_python_code,
|
| 337 |
"execute_code_from_file": execute_code_from_file,
|
|
|
|
| 664 |
- NEW Automation: auto_ml_pipeline (zero-config full pipeline), auto_feature_selection
|
| 665 |
- NEW Visualization: generate_all_plots, generate_data_quality_plots, generate_eda_plots, generate_model_performance_plots, generate_feature_importance_plot
|
| 666 |
- NEW Interactive Plotly Visualizations: generate_interactive_scatter, generate_interactive_histogram, generate_interactive_correlation_heatmap, generate_interactive_box_plots, generate_interactive_time_series, generate_plotly_dashboard (interactive web-based plots with zoom/pan/hover)
|
| 667 |
+
- NEW EDA Report Generation: generate_ydata_profiling_report (comprehensive detailed analysis with full statistics, distributions, correlations, and data quality insights)
|
| 668 |
- NEW Enhanced Feature Engineering: create_ratio_features, create_statistical_features, create_log_features, create_binned_features
|
| 669 |
|
| 670 |
**RULES:**
|
src/tools/__init__.py
CHANGED
|
@@ -141,11 +141,9 @@ from .plotly_visualizations import (
|
|
| 141 |
generate_plotly_dashboard
|
| 142 |
)
|
| 143 |
|
| 144 |
-
# EDA Report Generation (
|
| 145 |
from .eda_reports import (
|
| 146 |
-
|
| 147 |
-
generate_ydata_profiling_report,
|
| 148 |
-
generate_combined_eda_report
|
| 149 |
)
|
| 150 |
|
| 151 |
# Code Interpreter (2) - NEW PHASE 2 - CRITICAL for True AI Agent
|
|
@@ -279,10 +277,8 @@ __all__ = [
|
|
| 279 |
"generate_interactive_time_series",
|
| 280 |
"generate_plotly_dashboard",
|
| 281 |
|
| 282 |
-
# EDA Report Generation (
|
| 283 |
-
"generate_sweetviz_report",
|
| 284 |
"generate_ydata_profiling_report",
|
| 285 |
-
"generate_combined_eda_report",
|
| 286 |
|
| 287 |
# Code Interpreter (2) - NEW PHASE 2 - CRITICAL for True AI Agent
|
| 288 |
"execute_python_code",
|
|
|
|
| 141 |
generate_plotly_dashboard
|
| 142 |
)
|
| 143 |
|
| 144 |
+
# EDA Report Generation (1) - NEW PHASE 2
|
| 145 |
from .eda_reports import (
|
| 146 |
+
generate_ydata_profiling_report
|
|
|
|
|
|
|
| 147 |
)
|
| 148 |
|
| 149 |
# Code Interpreter (2) - NEW PHASE 2 - CRITICAL for True AI Agent
|
|
|
|
| 277 |
"generate_interactive_time_series",
|
| 278 |
"generate_plotly_dashboard",
|
| 279 |
|
| 280 |
+
# EDA Report Generation (1) - NEW PHASE 2
|
|
|
|
| 281 |
"generate_ydata_profiling_report",
|
|
|
|
| 282 |
|
| 283 |
# Code Interpreter (2) - NEW PHASE 2 - CRITICAL for True AI Agent
|
| 284 |
"execute_python_code",
|
src/tools/eda_reports.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
"""
|
| 2 |
EDA Report Generation Tools
|
| 3 |
-
Generates comprehensive HTML reports using
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
|
@@ -9,128 +9,6 @@ from typing import Dict, Any, Optional
|
|
| 9 |
import polars as pl
|
| 10 |
|
| 11 |
|
| 12 |
-
def generate_sweetviz_report(
|
| 13 |
-
file_path: str,
|
| 14 |
-
output_path: str = "./outputs/reports/sweetviz_report.html",
|
| 15 |
-
target_column: Optional[str] = None,
|
| 16 |
-
compare_file_path: Optional[str] = None
|
| 17 |
-
) -> Dict[str, Any]:
|
| 18 |
-
"""
|
| 19 |
-
Generate a beautiful HTML report using Sweetviz.
|
| 20 |
-
|
| 21 |
-
Sweetviz creates stunning visualizations for EDA with:
|
| 22 |
-
- Target analysis (associations with target variable)
|
| 23 |
-
- Feature distributions and statistics
|
| 24 |
-
- Correlations and relationships
|
| 25 |
-
- Missing value analysis
|
| 26 |
-
- Comparison between datasets (train vs test)
|
| 27 |
-
|
| 28 |
-
Args:
|
| 29 |
-
file_path: Path to the dataset CSV file
|
| 30 |
-
output_path: Where to save the HTML report
|
| 31 |
-
target_column: Optional target variable for analysis
|
| 32 |
-
compare_file_path: Optional second dataset to compare against
|
| 33 |
-
|
| 34 |
-
Returns:
|
| 35 |
-
Dict with success status, report path, and summary
|
| 36 |
-
"""
|
| 37 |
-
try:
|
| 38 |
-
import warnings
|
| 39 |
-
import pandas as pd
|
| 40 |
-
|
| 41 |
-
# Suppress NumPy deprecation warnings that Sweetviz triggers
|
| 42 |
-
warnings.filterwarnings('ignore', category=DeprecationWarning)
|
| 43 |
-
|
| 44 |
-
import sweetviz as sv
|
| 45 |
-
|
| 46 |
-
# Read dataset (Sweetviz requires pandas)
|
| 47 |
-
if file_path.endswith('.csv'):
|
| 48 |
-
df = pd.read_csv(file_path)
|
| 49 |
-
elif file_path.endswith('.parquet'):
|
| 50 |
-
df = pd.read_parquet(file_path)
|
| 51 |
-
else:
|
| 52 |
-
raise ValueError(f"Unsupported file format: {file_path}")
|
| 53 |
-
|
| 54 |
-
# Create output directory if needed
|
| 55 |
-
os.makedirs(os.path.dirname(output_path) or "./outputs/reports", exist_ok=True)
|
| 56 |
-
|
| 57 |
-
# Generate report based on configuration
|
| 58 |
-
if compare_file_path:
|
| 59 |
-
# Comparison report (e.g., train vs test)
|
| 60 |
-
if compare_file_path.endswith('.csv'):
|
| 61 |
-
df_compare = pd.read_csv(compare_file_path)
|
| 62 |
-
elif compare_file_path.endswith('.parquet'):
|
| 63 |
-
df_compare = pd.read_parquet(compare_file_path)
|
| 64 |
-
else:
|
| 65 |
-
raise ValueError(f"Unsupported compare file format: {compare_file_path}")
|
| 66 |
-
|
| 67 |
-
report = sv.compare([df, "Dataset 1"], [df_compare, "Dataset 2"], target_column)
|
| 68 |
-
elif target_column:
|
| 69 |
-
# Analysis with target variable
|
| 70 |
-
if target_column not in df.columns:
|
| 71 |
-
available = list(df.columns)
|
| 72 |
-
return {
|
| 73 |
-
"success": False,
|
| 74 |
-
"error": f"Column '{target_column}' not found. Available columns: {', '.join(available)}",
|
| 75 |
-
"suggestion": f"Did you mean one of: {', '.join(available[:5])}?"
|
| 76 |
-
}
|
| 77 |
-
report = sv.analyze([df, "Dataset"], target_feat=target_column)
|
| 78 |
-
else:
|
| 79 |
-
# Basic analysis without target
|
| 80 |
-
report = sv.analyze(df)
|
| 81 |
-
|
| 82 |
-
# Generate HTML report
|
| 83 |
-
report.show_html(filepath=output_path, open_browser=False, layout='vertical', scale=1.0)
|
| 84 |
-
|
| 85 |
-
# Get summary statistics
|
| 86 |
-
num_features = len(df.columns)
|
| 87 |
-
num_rows = len(df)
|
| 88 |
-
num_numeric = df.select_dtypes(include=['number']).shape[1]
|
| 89 |
-
num_categorical = df.select_dtypes(include=['object', 'category']).shape[1]
|
| 90 |
-
missing_pct = (df.isnull().sum().sum() / (num_rows * num_features)) * 100
|
| 91 |
-
|
| 92 |
-
return {
|
| 93 |
-
"success": True,
|
| 94 |
-
"report_path": output_path,
|
| 95 |
-
"message": f"✅ Sweetviz report generated successfully at: {output_path}",
|
| 96 |
-
"summary": {
|
| 97 |
-
"features": num_features,
|
| 98 |
-
"rows": num_rows,
|
| 99 |
-
"numeric_features": num_numeric,
|
| 100 |
-
"categorical_features": num_categorical,
|
| 101 |
-
"missing_percentage": round(missing_pct, 2),
|
| 102 |
-
"target_column": target_column,
|
| 103 |
-
"has_comparison": compare_file_path is not None
|
| 104 |
-
}
|
| 105 |
-
}
|
| 106 |
-
|
| 107 |
-
except ImportError:
|
| 108 |
-
return {
|
| 109 |
-
"success": False,
|
| 110 |
-
"error": "Sweetviz not installed. Install with: pip install sweetviz",
|
| 111 |
-
"error_type": "MissingDependency",
|
| 112 |
-
"workaround": "Use generate_ydata_profiling_report as an alternative for comprehensive EDA reports."
|
| 113 |
-
}
|
| 114 |
-
except AttributeError as e:
|
| 115 |
-
if "VisibleDeprecationWarning" in str(e) or "numpy" in str(e).lower():
|
| 116 |
-
return {
|
| 117 |
-
"success": False,
|
| 118 |
-
"error": "Sweetviz is incompatible with NumPy 2.x. NumPy version downgrade required.",
|
| 119 |
-
"error_type": "DependencyConflict",
|
| 120 |
-
"solution": "Downgrade NumPy to 1.x: py -m pip install 'numpy<2.0'",
|
| 121 |
-
"workaround": "Use generate_ydata_profiling_report instead - it's fully compatible with NumPy 2.x and provides more comprehensive analysis.",
|
| 122 |
-
"alternative_report_path": output_path.replace("sweetviz", "ydata_profile")
|
| 123 |
-
}
|
| 124 |
-
raise
|
| 125 |
-
except Exception as e:
|
| 126 |
-
return {
|
| 127 |
-
"success": False,
|
| 128 |
-
"error": f"Failed to generate Sweetviz report: {str(e)}",
|
| 129 |
-
"error_type": type(e).__name__,
|
| 130 |
-
"workaround": "Try generate_ydata_profiling_report for a comprehensive EDA report instead."
|
| 131 |
-
}
|
| 132 |
-
|
| 133 |
-
|
| 134 |
def generate_ydata_profiling_report(
|
| 135 |
file_path: str,
|
| 136 |
output_path: str = "./outputs/reports/ydata_profile.html",
|
|
@@ -250,90 +128,3 @@ def generate_ydata_profiling_report(
|
|
| 250 |
"error": f"Failed to generate ydata-profiling report: {str(e)}",
|
| 251 |
"error_type": type(e).__name__
|
| 252 |
}
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
def generate_combined_eda_report(
|
| 256 |
-
file_path: str,
|
| 257 |
-
output_dir: str = "./outputs/reports",
|
| 258 |
-
target_column: Optional[str] = None,
|
| 259 |
-
minimal: bool = False
|
| 260 |
-
) -> Dict[str, Any]:
|
| 261 |
-
"""
|
| 262 |
-
Generate both Sweetviz and ydata-profiling reports in one call.
|
| 263 |
-
|
| 264 |
-
This convenience function creates comprehensive EDA reports using both tools,
|
| 265 |
-
giving you the best of both worlds:
|
| 266 |
-
- Sweetviz: Beautiful, fast, focused visualizations
|
| 267 |
-
- ydata-profiling: Comprehensive, detailed analysis
|
| 268 |
-
|
| 269 |
-
Args:
|
| 270 |
-
file_path: Path to the dataset CSV file
|
| 271 |
-
output_dir: Directory to save both reports
|
| 272 |
-
target_column: Optional target variable for Sweetviz analysis
|
| 273 |
-
minimal: If True, uses minimal mode for ydata-profiling
|
| 274 |
-
|
| 275 |
-
Returns:
|
| 276 |
-
Dict with success status and paths to both reports
|
| 277 |
-
"""
|
| 278 |
-
try:
|
| 279 |
-
# Create output directory
|
| 280 |
-
os.makedirs(output_dir, exist_ok=True)
|
| 281 |
-
|
| 282 |
-
# Generate Sweetviz report
|
| 283 |
-
sweetviz_path = os.path.join(output_dir, "sweetviz_report.html")
|
| 284 |
-
sweetviz_result = generate_sweetviz_report(
|
| 285 |
-
file_path=file_path,
|
| 286 |
-
output_path=sweetviz_path,
|
| 287 |
-
target_column=target_column
|
| 288 |
-
)
|
| 289 |
-
|
| 290 |
-
# Generate ydata-profiling report
|
| 291 |
-
ydata_path = os.path.join(output_dir, "ydata_profile.html")
|
| 292 |
-
ydata_result = generate_ydata_profiling_report(
|
| 293 |
-
file_path=file_path,
|
| 294 |
-
output_path=ydata_path,
|
| 295 |
-
minimal=minimal
|
| 296 |
-
)
|
| 297 |
-
|
| 298 |
-
# Check if both succeeded
|
| 299 |
-
both_success = sweetviz_result["success"] and ydata_result["success"]
|
| 300 |
-
|
| 301 |
-
if both_success:
|
| 302 |
-
return {
|
| 303 |
-
"success": True,
|
| 304 |
-
"message": f"✅ Generated both EDA reports successfully in: {output_dir}",
|
| 305 |
-
"reports": {
|
| 306 |
-
"sweetviz": {
|
| 307 |
-
"path": sweetviz_path,
|
| 308 |
-
"summary": sweetviz_result.get("summary", {})
|
| 309 |
-
},
|
| 310 |
-
"ydata_profiling": {
|
| 311 |
-
"path": ydata_path,
|
| 312 |
-
"statistics": ydata_result.get("statistics", {})
|
| 313 |
-
}
|
| 314 |
-
},
|
| 315 |
-
"recommendation": "Open both reports in your browser to get comprehensive insights!"
|
| 316 |
-
}
|
| 317 |
-
else:
|
| 318 |
-
# At least one failed
|
| 319 |
-
errors = []
|
| 320 |
-
if not sweetviz_result["success"]:
|
| 321 |
-
errors.append(f"Sweetviz: {sweetviz_result['error']}")
|
| 322 |
-
if not ydata_result["success"]:
|
| 323 |
-
errors.append(f"ydata-profiling: {ydata_result['error']}")
|
| 324 |
-
|
| 325 |
-
return {
|
| 326 |
-
"success": False,
|
| 327 |
-
"error": " | ".join(errors),
|
| 328 |
-
"partial_results": {
|
| 329 |
-
"sweetviz": sweetviz_result,
|
| 330 |
-
"ydata_profiling": ydata_result
|
| 331 |
-
}
|
| 332 |
-
}
|
| 333 |
-
|
| 334 |
-
except Exception as e:
|
| 335 |
-
return {
|
| 336 |
-
"success": False,
|
| 337 |
-
"error": f"Failed to generate combined reports: {str(e)}",
|
| 338 |
-
"error_type": type(e).__name__
|
| 339 |
-
}
|
|
|
|
| 1 |
"""
|
| 2 |
EDA Report Generation Tools
|
| 3 |
+
Generates comprehensive HTML reports using ydata-profiling.
|
| 4 |
"""
|
| 5 |
|
| 6 |
import os
|
|
|
|
| 9 |
import polars as pl
|
| 10 |
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
def generate_ydata_profiling_report(
|
| 13 |
file_path: str,
|
| 14 |
output_path: str = "./outputs/reports/ydata_profile.html",
|
|
|
|
| 128 |
"error": f"Failed to generate ydata-profiling report: {str(e)}",
|
| 129 |
"error_type": type(e).__name__
|
| 130 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/tools/tools_registry.py
CHANGED
|
@@ -1431,29 +1431,12 @@ TOOLS = [
|
|
| 1431 |
}
|
| 1432 |
}
|
| 1433 |
},
|
| 1434 |
-
# EDA Report Generation (
|
| 1435 |
-
{
|
| 1436 |
-
"type": "function",
|
| 1437 |
-
"function": {
|
| 1438 |
-
"name": "generate_sweetviz_report",
|
| 1439 |
-
"description": "Generate beautiful HTML EDA report using Sweetviz. Creates stunning visualizations with target analysis, feature distributions, correlations, missing values. Fast and visually appealing. Supports dataset comparison (train vs test).",
|
| 1440 |
-
"parameters": {
|
| 1441 |
-
"type": "object",
|
| 1442 |
-
"properties": {
|
| 1443 |
-
"file_path": {"type": "string", "description": "Path to the dataset CSV/Parquet file"},
|
| 1444 |
-
"output_path": {"type": "string", "description": "Where to save HTML report (default: ./outputs/reports/sweetviz_report.html)"},
|
| 1445 |
-
"target_column": {"type": "string", "description": "Optional target variable for association analysis"},
|
| 1446 |
-
"compare_file_path": {"type": "string", "description": "Optional second dataset to compare (e.g., train vs test)"}
|
| 1447 |
-
},
|
| 1448 |
-
"required": ["file_path"]
|
| 1449 |
-
}
|
| 1450 |
-
}
|
| 1451 |
-
},
|
| 1452 |
{
|
| 1453 |
"type": "function",
|
| 1454 |
"function": {
|
| 1455 |
"name": "generate_ydata_profiling_report",
|
| 1456 |
-
"description": "Generate comprehensive HTML report using ydata-profiling (formerly pandas-profiling). Provides extensive analysis: overview, variable statistics, interactions, correlations (Pearson, Spearman, Cramér's V), missing values matrix, duplicate analysis, and more. Most detailed profiling tool.",
|
| 1457 |
"parameters": {
|
| 1458 |
"type": "object",
|
| 1459 |
"properties": {
|
|
@@ -1466,23 +1449,6 @@ TOOLS = [
|
|
| 1466 |
}
|
| 1467 |
}
|
| 1468 |
},
|
| 1469 |
-
{
|
| 1470 |
-
"type": "function",
|
| 1471 |
-
"function": {
|
| 1472 |
-
"name": "generate_combined_eda_report",
|
| 1473 |
-
"description": "Generate BOTH Sweetviz and ydata-profiling reports in one call. Best of both worlds: Sweetviz for beautiful fast visualizations + ydata-profiling for comprehensive detailed analysis. Recommended for complete EDA.",
|
| 1474 |
-
"parameters": {
|
| 1475 |
-
"type": "object",
|
| 1476 |
-
"properties": {
|
| 1477 |
-
"file_path": {"type": "string", "description": "Path to the dataset CSV/Parquet file"},
|
| 1478 |
-
"output_dir": {"type": "string", "description": "Directory to save both reports (default: ./outputs/reports)"},
|
| 1479 |
-
"target_column": {"type": "string", "description": "Optional target variable for Sweetviz analysis"},
|
| 1480 |
-
"minimal": {"type": "boolean", "description": "If true, uses minimal mode for ydata-profiling (default: false)"}
|
| 1481 |
-
},
|
| 1482 |
-
"required": ["file_path"]
|
| 1483 |
-
}
|
| 1484 |
-
}
|
| 1485 |
-
},
|
| 1486 |
# ========================================
|
| 1487 |
# CODE INTERPRETER - THE GAME CHANGER 🚀
|
| 1488 |
# ========================================
|
|
|
|
| 1431 |
}
|
| 1432 |
}
|
| 1433 |
},
|
| 1434 |
+
# EDA Report Generation (1) - NEW PHASE 2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1435 |
{
|
| 1436 |
"type": "function",
|
| 1437 |
"function": {
|
| 1438 |
"name": "generate_ydata_profiling_report",
|
| 1439 |
+
"description": "Generate comprehensive HTML report using ydata-profiling (formerly pandas-profiling). Provides extensive analysis: overview, variable statistics, interactions, correlations (Pearson, Spearman, Cramér's V), missing values matrix, duplicate analysis, and more. Most detailed and comprehensive profiling tool with automated insights and data quality warnings.",
|
| 1440 |
"parameters": {
|
| 1441 |
"type": "object",
|
| 1442 |
"properties": {
|
|
|
|
| 1449 |
}
|
| 1450 |
}
|
| 1451 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1452 |
# ========================================
|
| 1453 |
# CODE INTERPRETER - THE GAME CHANGER 🚀
|
| 1454 |
# ========================================
|