File size: 4,100 Bytes
09cd93c
 
 
 
 
 
 
 
 
226ac39
 
09cd93c
50a857f
09cd93c
50a857f
09cd93c
50a857f
09cd93c
 
 
 
 
 
 
50a857f
09cd93c
50a857f
09cd93c
 
 
 
 
50a857f
09cd93c
50a857f
09cd93c
 
 
 
50a857f
09cd93c
50a857f
09cd93c
 
 
 
50a857f
09cd93c
50a857f
09cd93c
50a857f
09cd93c
d92d2aa
 
09cd93c
d92d2aa
 
09cd93c
d92d2aa
09cd93c
d92d2aa
 
09cd93c
 
 
d92d2aa
09cd93c
 
 
d92d2aa
09cd93c
 
d92d2aa
09cd93c
 
d92d2aa
09cd93c
 
d92d2aa
 
09cd93c
d92d2aa
 
09cd93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d92d2aa
 
09cd93c
d92d2aa
09cd93c
 
 
 
d92d2aa
09cd93c
d92d2aa
09cd93c
 
 
 
 
d92d2aa
 
 
09cd93c
d92d2aa
 
 
09cd93c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: DevSprint Data Science Agent
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
---

# DevSprint Data Science Agent πŸ€–

An intelligent AI agent for automated data science workflows, powered by Google Gemini 2.5 Flash with 82+ specialized tools for data analysis, visualization, and machine learning.

## Features

- πŸ” **Automated EDA**: YData profiling, statistical analysis, data quality reports
- πŸ“Š **Smart Visualizations**: Plotly dashboards, matplotlib plots, interactive charts
- 🧹 **Data Cleaning**: Missing value handling, outlier detection, type conversion
- πŸ› οΈ **Feature Engineering**: Automated feature creation, encoding, scaling
- πŸ€– **ML Training**: AutoML with XGBoost, LightGBM, CatBoost, Neural Networks
- πŸ’¬ **Natural Language Interface**: Chat-based interaction for complex workflows
- πŸ“ˆ **Business Intelligence**: KPI tracking, trend analysis, forecasting

## Tech Stack

- **Backend**: FastAPI + Python 3.12
- **LLM**: Google Gemini 2.5 Flash (text-based tool calling)
- **Data Processing**: Polars (high-performance dataframes)
- **Frontend**: React 19 + TypeScript + Vite
- **ML Libraries**: Scikit-learn, XGBoost, LightGBM, CatBoost, PyTorch

## Usage

1. Upload your CSV/Excel dataset
2. Ask questions in natural language (e.g., "Generate a detailed profiling report")
3. The agent automatically selects and executes the right tools
4. View generated reports, visualizations, and insights

## Memory Optimization

For large datasets (>50k rows or >10MB), the agent automatically:
- Samples to 50,000 rows for profiling
- Enables minimal mode to reduce memory usage
- Disables expensive correlation/interaction calculations

This ensures smooth operation even with large datasets on HuggingFace's 16GB RAM.

## Environment Variables

Set `GEMINI_API_KEY` in HuggingFace Spaces settings (Settings β†’ Repository secrets):

```
GEMINI_API_KEY=your_google_gemini_api_key_here
```

Get your API key from: https://aistudio.google.com/app/apikey

## Local Development

```bash
# Clone repository
git clone https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
cd devs-print-data-science-agent

# Install dependencies
pip install -r requirements.txt
npm install --prefix FRRONTEEEND

# Build frontend
cd FRRONTEEEND && npm run build && cd ..

# Set API key
export GEMINI_API_KEY=your_key_here

# Run server
uvicorn src.api.app:app --host 0.0.0.0 --port 7860
```

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  React Frontend β”‚  ← User uploads data + asks questions
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FastAPI Server β”‚  ← Serves frontend + API endpoints
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Orchestrator   β”‚  ← LLM-driven tool selection & execution
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   82+ Tools     β”‚  ← Specialized data science functions
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Key Components

- **Orchestrator** ([src/orchestrator.py](src/orchestrator.py)): ReAct-based tool calling with Gemini
- **Tools Registry** ([src/tools/](src/tools/)): 82+ specialized data science tools
- **Session Memory** ([src/session_memory.py](src/session_memory.py)): Conversation history + file tracking
- **Artifact Store** ([src/storage/artifact_store.py](src/storage/artifact_store.py)): File management + metadata

## Deployment

This Space uses a **Docker** deployment for maximum compatibility:
- Base image: `python:3.12-slim`
- Multi-stage build (Node.js for frontend, Python for backend)
- Auto-exposes port 7860 for HuggingFace
- All dependencies bundled in container

## Contributing

Built for DevSprint Hackathon 2025. Contributions welcome post-hackathon!

## License

MIT License - see LICENSE file for details