Spaces:

Pulastya0
/

Data-Science-Agent

Running

App Files Files Community

Data-Science-Agent / README_SPACES.md

Pulastya B

fix: Support both GOOGLE_API_KEY and GEMINI_API_KEY env vars for HuggingFace compatibility

fde5dd3 3 days ago

preview code

raw

history blame contribute delete

4.35 kB

metadata

title: DevSprint Data Science Agent
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

DevSprint Data Science Agent 🤖

An intelligent AI agent for automated data science workflows, powered by Google Gemini 2.5 Flash with 82+ specialized tools for data analysis, visualization, and machine learning.

Features

🔍 Automated EDA: YData profiling, statistical analysis, data quality reports
📊 Smart Visualizations: Plotly dashboards, matplotlib plots, interactive charts
🧹 Data Cleaning: Missing value handling, outlier detection, type conversion
🛠️ Feature Engineering: Automated feature creation, encoding, scaling
🤖 ML Training: AutoML with XGBoost, LightGBM, CatBoost, Neural Networks
💬 Natural Language Interface: Chat-based interaction for complex workflows
📈 Business Intelligence: KPI tracking, trend analysis, forecasting

Tech Stack

Backend: FastAPI + Python 3.12
LLM: Google Gemini 2.5 Flash (text-based tool calling)
Data Processing: Polars (high-performance dataframes)
Frontend: React 19 + TypeScript + Vite
ML Libraries: Scikit-learn, XGBoost, LightGBM, CatBoost, PyTorch

Usage

Upload your CSV/Excel dataset
Ask questions in natural language (e.g., "Generate a detailed profiling report")
The agent automatically selects and executes the right tools
View generated reports, visualizations, and insights

Memory Optimization

For large datasets (>50k rows or >10MB), the agent automatically:

Samples to 50,000 rows for profiling
Enables minimal mode to reduce memory usage
Disables expensive correlation/interaction calculations

This ensures smooth operation even with large datasets on HuggingFace's 16GB RAM.

Environment Variables

Set these in HuggingFace Spaces settings (Settings → Repository secrets):

Required:

GEMINI_API_KEY - Your Google Gemini API key (get from https://aistudio.google.com/app/apikey)
LLM_PROVIDER - Set to gemini to use Gemini (or groq if you have Groq API key)

Optional:

GROQ_API_KEY - Only if using Groq provider instead of Gemini

Note: The code supports both GOOGLE_API_KEY and GEMINI_API_KEY environment variable names.

Local Development

# Clone repository
git clone https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
cd devs-print-data-science-agent

# Install dependencies
pip install -r requirements.txt
npm install --prefix FRRONTEEEND

# Build frontend
cd FRRONTEEEND && npm run build && cd ..

# Set API key
export GEMINI_API_KEY=your_key_here

# Run server
uvicorn src.api.app:app --host 0.0.0.0 --port 7860

Architecture

┌─────────────────┐
│  React Frontend │  ← User uploads data + asks questions
└────────┬────────┘
         │
┌────────▼────────┐
│  FastAPI Server │  ← Serves frontend + API endpoints
└────────┬────────┘
         │
┌────────▼────────┐
│  Orchestrator   │  ← LLM-driven tool selection & execution
└────────┬────────┘
         │
┌────────▼────────┐
│   82+ Tools     │  ← Specialized data science functions
└─────────────────┘

Key Components

Orchestrator (src/orchestrator.py): ReAct-based tool calling with Gemini
Tools Registry (src/tools/): 82+ specialized data science tools
Session Memory (src/session_memory.py): Conversation history + file tracking
Artifact Store (src/storage/artifact_store.py): File management + metadata

Deployment

This Space uses a Docker deployment for maximum compatibility:

Base image: python:3.12-slim
Multi-stage build (Node.js for frontend, Python for backend)
Auto-exposes port 7860 for HuggingFace
All dependencies bundled in container

Contributing

Built for DevSprint Hackathon 2025. Contributions welcome post-hackathon!

License

MIT License - see LICENSE file for details