Data-Science-Agent / README_SPACES.md
Pulastya B
fix: Support both GOOGLE_API_KEY and GEMINI_API_KEY env vars for HuggingFace compatibility
fde5dd3
metadata
title: DevSprint Data Science Agent
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

DevSprint Data Science Agent πŸ€–

An intelligent AI agent for automated data science workflows, powered by Google Gemini 2.5 Flash with 82+ specialized tools for data analysis, visualization, and machine learning.

Features

  • πŸ” Automated EDA: YData profiling, statistical analysis, data quality reports
  • πŸ“Š Smart Visualizations: Plotly dashboards, matplotlib plots, interactive charts
  • 🧹 Data Cleaning: Missing value handling, outlier detection, type conversion
  • πŸ› οΈ Feature Engineering: Automated feature creation, encoding, scaling
  • πŸ€– ML Training: AutoML with XGBoost, LightGBM, CatBoost, Neural Networks
  • πŸ’¬ Natural Language Interface: Chat-based interaction for complex workflows
  • πŸ“ˆ Business Intelligence: KPI tracking, trend analysis, forecasting

Tech Stack

  • Backend: FastAPI + Python 3.12
  • LLM: Google Gemini 2.5 Flash (text-based tool calling)
  • Data Processing: Polars (high-performance dataframes)
  • Frontend: React 19 + TypeScript + Vite
  • ML Libraries: Scikit-learn, XGBoost, LightGBM, CatBoost, PyTorch

Usage

  1. Upload your CSV/Excel dataset
  2. Ask questions in natural language (e.g., "Generate a detailed profiling report")
  3. The agent automatically selects and executes the right tools
  4. View generated reports, visualizations, and insights

Memory Optimization

For large datasets (>50k rows or >10MB), the agent automatically:

  • Samples to 50,000 rows for profiling
  • Enables minimal mode to reduce memory usage
  • Disables expensive correlation/interaction calculations

This ensures smooth operation even with large datasets on HuggingFace's 16GB RAM.

Environment Variables

Set these in HuggingFace Spaces settings (Settings β†’ Repository secrets):

Required:

Optional:

  • GROQ_API_KEY - Only if using Groq provider instead of Gemini

Note: The code supports both GOOGLE_API_KEY and GEMINI_API_KEY environment variable names.

Local Development

# Clone repository
git clone https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent
cd devs-print-data-science-agent

# Install dependencies
pip install -r requirements.txt
npm install --prefix FRRONTEEEND

# Build frontend
cd FRRONTEEEND && npm run build && cd ..

# Set API key
export GEMINI_API_KEY=your_key_here

# Run server
uvicorn src.api.app:app --host 0.0.0.0 --port 7860

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  React Frontend β”‚  ← User uploads data + asks questions
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  FastAPI Server β”‚  ← Serves frontend + API endpoints
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Orchestrator   β”‚  ← LLM-driven tool selection & execution
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   82+ Tools     β”‚  ← Specialized data science functions
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

Deployment

This Space uses a Docker deployment for maximum compatibility:

  • Base image: python:3.12-slim
  • Multi-stage build (Node.js for frontend, Python for backend)
  • Auto-exposes port 7860 for HuggingFace
  • All dependencies bundled in container

Contributing

Built for DevSprint Hackathon 2025. Contributions welcome post-hackathon!

License

MIT License - see LICENSE file for details