--- title: DevSprint Data Science Agent emoji: ๐Ÿค– colorFrom: blue colorTo: purple sdk: docker pinned: false license: mit app_port: 7860 --- # DevSprint Data Science Agent ๐Ÿค– An intelligent AI agent for automated data science workflows, powered by Google Gemini 2.5 Flash with 82+ specialized tools for data analysis, visualization, and machine learning. ## Features - ๐Ÿ” **Automated EDA**: YData profiling, statistical analysis, data quality reports - ๐Ÿ“Š **Smart Visualizations**: Plotly dashboards, matplotlib plots, interactive charts - ๐Ÿงน **Data Cleaning**: Missing value handling, outlier detection, type conversion - ๐Ÿ› ๏ธ **Feature Engineering**: Automated feature creation, encoding, scaling - ๐Ÿค– **ML Training**: AutoML with XGBoost, LightGBM, CatBoost, Neural Networks - ๐Ÿ’ฌ **Natural Language Interface**: Chat-based interaction for complex workflows - ๐Ÿ“ˆ **Business Intelligence**: KPI tracking, trend analysis, forecasting ## Tech Stack - **Backend**: FastAPI + Python 3.12 - **LLM**: Google Gemini 2.5 Flash (text-based tool calling) - **Data Processing**: Polars (high-performance dataframes) - **Frontend**: React 19 + TypeScript + Vite - **ML Libraries**: Scikit-learn, XGBoost, LightGBM, CatBoost, PyTorch ## Usage 1. Upload your CSV/Excel dataset 2. Ask questions in natural language (e.g., "Generate a detailed profiling report") 3. The agent automatically selects and executes the right tools 4. View generated reports, visualizations, and insights ## Memory Optimization For large datasets (>50k rows or >10MB), the agent automatically: - Samples to 50,000 rows for profiling - Enables minimal mode to reduce memory usage - Disables expensive correlation/interaction calculations This ensures smooth operation even with large datasets on HuggingFace's 16GB RAM. ## Environment Variables Set `GEMINI_API_KEY` in HuggingFace Spaces settings (Settings โ†’ Repository secrets): ``` GEMINI_API_KEY=your_google_gemini_api_key_here ``` Get your API key from: https://aistudio.google.com/app/apikey ## Local Development ```bash # Clone repository git clone https://huggingface.co/spaces/YOUR_USERNAME/devs-print-data-science-agent cd devs-print-data-science-agent # Install dependencies pip install -r requirements.txt npm install --prefix FRRONTEEEND # Build frontend cd FRRONTEEEND && npm run build && cd .. # Set API key export GEMINI_API_KEY=your_key_here # Run server uvicorn src.api.app:app --host 0.0.0.0 --port 7860 ``` ## Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ React Frontend โ”‚ โ† User uploads data + asks questions โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FastAPI Server โ”‚ โ† Serves frontend + API endpoints โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Orchestrator โ”‚ โ† LLM-driven tool selection & execution โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 82+ Tools โ”‚ โ† Specialized data science functions โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Key Components - **Orchestrator** ([src/orchestrator.py](src/orchestrator.py)): ReAct-based tool calling with Gemini - **Tools Registry** ([src/tools/](src/tools/)): 82+ specialized data science tools - **Session Memory** ([src/session_memory.py](src/session_memory.py)): Conversation history + file tracking - **Artifact Store** ([src/storage/artifact_store.py](src/storage/artifact_store.py)): File management + metadata ## Deployment This Space uses a **Docker** deployment for maximum compatibility: - Base image: `python:3.12-slim` - Multi-stage build (Node.js for frontend, Python for backend) - Auto-exposes port 7860 for HuggingFace - All dependencies bundled in container ## Contributing Built for DevSprint Hackathon 2025. Contributions welcome post-hackathon! ## License MIT License - see LICENSE file for details