--- title: DataAnalyst Agent emoji: 📊 colorFrom: blue colorTo: green sdk: docker pinned: false --- # DataAnalyst Agent 🧠📊 **A Privacy-First, Autonomous Multi-Agent Data Analysis System** ![Python](https://img.shields.io/badge/Python-3.10+-blue?style=for-the-badge\&logo=python\&logoColor=white) ![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge\&logo=fastapi\&logoColor=white) ![LangGraph](https://img.shields.io/badge/LangGraph-D5A6BD?style=for-the-badge) ![Pandas](https://img.shields.io/badge/Pandas-150458?style=for-the-badge\&logo=pandas\&logoColor=white) --- ## 📖 Overview **DataAnalyst Agent** is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured **LangGraph multi-agent pipeline**. The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights — all without human intervention. Designed with a **zero-retention architecture**, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees. --- ## 🧠 Key Capabilities * 📊 Automated dataset understanding (schema profiling) * ❓ AI-generated analytical questions & hypotheses * 🧮 Deterministic execution using Pandas (no hallucinated computation) * 🧠 LLM-powered insight generation * 🔐 Privacy-first processing (PII masking + zero retention) * ⚡ Asynchronous execution (non-blocking API) * 📄 Exportable reports (JSON / HTML / PDF) --- ## 💼 Real-World Use Cases * Customer behavior analytics (without exposing PII) * Financial reporting and summarization * Automated exploratory data analysis (EDA) * Internal enterprise analytics tools * Privacy-sensitive datasets (healthcare, business intelligence) --- ## 🏗️ Architectural Flow (Simplified View) ```mermaid graph TD; A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway) B --> C{Security & Validation Layer} C -->|Sanitized Data| D[(In-Memory DataFrame)] D --> E[LangGraph Orchestrator] subgraph Agent Pipeline E --> F[1. Schema Profiler] F --> G[2. Question Generator (LLM)] G --> H[3. Execution Planner (LLM)] H --> I[4. Sandboxed Python Execution] I --> J[5. Insight Generator (LLM)] end J --> K[Report Generator] K --> L[Memory Cleanup Daemon] L --> M[Results Returned to User] ``` --- ## 🧠 Design Principles * **Cognitive Isolation**: LLMs never access raw datasets directly * **Deterministic Execution**: All computations handled via Python (Pandas) * **Zero Data Persistence**: No dataset is written to disk * **Separation of Concerns**: Clear boundaries between reasoning, execution, and storage * **Fail-Safe Execution**: Sandboxed environment prevents unsafe operations --- ## ⚡ Core Engineering Highlights ### 🔹 Multi-Agent Orchestration (LangGraph) Implements a structured pipeline: * Schema Profiling → Question Generation → Execution Planning → Deterministic Execution → Insight Synthesis ### 🔹 Zero-Retention Architecture Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon. ### 🔹 Dynamic PII Masking Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement. ### 🔹 Asynchronous Processing Built using **FastAPI BackgroundTasks**, enabling non-blocking execution and responsive APIs. ### 🔹 Secure Logging Implements redacted logging to ensure sensitive data is never exposed in logs. --- ## 🚀 Quick Start Guide ### Prerequisites * Python 3.10+ * Git --- ### 1. Clone the Repository ```bash git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git cd DataAnalyst-Agent ``` --- ### 2. Environment Configuration ```bash cp .env.example .env ``` Add your API key: ``` GROQ_API_KEY=your_api_key_here ``` --- ### 3. Install Dependencies ```bash pip install -r requirements.txt ``` --- ### 4. Run the System #### Backend (FastAPI) ```bash uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload ``` #### Frontend (Flask) ```bash cd frontend python app.py ``` Access the dashboard at: ``` http://127.0.0.1:5000 ``` --- ## 🛠️ Configuration (.env) | Variable | Description | | -------------------------- | ---------------------------------------------- | | `LLM_MODEL` | Model used for reasoning (e.g., llama-3.1-70b) | | `ENABLE_DATA_MASKING` | Enable/disable PII masking | | `DISABLE_DATA_PERSISTENCE` | Enforce zero-retention | | `MAX_UPLOAD_SIZE_BYTES` | Limit dataset size | | `PROXY_TRUST_MODE` | Enable trusted proxy validation | --- ## 🎯 System Highlights * 🔐 Privacy-first AI system * 🧠 Agentic architecture (LangGraph) * ⚡ Async & scalable backend * 🛡️ Secure execution environment * 📊 Fully automated data analysis --- ## 🧠 One-Line Summary > A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention. --- ## 📌 Future Improvements * Distributed task queue (Celery / Redis) * Vector memory for contextual recall * Advanced visualization dashboard * Multi-dataset comparative analysis --- > *"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."*