Spaces:
Running
Running
| title: DataAnalyst Agent | |
| emoji: 📊 | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| # DataAnalyst Agent 🧠📊 | |
| **A Privacy-First, Autonomous Multi-Agent Data Analysis System** | |
|  | |
|  | |
|  | |
|  | |
| --- | |
| ## 📖 Overview | |
| **DataAnalyst Agent** is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured **LangGraph multi-agent pipeline**. | |
| The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights — all without human intervention. | |
| Designed with a **zero-retention architecture**, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees. | |
| --- | |
| ## 🧠 Key Capabilities | |
| * 📊 Automated dataset understanding (schema profiling) | |
| * ❓ AI-generated analytical questions & hypotheses | |
| * 🧮 Deterministic execution using Pandas (no hallucinated computation) | |
| * 🧠 LLM-powered insight generation | |
| * 🔐 Privacy-first processing (PII masking + zero retention) | |
| * ⚡ Asynchronous execution (non-blocking API) | |
| * 📄 Exportable reports (JSON / HTML / PDF) | |
| --- | |
| ## 💼 Real-World Use Cases | |
| * Customer behavior analytics (without exposing PII) | |
| * Financial reporting and summarization | |
| * Automated exploratory data analysis (EDA) | |
| * Internal enterprise analytics tools | |
| * Privacy-sensitive datasets (healthcare, business intelligence) | |
| --- | |
| ## 🏗️ Architectural Flow (Simplified View) | |
| ```mermaid | |
| graph TD; | |
| A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway) | |
| B --> C{Security & Validation Layer} | |
| C -->|Sanitized Data| D[(In-Memory DataFrame)] | |
| D --> E[LangGraph Orchestrator] | |
| subgraph Agent Pipeline | |
| E --> F[1. Schema Profiler] | |
| F --> G[2. Question Generator (LLM)] | |
| G --> H[3. Execution Planner (LLM)] | |
| H --> I[4. Sandboxed Python Execution] | |
| I --> J[5. Insight Generator (LLM)] | |
| end | |
| J --> K[Report Generator] | |
| K --> L[Memory Cleanup Daemon] | |
| L --> M[Results Returned to User] | |
| ``` | |
| --- | |
| ## 🧠 Design Principles | |
| * **Cognitive Isolation**: LLMs never access raw datasets directly | |
| * **Deterministic Execution**: All computations handled via Python (Pandas) | |
| * **Zero Data Persistence**: No dataset is written to disk | |
| * **Separation of Concerns**: Clear boundaries between reasoning, execution, and storage | |
| * **Fail-Safe Execution**: Sandboxed environment prevents unsafe operations | |
| --- | |
| ## ⚡ Core Engineering Highlights | |
| ### 🔹 Multi-Agent Orchestration (LangGraph) | |
| Implements a structured pipeline: | |
| * Schema Profiling → Question Generation → Execution Planning → Deterministic Execution → Insight Synthesis | |
| ### 🔹 Zero-Retention Architecture | |
| Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon. | |
| ### 🔹 Dynamic PII Masking | |
| Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement. | |
| ### 🔹 Asynchronous Processing | |
| Built using **FastAPI BackgroundTasks**, enabling non-blocking execution and responsive APIs. | |
| ### 🔹 Secure Logging | |
| Implements redacted logging to ensure sensitive data is never exposed in logs. | |
| --- | |
| ## 🚀 Quick Start Guide | |
| ### Prerequisites | |
| * Python 3.10+ | |
| * Git | |
| --- | |
| ### 1. Clone the Repository | |
| ```bash | |
| git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git | |
| cd DataAnalyst-Agent | |
| ``` | |
| --- | |
| ### 2. Environment Configuration | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| Add your API key: | |
| ``` | |
| GROQ_API_KEY=your_api_key_here | |
| ``` | |
| --- | |
| ### 3. Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| --- | |
| ### 4. Run the System | |
| #### Backend (FastAPI) | |
| ```bash | |
| uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| #### Frontend (Flask) | |
| ```bash | |
| cd frontend | |
| python app.py | |
| ``` | |
| Access the dashboard at: | |
| ``` | |
| http://127.0.0.1:5000 | |
| ``` | |
| --- | |
| ## 🛠️ Configuration (.env) | |
| | Variable | Description | | |
| | -------------------------- | ---------------------------------------------- | | |
| | `LLM_MODEL` | Model used for reasoning (e.g., llama-3.1-70b) | | |
| | `ENABLE_DATA_MASKING` | Enable/disable PII masking | | |
| | `DISABLE_DATA_PERSISTENCE` | Enforce zero-retention | | |
| | `MAX_UPLOAD_SIZE_BYTES` | Limit dataset size | | |
| | `PROXY_TRUST_MODE` | Enable trusted proxy validation | | |
| --- | |
| ## 🎯 System Highlights | |
| * 🔐 Privacy-first AI system | |
| * 🧠 Agentic architecture (LangGraph) | |
| * ⚡ Async & scalable backend | |
| * 🛡️ Secure execution environment | |
| * 📊 Fully automated data analysis | |
| --- | |
| ## 🧠 One-Line Summary | |
| > A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention. | |
| --- | |
| ## 📌 Future Improvements | |
| * Distributed task queue (Celery / Redis) | |
| * Vector memory for contextual recall | |
| * Advanced visualization dashboard | |
| * Multi-dataset comparative analysis | |
| --- | |
| > *"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."* | |