Spaces:
Running
Running
File size: 5,606 Bytes
22168f4 e3e5444 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | ---
title: DataAnalyst Agent
emoji: ๐
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
---
# DataAnalyst Agent ๐ง ๐
**A Privacy-First, Autonomous Multi-Agent Data Analysis System**




---
## ๐ Overview
**DataAnalyst Agent** is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured **LangGraph multi-agent pipeline**.
The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights โ all without human intervention.
Designed with a **zero-retention architecture**, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees.
---
## ๐ง Key Capabilities
* ๐ Automated dataset understanding (schema profiling)
* โ AI-generated analytical questions & hypotheses
* ๐งฎ Deterministic execution using Pandas (no hallucinated computation)
* ๐ง LLM-powered insight generation
* ๐ Privacy-first processing (PII masking + zero retention)
* โก Asynchronous execution (non-blocking API)
* ๐ Exportable reports (JSON / HTML / PDF)
---
## ๐ผ Real-World Use Cases
* Customer behavior analytics (without exposing PII)
* Financial reporting and summarization
* Automated exploratory data analysis (EDA)
* Internal enterprise analytics tools
* Privacy-sensitive datasets (healthcare, business intelligence)
---
## ๐๏ธ Architectural Flow (Simplified View)
```mermaid
graph TD;
A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway)
B --> C{Security & Validation Layer}
C -->|Sanitized Data| D[(In-Memory DataFrame)]
D --> E[LangGraph Orchestrator]
subgraph Agent Pipeline
E --> F[1. Schema Profiler]
F --> G[2. Question Generator (LLM)]
G --> H[3. Execution Planner (LLM)]
H --> I[4. Sandboxed Python Execution]
I --> J[5. Insight Generator (LLM)]
end
J --> K[Report Generator]
K --> L[Memory Cleanup Daemon]
L --> M[Results Returned to User]
```
---
## ๐ง Design Principles
* **Cognitive Isolation**: LLMs never access raw datasets directly
* **Deterministic Execution**: All computations handled via Python (Pandas)
* **Zero Data Persistence**: No dataset is written to disk
* **Separation of Concerns**: Clear boundaries between reasoning, execution, and storage
* **Fail-Safe Execution**: Sandboxed environment prevents unsafe operations
---
## โก Core Engineering Highlights
### ๐น Multi-Agent Orchestration (LangGraph)
Implements a structured pipeline:
* Schema Profiling โ Question Generation โ Execution Planning โ Deterministic Execution โ Insight Synthesis
### ๐น Zero-Retention Architecture
Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon.
### ๐น Dynamic PII Masking
Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement.
### ๐น Asynchronous Processing
Built using **FastAPI BackgroundTasks**, enabling non-blocking execution and responsive APIs.
### ๐น Secure Logging
Implements redacted logging to ensure sensitive data is never exposed in logs.
---
## ๐ Quick Start Guide
### Prerequisites
* Python 3.10+
* Git
---
### 1. Clone the Repository
```bash
git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git
cd DataAnalyst-Agent
```
---
### 2. Environment Configuration
```bash
cp .env.example .env
```
Add your API key:
```
GROQ_API_KEY=your_api_key_here
```
---
### 3. Install Dependencies
```bash
pip install -r requirements.txt
```
---
### 4. Run the System
#### Backend (FastAPI)
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
#### Frontend (Flask)
```bash
cd frontend
python app.py
```
Access the dashboard at:
```
http://127.0.0.1:5000
```
---
## ๐ ๏ธ Configuration (.env)
| Variable | Description |
| -------------------------- | ---------------------------------------------- |
| `LLM_MODEL` | Model used for reasoning (e.g., llama-3.1-70b) |
| `ENABLE_DATA_MASKING` | Enable/disable PII masking |
| `DISABLE_DATA_PERSISTENCE` | Enforce zero-retention |
| `MAX_UPLOAD_SIZE_BYTES` | Limit dataset size |
| `PROXY_TRUST_MODE` | Enable trusted proxy validation |
---
## ๐ฏ System Highlights
* ๐ Privacy-first AI system
* ๐ง Agentic architecture (LangGraph)
* โก Async & scalable backend
* ๐ก๏ธ Secure execution environment
* ๐ Fully automated data analysis
---
## ๐ง One-Line Summary
> A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention.
---
## ๐ Future Improvements
* Distributed task queue (Celery / Redis)
* Vector memory for contextual recall
* Advanced visualization dashboard
* Multi-dataset comparative analysis
---
> *"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."*
|