File size: 5,606 Bytes
22168f4
 
 
 
 
 
 
 
 
e3e5444
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: DataAnalyst Agent
emoji: ๐Ÿ“Š
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
---

# DataAnalyst Agent ๐Ÿง ๐Ÿ“Š

**A Privacy-First, Autonomous Multi-Agent Data Analysis System**

![Python](https://img.shields.io/badge/Python-3.10+-blue?style=for-the-badge\&logo=python\&logoColor=white)
![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge\&logo=fastapi\&logoColor=white)
![LangGraph](https://img.shields.io/badge/LangGraph-D5A6BD?style=for-the-badge)
![Pandas](https://img.shields.io/badge/Pandas-150458?style=for-the-badge\&logo=pandas\&logoColor=white)

---

## ๐Ÿ“– Overview

**DataAnalyst Agent** is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured **LangGraph multi-agent pipeline**.

The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights โ€” all without human intervention.

Designed with a **zero-retention architecture**, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees.

---

## ๐Ÿง  Key Capabilities

* ๐Ÿ“Š Automated dataset understanding (schema profiling)
* โ“ AI-generated analytical questions & hypotheses
* ๐Ÿงฎ Deterministic execution using Pandas (no hallucinated computation)
* ๐Ÿง  LLM-powered insight generation
* ๐Ÿ” Privacy-first processing (PII masking + zero retention)
* โšก Asynchronous execution (non-blocking API)
* ๐Ÿ“„ Exportable reports (JSON / HTML / PDF)

---

## ๐Ÿ’ผ Real-World Use Cases

* Customer behavior analytics (without exposing PII)
* Financial reporting and summarization
* Automated exploratory data analysis (EDA)
* Internal enterprise analytics tools
* Privacy-sensitive datasets (healthcare, business intelligence)

---

## ๐Ÿ—๏ธ Architectural Flow (Simplified View)

```mermaid
graph TD;
    A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway)
    B --> C{Security & Validation Layer}
    C -->|Sanitized Data| D[(In-Memory DataFrame)]
    D --> E[LangGraph Orchestrator]
    
    subgraph Agent Pipeline
    E --> F[1. Schema Profiler]
    F --> G[2. Question Generator (LLM)]
    G --> H[3. Execution Planner (LLM)]
    H --> I[4. Sandboxed Python Execution]
    I --> J[5. Insight Generator (LLM)]
    end
    
    J --> K[Report Generator]
    K --> L[Memory Cleanup Daemon]
    L --> M[Results Returned to User]
```


---

## ๐Ÿง  Design Principles

* **Cognitive Isolation**: LLMs never access raw datasets directly
* **Deterministic Execution**: All computations handled via Python (Pandas)
* **Zero Data Persistence**: No dataset is written to disk
* **Separation of Concerns**: Clear boundaries between reasoning, execution, and storage
* **Fail-Safe Execution**: Sandboxed environment prevents unsafe operations

---

## โšก Core Engineering Highlights

### ๐Ÿ”น Multi-Agent Orchestration (LangGraph)

Implements a structured pipeline:

* Schema Profiling โ†’ Question Generation โ†’ Execution Planning โ†’ Deterministic Execution โ†’ Insight Synthesis

### ๐Ÿ”น Zero-Retention Architecture

Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon.

### ๐Ÿ”น Dynamic PII Masking

Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement.

### ๐Ÿ”น Asynchronous Processing

Built using **FastAPI BackgroundTasks**, enabling non-blocking execution and responsive APIs.

### ๐Ÿ”น Secure Logging

Implements redacted logging to ensure sensitive data is never exposed in logs.

---

## ๐Ÿš€ Quick Start Guide

### Prerequisites

* Python 3.10+
* Git

---

### 1. Clone the Repository

```bash
git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git
cd DataAnalyst-Agent
```

---

### 2. Environment Configuration

```bash
cp .env.example .env
```

Add your API key:

```
GROQ_API_KEY=your_api_key_here
```

---

### 3. Install Dependencies

```bash
pip install -r requirements.txt
```

---

### 4. Run the System

#### Backend (FastAPI)

```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```

#### Frontend (Flask)

```bash
cd frontend
python app.py
```

Access the dashboard at:

```
http://127.0.0.1:5000
```

---

## ๐Ÿ› ๏ธ Configuration (.env)

| Variable                   | Description                                    |
| -------------------------- | ---------------------------------------------- |
| `LLM_MODEL`                | Model used for reasoning (e.g., llama-3.1-70b) |
| `ENABLE_DATA_MASKING`      | Enable/disable PII masking                     |
| `DISABLE_DATA_PERSISTENCE` | Enforce zero-retention                         |
| `MAX_UPLOAD_SIZE_BYTES`    | Limit dataset size                             |
| `PROXY_TRUST_MODE`         | Enable trusted proxy validation                |

---

## ๐ŸŽฏ System Highlights

* ๐Ÿ” Privacy-first AI system
* ๐Ÿง  Agentic architecture (LangGraph)
* โšก Async & scalable backend
* ๐Ÿ›ก๏ธ Secure execution environment
* ๐Ÿ“Š Fully automated data analysis

---

## ๐Ÿง  One-Line Summary

> A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention.

---

## ๐Ÿ“Œ Future Improvements

* Distributed task queue (Celery / Redis)
* Vector memory for contextual recall
* Advanced visualization dashboard
* Multi-dataset comparative analysis

---

> *"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."*