Spaces:
Running
title: DataAnalyst Agent
emoji: ๐
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
DataAnalyst Agent ๐ง ๐
A Privacy-First, Autonomous Multi-Agent Data Analysis System
๐ Overview
DataAnalyst Agent is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured LangGraph multi-agent pipeline.
The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights โ all without human intervention.
Designed with a zero-retention architecture, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees.
๐ง Key Capabilities
- ๐ Automated dataset understanding (schema profiling)
- โ AI-generated analytical questions & hypotheses
- ๐งฎ Deterministic execution using Pandas (no hallucinated computation)
- ๐ง LLM-powered insight generation
- ๐ Privacy-first processing (PII masking + zero retention)
- โก Asynchronous execution (non-blocking API)
- ๐ Exportable reports (JSON / HTML / PDF)
๐ผ Real-World Use Cases
- Customer behavior analytics (without exposing PII)
- Financial reporting and summarization
- Automated exploratory data analysis (EDA)
- Internal enterprise analytics tools
- Privacy-sensitive datasets (healthcare, business intelligence)
๐๏ธ Architectural Flow (Simplified View)
graph TD;
A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway)
B --> C{Security & Validation Layer}
C -->|Sanitized Data| D[(In-Memory DataFrame)]
D --> E[LangGraph Orchestrator]
subgraph Agent Pipeline
E --> F[1. Schema Profiler]
F --> G[2. Question Generator (LLM)]
G --> H[3. Execution Planner (LLM)]
H --> I[4. Sandboxed Python Execution]
I --> J[5. Insight Generator (LLM)]
end
J --> K[Report Generator]
K --> L[Memory Cleanup Daemon]
L --> M[Results Returned to User]
๐ง Design Principles
- Cognitive Isolation: LLMs never access raw datasets directly
- Deterministic Execution: All computations handled via Python (Pandas)
- Zero Data Persistence: No dataset is written to disk
- Separation of Concerns: Clear boundaries between reasoning, execution, and storage
- Fail-Safe Execution: Sandboxed environment prevents unsafe operations
โก Core Engineering Highlights
๐น Multi-Agent Orchestration (LangGraph)
Implements a structured pipeline:
- Schema Profiling โ Question Generation โ Execution Planning โ Deterministic Execution โ Insight Synthesis
๐น Zero-Retention Architecture
Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon.
๐น Dynamic PII Masking
Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement.
๐น Asynchronous Processing
Built using FastAPI BackgroundTasks, enabling non-blocking execution and responsive APIs.
๐น Secure Logging
Implements redacted logging to ensure sensitive data is never exposed in logs.
๐ Quick Start Guide
Prerequisites
- Python 3.10+
- Git
1. Clone the Repository
git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git
cd DataAnalyst-Agent
2. Environment Configuration
cp .env.example .env
Add your API key:
GROQ_API_KEY=your_api_key_here
3. Install Dependencies
pip install -r requirements.txt
4. Run the System
Backend (FastAPI)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
Frontend (Flask)
cd frontend
python app.py
Access the dashboard at:
http://127.0.0.1:5000
๐ ๏ธ Configuration (.env)
| Variable | Description |
|---|---|
LLM_MODEL |
Model used for reasoning (e.g., llama-3.1-70b) |
ENABLE_DATA_MASKING |
Enable/disable PII masking |
DISABLE_DATA_PERSISTENCE |
Enforce zero-retention |
MAX_UPLOAD_SIZE_BYTES |
Limit dataset size |
PROXY_TRUST_MODE |
Enable trusted proxy validation |
๐ฏ System Highlights
- ๐ Privacy-first AI system
- ๐ง Agentic architecture (LangGraph)
- โก Async & scalable backend
- ๐ก๏ธ Secure execution environment
- ๐ Fully automated data analysis
๐ง One-Line Summary
A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention.
๐ Future Improvements
- Distributed task queue (Celery / Redis)
- Vector memory for contextual recall
- Advanced visualization dashboard
- Multi-dataset comparative analysis
"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."