Spaces:

shoaib4045
/

DataAnalyst-Agent

Running

App Files Files Community

DataAnalyst-Agent / README.md

shoaib4045

Fix HF Space Configuration

22168f4 about 1 month ago

preview code

raw

history blame contribute delete

5.61 kB

metadata

title: DataAnalyst Agent
emoji: 📊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false

DataAnalyst Agent 🧠📊

A Privacy-First, Autonomous Multi-Agent Data Analysis System

📖 Overview

DataAnalyst Agent is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured LangGraph multi-agent pipeline.

The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights — all without human intervention.

Designed with a zero-retention architecture, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees.

🧠 Key Capabilities

📊 Automated dataset understanding (schema profiling)
❓ AI-generated analytical questions & hypotheses
🧮 Deterministic execution using Pandas (no hallucinated computation)
🧠 LLM-powered insight generation
🔐 Privacy-first processing (PII masking + zero retention)
⚡ Asynchronous execution (non-blocking API)
📄 Exportable reports (JSON / HTML / PDF)

💼 Real-World Use Cases

Customer behavior analytics (without exposing PII)
Financial reporting and summarization
Automated exploratory data analysis (EDA)
Internal enterprise analytics tools
Privacy-sensitive datasets (healthcare, business intelligence)

🏗️ Architectural Flow (Simplified View)

graph TD;
    A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway)
    B --> C{Security & Validation Layer}
    C -->|Sanitized Data| D[(In-Memory DataFrame)]
    D --> E[LangGraph Orchestrator]
    
    subgraph Agent Pipeline
    E --> F[1. Schema Profiler]
    F --> G[2. Question Generator (LLM)]
    G --> H[3. Execution Planner (LLM)]
    H --> I[4. Sandboxed Python Execution]
    I --> J[5. Insight Generator (LLM)]
    end
    
    J --> K[Report Generator]
    K --> L[Memory Cleanup Daemon]
    L --> M[Results Returned to User]

🧠 Design Principles

Cognitive Isolation: LLMs never access raw datasets directly
Deterministic Execution: All computations handled via Python (Pandas)
Zero Data Persistence: No dataset is written to disk
Separation of Concerns: Clear boundaries between reasoning, execution, and storage
Fail-Safe Execution: Sandboxed environment prevents unsafe operations

⚡ Core Engineering Highlights

🔹 Multi-Agent Orchestration (LangGraph)

Implements a structured pipeline:

Schema Profiling → Question Generation → Execution Planning → Deterministic Execution → Insight Synthesis

🔹 Zero-Retention Architecture

Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon.

🔹 Dynamic PII Masking

Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement.

🔹 Asynchronous Processing

Built using FastAPI BackgroundTasks, enabling non-blocking execution and responsive APIs.

🔹 Secure Logging

Implements redacted logging to ensure sensitive data is never exposed in logs.

🚀 Quick Start Guide

Prerequisites

Python 3.10+
Git

1. Clone the Repository

git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git
cd DataAnalyst-Agent

2. Environment Configuration

cp .env.example .env

Add your API key:

GROQ_API_KEY=your_api_key_here

3. Install Dependencies

pip install -r requirements.txt

4. Run the System

Backend (FastAPI)

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Frontend (Flask)

cd frontend
python app.py

Access the dashboard at:

http://127.0.0.1:5000

🛠️ Configuration (.env)

Variable	Description
`LLM_MODEL`	Model used for reasoning (e.g., llama-3.1-70b)
`ENABLE_DATA_MASKING`	Enable/disable PII masking
`DISABLE_DATA_PERSISTENCE`	Enforce zero-retention
`MAX_UPLOAD_SIZE_BYTES`	Limit dataset size
`PROXY_TRUST_MODE`	Enable trusted proxy validation

🎯 System Highlights

🔐 Privacy-first AI system
🧠 Agentic architecture (LangGraph)
⚡ Async & scalable backend
🛡️ Secure execution environment
📊 Fully automated data analysis

🧠 One-Line Summary

A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention.

📌 Future Improvements

Distributed task queue (Celery / Redis)
Vector memory for contextual recall
Advanced visualization dashboard
Multi-dataset comparative analysis

"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."