DataAnalyst-Agent / README.md
shoaib4045's picture
Fix HF Space Configuration
22168f4
metadata
title: DataAnalyst Agent
emoji: ๐Ÿ“Š
colorFrom: blue
colorTo: green
sdk: docker
pinned: false

DataAnalyst Agent ๐Ÿง ๐Ÿ“Š

A Privacy-First, Autonomous Multi-Agent Data Analysis System

Python FastAPI LangGraph Pandas


๐Ÿ“– Overview

DataAnalyst Agent is a privacy-preserving, agentic AI system that performs autonomous data analysis using a structured LangGraph multi-agent pipeline.

The system ingests structured datasets (CSV/SQL), automatically profiles schema, generates analytical hypotheses, constructs deterministic execution plans, and produces human-readable insights โ€” all without human intervention.

Designed with a zero-retention architecture, all data is processed strictly in-memory and securely cleared after execution, ensuring strong privacy guarantees.


๐Ÿง  Key Capabilities

  • ๐Ÿ“Š Automated dataset understanding (schema profiling)
  • โ“ AI-generated analytical questions & hypotheses
  • ๐Ÿงฎ Deterministic execution using Pandas (no hallucinated computation)
  • ๐Ÿง  LLM-powered insight generation
  • ๐Ÿ” Privacy-first processing (PII masking + zero retention)
  • โšก Asynchronous execution (non-blocking API)
  • ๐Ÿ“„ Exportable reports (JSON / HTML / PDF)

๐Ÿ’ผ Real-World Use Cases

  • Customer behavior analytics (without exposing PII)
  • Financial reporting and summarization
  • Automated exploratory data analysis (EDA)
  • Internal enterprise analytics tools
  • Privacy-sensitive datasets (healthcare, business intelligence)

๐Ÿ—๏ธ Architectural Flow (Simplified View)

graph TD;
    A[Frontend Dashboard] --> |Upload Request| B(FastAPI API Gateway)
    B --> C{Security & Validation Layer}
    C -->|Sanitized Data| D[(In-Memory DataFrame)]
    D --> E[LangGraph Orchestrator]
    
    subgraph Agent Pipeline
    E --> F[1. Schema Profiler]
    F --> G[2. Question Generator (LLM)]
    G --> H[3. Execution Planner (LLM)]
    H --> I[4. Sandboxed Python Execution]
    I --> J[5. Insight Generator (LLM)]
    end
    
    J --> K[Report Generator]
    K --> L[Memory Cleanup Daemon]
    L --> M[Results Returned to User]

๐Ÿง  Design Principles

  • Cognitive Isolation: LLMs never access raw datasets directly
  • Deterministic Execution: All computations handled via Python (Pandas)
  • Zero Data Persistence: No dataset is written to disk
  • Separation of Concerns: Clear boundaries between reasoning, execution, and storage
  • Fail-Safe Execution: Sandboxed environment prevents unsafe operations

โšก Core Engineering Highlights

๐Ÿ”น Multi-Agent Orchestration (LangGraph)

Implements a structured pipeline:

  • Schema Profiling โ†’ Question Generation โ†’ Execution Planning โ†’ Deterministic Execution โ†’ Insight Synthesis

๐Ÿ”น Zero-Retention Architecture

Data is processed exclusively in-memory and automatically cleared after execution via a cleanup daemon.

๐Ÿ”น Dynamic PII Masking

Sensitive fields are anonymized before any LLM interaction using regex-based detection and synthetic data replacement.

๐Ÿ”น Asynchronous Processing

Built using FastAPI BackgroundTasks, enabling non-blocking execution and responsive APIs.

๐Ÿ”น Secure Logging

Implements redacted logging to ensure sensitive data is never exposed in logs.


๐Ÿš€ Quick Start Guide

Prerequisites

  • Python 3.10+
  • Git

1. Clone the Repository

git clone https://github.com/mshoaib40458/DataAnalyst-Agent.git
cd DataAnalyst-Agent

2. Environment Configuration

cp .env.example .env

Add your API key:

GROQ_API_KEY=your_api_key_here

3. Install Dependencies

pip install -r requirements.txt

4. Run the System

Backend (FastAPI)

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Frontend (Flask)

cd frontend
python app.py

Access the dashboard at:

http://127.0.0.1:5000

๐Ÿ› ๏ธ Configuration (.env)

Variable Description
LLM_MODEL Model used for reasoning (e.g., llama-3.1-70b)
ENABLE_DATA_MASKING Enable/disable PII masking
DISABLE_DATA_PERSISTENCE Enforce zero-retention
MAX_UPLOAD_SIZE_BYTES Limit dataset size
PROXY_TRUST_MODE Enable trusted proxy validation

๐ŸŽฏ System Highlights

  • ๐Ÿ” Privacy-first AI system
  • ๐Ÿง  Agentic architecture (LangGraph)
  • โšก Async & scalable backend
  • ๐Ÿ›ก๏ธ Secure execution environment
  • ๐Ÿ“Š Fully automated data analysis

๐Ÿง  One-Line Summary

A privacy-preserving, agentic AI system that autonomously analyzes structured data using a controlled LangGraph pipeline with zero data retention.


๐Ÿ“Œ Future Improvements

  • Distributed task queue (Celery / Redis)
  • Vector memory for contextual recall
  • Advanced visualization dashboard
  • Multi-dataset comparative analysis

"Designing AI systems that are not only intelligent, but also secure, controlled, and production-ready."