Spaces:

rohitdeshmukh318
/

Data_analysis_agent

Sleeping

App Files Files Community

Data_analysis_agent / README.md

rohitdeshmukh318

docs: add Medium article link to README

ed8db66 12 days ago

preview code

raw

history blame contribute delete

12.6 kB

metadata

title: Data Analyst Agent
emoji: 🤖
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860

🤖 Autonomous Multi-Agent Data Analytics System

📖 Read the Deep-Dive Article on Medium: Static Dashboards Are Dying: Inside an Autonomous Analytics Agent

An advanced, self-correcting multi-agent AI system that transforms natural language queries into secure database operations, profiles data schemas, detects outliers dynamically, and compiles insights into professional analytical reports.

🛠️ System Architecture

The following diagram illustrates the complete end-to-end request-response life cycle, showcasing the LangGraph Orchestrator, secure AST Sandboxing, and dynamic Multi-Tenant Database Ingestion:

flowchart TD
    Client[Frontend Client - Vite/React] -->|1. Natural Language Query| API[FastAPI Backend]
    API -->|2. Ingests Query| LangGraph[LangGraph Orchestrator]
    
    subgraph LangGraph [LangGraph Workflow State Machine]
        direction TB
        Router{Intent Router} -->|Direct General Chat| Synth[Insight Synthesizer]
        Router -->|Data Query| Planner[Query Planner Node]
        
        Planner -->|Fetch Context| Memory[Memory Retriever Node]
        Memory -->|Select Target Schema| SchemaRAG[Schema Vector Retrieval]
        
        SchemaRAG -->|Generate PostgreSQL/Pandas| Generator[Code Generator Node]
        Generator -->|Verify AST| Validator{AST Safety Validator}
        
        Validator -->|Syntax/Policy Failure| SelfCorrect[Self-Correction Loop]
        SelfCorrect -->|Error Traceback Feed| Generator
        
        Validator -->|Security Passed| Executor[Safe Execution Engine]
        Executor -->|Inspect Output| Anomaly[Anomaly Detector Node]
        Anomaly -->|Format Data & Anomalies| Synth
    end
    
    subgraph Data & Execution Isolation
        Executor -->|Isolated Read-Only Connect| UserDB[(Target User DB)]
        SchemaRAG -->|Store/Match Vectors| MetaDB[(Central Metadata DB)]
    end
    
    subgraph Shared Services
        Generator -.->|Inference| Groq[Groq Llama 3.1 70B]
        Planner -.->|Inference| Groq8B[Groq Llama 3.1 8B]
        API -.->|Session Storage| Redis[Upstash Redis Cache]
    end
    
    Synth -->|3. Server-Sent Events SSE| Client

🤖 Specialized Agent Nodes

The LangGraph-powered state machine decouples execution into cooperative, specialized agent nodes. Each node acts as a distinct agent with unique prompts, reasoning scopes, and target model configurations:

Intent Router Agent (LLaMA-3.1-8B): Automatically scans raw user questions to classify user intent (sql, pandas, insight_only, or unsupported). Operates on the lightweight 8B model to keep initial routing latency under 100ms.
Query Planner Agent (LLaMA-3.1-8B): Evaluates previous conversation memory alongside relevant schemas retrieved via Schema RAG to generate a structured JSON execution strategy (specifying necessary joins, table scopes, and expected query complexity).
Code Generator Agent (LLaMA-3.3-70B): Executes the Planner's blueprint to write clean execution code. Generates ANSI-compliant SQL for database connectors or Pandas/Python scripts for CSV/flat files. Utilizes the larger 70B model for deep logical reasoning.
Safety Validator Agent (Pure-Python/AST): An essential security gate. SQL operations are constrained to read-only SELECT calls. Python scripts are parsed into an Abstract Syntax Tree (AST) using Python's native ast module, immediately terminating executions containing blacklisted imports (os, sys, subprocess) or dangerous built-in functions (open, eval).
Error Classifier Agent (LLaMA-3.1-8B): If code execution fails, this agent intercepts the compiler or database stack trace, classifies the exact issue (e.g. UndefinedColumn, SyntaxError), and structures a direct diagnosis prompt.
Self-Corrector Agent (LLaMA-3.3-70B): Absorbs the broken script and the diagnostic report from the Error Classifier, automatically self-heals the code, and re-routes it back to the Safety Validator. This forms the system's self-correction loop (capped at 3 attempts).
Insight Synthesizer Agent (LLaMA-3.1-8B): Receives raw, formatted tabular query results and synthesizes a concise, high-impact 2-3 sentence analytical summary focusing on business metrics and key statistics.

🔒 Security Architecture (Recruiter & Enterprise Review)

Allowing AI models to write and execute code on live relational databases presents extreme security risks. This platform solves this by enforcing four layers of strict containment:

Client-Side Credential Tokenization (Symmetric Key Cipher): When a user inputs their custom cloud PostgreSQL connection URI, the backend immediately encrypts it using a server-side symmetric block key derived dynamically using SHA-256 and an XOR stream cipher. The frontend only stores the secure token (postgres-enc:XYZ...) in its session state—meaning raw passwords never touch server logs, browser caches, or persistent disk storage.
Abstract Syntax Tree (AST) Parsing: Before any query is sent to a target database, it is compiled into an AST using sqlglot. If the parser detects any blocked operations (e.g., DROP, DELETE, ALTER, UPDATE, INSERT), the query is terminated instantly.
Restricted Python Sandboxing: For tabular files (CSVs), code is executed under RestrictedPython, blocking access to the local filesystem, shell, importing malicious packages, and private built-ins.
Session-Level Read-Only Isolation: All database sessions are forced into strict read-only transactions at the database driver level:
```
conn.set_session(readonly=True, autocommit=True)
```
Custom user database queries completely bypass the platform's global connection pool, ensuring zero session leakage or database cross-talk.

📖 User Perspective & Operating Guide

From an analyst's or business user's perspective, the system operates as a seamless companion dashboard:

1. Ingesting Your Data Sources

Click Upload Data in the left sidebar. The system supports two methods of analysis:

Option A (Local Files): Drag and drop a .csv, .sqlite, or .db file. The backend loads it dynamically into a sandbox-safe, fast in-memory database workspace.
Option B (Secure Cloud Database): Paste a standard public PostgreSQL connection string (from Railway, AWS RDS, GCP Cloud SQL, Azure, etc.) and type your target schema name (e.g. public).
- The system will automatically establish an isolated connection, profile your schema, embed your table descriptions, and index them into semantic search memory.

2. Conversational Analytical Chat

Navigate to the Chat screen:

Choose your active database source using the selector at the top.
Ask natural language questions like: "Show me our monthly order volume for the past year and point out any spikes." or "What are the top 5 highest-grossing products?"
The agent dynamically translates the request, generates clean Postgres-dialect SQL, executes it, creates interactive chart specifications using Plotly, and provides statistical insights.

3. Automated Anomaly Detection

Every query result is piped through our parallel statistical anomaly engine. If the system detects spikes, outlier points, or extreme values in your dataset, it highlights them automatically under the Anomalies Found banner.

4. Interactive Metrics Dashboard

Click Dashboard in the sidebar to review your saved charts, add custom titles, and compile real-time metric cards. You can trace query execution status (e.g., Latency, LLM token count, Self-correction rates) directly in the Metrics panel.

5. Professional Report Export

When your analysis is complete, click Export PDF at the top right of the Chat screen. The system compiles your session history, charts, and mathematical insights into a polished, print-ready executive PDF report.

🚀 Quickstart Guide (Local Development)

Prerequisites

Python 3.11+ installed
Node.js 18+ installed

1. Configure the Environment

Clone the repository, copy the example environment file, and populate your API credentials:

cp .env.example .env

Specify your GROQ_API_KEY, your active NEON_DATABASE_URL (metadata store), and other optional parameters.

2. Initialize the Metadata Database

The platform requires a PostgreSQL database with pgvector enabled to act as its primary system store. Run the migrations script against your database to set up system tables:

# Run migration setup script
python scripts/run_migration.py

3. Seed Demo Data (Optional)

To test the system immediately with realistic e-commerce datasets (customers, orders, products):

python scripts/seed_demo.py

4. Boot the FastAPI Backend

pip install -r requirements.txt
python -m uvicorn api.main:app --host 127.0.0.1 --port 8000

5. Boot the Vite/React Frontend

In a separate terminal:

cd frontend
npm install
npm run dev

Open your browser to http://localhost:5173/ to begin your analytical session!

Deployed Production Usage (Recruiter & Public Demo Guide)

Once this project is deployed to a cloud platform (e.g. Railway, Render, AWS, GCP), users and technical recruiters can utilize the live system instantly without downloading any code or setting up local files.

🌟 Instant Recruiter Testing (Zero Setup Required)

If you are evaluating this project as a hiring manager or recruiter:

Open the live production URL of the deployed application (e.g. https://your-data-analyst-agent.up.railway.app).
Immediate Demo Database Access: On the Chat page, look at the database selector in the top-left corner. It will be pre-configured with Demo DB (Neon).
Ask Queries Instantly: You can start chatting with the agent immediately using the pre-seeded e-commerce data (e.g. "Show me total revenue per product category as a bar chart" or "Highlight any anomalies in monthly customer signups").
Connect Your Own Cloud Database: Go to the Upload Data tab, click Connect Cloud Database, paste any standard PostgreSQL URL from Railway/AWS/GCP, and hit Securely Link & Query Database to immediately analyze your own datasets.

☁️ How to Deploy to Production (100% Free Stack)

This repository has been fully optimized to run on the ultimate free tier stack: Hugging Face Spaces (Compute) + Neon (Database) + Upstash (Cache). This gives you enterprise-grade performance (16GB RAM) for $0/month.

1. Set Up the Metadata Database (Neon.tech)

Go to Neon.tech and create a free account.
Create a new Postgres database (Neon automatically supports pgvector).
Copy your database connection string (it looks like postgresql://...).
Run the migrations locally against this remote database to set up all system tables:
```
NEON_DATABASE_URL="your-neon-url" python scripts/run_migration.py
```

2. Set Up the Cache (Upstash)

Go to Upstash.com and create a free account.
Create a new Redis database.
Scroll down to the REST API section and copy your UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN.

3. Deploy to Hugging Face Spaces (Compute)

Hugging Face Spaces provides a free Docker environment with a massive 16GB of RAM. The Dockerfile is already pre-configured to expose port 7860 as required by HF Spaces!

Go to Hugging Face Spaces and click Create new Space.
Name your space and choose Docker as the Space SDK (select the "Blank" template).
Connect your GitHub repository or push this codebase directly to the Space.

Go to your Space's Settings -> Variables and secrets and add the following New Secrets:

# Core LLM
GROQ_API_KEY=gsk_your_groq_key

# Metadata Database
NEON_DATABASE_URL=postgresql://your-neon-url...

# Cache
UPSTASH_REDIS_REST_URL=https://your-upstash-url...
UPSTASH_REDIS_REST_TOKEN=your_upstash_token...

Hugging Face will automatically build the Docker container (which downloads the 420MB local embeddings model during build) and boot up your API. Your multi-agent system is now fully live and free!