| --- |
| title: Data Analyst Agent |
| emoji: 🤖 |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| pinned: false |
| app_port: 7860 |
| --- |
| |
| # 🤖 Autonomous Multi-Agent Data Analytics System |
|
|
| 📖 **Read the Deep-Dive Article on Medium:** [Static Dashboards Are Dying: Inside an Autonomous Analytics Agent](https://medium.com/@rohitdeshmukh318/static-dashboards-are-dying-inside-an-autonomous-analytics-agent-04d871b32c07) |
|
|
| An advanced, self-correcting multi-agent AI system that transforms natural language queries into secure database operations, profiles data schemas, detects outliers dynamically, and compiles insights into professional analytical reports. |
|
|
| --- |
|
|
| ## 🛠️ System Architecture |
|
|
| The following diagram illustrates the complete end-to-end request-response life cycle, showcasing the **LangGraph Orchestrator**, secure **AST Sandboxing**, and dynamic **Multi-Tenant Database Ingestion**: |
|
|
| ```mermaid |
| flowchart TD |
| Client[Frontend Client - Vite/React] -->|1. Natural Language Query| API[FastAPI Backend] |
| API -->|2. Ingests Query| LangGraph[LangGraph Orchestrator] |
| |
| subgraph LangGraph [LangGraph Workflow State Machine] |
| direction TB |
| Router{Intent Router} -->|Direct General Chat| Synth[Insight Synthesizer] |
| Router -->|Data Query| Planner[Query Planner Node] |
| |
| Planner -->|Fetch Context| Memory[Memory Retriever Node] |
| Memory -->|Select Target Schema| SchemaRAG[Schema Vector Retrieval] |
| |
| SchemaRAG -->|Generate PostgreSQL/Pandas| Generator[Code Generator Node] |
| Generator -->|Verify AST| Validator{AST Safety Validator} |
| |
| Validator -->|Syntax/Policy Failure| SelfCorrect[Self-Correction Loop] |
| SelfCorrect -->|Error Traceback Feed| Generator |
| |
| Validator -->|Security Passed| Executor[Safe Execution Engine] |
| Executor -->|Inspect Output| Anomaly[Anomaly Detector Node] |
| Anomaly -->|Format Data & Anomalies| Synth |
| end |
| |
| subgraph Data & Execution Isolation |
| Executor -->|Isolated Read-Only Connect| UserDB[(Target User DB)] |
| SchemaRAG -->|Store/Match Vectors| MetaDB[(Central Metadata DB)] |
| end |
| |
| subgraph Shared Services |
| Generator -.->|Inference| Groq[Groq Llama 3.1 70B] |
| Planner -.->|Inference| Groq8B[Groq Llama 3.1 8B] |
| API -.->|Session Storage| Redis[Upstash Redis Cache] |
| end |
| |
| Synth -->|3. Server-Sent Events SSE| Client |
| ``` |
|
|
| --- |
|
|
| ### 🤖 Specialized Agent Nodes |
|
|
| The LangGraph-powered state machine decouples execution into cooperative, specialized agent nodes. Each node acts as a distinct agent with unique prompts, reasoning scopes, and target model configurations: |
|
|
| 1. **Intent Router Agent (LLaMA-3.1-8B):** Automatically scans raw user questions to classify user intent (`sql`, `pandas`, `insight_only`, or `unsupported`). Operates on the lightweight 8B model to keep initial routing latency under 100ms. |
| 2. **Query Planner Agent (LLaMA-3.1-8B):** Evaluates previous conversation memory alongside relevant schemas retrieved via Schema RAG to generate a structured JSON execution strategy (specifying necessary joins, table scopes, and expected query complexity). |
| 3. **Code Generator Agent (LLaMA-3.3-70B):** Executes the Planner's blueprint to write clean execution code. Generates ANSI-compliant SQL for database connectors or Pandas/Python scripts for CSV/flat files. Utilizes the larger 70B model for deep logical reasoning. |
| 4. **Safety Validator Agent (Pure-Python/AST):** An essential security gate. SQL operations are constrained to read-only `SELECT` calls. Python scripts are parsed into an Abstract Syntax Tree (AST) using Python's native `ast` module, immediately terminating executions containing blacklisted imports (`os`, `sys`, `subprocess`) or dangerous built-in functions (`open`, `eval`). |
| 5. **Error Classifier Agent (LLaMA-3.1-8B):** If code execution fails, this agent intercepts the compiler or database stack trace, classifies the exact issue (e.g. `UndefinedColumn`, `SyntaxError`), and structures a direct diagnosis prompt. |
| 6. **Self-Corrector Agent (LLaMA-3.3-70B):** Absorbs the broken script and the diagnostic report from the Error Classifier, automatically self-heals the code, and re-routes it back to the Safety Validator. This forms the system's self-correction loop (capped at 3 attempts). |
| 7. **Insight Synthesizer Agent (LLaMA-3.1-8B):** Receives raw, formatted tabular query results and synthesizes a concise, high-impact 2-3 sentence analytical summary focusing on business metrics and key statistics. |
|
|
| --- |
|
|
| ## 🔒 Security Architecture (Recruiter & Enterprise Review) |
|
|
| Allowing AI models to write and execute code on live relational databases presents extreme security risks. This platform solves this by enforcing **four layers of strict containment**: |
|
|
| 1. **Client-Side Credential Tokenization (Symmetric Key Cipher):** |
| When a user inputs their custom cloud PostgreSQL connection URI, the backend immediately encrypts it using a server-side symmetric block key derived dynamically using SHA-256 and an XOR stream cipher. The frontend only stores the secure token (`postgres-enc:XYZ...`) in its session state—meaning **raw passwords never touch server logs, browser caches, or persistent disk storage.** |
| 2. **Abstract Syntax Tree (AST) Parsing:** |
| Before any query is sent to a target database, it is compiled into an AST using `sqlglot`. If the parser detects any blocked operations (e.g., `DROP`, `DELETE`, `ALTER`, `UPDATE`, `INSERT`), the query is terminated instantly. |
| 3. **Restricted Python Sandboxing:** |
| For tabular files (CSVs), code is executed under `RestrictedPython`, blocking access to the local filesystem, shell, importing malicious packages, and private built-ins. |
| 4. **Session-Level Read-Only Isolation:** |
| All database sessions are forced into strict read-only transactions at the database driver level: |
| ```python |
| conn.set_session(readonly=True, autocommit=True) |
| ``` |
| Custom user database queries completely bypass the platform's global connection pool, ensuring zero session leakage or database cross-talk. |
|
|
| --- |
|
|
| ## 📖 User Perspective & Operating Guide |
|
|
| From an analyst's or business user's perspective, the system operates as a seamless companion dashboard: |
|
|
| ### 1. Ingesting Your Data Sources |
| Click **Upload Data** in the left sidebar. The system supports two methods of analysis: |
| * **Option A (Local Files):** Drag and drop a `.csv`, `.sqlite`, or `.db` file. The backend loads it dynamically into a sandbox-safe, fast in-memory database workspace. |
| * **Option B (Secure Cloud Database):** Paste a standard public PostgreSQL connection string (from Railway, AWS RDS, GCP Cloud SQL, Azure, etc.) and type your target schema name (e.g. `public`). |
| * The system will automatically establish an isolated connection, profile your schema, embed your table descriptions, and index them into semantic search memory. |
|
|
| ### 2. Conversational Analytical Chat |
| Navigate to the **Chat** screen: |
| * Choose your active database source using the selector at the top. |
| * Ask natural language questions like: *"Show me our monthly order volume for the past year and point out any spikes."* or *"What are the top 5 highest-grossing products?"* |
| * The agent dynamically translates the request, generates clean Postgres-dialect SQL, executes it, creates interactive chart specifications using Plotly, and provides statistical insights. |
|
|
| ### 3. Automated Anomaly Detection |
| Every query result is piped through our parallel statistical anomaly engine. If the system detects spikes, outlier points, or extreme values in your dataset, it highlights them automatically under the **Anomalies Found** banner. |
|
|
| ### 4. Interactive Metrics Dashboard |
| Click **Dashboard** in the sidebar to review your saved charts, add custom titles, and compile real-time metric cards. You can trace query execution status (e.g., Latency, LLM token count, Self-correction rates) directly in the **Metrics** panel. |
|
|
| ### 5. Professional Report Export |
| When your analysis is complete, click **Export PDF** at the top right of the Chat screen. The system compiles your session history, charts, and mathematical insights into a polished, print-ready executive PDF report. |
|
|
| --- |
|
|
| ## 🚀 Quickstart Guide (Local Development) |
|
|
| ### Prerequisites |
| * **Python 3.11+** installed |
| * **Node.js 18+** installed |
|
|
| ### 1. Configure the Environment |
| Clone the repository, copy the example environment file, and populate your API credentials: |
| ```bash |
| cp .env.example .env |
| ``` |
| *Specify your `GROQ_API_KEY`, your active `NEON_DATABASE_URL` (metadata store), and other optional parameters.* |
|
|
| ### 2. Initialize the Metadata Database |
| The platform requires a PostgreSQL database with `pgvector` enabled to act as its primary system store. Run the migrations script against your database to set up system tables: |
| ```bash |
| # Run migration setup script |
| python scripts/run_migration.py |
| ``` |
|
|
| ### 3. Seed Demo Data (Optional) |
| To test the system immediately with realistic e-commerce datasets (customers, orders, products): |
| ```bash |
| python scripts/seed_demo.py |
| ``` |
|
|
| ### 4. Boot the FastAPI Backend |
| ```bash |
| pip install -r requirements.txt |
| python -m uvicorn api.main:app --host 127.0.0.1 --port 8000 |
| ``` |
|
|
| ### 5. Boot the Vite/React Frontend |
| In a separate terminal: |
| ```bash |
| cd frontend |
| npm install |
| npm run dev |
| ``` |
|
|
| Open your browser to **`http://localhost:5173/`** to begin your analytical session! |
|
|
| --- |
|
|
| ## Deployed Production Usage (Recruiter & Public Demo Guide) |
|
|
| Once this project is deployed to a cloud platform (e.g. Railway, Render, AWS, GCP), users and technical recruiters can utilize the live system instantly without downloading any code or setting up local files. |
|
|
| ### 🌟 Instant Recruiter Testing (Zero Setup Required) |
| If you are evaluating this project as a hiring manager or recruiter: |
| 1. Open the live production URL of the deployed application (e.g. `https://your-data-analyst-agent.up.railway.app`). |
| 2. **Immediate Demo Database Access:** On the **Chat** page, look at the database selector in the top-left corner. It will be pre-configured with **`Demo DB (Neon)`**. |
| 3. **Ask Queries Instantly:** You can start chatting with the agent immediately using the pre-seeded e-commerce data (e.g. *"Show me total revenue per product category as a bar chart"* or *"Highlight any anomalies in monthly customer signups"*). |
| 4. **Connect Your Own Cloud Database:** Go to the **Upload Data** tab, click **Connect Cloud Database**, paste any standard PostgreSQL URL from Railway/AWS/GCP, and hit **Securely Link & Query Database** to immediately analyze your own datasets. |
|
|
| --- |
|
|
| ## ☁️ How to Deploy to Production (100% Free Stack) |
|
|
| This repository has been fully optimized to run on the ultimate free tier stack: **Hugging Face Spaces (Compute) + Neon (Database) + Upstash (Cache)**. This gives you enterprise-grade performance (16GB RAM) for $0/month. |
|
|
| ### 1. Set Up the Metadata Database (Neon.tech) |
| 1. Go to [Neon.tech](https://neon.tech/) and create a free account. |
| 2. Create a new Postgres database (Neon automatically supports `pgvector`). |
| 3. Copy your database connection string (it looks like `postgresql://...`). |
| 4. Run the migrations locally against this remote database to set up all system tables: |
| ```bash |
| NEON_DATABASE_URL="your-neon-url" python scripts/run_migration.py |
| ``` |
|
|
| ### 2. Set Up the Cache (Upstash) |
| 1. Go to [Upstash.com](https://upstash.com/) and create a free account. |
| 2. Create a new **Redis** database. |
| 3. Scroll down to the **REST API** section and copy your `UPSTASH_REDIS_REST_URL` and `UPSTASH_REDIS_REST_TOKEN`. |
|
|
| ### 3. Deploy to Hugging Face Spaces (Compute) |
| Hugging Face Spaces provides a free Docker environment with a massive 16GB of RAM. The `Dockerfile` is already pre-configured to expose port `7860` as required by HF Spaces! |
|
|
| 1. Go to [Hugging Face Spaces](https://huggingface.co/spaces) and click **Create new Space**. |
| 2. Name your space and choose **Docker** as the Space SDK (select the "Blank" template). |
| 3. Connect your GitHub repository or push this codebase directly to the Space. |
| 4. Go to your Space's **Settings** -> **Variables and secrets** and add the following **New Secrets**: |
|
|
| ```env |
| # Core LLM |
| GROQ_API_KEY=gsk_your_groq_key |
| |
| # Metadata Database |
| NEON_DATABASE_URL=postgresql://your-neon-url... |
| |
| # Cache |
| UPSTASH_REDIS_REST_URL=https://your-upstash-url... |
| UPSTASH_REDIS_REST_TOKEN=your_upstash_token... |
| ``` |
|
|
| 5. Hugging Face will automatically build the Docker container (which downloads the 420MB local embeddings model during build) and boot up your API. Your multi-agent system is now fully live and free! |
|
|
|
|