Spaces:

arthikrangan
/

data-analysis-agent

Sleeping

File size: 7,522 Bytes

---
title: "Data Analysis Agent"
emoji: "💬"
colorFrom: "indigo"
colorTo: "pink"
sdk: "streamlit"
sdk_version: "1.36.0"
app_file: "streamlit_app.py"
pinned: false
---

# Data Analysis Agent (Streamlit + LangGraph ReAct Agent)

A lightweight Streamlit application that ingests **Excel (.xlsx)** files into a local **DuckDB** database and lets you **chat with your data**. It uses a **LangGraph ReAct-style agent** to generate DuckDB SQL from natural language, refine on errors, and optionally render plots. Answers stream live, are aware of chat history, and show the generated SQL in an expander for transparency.

---

## Highlights

- **One‑click ingestion**: Excel → DuckDB using `excel_to_duckdb.py` (keeps ingestion logs).
- **Overview & preview**: Human‑friendly overview and quick table previews before chatting.
- **Chat with your dataset**: Natural‑language questions → SQL → answer (+ optional chart).
- **Streaming**: Answers appear token‑by‑token (“Answer pending…” shows immediately).
- **History‑aware**: Follow‑ups consider prior Q&A context for better intent & filters.
- **Conservative plotting**: Only plots when useful (trends/many rows or explicitly asked).
- **Clean UI**: Chart appears **below** the answer and **before** the generated SQL.
- **Session isolation**: Uploading a new file **clears** overview, previews, and chat.
- **Local by default**: DuckDB DB & plots are saved locally (`./dbs`, `./plots`).

---

## Repo Structure

```
streamlit_app.py          # Streamlit app (UI, flow, chat)
duckdb_react_agent.py     # LangGraph ReAct agent (SQL generation, refine, answer, plots)
excel_to_duckdb.py     # Excel → DuckDB ingestion script (unchanged behavior)
uploads/                  # Uploaded Excel files (created at runtime)
dbs/                      # DuckDB files per upload (created at runtime)
plots/                    # Charts generated by the agent (created at runtime)
```

> **Note:** The app intentionally skips internal tables (e.g., `__excel_*`) when discovering “user tables.” Those tables hold ingestion metadata and previews.

---

## Requirements

- Python 3.10+
- Recommended packages:
  ```bash
  pip install streamlit duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai
  ```
- **OpenAI** API key in environment or `.env` file:
  ```dotenv
  OPENAI_API_KEY=sk-...
  # optional: model override (defaults to gpt-4o-mini in app; gpt-4o in CLI agent)
  OPENAI_MODEL=gpt-4o-mini
  ```

If your ingestion or preview relies on Excel parsing features, make sure you have:
```bash
pip install openpyxl
```

---

## Quick Start

1. Place the three files side‑by‑side:
   ```
   streamlit_app.py
   duckdb_react_agent.py
   excel_to_duckdb.py
   ```

2. Set your API key (env var or `.env`):
   ```bash
   export OPENAI_API_KEY=sk-...        # macOS/Linux
   setx OPENAI_API_KEY "sk-..."        # Windows PowerShell (new sessions)
   ```

3. Run the app:
   ```bash
   streamlit run streamlit_app.py
   ```

4. In the UI:
   - **Upload** an `.xlsx` file → watch the **Processing logs**.
   - Once ingestion finishes, you’ll see the **overview** and **quick previews**.
   - Scroll down to **Chat with your dataset** and start asking questions.
   - Each bot reply shows: **answer text**, optional **chart**, and an expander for **SQL**.

> **New upload behavior:** As soon as you upload a different file, the app **clears** the previous overview, table previews, and chat, then ingests the new file.

---

## How It Works

### 1) Ingestion (Excel → DuckDB)
- The app shells out to `excel_to_duckdb.py`:
  - Detects multiple tables per sheet (using blank rows as separators), handles merged headers, ignores pivot/chart‑like blocks, and writes one DuckDB table per block.
  - Creates helpful metadata tables (e.g., `__excel_tables`, `__excel_schema`).
- The app then opens the DuckDB file and builds a **schema snapshot** for the agent.
- **User tables**: The app queries `information_schema` to list the “real” user tables (skips internals like `__excel_*`).

### 2) LangGraph ReAct Agent (DuckDB Q&A)
- **ReAct loop**: Draft a SELECT query → run → if error, **refine up to 3×** with the error and prior SQL.
- **Safety**: Only **SELECT** queries are permitted.
- **Streaming answers**: Final natural‑language answer streams to the UI.
- **History awareness**: Recent Q&A is summarized and fed to the LLM for better follow‑ups.
- **Plotting (optional)**: If the result truly benefits from visuals (or you ask for one), a chart is generated and saved to `./plots`; the app then displays it **below** the answer.

> The agent avoids revealing plot file names or paths. Answers refer generically to “the chart” and the UI renders the image inline.

---

## Configuration & Tweaks

- **Models**: Set `OPENAI_MODEL` in your env to change the LLM (`gpt-4o-mini`, `gpt-4o`, etc.).
- **Plot policy**: The agent is conservative by default. To make it more/less eager to chart, tune the **viz prompt** or the **row‑count threshold** in `duckdb_react_agent.py` (`many_rows = df.shape[0] > 20`).
- **Chart size**: In `streamlit_app.py`, the plot width is ~**560px** for answers and ~**320‑360px** in history; adjust to taste.
- **Schemas**: The schema summary focuses on `main`. To include more schemas, extend the `allowed_schemas` list where `get_schema_summary()` is called.

---

## Troubleshooting

**“No user tables were discovered. Showing ingestion metadata…”**  
- Your Excel may not have recognizable data blocks under the current heuristics. Check the **Processing logs** and the `__excel_tables` preview to verify what got detected.
- If it’s a header/merged‑cell issue, revisit your spreadsheet (ensure at least one empty row between blocks).

**Agent errors / SQL fails repeatedly**  
- The agent refines up to 3×. If it still fails, it returns the error. Try asking the question more explicitly (table names, time ranges, filters).

**Streamlit cache / stale code**  
- If updates don’t show up, restart Streamlit or clear cache:
  ```bash
  streamlit cache clear
  ```

**Windows console encoding warnings**  
- The app sets `PYTHONIOENCODING=utf-8` for ingestion logs. If you still see encoding issues, ensure system locale supports UTF‑8.

---

## Security & Privacy Notes

- **Local files**: DuckDB files and plots are stored locally by default (`./dbs`, `./plots`).
- **What goes to the LLM**: The agent sends the **schema snapshot** and a **small result preview** (first rows) along with your **question** and a compact **history summary**—not the full dataset. For stricter controls, reduce or strip the preview passed to the LLM in `duckdb_react_agent.py`.

---

## Extending the App

- **CSV support**: Add a CSV branch next to the Excel ingest that writes to the same DuckDB.  
- **Persisted chat**: Save `chat_history` to a file keyed by DB name to restore on reload.  
- **Role‑specific prompts**: Swap in system prompts for financial analysis, marketing analytics, etc.  
- **Auth & multi‑user**: Gate uploads and route per‑user databases under a workspace prefix.

---

## CLI (Agent Only)

You can also run the agent from the command line for ad‑hoc Q&A:

```bash
python duckdb_react_agent.py --duckdb ./dbs/mydata.duckdb --stream
```

Type your question; the agent prints streaming answers and shows where charts were saved (the Streamlit app displays them inline).

---

## License

Proprietary / internal. Adapt as needed for your environment.