Spaces:

arthikrangan
/

data-analysis-agent

Sleeping

App Files Files Community

data-analysis-agent / README.md

arthikrangan

Upload README.md

c9fb3f2 verified 7 months ago

preview code

raw

history blame contribute delete

7.52 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

metadata

title: Data Analysis Agent
emoji: 💬
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: streamlit_app.py
pinned: false

Data Analysis Agent (Streamlit + LangGraph ReAct Agent)

A lightweight Streamlit application that ingests Excel (.xlsx) files into a local DuckDB database and lets you chat with your data. It uses a LangGraph ReAct-style agent to generate DuckDB SQL from natural language, refine on errors, and optionally render plots. Answers stream live, are aware of chat history, and show the generated SQL in an expander for transparency.

Highlights

One‑click ingestion: Excel → DuckDB using excel_to_duckdb.py (keeps ingestion logs).
Overview & preview: Human‑friendly overview and quick table previews before chatting.
Chat with your dataset: Natural‑language questions → SQL → answer (+ optional chart).
Streaming: Answers appear token‑by‑token (“Answer pending…” shows immediately).
History‑aware: Follow‑ups consider prior Q&A context for better intent & filters.
Conservative plotting: Only plots when useful (trends/many rows or explicitly asked).
Clean UI: Chart appears below the answer and before the generated SQL.
Session isolation: Uploading a new file clears overview, previews, and chat.
Local by default: DuckDB DB & plots are saved locally (./dbs, ./plots).

Repo Structure

streamlit_app.py          # Streamlit app (UI, flow, chat)
duckdb_react_agent.py     # LangGraph ReAct agent (SQL generation, refine, answer, plots)
excel_to_duckdb.py     # Excel → DuckDB ingestion script (unchanged behavior)
uploads/                  # Uploaded Excel files (created at runtime)
dbs/                      # DuckDB files per upload (created at runtime)
plots/                    # Charts generated by the agent (created at runtime)

Note: The app intentionally skips internal tables (e.g., __excel_*) when discovering “user tables.” Those tables hold ingestion metadata and previews.

Requirements

Python 3.10+

Recommended packages:

pip install streamlit duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai

OpenAI API key in environment or .env file:

OPENAI_API_KEY=sk-...
# optional: model override (defaults to gpt-4o-mini in app; gpt-4o in CLI agent)
OPENAI_MODEL=gpt-4o-mini

If your ingestion or preview relies on Excel parsing features, make sure you have:

pip install openpyxl

Quick Start

Place the three files side‑by‑side:

streamlit_app.py
duckdb_react_agent.py
excel_to_duckdb.py

Set your API key (env var or .env):

export OPENAI_API_KEY=sk-...        # macOS/Linux
setx OPENAI_API_KEY "sk-..."        # Windows PowerShell (new sessions)

Run the app:
```
streamlit run streamlit_app.py
```
In the UI:
- Upload an .xlsx file → watch the Processing logs.
- Once ingestion finishes, you’ll see the overview and quick previews.
- Scroll down to Chat with your dataset and start asking questions.
- Each bot reply shows: answer text, optional chart, and an expander for SQL.

New upload behavior: As soon as you upload a different file, the app clears the previous overview, table previews, and chat, then ingests the new file.

How It Works

1) Ingestion (Excel → DuckDB)

The app shells out to excel_to_duckdb.py:
- Detects multiple tables per sheet (using blank rows as separators), handles merged headers, ignores pivot/chart‑like blocks, and writes one DuckDB table per block.
- Creates helpful metadata tables (e.g., __excel_tables, __excel_schema).
The app then opens the DuckDB file and builds a schema snapshot for the agent.
User tables: The app queries information_schema to list the “real” user tables (skips internals like __excel_*).

2) LangGraph ReAct Agent (DuckDB Q&A)

ReAct loop: Draft a SELECT query → run → if error, refine up to 3× with the error and prior SQL.
Safety: Only SELECT queries are permitted.
Streaming answers: Final natural‑language answer streams to the UI.
History awareness: Recent Q&A is summarized and fed to the LLM for better follow‑ups.
Plotting (optional): If the result truly benefits from visuals (or you ask for one), a chart is generated and saved to ./plots; the app then displays it below the answer.

The agent avoids revealing plot file names or paths. Answers refer generically to “the chart” and the UI renders the image inline.

Configuration & Tweaks

Models: Set OPENAI_MODEL in your env to change the LLM (gpt-4o-mini, gpt-4o, etc.).
Plot policy: The agent is conservative by default. To make it more/less eager to chart, tune the viz prompt or the row‑count threshold in duckdb_react_agent.py (many_rows = df.shape[0] > 20).
Chart size: In streamlit_app.py, the plot width is ~560px for answers and ~320‑360px in history; adjust to taste.
Schemas: The schema summary focuses on main. To include more schemas, extend the allowed_schemas list where get_schema_summary() is called.

Troubleshooting

“No user tables were discovered. Showing ingestion metadata…”

Your Excel may not have recognizable data blocks under the current heuristics. Check the Processing logs and the __excel_tables preview to verify what got detected.
If it’s a header/merged‑cell issue, revisit your spreadsheet (ensure at least one empty row between blocks).

Agent errors / SQL fails repeatedly

The agent refines up to 3×. If it still fails, it returns the error. Try asking the question more explicitly (table names, time ranges, filters).

Streamlit cache / stale code

If updates don’t show up, restart Streamlit or clear cache:
```
streamlit cache clear
```

Windows console encoding warnings

The app sets PYTHONIOENCODING=utf-8 for ingestion logs. If you still see encoding issues, ensure system locale supports UTF‑8.

Security & Privacy Notes

Local files: DuckDB files and plots are stored locally by default (./dbs, ./plots).
What goes to the LLM: The agent sends the schema snapshot and a small result preview (first rows) along with your question and a compact history summary—not the full dataset. For stricter controls, reduce or strip the preview passed to the LLM in duckdb_react_agent.py.

Extending the App

CSV support: Add a CSV branch next to the Excel ingest that writes to the same DuckDB.
Persisted chat: Save chat_history to a file keyed by DB name to restore on reload.
Role‑specific prompts: Swap in system prompts for financial analysis, marketing analytics, etc.
Auth & multi‑user: Gate uploads and route per‑user databases under a workspace prefix.

CLI (Agent Only)

You can also run the agent from the command line for ad‑hoc Q&A:

python duckdb_react_agent.py --duckdb ./dbs/mydata.duckdb --stream

Type your question; the agent prints streaming answers and shows where charts were saved (the Streamlit app displays them inline).

License

Proprietary / internal. Adapt as needed for your environment.