Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.56.0
title: Data Analysis Agent
emoji: 💬
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: streamlit_app.py
pinned: false
Data Analysis Agent (Streamlit + LangGraph ReAct Agent)
A lightweight Streamlit application that ingests Excel (.xlsx) files into a local DuckDB database and lets you chat with your data. It uses a LangGraph ReAct-style agent to generate DuckDB SQL from natural language, refine on errors, and optionally render plots. Answers stream live, are aware of chat history, and show the generated SQL in an expander for transparency.
Highlights
- One‑click ingestion: Excel → DuckDB using
excel_to_duckdb.py(keeps ingestion logs). - Overview & preview: Human‑friendly overview and quick table previews before chatting.
- Chat with your dataset: Natural‑language questions → SQL → answer (+ optional chart).
- Streaming: Answers appear token‑by‑token (“Answer pending…” shows immediately).
- History‑aware: Follow‑ups consider prior Q&A context for better intent & filters.
- Conservative plotting: Only plots when useful (trends/many rows or explicitly asked).
- Clean UI: Chart appears below the answer and before the generated SQL.
- Session isolation: Uploading a new file clears overview, previews, and chat.
- Local by default: DuckDB DB & plots are saved locally (
./dbs,./plots).
Repo Structure
streamlit_app.py # Streamlit app (UI, flow, chat)
duckdb_react_agent.py # LangGraph ReAct agent (SQL generation, refine, answer, plots)
excel_to_duckdb.py # Excel → DuckDB ingestion script (unchanged behavior)
uploads/ # Uploaded Excel files (created at runtime)
dbs/ # DuckDB files per upload (created at runtime)
plots/ # Charts generated by the agent (created at runtime)
Note: The app intentionally skips internal tables (e.g.,
__excel_*) when discovering “user tables.” Those tables hold ingestion metadata and previews.
Requirements
- Python 3.10+
- Recommended packages:
pip install streamlit duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai - OpenAI API key in environment or
.envfile:OPENAI_API_KEY=sk-... # optional: model override (defaults to gpt-4o-mini in app; gpt-4o in CLI agent) OPENAI_MODEL=gpt-4o-mini
If your ingestion or preview relies on Excel parsing features, make sure you have:
pip install openpyxl
Quick Start
Place the three files side‑by‑side:
streamlit_app.py duckdb_react_agent.py excel_to_duckdb.pySet your API key (env var or
.env):export OPENAI_API_KEY=sk-... # macOS/Linux setx OPENAI_API_KEY "sk-..." # Windows PowerShell (new sessions)Run the app:
streamlit run streamlit_app.pyIn the UI:
- Upload an
.xlsxfile → watch the Processing logs. - Once ingestion finishes, you’ll see the overview and quick previews.
- Scroll down to Chat with your dataset and start asking questions.
- Each bot reply shows: answer text, optional chart, and an expander for SQL.
- Upload an
New upload behavior: As soon as you upload a different file, the app clears the previous overview, table previews, and chat, then ingests the new file.
How It Works
1) Ingestion (Excel → DuckDB)
- The app shells out to
excel_to_duckdb.py:- Detects multiple tables per sheet (using blank rows as separators), handles merged headers, ignores pivot/chart‑like blocks, and writes one DuckDB table per block.
- Creates helpful metadata tables (e.g.,
__excel_tables,__excel_schema).
- The app then opens the DuckDB file and builds a schema snapshot for the agent.
- User tables: The app queries
information_schemato list the “real” user tables (skips internals like__excel_*).
2) LangGraph ReAct Agent (DuckDB Q&A)
- ReAct loop: Draft a SELECT query → run → if error, refine up to 3× with the error and prior SQL.
- Safety: Only SELECT queries are permitted.
- Streaming answers: Final natural‑language answer streams to the UI.
- History awareness: Recent Q&A is summarized and fed to the LLM for better follow‑ups.
- Plotting (optional): If the result truly benefits from visuals (or you ask for one), a chart is generated and saved to
./plots; the app then displays it below the answer.
The agent avoids revealing plot file names or paths. Answers refer generically to “the chart” and the UI renders the image inline.
Configuration & Tweaks
- Models: Set
OPENAI_MODELin your env to change the LLM (gpt-4o-mini,gpt-4o, etc.). - Plot policy: The agent is conservative by default. To make it more/less eager to chart, tune the viz prompt or the row‑count threshold in
duckdb_react_agent.py(many_rows = df.shape[0] > 20). - Chart size: In
streamlit_app.py, the plot width is ~560px for answers and ~320‑360px in history; adjust to taste. - Schemas: The schema summary focuses on
main. To include more schemas, extend theallowed_schemaslist whereget_schema_summary()is called.
Troubleshooting
“No user tables were discovered. Showing ingestion metadata…”
- Your Excel may not have recognizable data blocks under the current heuristics. Check the Processing logs and the
__excel_tablespreview to verify what got detected. - If it’s a header/merged‑cell issue, revisit your spreadsheet (ensure at least one empty row between blocks).
Agent errors / SQL fails repeatedly
- The agent refines up to 3×. If it still fails, it returns the error. Try asking the question more explicitly (table names, time ranges, filters).
Streamlit cache / stale code
- If updates don’t show up, restart Streamlit or clear cache:
streamlit cache clear
Windows console encoding warnings
- The app sets
PYTHONIOENCODING=utf-8for ingestion logs. If you still see encoding issues, ensure system locale supports UTF‑8.
Security & Privacy Notes
- Local files: DuckDB files and plots are stored locally by default (
./dbs,./plots). - What goes to the LLM: The agent sends the schema snapshot and a small result preview (first rows) along with your question and a compact history summary—not the full dataset. For stricter controls, reduce or strip the preview passed to the LLM in
duckdb_react_agent.py.
Extending the App
- CSV support: Add a CSV branch next to the Excel ingest that writes to the same DuckDB.
- Persisted chat: Save
chat_historyto a file keyed by DB name to restore on reload. - Role‑specific prompts: Swap in system prompts for financial analysis, marketing analytics, etc.
- Auth & multi‑user: Gate uploads and route per‑user databases under a workspace prefix.
CLI (Agent Only)
You can also run the agent from the command line for ad‑hoc Q&A:
python duckdb_react_agent.py --duckdb ./dbs/mydata.duckdb --stream
Type your question; the agent prints streaming answers and shows where charts were saved (the Streamlit app displays them inline).
License
Proprietary / internal. Adapt as needed for your environment.