Spaces:
Sleeping
Sleeping
Upload 5 files
Browse files- README.md +161 -14
- duckdb_react_agent.py +628 -0
- excel_to_duckdb.py +595 -0
- requirements.txt +13 -3
- streamlit_app.py +574 -0
README.md
CHANGED
|
@@ -1,19 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
-
forums](https://discuss.streamlit.io).
|
|
|
|
| 1 |
+
# Data Analysis Agent (Streamlit + LangGraph ReAct Agent)
|
| 2 |
+
|
| 3 |
+
A lightweight Streamlit application that ingests **Excel (.xlsx)** files into a local **DuckDB** database and lets you **chat with your data**. It uses a **LangGraph ReAct-style agent** to generate DuckDB SQL from natural language, refine on errors, and optionally render plots. Answers stream live, are aware of chat history, and show the generated SQL in an expander for transparency.
|
| 4 |
+
|
| 5 |
---
|
| 6 |
+
|
| 7 |
+
## Highlights
|
| 8 |
+
|
| 9 |
+
- **One‑click ingestion**: Excel → DuckDB using `excel_to_duckdb.py` (keeps ingestion logs).
|
| 10 |
+
- **Overview & preview**: Human‑friendly overview and quick table previews before chatting.
|
| 11 |
+
- **Chat with your dataset**: Natural‑language questions → SQL → answer (+ optional chart).
|
| 12 |
+
- **Streaming**: Answers appear token‑by‑token (“Answer pending…” shows immediately).
|
| 13 |
+
- **History‑aware**: Follow‑ups consider prior Q&A context for better intent & filters.
|
| 14 |
+
- **Conservative plotting**: Only plots when useful (trends/many rows or explicitly asked).
|
| 15 |
+
- **Clean UI**: Chart appears **below** the answer and **before** the generated SQL.
|
| 16 |
+
- **Session isolation**: Uploading a new file **clears** overview, previews, and chat.
|
| 17 |
+
- **Local by default**: DuckDB DB & plots are saved locally (`./dbs`, `./plots`).
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Repo Structure
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
streamlit_app.py # Streamlit app (UI, flow, chat)
|
| 25 |
+
duckdb_react_agent.py # LangGraph ReAct agent (SQL generation, refine, answer, plots)
|
| 26 |
+
excel_to_duckdb.py # Excel → DuckDB ingestion script (unchanged behavior)
|
| 27 |
+
uploads/ # Uploaded Excel files (created at runtime)
|
| 28 |
+
dbs/ # DuckDB files per upload (created at runtime)
|
| 29 |
+
plots/ # Charts generated by the agent (created at runtime)
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
> **Note:** The app intentionally skips internal tables (e.g., `__excel_*`) when discovering “user tables.” Those tables hold ingestion metadata and previews.
|
| 33 |
+
|
| 34 |
---
|
| 35 |
|
| 36 |
+
## Requirements
|
| 37 |
+
|
| 38 |
+
- Python 3.10+
|
| 39 |
+
- Recommended packages:
|
| 40 |
+
```bash
|
| 41 |
+
pip install streamlit duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai
|
| 42 |
+
```
|
| 43 |
+
- **OpenAI** API key in environment or `.env` file:
|
| 44 |
+
```dotenv
|
| 45 |
+
OPENAI_API_KEY=sk-...
|
| 46 |
+
# optional: model override (defaults to gpt-4o-mini in app; gpt-4o in CLI agent)
|
| 47 |
+
OPENAI_MODEL=gpt-4o-mini
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
If your ingestion or preview relies on Excel parsing features, make sure you have:
|
| 51 |
+
```bash
|
| 52 |
+
pip install openpyxl
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Quick Start
|
| 58 |
+
|
| 59 |
+
1. Place the three files side‑by‑side:
|
| 60 |
+
```
|
| 61 |
+
streamlit_app.py
|
| 62 |
+
duckdb_react_agent.py
|
| 63 |
+
excel_to_duckdb.py
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
2. Set your API key (env var or `.env`):
|
| 67 |
+
```bash
|
| 68 |
+
export OPENAI_API_KEY=sk-... # macOS/Linux
|
| 69 |
+
setx OPENAI_API_KEY "sk-..." # Windows PowerShell (new sessions)
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
3. Run the app:
|
| 73 |
+
```bash
|
| 74 |
+
streamlit run streamlit_app.py
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
4. In the UI:
|
| 78 |
+
- **Upload** an `.xlsx` file → watch the **Processing logs**.
|
| 79 |
+
- Once ingestion finishes, you’ll see the **overview** and **quick previews**.
|
| 80 |
+
- Scroll down to **Chat with your dataset** and start asking questions.
|
| 81 |
+
- Each bot reply shows: **answer text**, optional **chart**, and an expander for **SQL**.
|
| 82 |
+
|
| 83 |
+
> **New upload behavior:** As soon as you upload a different file, the app **clears** the previous overview, table previews, and chat, then ingests the new file.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## How It Works
|
| 88 |
+
|
| 89 |
+
### 1) Ingestion (Excel → DuckDB)
|
| 90 |
+
- The app shells out to `excel_to_duckdb.py`:
|
| 91 |
+
- Detects multiple tables per sheet (using blank rows as separators), handles merged headers, ignores pivot/chart‑like blocks, and writes one DuckDB table per block.
|
| 92 |
+
- Creates helpful metadata tables (e.g., `__excel_tables`, `__excel_schema`).
|
| 93 |
+
- The app then opens the DuckDB file and builds a **schema snapshot** for the agent.
|
| 94 |
+
- **User tables**: The app queries `information_schema` to list the “real” user tables (skips internals like `__excel_*`).
|
| 95 |
+
|
| 96 |
+
### 2) LangGraph ReAct Agent (DuckDB Q&A)
|
| 97 |
+
- **ReAct loop**: Draft a SELECT query → run → if error, **refine up to 3×** with the error and prior SQL.
|
| 98 |
+
- **Safety**: Only **SELECT** queries are permitted.
|
| 99 |
+
- **Streaming answers**: Final natural‑language answer streams to the UI.
|
| 100 |
+
- **History awareness**: Recent Q&A is summarized and fed to the LLM for better follow‑ups.
|
| 101 |
+
- **Plotting (optional)**: If the result truly benefits from visuals (or you ask for one), a chart is generated and saved to `./plots`; the app then displays it **below** the answer.
|
| 102 |
+
|
| 103 |
+
> The agent avoids revealing plot file names or paths. Answers refer generically to “the chart” and the UI renders the image inline.
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## Configuration & Tweaks
|
| 108 |
+
|
| 109 |
+
- **Models**: Set `OPENAI_MODEL` in your env to change the LLM (`gpt-4o-mini`, `gpt-4o`, etc.).
|
| 110 |
+
- **Plot policy**: The agent is conservative by default. To make it more/less eager to chart, tune the **viz prompt** or the **row‑count threshold** in `duckdb_react_agent.py` (`many_rows = df.shape[0] > 20`).
|
| 111 |
+
- **Chart size**: In `streamlit_app.py`, the plot width is ~**560px** for answers and ~**320‑360px** in history; adjust to taste.
|
| 112 |
+
- **Schemas**: The schema summary focuses on `main`. To include more schemas, extend the `allowed_schemas` list where `get_schema_summary()` is called.
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Troubleshooting
|
| 117 |
+
|
| 118 |
+
**“No user tables were discovered. Showing ingestion metadata…”**
|
| 119 |
+
- Your Excel may not have recognizable data blocks under the current heuristics. Check the **Processing logs** and the `__excel_tables` preview to verify what got detected.
|
| 120 |
+
- If it’s a header/merged‑cell issue, revisit your spreadsheet (ensure at least one empty row between blocks).
|
| 121 |
+
|
| 122 |
+
**Agent errors / SQL fails repeatedly**
|
| 123 |
+
- The agent refines up to 3×. If it still fails, it returns the error. Try asking the question more explicitly (table names, time ranges, filters).
|
| 124 |
+
|
| 125 |
+
**Streamlit cache / stale code**
|
| 126 |
+
- If updates don’t show up, restart Streamlit or clear cache:
|
| 127 |
+
```bash
|
| 128 |
+
streamlit cache clear
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
**Windows console encoding warnings**
|
| 132 |
+
- The app sets `PYTHONIOENCODING=utf-8` for ingestion logs. If you still see encoding issues, ensure system locale supports UTF‑8.
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
## Security & Privacy Notes
|
| 137 |
+
|
| 138 |
+
- **Local files**: DuckDB files and plots are stored locally by default (`./dbs`, `./plots`).
|
| 139 |
+
- **What goes to the LLM**: The agent sends the **schema snapshot** and a **small result preview** (first rows) along with your **question** and a compact **history summary**—not the full dataset. For stricter controls, reduce or strip the preview passed to the LLM in `duckdb_react_agent.py`.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## Extending the App
|
| 144 |
+
|
| 145 |
+
- **CSV support**: Add a CSV branch next to the Excel ingest that writes to the same DuckDB.
|
| 146 |
+
- **Persisted chat**: Save `chat_history` to a file keyed by DB name to restore on reload.
|
| 147 |
+
- **Role‑specific prompts**: Swap in system prompts for financial analysis, marketing analytics, etc.
|
| 148 |
+
- **Auth & multi‑user**: Gate uploads and route per‑user databases under a workspace prefix.
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## CLI (Agent Only)
|
| 153 |
+
|
| 154 |
+
You can also run the agent from the command line for ad‑hoc Q&A:
|
| 155 |
+
|
| 156 |
+
```bash
|
| 157 |
+
python duckdb_react_agent.py --duckdb ./dbs/mydata.duckdb --stream
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
Type your question; the agent prints streaming answers and shows where charts were saved (the Streamlit app displays them inline).
|
| 161 |
+
|
| 162 |
+
---
|
| 163 |
|
| 164 |
+
## License
|
| 165 |
|
| 166 |
+
Proprietary / internal. Adapt as needed for your environment.
|
|
|
duckdb_react_agent.py
ADDED
|
@@ -0,0 +1,628 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
DuckDB Q&A Agent (LangGraph + ReAct) — Streaming Edition
|
| 3 |
+
========================================================
|
| 4 |
+
|
| 5 |
+
Adds:
|
| 6 |
+
- Streaming responses for the final natural-language answer (CLI prints tokens live).
|
| 7 |
+
- Stricter schema discovery focused on *user* tables from the provided DB (defaults to schema 'main').
|
| 8 |
+
- Keeps SELECT-only safety, up to 3 SQL refinements on error, optional plotting saved to ./plots.
|
| 9 |
+
|
| 10 |
+
Quick start
|
| 11 |
+
-----------
|
| 12 |
+
1) Install deps (example):
|
| 13 |
+
pip install --upgrade duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai
|
| 14 |
+
|
| 15 |
+
2) Ensure an OpenAI API key is available:
|
| 16 |
+
- Put it in a .env file as: OPENAI_API_KEY=sk-...
|
| 17 |
+
- Or set an env var: set OPENAI_API_KEY=... (Windows) / export OPENAI_API_KEY=... (macOS/Linux)
|
| 18 |
+
|
| 19 |
+
3) Run the agent in an interactive loop:
|
| 20 |
+
python duckdb_react_agent.py --duckdb path/to/your.db --stream
|
| 21 |
+
|
| 22 |
+
Notes
|
| 23 |
+
-----
|
| 24 |
+
- Targets DuckDB SQL. The LLM prompt instructs to write *only* a SELECT statement.
|
| 25 |
+
- ReAct loop: on error, the LLM sees the error message + previous SQL and attempts to fix it (max 3 tries).
|
| 26 |
+
- The agent avoids DDL/DML; SELECT-only for safety.
|
| 27 |
+
- Plots are saved to ./plots/ as PNG files (no GUI required). The script uses a non-interactive backend.
|
| 28 |
+
- Internal/helper tables beginning with "__" (e.g., ingestion metadata) and non-user schemas are ignored.
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
import os
|
| 32 |
+
import re
|
| 33 |
+
import json
|
| 34 |
+
import uuid
|
| 35 |
+
import argparse
|
| 36 |
+
from typing import Any, Dict, List, Optional, TypedDict, Callable
|
| 37 |
+
|
| 38 |
+
# Non-GUI backend for saving plots on servers/Windows terminals
|
| 39 |
+
import matplotlib
|
| 40 |
+
matplotlib.use("Agg")
|
| 41 |
+
import matplotlib.pyplot as plt
|
| 42 |
+
|
| 43 |
+
import duckdb
|
| 44 |
+
import pandas as pd
|
| 45 |
+
|
| 46 |
+
from dotenv import load_dotenv, find_dotenv
|
| 47 |
+
|
| 48 |
+
# LangChain / LangGraph
|
| 49 |
+
from langchain_openai import ChatOpenAI
|
| 50 |
+
from langchain_core.messages import SystemMessage, HumanMessage
|
| 51 |
+
from langgraph.graph import StateGraph, END
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
# -----------------------------
|
| 55 |
+
# Utility: safe number formatting
|
| 56 |
+
# -----------------------------
|
| 57 |
+
def format_number(x: Any) -> str:
|
| 58 |
+
try:
|
| 59 |
+
if isinstance(x, (int, float)) and not isinstance(x, bool):
|
| 60 |
+
if abs(x) >= 1000 and abs(x) < 1e12:
|
| 61 |
+
return f"{x:,.2f}".rstrip('0').rstrip('.')
|
| 62 |
+
else:
|
| 63 |
+
return f"{x:.4g}" if isinstance(x, float) else str(x)
|
| 64 |
+
return str(x)
|
| 65 |
+
except Exception:
|
| 66 |
+
return str(x)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
# -----------------------------
|
| 70 |
+
# Schema introspection from DuckDB (user tables only)
|
| 71 |
+
# -----------------------------
|
| 72 |
+
def get_schema_summary(con: duckdb.DuckDBPyConnection, sample_rows: int = 3, include_views: bool = True,
|
| 73 |
+
allowed_schemas: Optional[List[str]] = None) -> str:
|
| 74 |
+
"""
|
| 75 |
+
Build a compact schema snapshot for user tables in the *provided DB*.
|
| 76 |
+
By default we only list tables/views in schema 'main' (user content) and skip internals.
|
| 77 |
+
"""
|
| 78 |
+
if allowed_schemas is None:
|
| 79 |
+
allowed_schemas = ["main"]
|
| 80 |
+
|
| 81 |
+
type_filter = "('BASE TABLE')" if not include_views else "('BASE TABLE','VIEW')"
|
| 82 |
+
|
| 83 |
+
tables_df = con.execute(
|
| 84 |
+
f"""
|
| 85 |
+
SELECT table_schema, table_name, table_type
|
| 86 |
+
FROM information_schema.tables
|
| 87 |
+
WHERE table_type IN {type_filter}
|
| 88 |
+
AND table_schema IN ({','.join(['?']*len(allowed_schemas))})
|
| 89 |
+
AND table_name NOT LIKE 'duckdb_%'
|
| 90 |
+
AND table_name NOT LIKE 'sqlite_%'
|
| 91 |
+
ORDER BY table_schema, table_name
|
| 92 |
+
""",
|
| 93 |
+
allowed_schemas
|
| 94 |
+
).fetchdf()
|
| 95 |
+
|
| 96 |
+
lines: List[str] = []
|
| 97 |
+
for _, row in tables_df.iterrows():
|
| 98 |
+
schema = row["table_schema"]
|
| 99 |
+
name = row["table_name"]
|
| 100 |
+
if name.startswith("__"):
|
| 101 |
+
# Common ingestion metadata; not user-facing
|
| 102 |
+
continue
|
| 103 |
+
|
| 104 |
+
cols = con.execute(
|
| 105 |
+
"""
|
| 106 |
+
SELECT column_name, data_type
|
| 107 |
+
FROM information_schema.columns
|
| 108 |
+
WHERE table_schema = ? AND table_name = ?
|
| 109 |
+
ORDER BY ordinal_position
|
| 110 |
+
""",
|
| 111 |
+
[schema, name],
|
| 112 |
+
).fetchdf()
|
| 113 |
+
|
| 114 |
+
lines.append(f"TABLE {schema}.{name}")
|
| 115 |
+
if len(cols) == 0:
|
| 116 |
+
lines.append(" (no columns discovered)")
|
| 117 |
+
continue
|
| 118 |
+
|
| 119 |
+
# Sample small set of values per column to guide the LLM
|
| 120 |
+
try:
|
| 121 |
+
sample = con.execute(f"SELECT * FROM {schema}.{name} LIMIT {sample_rows}").fetchdf()
|
| 122 |
+
except Exception:
|
| 123 |
+
sample = pd.DataFrame(columns=cols["column_name"].tolist())
|
| 124 |
+
|
| 125 |
+
for _, c in cols.iterrows():
|
| 126 |
+
col = c["column_name"]
|
| 127 |
+
dtype = c["data_type"]
|
| 128 |
+
examples: List[str] = []
|
| 129 |
+
if col in sample.columns:
|
| 130 |
+
for v in sample[col].tolist():
|
| 131 |
+
examples.append(format_number(v)[:80])
|
| 132 |
+
example_str = ", ".join(examples) if examples else ""
|
| 133 |
+
lines.append(f" - {col} :: {dtype} e.g. [{example_str}]")
|
| 134 |
+
lines.append("")
|
| 135 |
+
|
| 136 |
+
if not lines:
|
| 137 |
+
lines.append("(No user tables discovered in schema(s): " + ", ".join(allowed_schemas) + ")")
|
| 138 |
+
|
| 139 |
+
return "\n".join(lines)
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
# -----------------------------
|
| 143 |
+
# LLM helpers
|
| 144 |
+
# -----------------------------
|
| 145 |
+
def make_llm(model: str = "gpt-4o-mini", temperature: float = 0.0) -> ChatOpenAI:
|
| 146 |
+
return ChatOpenAI(model=model, temperature=temperature)
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
SQL_SYSTEM_PROMPT = """You are an expert DuckDB SQL writer. You will receive:
|
| 150 |
+
- A database schema (tables and columns) with a few sample values
|
| 151 |
+
- A user's natural-language question
|
| 152 |
+
|
| 153 |
+
Write exactly ONE valid DuckDB SQL SELECT statement to answer the question.
|
| 154 |
+
Rules:
|
| 155 |
+
- Output ONLY the SQL (no backticks, no explanation, no comments)
|
| 156 |
+
- Use DuckDB SQL
|
| 157 |
+
- Never write DDL/DML (CREATE/INSERT/UPDATE/DELETE), only SELECT statements
|
| 158 |
+
- Prefer explicit column names, avoid SELECT *
|
| 159 |
+
- If time grouping is needed, use date_trunc('month', col), date_trunc('day', col), etc.
|
| 160 |
+
- If casting is needed, use CAST(... AS TYPE)
|
| 161 |
+
- Use ONLY the listed user tables from the provided database (e.g., schema 'main'). Do not rely on external/attached sources.
|
| 162 |
+
- Avoid tables starting with "__" unless explicitly referenced.
|
| 163 |
+
- If uncertain, make a reasonable assumption and produce the best possible query
|
| 164 |
+
|
| 165 |
+
Be careful with joins and filters. Ensure SQL parses successfully.
|
| 166 |
+
"""
|
| 167 |
+
|
| 168 |
+
REFINE_SYSTEM_PROMPT = """You are fixing a DuckDB SQL query that errored. You'll see:
|
| 169 |
+
- the user's question
|
| 170 |
+
- the previous SQL
|
| 171 |
+
- the exact error message
|
| 172 |
+
|
| 173 |
+
Return a corrected DuckDB SELECT statement that addresses the error.
|
| 174 |
+
Rules:
|
| 175 |
+
- Output ONLY the SQL (no backticks, no explanation, no comments)
|
| 176 |
+
- SELECT-only (no DDL/DML)
|
| 177 |
+
- Keep the intent of the question intact
|
| 178 |
+
- Fix joins, field names, casts, aggregations or date functions as needed
|
| 179 |
+
- Use ONLY the listed user tables from the provided database (e.g., schema 'main')
|
| 180 |
+
"""
|
| 181 |
+
|
| 182 |
+
ANSWER_SYSTEM_PROMPT = """You are a helpful data analyst. Given a user's question and the query result data,
|
| 183 |
+
write a clear, concise, conversational answer in plain English. Use bullet points or short paragraphs.
|
| 184 |
+
- Format large numbers with thousands separators where appropriate.
|
| 185 |
+
- If percentages are present, include % signs.
|
| 186 |
+
- If a chart/image accompanies the answer, DO NOT mention file paths or filenames. Refer generically, e.g., 'as depicted in the chart' or 'see the chart alongside this answer.'
|
| 187 |
+
- If the result is empty, say so and suggest a next step.
|
| 188 |
+
- Do not invent columns or values that are not in the result.
|
| 189 |
+
- (Optional) You may refer to prior Q&A context if it informs this answer.
|
| 190 |
+
"""
|
| 191 |
+
|
| 192 |
+
VIZ_SYSTEM_PROMPT = """You will decide whether a chart would help answer the question using the SQL result.
|
| 193 |
+
Return STRICT JSON with keys: {"make_plot": bool, "chart": "line|bar|scatter|hist|box|pie", "x": "<col or null>", "y": "<col or null>", "series": "<col or null>", "agg": "<sum|avg|count|max|min|null>", "reason": "<short reason>"}.
|
| 194 |
+
Conservative rules (prefer NO plot):
|
| 195 |
+
- Only set make_plot=true if the user explicitly asks for a chart/visualization OR the result involves trends over time, distributions, or comparisons with many rows (e.g., > 20) where a chart adds clarity.
|
| 196 |
+
- For simple aggregates or few rows (<= 10), set make_plot=false unless explicitly requested.
|
| 197 |
+
Guidelines when make_plot=true:
|
| 198 |
+
- Prefer line for trends over time
|
| 199 |
+
- Prefer bar for ranking/comparison of categories
|
| 200 |
+
- Prefer scatter for correlation between two numeric columns
|
| 201 |
+
- Prefer hist for distribution of a single numeric column
|
| 202 |
+
- Prefer box for distribution with quartiles per category
|
| 203 |
+
- Prefer pie sparingly for simple part-to-whole
|
| 204 |
+
JSON only. No prose.
|
| 205 |
+
"""
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
def extract_sql(text: str) -> str:
|
| 209 |
+
"""Extract SQL from potential LLM output. We expect raw SQL with no fences, but be defensive."""
|
| 210 |
+
text = text.strip()
|
| 211 |
+
fence = re.compile(r"```(?:sql)?(.*?)```", re.DOTALL | re.IGNORECASE)
|
| 212 |
+
m = fence.search(text)
|
| 213 |
+
if m:
|
| 214 |
+
return m.group(1).strip()
|
| 215 |
+
return text
|
| 216 |
+
|
| 217 |
+
|
| 218 |
+
# -----------------------------
|
| 219 |
+
# LangGraph State
|
| 220 |
+
# -----------------------------
|
| 221 |
+
class AgentState(TypedDict):
|
| 222 |
+
question: str
|
| 223 |
+
schema: str
|
| 224 |
+
attempts: int
|
| 225 |
+
sql: Optional[str]
|
| 226 |
+
error: Optional[str]
|
| 227 |
+
result_json: Optional[str] # JSON records preview
|
| 228 |
+
result_columns: Optional[List[str]]
|
| 229 |
+
plot_path: Optional[str]
|
| 230 |
+
final_answer: Optional[str]
|
| 231 |
+
_result_df: Optional[pd.DataFrame]
|
| 232 |
+
_viz_spec: Optional[Dict[str, Any]]
|
| 233 |
+
_stream: bool
|
| 234 |
+
_token_cb: Optional[Callable[[str], None]]
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
# -----------------------------
|
| 238 |
+
# Nodes: SQL drafting, execution, refinement, viz, answer
|
| 239 |
+
# -----------------------------
|
| 240 |
+
def node_draft_sql(state: AgentState, llm: ChatOpenAI) -> AgentState:
|
| 241 |
+
msgs = [
|
| 242 |
+
SystemMessage(content=SQL_SYSTEM_PROMPT),
|
| 243 |
+
HumanMessage(content=f"SCHEMA:\n{state['schema']}\n\nQUESTION:\n{state['question']}"),
|
| 244 |
+
]
|
| 245 |
+
resp = llm.invoke(msgs) # type: ignore
|
| 246 |
+
sql = extract_sql(resp.content or "")
|
| 247 |
+
return {**state, "sql": sql, "error": None}
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
def run_duckdb_query(con: duckdb.DuckDBPyConnection, sql: str) -> pd.DataFrame:
|
| 251 |
+
first_token = re.split(r"\s+", sql.strip(), maxsplit=1)[0].upper()
|
| 252 |
+
if first_token != "SELECT" and not sql.strip().upper().startswith("WITH "):
|
| 253 |
+
raise ValueError("Only SELECT queries are allowed.")
|
| 254 |
+
df = con.execute(sql).fetchdf()
|
| 255 |
+
return df
|
| 256 |
+
|
| 257 |
+
|
| 258 |
+
def node_run_sql(state: AgentState, con: duckdb.DuckDBPyConnection) -> AgentState:
|
| 259 |
+
try:
|
| 260 |
+
df = run_duckdb_query(con, state["sql"] or "")
|
| 261 |
+
preview = df.head(50).to_dict(orient="records")
|
| 262 |
+
return {
|
| 263 |
+
**state,
|
| 264 |
+
"error": None,
|
| 265 |
+
"result_json": json.dumps(preview, default=str),
|
| 266 |
+
"result_columns": list(df.columns),
|
| 267 |
+
"_result_df": df,
|
| 268 |
+
}
|
| 269 |
+
except Exception as e:
|
| 270 |
+
return {**state, "error": str(e), "result_json": None, "result_columns": None, "_result_df": None}
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
def node_refine_sql(state: AgentState, llm: ChatOpenAI) -> AgentState:
|
| 274 |
+
if state.get("attempts", 0) >= 3:
|
| 275 |
+
return state
|
| 276 |
+
msgs = [
|
| 277 |
+
SystemMessage(content=REFINE_SYSTEM_PROMPT),
|
| 278 |
+
HumanMessage(
|
| 279 |
+
content=(
|
| 280 |
+
f"QUESTION:\n{state['question']}\n\n"
|
| 281 |
+
f"PREVIOUS SQL:\n{state.get('sql','')}\n\n"
|
| 282 |
+
f"ERROR:\n{state.get('error','')}"
|
| 283 |
+
)
|
| 284 |
+
),
|
| 285 |
+
]
|
| 286 |
+
resp = llm.invoke(msgs) # type: ignore
|
| 287 |
+
sql = extract_sql(resp.content or "")
|
| 288 |
+
return {**state, "sql": sql, "attempts": state.get("attempts", 0) + 1, "error": None}
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
def node_decide_viz(state: AgentState, llm: ChatOpenAI) -> AgentState:
|
| 292 |
+
if not state.get("result_json"):
|
| 293 |
+
return state
|
| 294 |
+
|
| 295 |
+
result_preview = state["result_json"]
|
| 296 |
+
msgs = [
|
| 297 |
+
SystemMessage(content=VIZ_SYSTEM_PROMPT),
|
| 298 |
+
HumanMessage(
|
| 299 |
+
content=(
|
| 300 |
+
f"QUESTION:\n{state['question']}\n\n"
|
| 301 |
+
f"COLUMNS: {state.get('result_columns',[])}\n"
|
| 302 |
+
f"RESULT PREVIEW (first rows):\n{result_preview}"
|
| 303 |
+
)
|
| 304 |
+
),
|
| 305 |
+
]
|
| 306 |
+
resp = llm.invoke(msgs) # type: ignore
|
| 307 |
+
|
| 308 |
+
spec = {"make_plot": False}
|
| 309 |
+
try:
|
| 310 |
+
spec = json.loads(resp.content) # type: ignore
|
| 311 |
+
if not isinstance(spec, dict) or "make_plot" not in spec:
|
| 312 |
+
spec = {"make_plot": False}
|
| 313 |
+
except Exception:
|
| 314 |
+
spec = {"make_plot": False}
|
| 315 |
+
|
| 316 |
+
return {**state, "_viz_spec": spec}
|
| 317 |
+
|
| 318 |
+
|
| 319 |
+
def node_make_plot(state: AgentState) -> AgentState:
|
| 320 |
+
spec = state.get("_viz_spec") or {}
|
| 321 |
+
if not spec or not spec.get("make_plot"):
|
| 322 |
+
return state
|
| 323 |
+
|
| 324 |
+
df: Optional[pd.DataFrame] = state.get("_result_df")
|
| 325 |
+
if df is None or df.empty:
|
| 326 |
+
return state
|
| 327 |
+
|
| 328 |
+
# Additional guard: avoid trivial plots unless explicitly requested
|
| 329 |
+
q_lower = (state.get('question') or '').lower()
|
| 330 |
+
explicit_viz = any(k in q_lower for k in ['chart', 'plot', 'graph', 'visual', 'visualize', 'trend'])
|
| 331 |
+
many_rows = df.shape[0] > 20
|
| 332 |
+
if not explicit_viz and not many_rows:
|
| 333 |
+
return state
|
| 334 |
+
|
| 335 |
+
x = spec.get("x")
|
| 336 |
+
y = spec.get("y")
|
| 337 |
+
series = spec.get("series")
|
| 338 |
+
chart = (spec.get("chart") or "").lower()
|
| 339 |
+
|
| 340 |
+
def col_ok(c: Optional[str]) -> bool:
|
| 341 |
+
return isinstance(c, str) and c in df.columns
|
| 342 |
+
|
| 343 |
+
if not col_ok(x) and df.shape[1] >= 1:
|
| 344 |
+
x = df.columns[0]
|
| 345 |
+
if not col_ok(y) and df.shape[1] >= 2:
|
| 346 |
+
y = df.columns[1]
|
| 347 |
+
|
| 348 |
+
try:
|
| 349 |
+
os.makedirs("plots", exist_ok=True)
|
| 350 |
+
fig = plt.figure()
|
| 351 |
+
ax = fig.gca()
|
| 352 |
+
|
| 353 |
+
if chart == "line":
|
| 354 |
+
if series and col_ok(series):
|
| 355 |
+
for k, g in df.groupby(series):
|
| 356 |
+
ax.plot(g[x], g[y], label=str(k))
|
| 357 |
+
ax.legend(loc="best")
|
| 358 |
+
else:
|
| 359 |
+
ax.plot(df[x], df[y])
|
| 360 |
+
ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
|
| 361 |
+
|
| 362 |
+
elif chart == "bar":
|
| 363 |
+
if series and col_ok(series):
|
| 364 |
+
pivot = df.pivot_table(index=x, columns=series, values=y, aggfunc="sum", fill_value=0)
|
| 365 |
+
pivot.plot(kind="bar", ax=ax)
|
| 366 |
+
ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
|
| 367 |
+
else:
|
| 368 |
+
ax.bar(df[x], df[y])
|
| 369 |
+
ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
|
| 370 |
+
fig.autofmt_xdate(rotation=45)
|
| 371 |
+
|
| 372 |
+
elif chart == "scatter":
|
| 373 |
+
ax.scatter(df[x], df[y])
|
| 374 |
+
ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
|
| 375 |
+
|
| 376 |
+
elif chart == "hist":
|
| 377 |
+
ax.hist(df[y] if col_ok(y) else df[x], bins=30)
|
| 378 |
+
ax.set_xlabel(str(y if col_ok(y) else x)); ax.set_ylabel("Frequency")
|
| 379 |
+
|
| 380 |
+
elif chart == "box":
|
| 381 |
+
ax.boxplot([df[y]] if col_ok(y) else [df[x]])
|
| 382 |
+
ax.set_ylabel(str(y if col_ok(y) else x))
|
| 383 |
+
|
| 384 |
+
elif chart == "pie":
|
| 385 |
+
labels = df[x].astype(str).tolist()
|
| 386 |
+
values = df[y].tolist()
|
| 387 |
+
ax.pie(values, labels=labels, autopct="%1.1f%%")
|
| 388 |
+
ax.axis("equal")
|
| 389 |
+
|
| 390 |
+
else:
|
| 391 |
+
plt.close(fig)
|
| 392 |
+
return state
|
| 393 |
+
|
| 394 |
+
out_path = os.path.join("plots", f"duckdb_answer_{uuid.uuid4().hex[:8]}.png")
|
| 395 |
+
fig.tight_layout()
|
| 396 |
+
fig.savefig(out_path, dpi=150)
|
| 397 |
+
plt.close(fig)
|
| 398 |
+
return {**state, "plot_path": out_path}
|
| 399 |
+
|
| 400 |
+
except Exception:
|
| 401 |
+
return state
|
| 402 |
+
|
| 403 |
+
|
| 404 |
+
def _stream_llm_answer(llm: ChatOpenAI, msgs: List[Any], token_cb: Optional[Callable[[str], None]]) -> str:
|
| 405 |
+
"""Stream tokens for the final answer; returns full text as well."""
|
| 406 |
+
out = ""
|
| 407 |
+
try:
|
| 408 |
+
for chunk in llm.stream(msgs): # type: ignore
|
| 409 |
+
piece = getattr(chunk, "content", None)
|
| 410 |
+
if not piece and hasattr(chunk, "message") and getattr(chunk, "message", None):
|
| 411 |
+
piece = getattr(chunk.message, "content", "")
|
| 412 |
+
if not piece:
|
| 413 |
+
continue
|
| 414 |
+
out += piece
|
| 415 |
+
if token_cb:
|
| 416 |
+
token_cb(piece)
|
| 417 |
+
except Exception:
|
| 418 |
+
# Fallback to non-streaming if stream not supported
|
| 419 |
+
resp = llm.invoke(msgs) # type: ignore
|
| 420 |
+
out = resp.content if isinstance(resp.content, str) else str(resp.content)
|
| 421 |
+
if token_cb:
|
| 422 |
+
token_cb(out)
|
| 423 |
+
return out
|
| 424 |
+
|
| 425 |
+
|
| 426 |
+
def node_answer(state: AgentState, llm: ChatOpenAI) -> AgentState:
|
| 427 |
+
preview = state.get("result_json") or "[]"
|
| 428 |
+
columns = state.get("result_columns") or []
|
| 429 |
+
plot_path = state.get("plot_path")
|
| 430 |
+
if len(preview) > 4000:
|
| 431 |
+
preview = preview[:4000] + " ..."
|
| 432 |
+
|
| 433 |
+
msgs = [
|
| 434 |
+
SystemMessage(content=ANSWER_SYSTEM_PROMPT),
|
| 435 |
+
HumanMessage(
|
| 436 |
+
content=(
|
| 437 |
+
f"QUESTION:\n{state['question']}\n\n"
|
| 438 |
+
f"COLUMNS: {columns}\n"
|
| 439 |
+
f"RESULT PREVIEW (rows):\n{preview}\n\n"
|
| 440 |
+
f"PLOT_PATH: {plot_path if plot_path else 'None'}"
|
| 441 |
+
)
|
| 442 |
+
),
|
| 443 |
+
]
|
| 444 |
+
|
| 445 |
+
if state.get("error") and not state.get("result_json"):
|
| 446 |
+
err_text = (
|
| 447 |
+
"I couldn't produce a working SQL query after 3 attempts.\n\n"
|
| 448 |
+
"Details:\n" + (state.get("error") or "Unknown error")
|
| 449 |
+
)
|
| 450 |
+
# Stream the error too for consistency
|
| 451 |
+
if state.get("_stream") and state.get("_token_cb"):
|
| 452 |
+
state["_token_cb"](err_text)
|
| 453 |
+
return {**state, "final_answer": err_text}
|
| 454 |
+
|
| 455 |
+
# Stream the final answer if requested
|
| 456 |
+
if state.get("_stream"):
|
| 457 |
+
answer = _stream_llm_answer(llm, msgs, state.get("_token_cb"))
|
| 458 |
+
else:
|
| 459 |
+
resp = llm.invoke(msgs) # type: ignore
|
| 460 |
+
answer = resp.content if isinstance(resp.content, str) else str(resp.content)
|
| 461 |
+
|
| 462 |
+
return {**state, "final_answer": answer}
|
| 463 |
+
|
| 464 |
+
|
| 465 |
+
# -----------------------------
|
| 466 |
+
# Graph assembly
|
| 467 |
+
# -----------------------------
|
| 468 |
+
def build_graph(con: duckdb.DuckDBPyConnection, llm: ChatOpenAI):
|
| 469 |
+
g = StateGraph(AgentState)
|
| 470 |
+
|
| 471 |
+
g.add_node("draft_sql", lambda s: node_draft_sql(s, llm))
|
| 472 |
+
g.add_node("run_sql", lambda s: node_run_sql(s, con))
|
| 473 |
+
g.add_node("refine_sql", lambda s: node_refine_sql(s, llm))
|
| 474 |
+
g.add_node("decide_viz", lambda s: node_decide_viz(s, llm))
|
| 475 |
+
g.add_node("make_plot", node_make_plot)
|
| 476 |
+
g.add_node("answer", lambda s: node_answer(s, llm))
|
| 477 |
+
|
| 478 |
+
g.set_entry_point("draft_sql")
|
| 479 |
+
g.add_edge("draft_sql", "run_sql")
|
| 480 |
+
|
| 481 |
+
def on_run_sql(state: AgentState):
|
| 482 |
+
if state.get("error"):
|
| 483 |
+
if state.get("attempts", 0) < 3:
|
| 484 |
+
return "refine_sql"
|
| 485 |
+
else:
|
| 486 |
+
return "answer"
|
| 487 |
+
return "decide_viz"
|
| 488 |
+
|
| 489 |
+
g.add_conditional_edges("run_sql", on_run_sql, {
|
| 490 |
+
"refine_sql": "refine_sql",
|
| 491 |
+
"decide_viz": "decide_viz",
|
| 492 |
+
"answer": "answer",
|
| 493 |
+
})
|
| 494 |
+
|
| 495 |
+
g.add_edge("refine_sql", "run_sql")
|
| 496 |
+
|
| 497 |
+
def on_decide_viz(state: AgentState):
|
| 498 |
+
spec = state.get("_viz_spec") or {}
|
| 499 |
+
if spec.get("make_plot"):
|
| 500 |
+
return "make_plot"
|
| 501 |
+
return "answer"
|
| 502 |
+
|
| 503 |
+
g.add_conditional_edges("decide_viz", on_decide_viz, {
|
| 504 |
+
"make_plot": "make_plot",
|
| 505 |
+
"answer": "answer",
|
| 506 |
+
})
|
| 507 |
+
|
| 508 |
+
g.add_edge("make_plot", "answer")
|
| 509 |
+
g.add_edge("answer", END)
|
| 510 |
+
|
| 511 |
+
return g.compile()
|
| 512 |
+
|
| 513 |
+
|
| 514 |
+
# -----------------------------
|
| 515 |
+
# Public function to ask a question
|
| 516 |
+
# -----------------------------
|
| 517 |
+
def answer_question(con: duckdb.DuckDBPyConnection, llm: ChatOpenAI, schema_text: str, question: str,
|
| 518 |
+
stream: bool = False, token_callback: Optional[Callable[[str], None]] = None) -> Dict[str, Any]:
|
| 519 |
+
app = build_graph(con, llm)
|
| 520 |
+
initial: AgentState = {
|
| 521 |
+
"question": question,
|
| 522 |
+
"schema": schema_text,
|
| 523 |
+
"attempts": 0,
|
| 524 |
+
"sql": None,
|
| 525 |
+
"error": None,
|
| 526 |
+
"result_json": None,
|
| 527 |
+
"result_columns": None,
|
| 528 |
+
"plot_path": None,
|
| 529 |
+
"final_answer": None,
|
| 530 |
+
"_result_df": None,
|
| 531 |
+
"_viz_spec": None,
|
| 532 |
+
"_stream": stream,
|
| 533 |
+
"_token_cb": token_callback,
|
| 534 |
+
}
|
| 535 |
+
final_state: AgentState = app.invoke(initial) # type: ignore
|
| 536 |
+
|
| 537 |
+
return {
|
| 538 |
+
"question": question,
|
| 539 |
+
"sql": final_state.get("sql"),
|
| 540 |
+
"answer": final_state.get("final_answer"),
|
| 541 |
+
"plot_path": final_state.get("plot_path"),
|
| 542 |
+
"error": final_state.get("error"),
|
| 543 |
+
}
|
| 544 |
+
|
| 545 |
+
|
| 546 |
+
# -----------------------------
|
| 547 |
+
# CLI
|
| 548 |
+
# -----------------------------
|
| 549 |
+
def main():
|
| 550 |
+
load_dotenv(find_dotenv())
|
| 551 |
+
|
| 552 |
+
parser = argparse.ArgumentParser(description="DuckDB Q&A ReAct Agent (Streaming)")
|
| 553 |
+
parser.add_argument("--duckdb", required=True, help="Path to DuckDB database file")
|
| 554 |
+
parser.add_argument("--model", default="gpt-4o", help="OpenAI chat model (default: gpt-4o)")
|
| 555 |
+
parser.add_argument("--stream", action="store_true", help="Stream the final answer to stdout")
|
| 556 |
+
parser.add_argument("--schemas", nargs="*", default=["main"], help="Schemas to include (default: main)")
|
| 557 |
+
args = parser.parse_args()
|
| 558 |
+
|
| 559 |
+
api_key = os.environ.get("OPENAI_API_KEY")
|
| 560 |
+
if not api_key:
|
| 561 |
+
print("ERROR: OPENAI_API_KEY not set. Put it in a .env file or export the env var.")
|
| 562 |
+
return
|
| 563 |
+
|
| 564 |
+
# Connect DuckDB strictly to the provided file (user DB)
|
| 565 |
+
try:
|
| 566 |
+
con = duckdb.connect(database=args.duckdb, read_only=True)
|
| 567 |
+
except Exception as e:
|
| 568 |
+
print(f"Failed to open DuckDB at {args.duckdb}: {e}")
|
| 569 |
+
return
|
| 570 |
+
|
| 571 |
+
# Introspect schema (user tables only)
|
| 572 |
+
schema_text = get_schema_summary(con, allowed_schemas=args.schemas)
|
| 573 |
+
|
| 574 |
+
# Init LLM; enable streaming capability (used only in final answer node)
|
| 575 |
+
llm = make_llm(model=args.model, temperature=0.0)
|
| 576 |
+
|
| 577 |
+
print("DuckDB Q&A Agent (ReAct, Streaming)\n")
|
| 578 |
+
print(f"Connected to: {args.duckdb}")
|
| 579 |
+
print(f"Schemas included: {', '.join(args.schemas)}")
|
| 580 |
+
print("\nSchema snapshot:\n-----------------\n")
|
| 581 |
+
print(schema_text)
|
| 582 |
+
print("\nType your question and press ENTER. Type 'exit' to quit.\n")
|
| 583 |
+
|
| 584 |
+
def print_token(t: str):
|
| 585 |
+
print(t, end="", flush=True)
|
| 586 |
+
|
| 587 |
+
while True:
|
| 588 |
+
try:
|
| 589 |
+
q = input("Q> ").strip()
|
| 590 |
+
except (EOFError, KeyboardInterrupt):
|
| 591 |
+
print("\nExiting.")
|
| 592 |
+
break
|
| 593 |
+
if q.lower() in ("exit", "quit"):
|
| 594 |
+
print("Goodbye.")
|
| 595 |
+
break
|
| 596 |
+
if not q:
|
| 597 |
+
continue
|
| 598 |
+
|
| 599 |
+
if args.stream:
|
| 600 |
+
print("\n--- ANSWER (streaming) ---")
|
| 601 |
+
result = answer_question(con, llm, schema_text, q, stream=True, token_callback=print_token)
|
| 602 |
+
print("") # newline after stream
|
| 603 |
+
if result.get("plot_path"):
|
| 604 |
+
print(f"\nChart saved to: {result['plot_path']}")
|
| 605 |
+
print("\n--- SQL ---")
|
| 606 |
+
print((result.get("sql") or "").strip())
|
| 607 |
+
if result.get("error"):
|
| 608 |
+
print("\nERROR: " + str(result.get("error")))
|
| 609 |
+
else:
|
| 610 |
+
result = answer_question(con, llm, schema_text, q, stream=False, token_callback=None)
|
| 611 |
+
print("\n--- SQL ---")
|
| 612 |
+
print((result.get("sql") or "").strip())
|
| 613 |
+
if result.get("error"):
|
| 614 |
+
print("\n--- RESULT ---")
|
| 615 |
+
print("Sorry, I couldn't resolve a working query after 3 attempts.")
|
| 616 |
+
print("Error: " + str(result.get("error")))
|
| 617 |
+
else:
|
| 618 |
+
print("\n--- ANSWER ---")
|
| 619 |
+
print(result.get("answer") or "")
|
| 620 |
+
if result.get("plot_path"):
|
| 621 |
+
print(f"\nChart saved to: {result['plot_path']}")
|
| 622 |
+
print("\n")
|
| 623 |
+
|
| 624 |
+
con.close()
|
| 625 |
+
|
| 626 |
+
|
| 627 |
+
if __name__ == "__main__":
|
| 628 |
+
main()
|
excel_to_duckdb.py
ADDED
|
@@ -0,0 +1,595 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Excel → DuckDB ingestion (generic, robust, multi-table)
|
| 3 |
+
|
| 4 |
+
- Hierarchical headers with merged-cell parent context (titles removed)
|
| 5 |
+
- Merged rows/cols resolved to master (top-left) value for consistent replication
|
| 6 |
+
- Multiple tables detected ONLY when separated by at least one completely empty row
|
| 7 |
+
- Footer detection (ignore trailing notes/summaries)
|
| 8 |
+
- Pivot detection (skip pivot-looking rows; optional sheet-level pivot/charthood skip)
|
| 9 |
+
- Optional LLM inference for unnamed columns and table titles (EXCEL_LLM_INFER=1)
|
| 10 |
+
- One DuckDB table per detected table block
|
| 11 |
+
- Traceability:
|
| 12 |
+
__excel_schema (sheet_name, table_name, column_ordinal, original_name, sql_column)
|
| 13 |
+
__excel_tables (sheet_name, table_name, block_index, start_row, end_row,
|
| 14 |
+
header_rows_json, inferred_title, original_title_text)
|
| 15 |
+
|
| 16 |
+
Usage:
|
| 17 |
+
python excel_to_duckdb.py --excel /path/file.xlsx --duckdb /path/out.duckdb
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
import os
|
| 21 |
+
import re
|
| 22 |
+
import sys
|
| 23 |
+
import json
|
| 24 |
+
import argparse
|
| 25 |
+
from pathlib import Path
|
| 26 |
+
from typing import List, Tuple, Dict
|
| 27 |
+
import hashlib
|
| 28 |
+
|
| 29 |
+
from openpyxl import load_workbook
|
| 30 |
+
from openpyxl.worksheet.worksheet import Worksheet
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def _nonempty(vals):
|
| 34 |
+
return [v for v in vals if v not in (None, "")]
|
| 35 |
+
|
| 36 |
+
def _is_numlike(x):
|
| 37 |
+
if isinstance(x, (int, float)):
|
| 38 |
+
return True
|
| 39 |
+
s = str(x).strip().replace(",", "")
|
| 40 |
+
if s.endswith("%"):
|
| 41 |
+
s = s[:-1]
|
| 42 |
+
if not s:
|
| 43 |
+
return False
|
| 44 |
+
if any(c.isalpha() for c in s):
|
| 45 |
+
return False
|
| 46 |
+
try:
|
| 47 |
+
float(s); return True
|
| 48 |
+
except: return False
|
| 49 |
+
|
| 50 |
+
def _is_year_token(x):
|
| 51 |
+
if isinstance(x, int) and 1800 <= x <= 2100: return True
|
| 52 |
+
s = str(x).strip()
|
| 53 |
+
return s.isdigit() and 1800 <= int(s) <= 2100
|
| 54 |
+
|
| 55 |
+
def is_probably_footer(cells):
|
| 56 |
+
nonempty = [(i, v) for i, v in enumerate(cells) if v not in (None, "")]
|
| 57 |
+
if not nonempty: return False
|
| 58 |
+
if len(nonempty) <= 2:
|
| 59 |
+
text = " ".join(str(v) for _, v in nonempty).strip().lower()
|
| 60 |
+
if any(text.startswith(k) for k in ["note","notes","source","summary","disclaimer"]): return True
|
| 61 |
+
if len(text) > 50: return True
|
| 62 |
+
return False
|
| 63 |
+
|
| 64 |
+
def is_probably_data(cells, num_cols):
|
| 65 |
+
vals = [v for v in cells if v not in (None, "")]
|
| 66 |
+
if not vals: return False
|
| 67 |
+
nums_list = [v for v in vals if _is_numlike(v)]
|
| 68 |
+
num_num = len(nums_list); num_text = len(vals) - num_num
|
| 69 |
+
density = len(vals) / max(1, num_cols)
|
| 70 |
+
if num_num >= 2 and all(_is_year_token(v) for v in nums_list) and num_text >= 2:
|
| 71 |
+
return False
|
| 72 |
+
if num_num >= max(2, num_text): return True
|
| 73 |
+
if density >= 0.6 and num_num >= 2: return True
|
| 74 |
+
first = str(vals[0]).strip().lower() if vals else ""
|
| 75 |
+
if first in ("total","totals","grand total"): return True
|
| 76 |
+
return False
|
| 77 |
+
|
| 78 |
+
PIVOT_MARKERS = {"row labels","column labels","values","grand total","report filter","filters","∑ values","σ values","Σ values"}
|
| 79 |
+
def is_pivot_marker_string(s: str) -> bool:
|
| 80 |
+
if not s: return False
|
| 81 |
+
t = str(s).strip().lower()
|
| 82 |
+
if t in PIVOT_MARKERS: return True
|
| 83 |
+
if t.startswith(("sum of ","count of ","avg of ","average of ")): return True
|
| 84 |
+
if t.endswith(" total") or t.startswith("total "): return True
|
| 85 |
+
return False
|
| 86 |
+
|
| 87 |
+
def is_pivot_row(cells) -> bool:
|
| 88 |
+
text_cells = [str(v).strip() for v in cells if v not in (None, "")]
|
| 89 |
+
if not text_cells: return False
|
| 90 |
+
if any(is_pivot_marker_string(x) for x in text_cells): return True
|
| 91 |
+
agg_hits = sum(1 for x in text_cells if x.lower().startswith(("sum of","count of","avg of","average of","min of","max of")))
|
| 92 |
+
return agg_hits >= 2
|
| 93 |
+
|
| 94 |
+
def is_pivot_or_chart_sheet(ws: Worksheet) -> bool:
|
| 95 |
+
try:
|
| 96 |
+
if getattr(ws, "_charts", None): return True
|
| 97 |
+
except Exception: pass
|
| 98 |
+
if hasattr(ws, "_pivots") and getattr(ws, "_pivots"): return True
|
| 99 |
+
scan_rows = min(ws.max_row, 40); scan_cols = min(ws.max_column, 20)
|
| 100 |
+
pivotish = 0
|
| 101 |
+
for r in range(1, scan_rows+1):
|
| 102 |
+
row = [ws.cell(r,c).value for c in range(1, scan_cols+1)]
|
| 103 |
+
if is_pivot_row(row):
|
| 104 |
+
pivotish += 1
|
| 105 |
+
if pivotish >= 2: return True
|
| 106 |
+
name = (ws.title or "").lower()
|
| 107 |
+
if any(k in name for k in ("pivot","dashboard","chart","charts")): return True
|
| 108 |
+
return False
|
| 109 |
+
|
| 110 |
+
def sanitize_table_name(name: str) -> str:
|
| 111 |
+
t = re.sub(r"[^\w]", "_", str(name))
|
| 112 |
+
t = re.sub(r"_+", "_", t).strip("_")
|
| 113 |
+
if t and not t[0].isalpha(): t = "table_" + t
|
| 114 |
+
return t or "sheet_data"
|
| 115 |
+
|
| 116 |
+
def _samples_for_column(rows, col_idx, max_items=20):
|
| 117 |
+
vals = []
|
| 118 |
+
for row in rows:
|
| 119 |
+
if col_idx < len(row):
|
| 120 |
+
v = row[col_idx]
|
| 121 |
+
if v not in (None, ""): vals.append(v)
|
| 122 |
+
if len(vals) >= max_items: break
|
| 123 |
+
return vals
|
| 124 |
+
|
| 125 |
+
def _heuristic_infer_col_name(samples):
|
| 126 |
+
if not samples: return None
|
| 127 |
+
if sum(1 for v in samples if _is_year_token(v)) >= max(2, int(0.8*len(samples))): return "year"
|
| 128 |
+
pct_hits = 0
|
| 129 |
+
for v in samples:
|
| 130 |
+
s = str(v).strip()
|
| 131 |
+
if s.endswith("%"): pct_hits += 1
|
| 132 |
+
else:
|
| 133 |
+
try:
|
| 134 |
+
f = float(s.replace(",",""))
|
| 135 |
+
if 0 <= f <= 1.0 or 0 <= f <= 100: pct_hits += 0.5
|
| 136 |
+
except: pass
|
| 137 |
+
if pct_hits >= max(2, int(0.7*len(samples))): return "percentage"
|
| 138 |
+
if sum(1 for v in samples if _is_numlike(v)) >= max(3, int(0.7*len(samples))):
|
| 139 |
+
intish = 0
|
| 140 |
+
for v in samples:
|
| 141 |
+
try:
|
| 142 |
+
if float(str(v).replace(",","")) == int(float(str(v).replace(",",""))): intish += 1
|
| 143 |
+
except: pass
|
| 144 |
+
if intish >= max(2, int(0.6*len(samples))): return "count"
|
| 145 |
+
return "value"
|
| 146 |
+
uniq = {str(v).strip().lower() for v in samples}
|
| 147 |
+
if len(uniq) <= 3 and max(len(str(v)) for v in samples) >= 30: return "question"
|
| 148 |
+
if sum(1 for v in samples if re.search(r"\d", str(v)) and ("-" in str(v) or "–" in str(v))) >= max(2, int(0.6*len(samples))): return "range"
|
| 149 |
+
if len(uniq) < max(5, int(0.5*len(samples))): return "category"
|
| 150 |
+
return None
|
| 151 |
+
|
| 152 |
+
def clean_col_name(s: str) -> str:
|
| 153 |
+
s = re.sub(r"[^\w\s%#‰]", "", str(s).strip())
|
| 154 |
+
s = s.replace("%"," pct").replace("‰"," permille").replace("#"," count ")
|
| 155 |
+
s = re.sub(r"\s+"," ", s)
|
| 156 |
+
s = re.sub(r"\s+","_", s)
|
| 157 |
+
s = re.sub(r"_+","_", s).strip("_")
|
| 158 |
+
if s and s[0].isdigit(): s = "col_" + s
|
| 159 |
+
return s or "unnamed_column"
|
| 160 |
+
|
| 161 |
+
def ensure_unique(names):
|
| 162 |
+
seen = {}; out = []
|
| 163 |
+
for n in names:
|
| 164 |
+
base = (n or "unnamed_column").lower()
|
| 165 |
+
if base not in seen:
|
| 166 |
+
seen[base] = 0; out.append(n)
|
| 167 |
+
else:
|
| 168 |
+
i = seen[base] + 1
|
| 169 |
+
while f"{n}_{i}".lower() in seen: i += 1
|
| 170 |
+
seen[base] = i; out.append(f"{n}_{i}")
|
| 171 |
+
seen[(f"{n}_{i}").lower()] = 0
|
| 172 |
+
return out
|
| 173 |
+
|
| 174 |
+
def compose_col(parts):
|
| 175 |
+
cleaned = []; prev = None
|
| 176 |
+
for p in parts:
|
| 177 |
+
if not p: continue
|
| 178 |
+
p_norm = str(p).strip()
|
| 179 |
+
if prev is not None and p_norm.lower() == prev.lower(): continue
|
| 180 |
+
cleaned.append(p_norm); prev = p_norm
|
| 181 |
+
if not cleaned: return ""
|
| 182 |
+
return clean_col_name("_".join(cleaned))
|
| 183 |
+
|
| 184 |
+
def used_bounds(ws: Worksheet) -> Tuple[int,int,int,int]:
|
| 185 |
+
min_row, max_row, min_col, max_col = None, 0, None, 0
|
| 186 |
+
for r in ws.iter_rows():
|
| 187 |
+
for c in r:
|
| 188 |
+
v = c.value
|
| 189 |
+
if v is not None and str(v).strip() != "":
|
| 190 |
+
if min_row is None or c.row < min_row: min_row = c.row
|
| 191 |
+
if c.row > max_row: max_row = c.row
|
| 192 |
+
if min_col is None or c.column < min_col: min_col = c.column
|
| 193 |
+
if c.column > max_col: max_col = c.column
|
| 194 |
+
if min_row is None: return 1,0,1,0
|
| 195 |
+
return min_row, max_row, min_col, max_col
|
| 196 |
+
|
| 197 |
+
def build_merged_master_map(ws: Worksheet):
|
| 198 |
+
mapping = {}
|
| 199 |
+
for mr in ws.merged_cells.ranges:
|
| 200 |
+
min_col, min_row, max_col, max_row = mr.min_col, mr.min_row, mr.max_col, mr.max_row
|
| 201 |
+
master = (min_row, min_col)
|
| 202 |
+
for r in range(min_row, max_row+1):
|
| 203 |
+
for c in range(min_col, max_col+1):
|
| 204 |
+
mapping[(r,c)] = master
|
| 205 |
+
return mapping
|
| 206 |
+
|
| 207 |
+
def build_value_grid(ws: Worksheet, min_row: int, max_row: int, min_col: int, max_col: int):
|
| 208 |
+
merged_map = build_merged_master_map(ws)
|
| 209 |
+
nrows = max_row - min_row + 1; ncols = max_col - min_col + 1
|
| 210 |
+
grid = [[None]*ncols for _ in range(nrows)]
|
| 211 |
+
for r in range(min_row, max_row+1):
|
| 212 |
+
rr = r - min_row
|
| 213 |
+
for c in range(min_col, max_col+1):
|
| 214 |
+
cc = c - min_col
|
| 215 |
+
master = merged_map.get((r,c))
|
| 216 |
+
if master:
|
| 217 |
+
mr, mc = master; grid[rr][cc] = ws.cell(mr, mc).value
|
| 218 |
+
else:
|
| 219 |
+
grid[rr][cc] = ws.cell(r, c).value
|
| 220 |
+
return grid
|
| 221 |
+
|
| 222 |
+
def row_vals_from_grid(grid, r, min_row):
|
| 223 |
+
return grid[r - min_row]
|
| 224 |
+
|
| 225 |
+
def is_empty_row_vals(vals):
|
| 226 |
+
return not any(v not in (None, "") for v in vals)
|
| 227 |
+
|
| 228 |
+
def is_title_like_row_vals(vals, total_cols=20):
|
| 229 |
+
vals_ne = _nonempty(vals)
|
| 230 |
+
if not vals_ne: return False
|
| 231 |
+
if len(vals_ne) == 1: return True
|
| 232 |
+
coverage = len(vals_ne) / max(1, total_cols)
|
| 233 |
+
if coverage <= 0.2 and all(isinstance(v,str) and len(str(v))>20 for v in vals_ne): return True
|
| 234 |
+
uniq = {str(v).strip().lower() for v in vals_ne}
|
| 235 |
+
if len(uniq) == 1: return True
|
| 236 |
+
block = {"local currency unit per us dollar","exchange rate","average annual exchange rate"}
|
| 237 |
+
if any(str(v).strip().lower() in block for v in vals_ne): return True
|
| 238 |
+
return False
|
| 239 |
+
|
| 240 |
+
def is_header_candidate_row_vals(vals, total_cols=20):
|
| 241 |
+
vals_ne = _nonempty(vals)
|
| 242 |
+
if not vals_ne: return False
|
| 243 |
+
if is_title_like_row_vals(vals, total_cols): return False
|
| 244 |
+
nums = sum(1 for v in vals_ne if _is_numlike(v))
|
| 245 |
+
years = sum(1 for v in vals_ne if _is_year_token(v))
|
| 246 |
+
has_text = any(not _is_numlike(v) for v in vals_ne)
|
| 247 |
+
if years >= 2 and has_text: return True
|
| 248 |
+
if nums >= max(2, len(vals_ne)-nums): return years >= max(2, int(0.6*len(vals_ne)))
|
| 249 |
+
uniq_labels = {str(v).strip().lower() for v in vals_ne if not _is_numlike(v)}
|
| 250 |
+
return (len(vals_ne) >= 2) or (len(uniq_labels) >= 2)
|
| 251 |
+
|
| 252 |
+
def detect_tables_fast(ws: Worksheet, grid, min_row, max_row, min_col, max_col):
|
| 253 |
+
blocks = []
|
| 254 |
+
if is_pivot_or_chart_sheet(ws): return blocks
|
| 255 |
+
total_cols = max_col - min_col + 1
|
| 256 |
+
r = min_row
|
| 257 |
+
while r <= max_row:
|
| 258 |
+
vals = row_vals_from_grid(grid, r, min_row)
|
| 259 |
+
if is_empty_row_vals(vals) or is_title_like_row_vals(vals, total_cols) or is_pivot_row(vals):
|
| 260 |
+
r += 1; continue
|
| 261 |
+
if not is_probably_data(vals, total_cols):
|
| 262 |
+
r += 1; continue
|
| 263 |
+
data_start = r
|
| 264 |
+
header_rows = []
|
| 265 |
+
up = data_start - 1
|
| 266 |
+
while up >= min_row:
|
| 267 |
+
vup = row_vals_from_grid(grid, up, min_row)
|
| 268 |
+
if is_empty_row_vals(vup): break
|
| 269 |
+
if is_title_like_row_vals(vup, total_cols) or is_pivot_row(vup):
|
| 270 |
+
up -= 1; continue
|
| 271 |
+
if is_header_candidate_row_vals(vup, total_cols):
|
| 272 |
+
header_rows = []
|
| 273 |
+
hdr_row = up
|
| 274 |
+
while hdr_row >= min_row:
|
| 275 |
+
hdr_vals = row_vals_from_grid(grid, hdr_row, min_row)
|
| 276 |
+
if is_empty_row_vals(hdr_vals): break
|
| 277 |
+
if is_header_candidate_row_vals(hdr_vals, total_cols):
|
| 278 |
+
header_rows.insert(0, hdr_row); hdr_row -= 1
|
| 279 |
+
else: break
|
| 280 |
+
break
|
| 281 |
+
data_end = data_start
|
| 282 |
+
rr = data_start + 1
|
| 283 |
+
while rr <= max_row:
|
| 284 |
+
v = row_vals_from_grid(grid, rr, min_row)
|
| 285 |
+
if is_probably_footer(v) or is_pivot_row(v): break
|
| 286 |
+
if is_empty_row_vals(v): break
|
| 287 |
+
if is_probably_data(v, total_cols) or is_header_candidate_row_vals(v, total_cols):
|
| 288 |
+
data_end = rr
|
| 289 |
+
rr += 1
|
| 290 |
+
title_text = None
|
| 291 |
+
if header_rows:
|
| 292 |
+
top = header_rows[0]
|
| 293 |
+
for tr in range(max(min_row, top-3), top):
|
| 294 |
+
tv = row_vals_from_grid(grid, tr, min_row)
|
| 295 |
+
if is_title_like_row_vals(tv, total_cols):
|
| 296 |
+
first = next((str(x).strip() for x in tv if x not in (None,"")), None)
|
| 297 |
+
if first: title_text = first
|
| 298 |
+
break
|
| 299 |
+
if (header_rows or data_end - data_start >= 1) and data_start <= data_end:
|
| 300 |
+
blocks.append({"header_rows": header_rows, "data_start": data_start, "data_end": data_end, "title_text": title_text})
|
| 301 |
+
r = data_end + 1
|
| 302 |
+
while r <= max_row and is_empty_row_vals(row_vals_from_grid(grid, r, min_row)):
|
| 303 |
+
r += 1
|
| 304 |
+
return blocks
|
| 305 |
+
|
| 306 |
+
def expand_headers_from_grid(grid, header_rows, min_row, min_col, eff_max_col):
|
| 307 |
+
if not header_rows: return []
|
| 308 |
+
mat = []
|
| 309 |
+
for r in header_rows:
|
| 310 |
+
row_vals = row_vals_from_grid(grid, r, min_row)
|
| 311 |
+
row = [("" if (row_vals[c] is None) else str(row_vals[c]).strip()) for c in range(0, eff_max_col)]
|
| 312 |
+
last = ""
|
| 313 |
+
for i in range(len(row)):
|
| 314 |
+
if row[i] == "" and i > 0: row[i] = last
|
| 315 |
+
else: last = row[i]
|
| 316 |
+
mat.append(row)
|
| 317 |
+
return mat
|
| 318 |
+
|
| 319 |
+
def sheet_block_to_df_fast(ws, grid, min_row, max_row, min_col, max_col, header_rows, data_start, data_end):
|
| 320 |
+
import pandas as pd
|
| 321 |
+
total_cols = max_col - min_col + 1
|
| 322 |
+
if (not header_rows) and data_start and data_start > min_row:
|
| 323 |
+
prev = row_vals_from_grid(grid, data_start - 1, min_row)
|
| 324 |
+
if is_header_candidate_row_vals(prev, total_cols):
|
| 325 |
+
header_rows = [data_start - 1]
|
| 326 |
+
if (not header_rows) and data_start:
|
| 327 |
+
cur = row_vals_from_grid(grid, data_start, min_row)
|
| 328 |
+
nxt = row_vals_from_grid(grid, data_start + 1, min_row) if data_start + 1 <= max_row else []
|
| 329 |
+
if is_header_candidate_row_vals(cur, total_cols) and is_probably_data(nxt, total_cols):
|
| 330 |
+
header_rows = [data_start]; data_start += 1
|
| 331 |
+
if not header_rows or data_start is None or data_end is None or data_end < data_start:
|
| 332 |
+
return pd.DataFrame(), [], []
|
| 333 |
+
def used_upto_col():
|
| 334 |
+
maxc = 0
|
| 335 |
+
for r in list(header_rows) + list(range(data_start, data_end+1)):
|
| 336 |
+
vals = row_vals_from_grid(grid, r, min_row)
|
| 337 |
+
for c_off in range(total_cols):
|
| 338 |
+
v = vals[c_off]
|
| 339 |
+
if v not in (None, ""): maxc = max(maxc, c_off+1)
|
| 340 |
+
return maxc or total_cols
|
| 341 |
+
eff_max_col = used_upto_col()
|
| 342 |
+
header_mat = expand_headers_from_grid(grid, header_rows, min_row, min_col, eff_max_col)
|
| 343 |
+
def is_title_level(values):
|
| 344 |
+
total = len(values)
|
| 345 |
+
filled = [str(v).strip() for v in values if v not in (None, "")]
|
| 346 |
+
if total == 0: return False
|
| 347 |
+
coverage = len(filled) / total
|
| 348 |
+
if coverage <= 0.2 and len(filled) <= 2: return True
|
| 349 |
+
if filled:
|
| 350 |
+
uniq = {v.lower() for v in filled}
|
| 351 |
+
if len(uniq) == 1:
|
| 352 |
+
label = next(iter(uniq))
|
| 353 |
+
dom = sum(1 for v in values if isinstance(v,str) and v.strip().lower() == label)
|
| 354 |
+
if dom / total >= 0.6: return True
|
| 355 |
+
return False
|
| 356 |
+
usable_levels = [i for i in range(len(header_mat)) if not is_title_level(header_mat[i])]
|
| 357 |
+
if not usable_levels and header_mat: usable_levels = [len(header_mat) - 1]
|
| 358 |
+
cols = []
|
| 359 |
+
for c_off in range(eff_max_col):
|
| 360 |
+
parts = [header_mat[l][c_off] for l in range(usable_levels[0], usable_levels[-1]+1)] if usable_levels else []
|
| 361 |
+
cols.append(compose_col(parts))
|
| 362 |
+
cols = ensure_unique([clean_col_name(x) for x in cols])
|
| 363 |
+
data_rows = []
|
| 364 |
+
for r in range(data_start, data_end+1):
|
| 365 |
+
vals = row_vals_from_grid(grid, r, min_row)
|
| 366 |
+
row = [vals[c_off] for c_off in range(eff_max_col)]
|
| 367 |
+
if is_probably_footer(row): break
|
| 368 |
+
data_rows.append(row[:len(cols)])
|
| 369 |
+
if not data_rows: return pd.DataFrame(columns=cols), header_mat, cols
|
| 370 |
+
keep_mask = [any(row[i] not in (None, "") for row in data_rows) for i in range(len(cols))]
|
| 371 |
+
kept_cols = [c for c,k in zip(cols, keep_mask) if k]
|
| 372 |
+
trimmed_rows = [[v for v,k in zip(row, keep_mask) if k] for row in data_rows]
|
| 373 |
+
df = pd.DataFrame(trimmed_rows, columns=kept_cols)
|
| 374 |
+
if any(str(c).startswith("unnamed_column") for c in df.columns):
|
| 375 |
+
new_names = list(df.columns)
|
| 376 |
+
for idx, name in enumerate(list(df.columns)):
|
| 377 |
+
if not str(name).startswith("unnamed_column"): continue
|
| 378 |
+
samples = _samples_for_column(trimmed_rows, idx, max_items=20)
|
| 379 |
+
guess = _heuristic_infer_col_name(samples)
|
| 380 |
+
if guess: new_names[idx] = clean_col_name(guess)
|
| 381 |
+
df.columns = ensure_unique([clean_col_name(x) for x in new_names])
|
| 382 |
+
return df, header_mat, kept_cols
|
| 383 |
+
|
| 384 |
+
def _llm_infer_table_title(header_mat, sample_rows, sheet_name):
|
| 385 |
+
if os.environ.get("EXCEL_LLM_INFER","0") != "1": return None
|
| 386 |
+
api_key = os.environ.get("OPENAI_API_KEY")
|
| 387 |
+
if not api_key: return None
|
| 388 |
+
headers = []
|
| 389 |
+
if header_mat:
|
| 390 |
+
for c in range(len(header_mat[0])):
|
| 391 |
+
parts = [header_mat[l][c] for l in range(len(header_mat))]
|
| 392 |
+
parts = [p for p in parts if p]
|
| 393 |
+
if parts: headers.append("_".join(parts))
|
| 394 |
+
headers = headers[:10]
|
| 395 |
+
samples = [[str(x) for x in r[:6]] for r in sample_rows[:5]]
|
| 396 |
+
prompt = (
|
| 397 |
+
"Propose a short, human-readable title for a data table.\n"
|
| 398 |
+
"Keep it 3-6 words, Title Case, no punctuation at the end.\n"
|
| 399 |
+
f"Sheet: {sheet_name}\nHeaders: {headers}\nRow samples: {samples}\n"
|
| 400 |
+
"Answer with JSON: {\"title\": \"...\"}"
|
| 401 |
+
)
|
| 402 |
+
try:
|
| 403 |
+
from openai import OpenAI
|
| 404 |
+
client = OpenAI(api_key=api_key)
|
| 405 |
+
resp = client.chat.completions.create(
|
| 406 |
+
model=os.environ.get("OPENAI_MODEL","gpt-4o-mini"),
|
| 407 |
+
messages=[{"role":"user","content":prompt}], temperature=0.2,
|
| 408 |
+
)
|
| 409 |
+
text = resp.choices[0].message.content.strip()
|
| 410 |
+
except Exception:
|
| 411 |
+
return None
|
| 412 |
+
import re, json as _json
|
| 413 |
+
m = re.search(r"\{.*\}", text, re.S)
|
| 414 |
+
if not m: return None
|
| 415 |
+
try:
|
| 416 |
+
obj = _json.loads(m.group(0)); title = obj.get("title","").strip()
|
| 417 |
+
return title or None
|
| 418 |
+
except Exception: return None
|
| 419 |
+
|
| 420 |
+
def _heuristic_table_title(header_mat, sheet_name, idx):
|
| 421 |
+
if header_mat:
|
| 422 |
+
parts = []
|
| 423 |
+
levels = len(header_mat)
|
| 424 |
+
cols = len(header_mat[0]) if header_mat else 0
|
| 425 |
+
for c in range(min(6, cols)):
|
| 426 |
+
colparts = [header_mat[l][c] for l in range(min(levels, 2)) if header_mat[l][c]]
|
| 427 |
+
if colparts: parts.extend(colparts)
|
| 428 |
+
if parts:
|
| 429 |
+
base = " ".join(dict.fromkeys(parts))
|
| 430 |
+
return base[:60]
|
| 431 |
+
return f"{sheet_name} Table {idx}"
|
| 432 |
+
|
| 433 |
+
def infer_table_title(header_mat, sample_rows, sheet_name, idx):
|
| 434 |
+
title = _heuristic_table_title(header_mat, sheet_name, idx)
|
| 435 |
+
llm = _llm_infer_table_title(header_mat, sample_rows, sheet_name)
|
| 436 |
+
return llm or title
|
| 437 |
+
|
| 438 |
+
def ensure_unique_table_name(existing: set, name: str) -> str:
|
| 439 |
+
base = sanitize_table_name(name) or "table"
|
| 440 |
+
if base not in existing:
|
| 441 |
+
existing.add(base); return base
|
| 442 |
+
i = 2
|
| 443 |
+
while f"{base}_{i}" in existing: i += 1
|
| 444 |
+
out = f"{base}_{i}"; existing.add(out); return out
|
| 445 |
+
|
| 446 |
+
# --- NEW: block coalescing to avoid nested/overlapping duplicates ---
|
| 447 |
+
def coalesce_blocks(blocks: List[Dict]) -> List[Dict]:
|
| 448 |
+
"""Keep only maximal non-overlapping blocks by data row range."""
|
| 449 |
+
if not blocks: return blocks
|
| 450 |
+
blocks_sorted = sorted(blocks, key=lambda b: (b["data_start"], b["data_end"]))
|
| 451 |
+
result = []
|
| 452 |
+
for b in blocks_sorted:
|
| 453 |
+
if any(b["data_start"] >= x["data_start"] and b["data_end"] <= x["data_end"] for x in result):
|
| 454 |
+
# fully contained -> drop
|
| 455 |
+
continue
|
| 456 |
+
result.append(b)
|
| 457 |
+
return result
|
| 458 |
+
|
| 459 |
+
# ------------------------- Persistence -------------------------
|
| 460 |
+
def persist(excel_path, duckdb_path):
|
| 461 |
+
try:
|
| 462 |
+
from duckdb import connect
|
| 463 |
+
except ImportError:
|
| 464 |
+
print("Error: DuckDB library not installed. Install with: pip install duckdb"); sys.exit(1)
|
| 465 |
+
try:
|
| 466 |
+
wb = load_workbook(excel_path, data_only=True)
|
| 467 |
+
except FileNotFoundError:
|
| 468 |
+
print(f"Error: Excel file not found: {excel_path}"); sys.exit(1)
|
| 469 |
+
except Exception as e:
|
| 470 |
+
print(f"Error loading Excel file: {e}"); sys.exit(1)
|
| 471 |
+
|
| 472 |
+
db_path = Path(duckdb_path)
|
| 473 |
+
db_path.parent.mkdir(parents=True, exist_ok=True)
|
| 474 |
+
new_db = not db_path.exists()
|
| 475 |
+
con = connect(str(db_path))
|
| 476 |
+
if new_db: print(f"Created new DuckDB at: {db_path}")
|
| 477 |
+
|
| 478 |
+
con.execute("""
|
| 479 |
+
CREATE TABLE IF NOT EXISTS __excel_schema (
|
| 480 |
+
sheet_name TEXT,
|
| 481 |
+
table_name TEXT,
|
| 482 |
+
column_ordinal INTEGER,
|
| 483 |
+
original_name TEXT,
|
| 484 |
+
sql_column TEXT
|
| 485 |
+
)
|
| 486 |
+
""")
|
| 487 |
+
con.execute("""
|
| 488 |
+
CREATE TABLE IF NOT EXISTS __excel_tables (
|
| 489 |
+
sheet_name TEXT,
|
| 490 |
+
table_name TEXT,
|
| 491 |
+
block_index INTEGER,
|
| 492 |
+
start_row INTEGER,
|
| 493 |
+
end_row INTEGER,
|
| 494 |
+
header_rows_json TEXT,
|
| 495 |
+
inferred_title TEXT,
|
| 496 |
+
original_title_text TEXT
|
| 497 |
+
)
|
| 498 |
+
""")
|
| 499 |
+
con.execute("DELETE FROM __excel_schema")
|
| 500 |
+
con.execute("DELETE FROM __excel_tables")
|
| 501 |
+
|
| 502 |
+
used_names = set(); total_tables = 0; total_rows = 0
|
| 503 |
+
|
| 504 |
+
for sheet in wb.sheetnames:
|
| 505 |
+
ws = wb[sheet]
|
| 506 |
+
try:
|
| 507 |
+
if not isinstance(ws, Worksheet):
|
| 508 |
+
print(f"Skipping chartsheet: {sheet}"); continue
|
| 509 |
+
except Exception: pass
|
| 510 |
+
if is_pivot_or_chart_sheet(ws):
|
| 511 |
+
print(f"Skipping pivot/chart-like sheet: {sheet}"); continue
|
| 512 |
+
|
| 513 |
+
min_row, max_row, min_col, max_col = used_bounds(ws)
|
| 514 |
+
if max_row < min_row: continue
|
| 515 |
+
grid = build_value_grid(ws, min_row, max_row, min_col, max_col)
|
| 516 |
+
|
| 517 |
+
blocks = detect_tables_fast(ws, grid, min_row, max_row, min_col, max_col)
|
| 518 |
+
blocks = coalesce_blocks(blocks) # NEW: drop contained duplicates
|
| 519 |
+
if not blocks: continue
|
| 520 |
+
|
| 521 |
+
# NEW: per-sheet content hash set to avoid identical duplicate content
|
| 522 |
+
seen_content = set()
|
| 523 |
+
|
| 524 |
+
for idx, blk in enumerate(blocks, start=1):
|
| 525 |
+
df, header_mat, kept_cols = sheet_block_to_df_fast(
|
| 526 |
+
ws, grid, min_row, max_row, min_col, max_col,
|
| 527 |
+
blk["header_rows"], blk["data_start"], blk["data_end"]
|
| 528 |
+
)
|
| 529 |
+
if df.empty: continue
|
| 530 |
+
|
| 531 |
+
# Content hash (stable CSV representation)
|
| 532 |
+
csv_bytes = df.to_csv(index=False).encode("utf-8")
|
| 533 |
+
h = hashlib.sha256(csv_bytes).hexdigest()
|
| 534 |
+
if h in seen_content:
|
| 535 |
+
# Skip duplicate block with identical content
|
| 536 |
+
print(f"Skipping duplicate content on sheet {sheet} (block {idx})")
|
| 537 |
+
continue
|
| 538 |
+
seen_content.add(h)
|
| 539 |
+
|
| 540 |
+
# Build original composite header names for lineage mapping
|
| 541 |
+
original_cols = []
|
| 542 |
+
if header_mat:
|
| 543 |
+
levels = len(header_mat)
|
| 544 |
+
cols = len(header_mat[0]) if header_mat else 0
|
| 545 |
+
for c in range(cols):
|
| 546 |
+
parts = [header_mat[l][c] for l in range(levels)]
|
| 547 |
+
original_cols.append("_".join([p for p in parts if p]))
|
| 548 |
+
else:
|
| 549 |
+
original_cols = list(df.columns)
|
| 550 |
+
while len(original_cols) < len(df.columns): original_cols.append("unnamed")
|
| 551 |
+
|
| 552 |
+
title_orig = blk.get("title_text")
|
| 553 |
+
title = title_orig or infer_table_title(header_mat, df.values.tolist(), sheet, idx)
|
| 554 |
+
candidate = title if title else f"{sheet} Table {idx}"
|
| 555 |
+
table = ensure_unique_table_name(used_names, candidate)
|
| 556 |
+
|
| 557 |
+
con.execute(f'DROP TABLE IF EXISTS "{table}"')
|
| 558 |
+
con.register(f"{table}_temp", df)
|
| 559 |
+
con.execute(f'CREATE TABLE "{table}" AS SELECT * FROM {table}_temp')
|
| 560 |
+
con.unregister(f"{table}_temp")
|
| 561 |
+
|
| 562 |
+
for cidx, (orig, sqlc) in enumerate(zip(original_cols[:len(df.columns)], df.columns), start=1):
|
| 563 |
+
con.execute("INSERT INTO __excel_schema VALUES (?, ?, ?, ?, ?)", [sheet, table, cidx, str(orig), str(sqlc)])
|
| 564 |
+
|
| 565 |
+
con.execute(
|
| 566 |
+
"""
|
| 567 |
+
INSERT INTO __excel_tables
|
| 568 |
+
(sheet_name, table_name, block_index, start_row, end_row, header_rows_json, inferred_title, original_title_text)
|
| 569 |
+
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
| 570 |
+
""",
|
| 571 |
+
[sheet, table, int(idx), int(blk["data_start"]), int(blk["data_end"]), json.dumps(blk["header_rows"]), title if title else None, title_orig if title_orig else None],
|
| 572 |
+
)
|
| 573 |
+
|
| 574 |
+
print(f"Created table {table} from sheet {sheet} with {len(df)} rows and {len(df.columns)} columns.")
|
| 575 |
+
total_tables += 1; total_rows += len(df)
|
| 576 |
+
|
| 577 |
+
con.close()
|
| 578 |
+
print(f"""\n✅ Completed.
|
| 579 |
+
- Created {total_tables} tables with {total_rows} total rows
|
| 580 |
+
- Column lineage: __excel_schema
|
| 581 |
+
- Block metadata: __excel_tables""")
|
| 582 |
+
|
| 583 |
+
|
| 584 |
+
def main():
|
| 585 |
+
ap = argparse.ArgumentParser(description="Excel → DuckDB (v2-compatible perf + no-dup safe guards).")
|
| 586 |
+
ap.add_argument("--excel", required=True, help="Path to .xlsx")
|
| 587 |
+
ap.add_argument("--duckdb", required=True, help="Path to DuckDB file")
|
| 588 |
+
args = ap.parse_args()
|
| 589 |
+
if not os.path.exists(args.excel):
|
| 590 |
+
print(f"Error: Excel file not found: {args.excel}"); sys.exit(1)
|
| 591 |
+
persist(args.excel, args.duckdb)
|
| 592 |
+
|
| 593 |
+
|
| 594 |
+
if __name__ == "__main__":
|
| 595 |
+
main()
|
requirements.txt
CHANGED
|
@@ -1,3 +1,13 @@
|
|
| 1 |
-
|
| 2 |
-
pandas
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
duckdb
|
| 2 |
+
pandas
|
| 3 |
+
openpyxl
|
| 4 |
+
xlsxwriter
|
| 5 |
+
langgraph
|
| 6 |
+
langchain-openai
|
| 7 |
+
langchain-core
|
| 8 |
+
faker
|
| 9 |
+
python-dotenv==1.0.1
|
| 10 |
+
streamlit
|
| 11 |
+
matplotlib
|
| 12 |
+
openai
|
| 13 |
+
tabulate
|
streamlit_app.py
ADDED
|
@@ -0,0 +1,574 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import sys
|
| 3 |
+
import duckdb
|
| 4 |
+
import json
|
| 5 |
+
import subprocess
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Dict, List, Tuple
|
| 8 |
+
|
| 9 |
+
import streamlit as st
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
try:
|
| 13 |
+
from dotenv import load_dotenv, find_dotenv
|
| 14 |
+
load_dotenv(find_dotenv())
|
| 15 |
+
except Exception:
|
| 16 |
+
pass
|
| 17 |
+
|
| 18 |
+
# ---------- Basic page setup ----------
|
| 19 |
+
st.set_page_config(page_title="Excel → Dataset", page_icon="📊", layout="wide")
|
| 20 |
+
|
| 21 |
+
PRIMARY_DIR = Path(__file__).parent.resolve()
|
| 22 |
+
UPLOAD_DIR = PRIMARY_DIR / "uploads"
|
| 23 |
+
DB_DIR = PRIMARY_DIR / "dbs"
|
| 24 |
+
SCRIPT_PATH = PRIMARY_DIR / "excel_to_duckdb.py" # must be colocated
|
| 25 |
+
|
| 26 |
+
UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
|
| 27 |
+
DB_DIR.mkdir(parents=True, exist_ok=True)
|
| 28 |
+
|
| 29 |
+
st.markdown(
|
| 30 |
+
"""
|
| 31 |
+
<style>
|
| 32 |
+
.logbox { border: 1px solid #e5e7eb; background:#fafafa; padding:10px; border-radius:12px; }
|
| 33 |
+
.logbox code { white-space: pre-wrap; font-size: 0.85rem; }
|
| 34 |
+
</style>
|
| 35 |
+
""",
|
| 36 |
+
unsafe_allow_html=True,
|
| 37 |
+
)
|
| 38 |
+
|
| 39 |
+
st.title("Data Analysis Agent")
|
| 40 |
+
# --------- Session state helpers ---------
|
| 41 |
+
if "processing" not in st.session_state:
|
| 42 |
+
st.session_state.processing = False
|
| 43 |
+
if "processed_key" not in st.session_state:
|
| 44 |
+
st.session_state.processed_key = None
|
| 45 |
+
if "last_overview_md" not in st.session_state:
|
| 46 |
+
st.session_state.last_overview_md = None
|
| 47 |
+
if "last_preview_items" not in st.session_state:
|
| 48 |
+
st.session_state.last_preview_items = [] # list of dicts: {'table_ref': str, 'label': str}
|
| 49 |
+
|
| 50 |
+
def _file_key(uploaded) -> str:
|
| 51 |
+
# unique-ish key per upload (name + size)
|
| 52 |
+
try:
|
| 53 |
+
size = len(uploaded.getbuffer())
|
| 54 |
+
except Exception:
|
| 55 |
+
size = 0
|
| 56 |
+
return f"{uploaded.name}:{size}"
|
| 57 |
+
|
| 58 |
+
# ---------- DuckDB helpers ----------
|
| 59 |
+
def list_user_tables(con: duckdb.DuckDBPyConnection) -> List[str]:
|
| 60 |
+
"""
|
| 61 |
+
Robust discovery:
|
| 62 |
+
1) information_schema.tables for BASE TABLE/VIEW in any schema, excluding __*
|
| 63 |
+
2) duckdb_tables() as a secondary path
|
| 64 |
+
3) __excel_tables mapping + existence check as last resort
|
| 65 |
+
"""
|
| 66 |
+
# 1) information_schema (most portable)
|
| 67 |
+
try:
|
| 68 |
+
q = (
|
| 69 |
+
"SELECT table_schema, table_name "
|
| 70 |
+
"FROM information_schema.tables "
|
| 71 |
+
"WHERE table_type IN ('BASE TABLE','VIEW') "
|
| 72 |
+
"AND table_name NOT LIKE '__%%' "
|
| 73 |
+
"ORDER BY table_schema, table_name"
|
| 74 |
+
)
|
| 75 |
+
rows = con.execute(q).fetchall()
|
| 76 |
+
names = []
|
| 77 |
+
for schema, name in rows:
|
| 78 |
+
if (schema or '').lower() == 'main':
|
| 79 |
+
names.append(name)
|
| 80 |
+
else:
|
| 81 |
+
names.append(f'{schema}."{name}"')
|
| 82 |
+
if names:
|
| 83 |
+
return names
|
| 84 |
+
except Exception:
|
| 85 |
+
pass
|
| 86 |
+
|
| 87 |
+
# 2) duckdb_tables()
|
| 88 |
+
try:
|
| 89 |
+
q2 = (
|
| 90 |
+
"SELECT schema_name, table_name "
|
| 91 |
+
"FROM duckdb_tables() "
|
| 92 |
+
"WHERE table_type = 'BASE TABLE' "
|
| 93 |
+
"AND table_name NOT LIKE '__%%' "
|
| 94 |
+
"ORDER BY schema_name, table_name"
|
| 95 |
+
)
|
| 96 |
+
rows = con.execute(q2).fetchall()
|
| 97 |
+
names = []
|
| 98 |
+
for schema, name in rows:
|
| 99 |
+
if (schema or '').lower() == 'main':
|
| 100 |
+
names.append(name)
|
| 101 |
+
else:
|
| 102 |
+
names.append(f'{schema}."{name}"')
|
| 103 |
+
if names:
|
| 104 |
+
return names
|
| 105 |
+
except Exception:
|
| 106 |
+
pass
|
| 107 |
+
|
| 108 |
+
# 3) Fallback to metadata table
|
| 109 |
+
try:
|
| 110 |
+
meta = con.execute("SELECT DISTINCT table_name FROM __excel_tables").fetchall()
|
| 111 |
+
names = []
|
| 112 |
+
for (t,) in meta:
|
| 113 |
+
try:
|
| 114 |
+
con.execute(f'SELECT 1 FROM "{t}" LIMIT 1').fetchone()
|
| 115 |
+
names.append(t)
|
| 116 |
+
except Exception:
|
| 117 |
+
continue
|
| 118 |
+
return names
|
| 119 |
+
except Exception:
|
| 120 |
+
return []
|
| 121 |
+
|
| 122 |
+
def get_columns(con: duckdb.DuckDBPyConnection, table: str) -> List[Tuple[str,str]]:
|
| 123 |
+
# Normalize table name for information_schema lookup
|
| 124 |
+
if table.lower().startswith("main."):
|
| 125 |
+
tname = table.split('.', 1)[1].strip('"')
|
| 126 |
+
schema_filter = 'main'
|
| 127 |
+
elif '.' in table:
|
| 128 |
+
schema_filter, tname_raw = table.split('.', 1)
|
| 129 |
+
tname = tname_raw.strip('"')
|
| 130 |
+
else:
|
| 131 |
+
schema_filter = 'main'
|
| 132 |
+
tname = table.strip('"')
|
| 133 |
+
q = (
|
| 134 |
+
"SELECT column_name, data_type "
|
| 135 |
+
"FROM information_schema.columns "
|
| 136 |
+
"WHERE table_schema=? AND table_name=? "
|
| 137 |
+
"ORDER BY ordinal_position"
|
| 138 |
+
)
|
| 139 |
+
return con.execute(q, [schema_filter, tname]).fetchall()
|
| 140 |
+
|
| 141 |
+
def detect_year_column(con, table: str, col: str) -> bool:
|
| 142 |
+
try:
|
| 143 |
+
sql = (
|
| 144 |
+
f'SELECT AVG(CASE WHEN TRY_CAST("{col}" AS INTEGER) BETWEEN 1900 AND 2100 '
|
| 145 |
+
f'THEN 1.0 ELSE 0.0 END) FROM {table}'
|
| 146 |
+
)
|
| 147 |
+
v = con.execute(sql).fetchone()[0]
|
| 148 |
+
return (v or 0) > 0.7
|
| 149 |
+
except Exception:
|
| 150 |
+
return False
|
| 151 |
+
|
| 152 |
+
def role_of_column(con, table: str, col: str, dtype: str) -> str:
|
| 153 |
+
d = (dtype or '').upper()
|
| 154 |
+
if any(tok in d for tok in ["DATE", "TIMESTAMP"]):
|
| 155 |
+
return "date"
|
| 156 |
+
if any(tok in d for tok in ["INT", "BIGINT", "DOUBLE", "DECIMAL", "FLOAT", "HUGEINT", "REAL"]):
|
| 157 |
+
if detect_year_column(con, table, col):
|
| 158 |
+
return "year"
|
| 159 |
+
return "numeric"
|
| 160 |
+
if any(tok in d for tok in ["CHAR", "STRING", "TEXT", "VARCHAR"]):
|
| 161 |
+
try:
|
| 162 |
+
sql = f'SELECT COUNT(*), COUNT(DISTINCT "{col}") FROM {table}'
|
| 163 |
+
n, nd = con.execute(sql).fetchone()
|
| 164 |
+
if n and nd is not None:
|
| 165 |
+
ratio = (nd / n) if n else 0
|
| 166 |
+
if ratio > 0.95:
|
| 167 |
+
return "id_like"
|
| 168 |
+
if 0.01 <= ratio <= 0.35:
|
| 169 |
+
return "category"
|
| 170 |
+
if ratio < 0.01:
|
| 171 |
+
return "binary_flag"
|
| 172 |
+
except Exception:
|
| 173 |
+
pass
|
| 174 |
+
return "text"
|
| 175 |
+
return "other"
|
| 176 |
+
|
| 177 |
+
def quick_table_profile(con: duckdb.DuckDBPyConnection, table: str) -> Dict:
|
| 178 |
+
rows = con.execute(f'SELECT COUNT(*) FROM {table}').fetchone()[0]
|
| 179 |
+
cols = get_columns(con, table)
|
| 180 |
+
roles = {"category": [], "numeric": [], "date": [], "year": [], "id_like": [], "text": [], "binary": [], "other": []}
|
| 181 |
+
for c, d in cols:
|
| 182 |
+
r = role_of_column(con, table, c, d)
|
| 183 |
+
if r == "binary_flag":
|
| 184 |
+
roles["binary"].append(c)
|
| 185 |
+
else:
|
| 186 |
+
roles.setdefault(r, []).append(c)
|
| 187 |
+
return {
|
| 188 |
+
"rows": int(rows or 0),
|
| 189 |
+
"n_cols": len(cols),
|
| 190 |
+
"n_cat": len(roles["category"]),
|
| 191 |
+
"n_num": len(roles["numeric"]),
|
| 192 |
+
"n_time": len(roles["year"]) + len(roles["date"]),
|
| 193 |
+
}
|
| 194 |
+
|
| 195 |
+
def table_mapping(con: duckdb.DuckDBPyConnection, user_tables: List[str]) -> Dict[str, Dict]:
|
| 196 |
+
"""
|
| 197 |
+
Map db_table (normalized) -> {sheet_name, title} using __excel_tables if present.
|
| 198 |
+
"""
|
| 199 |
+
normalize = lambda t: t.split('.', 1)[1].strip('"') if '.' in t else t.strip('"')
|
| 200 |
+
want_names = {normalize(t) for t in user_tables}
|
| 201 |
+
mapping: Dict[str, Dict] = {}
|
| 202 |
+
try:
|
| 203 |
+
rows = con.execute(
|
| 204 |
+
"SELECT sheet_name, table_name, inferred_title, original_title_text, block_index, start_row "
|
| 205 |
+
"FROM __excel_tables ORDER BY block_index, start_row"
|
| 206 |
+
).fetchall()
|
| 207 |
+
for sheet_name, table_name, inferred_title, original_title_text, block_index, start_row in rows:
|
| 208 |
+
if table_name not in want_names:
|
| 209 |
+
continue
|
| 210 |
+
title = inferred_title or original_title_text or 'untitled'
|
| 211 |
+
mapping[table_name] = {'sheet_name': sheet_name, 'title': title}
|
| 212 |
+
except Exception:
|
| 213 |
+
pass
|
| 214 |
+
return mapping
|
| 215 |
+
|
| 216 |
+
def excel_schema_samples(con: duckdb.DuckDBPyConnection, mapping: Dict[str, Dict], max_cols: int = 8) -> Dict[str, List[str]]:
|
| 217 |
+
""" Return up to max_cols original column names per table_name (normalized) for LLM hints. """
|
| 218 |
+
samples: Dict[str, List[str]] = {}
|
| 219 |
+
try:
|
| 220 |
+
rows = con.execute("SELECT sheet_name, table_name, column_ordinal, original_name FROM __excel_schema ORDER BY sheet_name, table_name, column_ordinal").fetchall()
|
| 221 |
+
for sheet_name, table_name, ordn, orig in rows:
|
| 222 |
+
if table_name not in mapping:
|
| 223 |
+
continue
|
| 224 |
+
lst = samples.setdefault(table_name, [])
|
| 225 |
+
if orig and len(lst) < max_cols:
|
| 226 |
+
lst.append(str(orig))
|
| 227 |
+
except Exception:
|
| 228 |
+
pass
|
| 229 |
+
return samples
|
| 230 |
+
|
| 231 |
+
# ---------- OpenAI ----------
|
| 232 |
+
def ai_overview_from_context(context: Dict) -> str:
|
| 233 |
+
api_key = os.environ.get("OPENAI_API_KEY") or st.secrets.get("OPENAI_API_KEY", None)
|
| 234 |
+
if not api_key:
|
| 235 |
+
raise RuntimeError("OPENAI_API_KEY is not set. Please add it to .env or Streamlit secrets.")
|
| 236 |
+
|
| 237 |
+
try:
|
| 238 |
+
from openai import OpenAI
|
| 239 |
+
client = OpenAI(api_key=api_key)
|
| 240 |
+
model = os.environ.get("OPENAI_MODEL", "gpt-4o-mini")
|
| 241 |
+
except Exception as e:
|
| 242 |
+
raise RuntimeError("OpenAI client not available. Install 'openai' >= 1.0 and try again.") from e
|
| 243 |
+
|
| 244 |
+
prompt = f'''
|
| 245 |
+
Start directly (no greeting). Write a concise, conversational overview (max two short paragraphs) of the dataset created from the uploaded Excel.
|
| 246 |
+
Requirements:
|
| 247 |
+
- Do NOT mention database engines, schemas, or technical column/table names.
|
| 248 |
+
- For each segment, reference it as: Sheet "<sheet_name>" — Table "<title>" (use "untitled" if missing).
|
| 249 |
+
- Use any provided original Excel header hints ONLY to infer friendlier human concepts; do not quote them verbatim.
|
| 250 |
+
- After the overview, list 6–8 simple questions a user could ask in natural language.
|
| 251 |
+
- Output Markdown with headings: "Overview" and "Try These Questions".
|
| 252 |
+
|
| 253 |
+
Context (JSON):
|
| 254 |
+
{json.dumps(context, ensure_ascii=False, indent=2)}
|
| 255 |
+
'''
|
| 256 |
+
resp = client.chat.completions.create(
|
| 257 |
+
model=model,
|
| 258 |
+
messages=[{"role": "user", "content": prompt}],
|
| 259 |
+
temperature=0.4,
|
| 260 |
+
)
|
| 261 |
+
return resp.choices[0].message.content.strip()
|
| 262 |
+
|
| 263 |
+
# ---------- Orchestration ----------
|
| 264 |
+
def run_ingestion_pipeline(xlsx_path: Path, db_path: Path, log_placeholder):
|
| 265 |
+
# Combined log function
|
| 266 |
+
log_lines: List[str] = []
|
| 267 |
+
def _append(line: str):
|
| 268 |
+
log_lines.append(line)
|
| 269 |
+
log_placeholder.markdown(
|
| 270 |
+
f"<div class='logbox'><code>{'</code><br/><code>'.join(map(str, log_lines[-400:]))}</code></div>",
|
| 271 |
+
unsafe_allow_html=True,
|
| 272 |
+
)
|
| 273 |
+
|
| 274 |
+
# 1) Save (already saved by caller, but we log here for a single place)
|
| 275 |
+
_append("[app] Saving file…")
|
| 276 |
+
_append("[app] Saved.")
|
| 277 |
+
if not SCRIPT_PATH.exists():
|
| 278 |
+
_append("[app] ERROR: ingestion component not found next to the app.")
|
| 279 |
+
raise FileNotFoundError("Required ingestion component not found.")
|
| 280 |
+
|
| 281 |
+
# 2) Ingest
|
| 282 |
+
_append("[app] Ingesting…")
|
| 283 |
+
env = os.environ.copy()
|
| 284 |
+
env["PYTHONIOENCODING"] = "utf-8"
|
| 285 |
+
|
| 286 |
+
cmd = [sys.executable, str(SCRIPT_PATH), "--excel", str(xlsx_path), "--duckdb", str(db_path)]
|
| 287 |
+
try:
|
| 288 |
+
proc = subprocess.Popen(
|
| 289 |
+
cmd, cwd=str(PRIMARY_DIR),
|
| 290 |
+
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
|
| 291 |
+
text=True, bufsize=1, universal_newlines=True, env=env
|
| 292 |
+
)
|
| 293 |
+
except Exception as e:
|
| 294 |
+
_append(f"[app] ERROR: failed to start ingestion: {e}")
|
| 295 |
+
raise
|
| 296 |
+
|
| 297 |
+
for line in iter(proc.stdout.readline, ""):
|
| 298 |
+
_append(line.rstrip("\n"))
|
| 299 |
+
proc.wait()
|
| 300 |
+
if proc.returncode != 0:
|
| 301 |
+
_append("[app] ERROR: ingestion reported a non-zero exit code.")
|
| 302 |
+
raise RuntimeError("Ingestion failed. See logs.")
|
| 303 |
+
|
| 304 |
+
_append("[app] Ingestion complete.")
|
| 305 |
+
|
| 306 |
+
# 3) Open dataset
|
| 307 |
+
_append("[app] Opening dataset…")
|
| 308 |
+
con = duckdb.connect(str(db_path))
|
| 309 |
+
_append("[app] Dataset open.")
|
| 310 |
+
|
| 311 |
+
return con, _append
|
| 312 |
+
|
| 313 |
+
def analyze_and_summarize(con: duckdb.DuckDBPyConnection):
|
| 314 |
+
user_tables = list_user_tables(con)
|
| 315 |
+
preview_items = [] # list of {'table_ref': t, 'label': label} for UI
|
| 316 |
+
if not user_tables:
|
| 317 |
+
# Try to provide metadata if no tables are found
|
| 318 |
+
try:
|
| 319 |
+
meta_df = con.execute("SELECT * FROM __excel_tables").fetchdf()
|
| 320 |
+
st.warning("No user tables were discovered. Showing ingestion metadata for reference.")
|
| 321 |
+
st.dataframe(meta_df, use_container_width=True, hide_index=True)
|
| 322 |
+
except Exception:
|
| 323 |
+
st.error("No user tables were discovered and no metadata table is available.")
|
| 324 |
+
return "", []
|
| 325 |
+
|
| 326 |
+
# Build mapping + schema hints
|
| 327 |
+
mapping = table_mapping(con, user_tables) # normalized table_name -> {sheet_name, title}
|
| 328 |
+
schema_hints = excel_schema_samples(con, mapping, max_cols=8)
|
| 329 |
+
|
| 330 |
+
# Build compact profiling context for LLM; avoid raw db table names
|
| 331 |
+
per_table = []
|
| 332 |
+
for idx, t in enumerate(user_tables, start=1):
|
| 333 |
+
prof = quick_table_profile(con, t)
|
| 334 |
+
norm = t.split('.', 1)[1].strip('"') if '.' in t else t.strip('"')
|
| 335 |
+
m = mapping.get(norm, {})
|
| 336 |
+
sheet = m.get('sheet_name')
|
| 337 |
+
title = m.get('title')
|
| 338 |
+
per_table.append({
|
| 339 |
+
"idx": idx,
|
| 340 |
+
"sheet_name": sheet,
|
| 341 |
+
"title": title or "untitled",
|
| 342 |
+
"rows": prof["rows"],
|
| 343 |
+
"n_cols": prof["n_cols"],
|
| 344 |
+
"category_fields": prof["n_cat"],
|
| 345 |
+
"numeric_measures": prof["n_num"],
|
| 346 |
+
"time_fields": prof["n_time"],
|
| 347 |
+
"example_original_headers": schema_hints.get(norm, [])
|
| 348 |
+
})
|
| 349 |
+
label = f'Sheet "{sheet}" — Table "{title or "untitled"}"' if sheet else f'Table "{title or "untitled"}"'
|
| 350 |
+
preview_items.append({'table_ref': t, 'label': label})
|
| 351 |
+
|
| 352 |
+
context = {
|
| 353 |
+
"segments": per_table
|
| 354 |
+
}
|
| 355 |
+
|
| 356 |
+
# Generate overview (LLM only)
|
| 357 |
+
overview_md = ai_overview_from_context(context)
|
| 358 |
+
return overview_md, preview_items
|
| 359 |
+
|
| 360 |
+
# ---------- UI flow ----------
|
| 361 |
+
file = st.file_uploader("Upload an Excel file", type=["xlsx"])
|
| 362 |
+
|
| 363 |
+
if file is None and not st.session_state.last_overview_md:
|
| 364 |
+
st.info("Upload an .xlsx file to begin.")
|
| 365 |
+
|
| 366 |
+
# Only show logs AFTER there is an upload or some result to show
|
| 367 |
+
logs_placeholder = None
|
| 368 |
+
if file is not None or st.session_state.processing or st.session_state.last_overview_md:
|
| 369 |
+
logs_exp = st.expander("Processing logs", expanded=False)
|
| 370 |
+
logs_placeholder = logs_exp.empty()
|
| 371 |
+
|
| 372 |
+
if file is not None:
|
| 373 |
+
key = _file_key(file)
|
| 374 |
+
stem = Path(file.name).stem
|
| 375 |
+
saved_xlsx = UPLOAD_DIR / f"{stem}.xlsx"
|
| 376 |
+
db_path = DB_DIR / f"{stem}.duckdb"
|
| 377 |
+
|
| 378 |
+
# --- CLEAR state immediately on new upload ---
|
| 379 |
+
if st.session_state.get("processed_key") != key:
|
| 380 |
+
st.session_state["last_overview_md"] = None
|
| 381 |
+
st.session_state["last_preview_items"] = []
|
| 382 |
+
st.session_state["chat_history"] = []
|
| 383 |
+
st.session_state["schema_text"] = None
|
| 384 |
+
st.session_state["db_path"] = None
|
| 385 |
+
# Optional: clear any previous logs shown in UI on rerun
|
| 386 |
+
# (no explicit log buffer stored; the log expander will refresh)
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
# Auto-start ingestion exactly once per unique upload
|
| 390 |
+
if (st.session_state.processed_key != key) and (not st.session_state.processing):
|
| 391 |
+
st.session_state.processing = True
|
| 392 |
+
if logs_placeholder is None:
|
| 393 |
+
logs_exp = st.expander("Ingestion logs", expanded=False)
|
| 394 |
+
logs_placeholder = logs_exp.empty()
|
| 395 |
+
|
| 396 |
+
# Save uploaded file
|
| 397 |
+
with open(saved_xlsx, "wb") as f:
|
| 398 |
+
f.write(file.getbuffer())
|
| 399 |
+
|
| 400 |
+
try:
|
| 401 |
+
con, app_log = run_ingestion_pipeline(saved_xlsx, db_path, logs_placeholder)
|
| 402 |
+
# Analyze + overview
|
| 403 |
+
app_log("[app] Analyzing data…")
|
| 404 |
+
overview_md, preview_items = analyze_and_summarize(con)
|
| 405 |
+
app_log("[app] Overview complete.")
|
| 406 |
+
con.close()
|
| 407 |
+
|
| 408 |
+
st.session_state.last_overview_md = overview_md
|
| 409 |
+
st.session_state.last_preview_items = preview_items
|
| 410 |
+
st.session_state.processed_key = key
|
| 411 |
+
st.session_state.processing = False
|
| 412 |
+
|
| 413 |
+
except Exception as e:
|
| 414 |
+
st.session_state.processing = False
|
| 415 |
+
st.error(f"Ingestion failed. See logs for details. Error: {e}")
|
| 416 |
+
|
| 417 |
+
# Display results if available (and avoid re-triggering ingestion)
|
| 418 |
+
if st.session_state.last_overview_md:
|
| 419 |
+
#st.subheader("Overview")
|
| 420 |
+
st.markdown(st.session_state.last_overview_md)
|
| 421 |
+
|
| 422 |
+
with st.expander("Quick preview (verification only)", expanded=False):
|
| 423 |
+
try:
|
| 424 |
+
# Reconnect to current dataset path (if present)
|
| 425 |
+
if file is not None:
|
| 426 |
+
stem = Path(file.name).stem
|
| 427 |
+
db_path = DB_DIR / f"{stem}.duckdb"
|
| 428 |
+
con = duckdb.connect(str(db_path))
|
| 429 |
+
for item in st.session_state.last_preview_items:
|
| 430 |
+
t = item['table_ref']
|
| 431 |
+
label = item['label']
|
| 432 |
+
df = con.execute(f"SELECT * FROM {t} LIMIT 50").df()
|
| 433 |
+
st.caption(f"Preview — {label}")
|
| 434 |
+
st.dataframe(df, use_container_width=True, hide_index=True)
|
| 435 |
+
con.close()
|
| 436 |
+
except Exception as e:
|
| 437 |
+
st.warning(f"Could not preview tables: {e}")
|
| 438 |
+
|
| 439 |
+
|
| 440 |
+
# =====================
|
| 441 |
+
# Chat with your dataset
|
| 442 |
+
# (Appends after overview & preview; leaves earlier logic untouched)
|
| 443 |
+
# =====================
|
| 444 |
+
if st.session_state.get("last_overview_md"):
|
| 445 |
+
st.divider()
|
| 446 |
+
st.subheader("Chat with your dataset")
|
| 447 |
+
|
| 448 |
+
# Lazy imports so nothing changes before preview completes
|
| 449 |
+
def _lazy_imports():
|
| 450 |
+
from duckdb_react_agent import get_schema_summary, make_llm, answer_question # noqa: F401
|
| 451 |
+
return get_schema_summary, make_llm, answer_question
|
| 452 |
+
|
| 453 |
+
# Initialize chat memory
|
| 454 |
+
st.session_state.setdefault("chat_history", []) # [{role, content, sql?, plot_path?}]
|
| 455 |
+
|
| 456 |
+
# --- 1) Take input first so the user's question appears immediately ---
|
| 457 |
+
user_q = st.chat_input("Ask a question about the dataset…")
|
| 458 |
+
if user_q:
|
| 459 |
+
st.session_state.chat_history.append({"role": "user", "content": user_q})
|
| 460 |
+
|
| 461 |
+
# --- 2) Render history in strict User → Assistant order ---
|
| 462 |
+
for msg in st.session_state.chat_history:
|
| 463 |
+
with st.chat_message("user" if msg["role"] == "user" else "assistant"):
|
| 464 |
+
# Always: text first
|
| 465 |
+
st.markdown(msg["content"])
|
| 466 |
+
|
| 467 |
+
# Then: optional plot (below the answer)
|
| 468 |
+
plot_path_hist = msg.get("plot_path")
|
| 469 |
+
if plot_path_hist:
|
| 470 |
+
if not os.path.isabs(plot_path_hist):
|
| 471 |
+
plot_path_hist = str((PRIMARY_DIR / plot_path_hist).resolve())
|
| 472 |
+
if os.path.exists(plot_path_hist):
|
| 473 |
+
st.image(plot_path_hist, caption="Chart", width=520)
|
| 474 |
+
|
| 475 |
+
# Finally: optional SQL expander
|
| 476 |
+
if msg.get("sql"):
|
| 477 |
+
with st.expander("View generated SQL", expanded=False):
|
| 478 |
+
st.markdown(f"<div class='sqlbox'>{msg['sql']}</div>", unsafe_allow_html=True)
|
| 479 |
+
|
| 480 |
+
# --- 3) If a new question arrived, stream the assistant answer now ---
|
| 481 |
+
if user_q:
|
| 482 |
+
with st.chat_message("assistant"):
|
| 483 |
+
# Placeholders in sequence: text → plot → SQL
|
| 484 |
+
stream_placeholder = st.empty()
|
| 485 |
+
plot_placeholder = st.empty()
|
| 486 |
+
sql_placeholder = st.empty()
|
| 487 |
+
|
| 488 |
+
# Show pending immediately
|
| 489 |
+
stream_placeholder.markdown("_Answer pending…_")
|
| 490 |
+
|
| 491 |
+
partial_chunks = []
|
| 492 |
+
def on_token(t: str):
|
| 493 |
+
partial_chunks.append(t)
|
| 494 |
+
stream_placeholder.markdown("".join(partial_chunks))
|
| 495 |
+
|
| 496 |
+
# Resolve DB path
|
| 497 |
+
if 'db_path' in locals():
|
| 498 |
+
_db_path = db_path # from the preview scope if defined
|
| 499 |
+
else:
|
| 500 |
+
if 'file' in locals() and file is not None:
|
| 501 |
+
_stem = Path(file.name).stem
|
| 502 |
+
_db_path = DB_DIR / f"{_stem}.duckdb"
|
| 503 |
+
else:
|
| 504 |
+
_candidates = sorted(DB_DIR.glob("*.duckdb"), key=lambda p: p.stat().st_mtime, reverse=True)
|
| 505 |
+
_db_path = _candidates[0] if _candidates else None
|
| 506 |
+
|
| 507 |
+
if not _db_path or not Path(_db_path).exists():
|
| 508 |
+
stream_placeholder.error("No dataset found. Please re-upload the Excel file in this session.")
|
| 509 |
+
else:
|
| 510 |
+
# Call agent lazily
|
| 511 |
+
get_schema_summary, make_llm, answer_question = _lazy_imports()
|
| 512 |
+
try:
|
| 513 |
+
try:
|
| 514 |
+
con2 = duckdb.connect(str(_db_path), read_only=True)
|
| 515 |
+
except Exception:
|
| 516 |
+
con2 = duckdb.connect(str(_db_path))
|
| 517 |
+
|
| 518 |
+
schema_text = get_schema_summary(con2, allowed_schemas=["main"])
|
| 519 |
+
llm = make_llm(model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"), temperature=0.0)
|
| 520 |
+
|
| 521 |
+
import inspect as _inspect
|
| 522 |
+
_sig = None
|
| 523 |
+
try:
|
| 524 |
+
_sig = _inspect.signature(answer_question)
|
| 525 |
+
except Exception:
|
| 526 |
+
_sig = None
|
| 527 |
+
|
| 528 |
+
def _call_answer():
|
| 529 |
+
try:
|
| 530 |
+
if _sig and "history" in _sig.parameters and "token_callback" in _sig.parameters and "stream" in _sig.parameters:
|
| 531 |
+
return answer_question(con2, llm, schema_text, user_q, stream=True, token_callback=on_token, history=st.session_state.chat_history)
|
| 532 |
+
elif _sig and "token_callback" in _sig.parameters and "stream" in _sig.parameters:
|
| 533 |
+
return answer_question(con2, llm, schema_text, user_q, stream=True, token_callback=on_token)
|
| 534 |
+
else:
|
| 535 |
+
return answer_question(con2, llm, schema_text, user_q)
|
| 536 |
+
except TypeError:
|
| 537 |
+
try:
|
| 538 |
+
return answer_question(con2, llm, schema_text, user_q, stream=True, token_callback=on_token)
|
| 539 |
+
except TypeError:
|
| 540 |
+
return answer_question(con2, llm, schema_text, user_q)
|
| 541 |
+
|
| 542 |
+
result = _call_answer()
|
| 543 |
+
con2.close()
|
| 544 |
+
|
| 545 |
+
# Finalize text
|
| 546 |
+
answer_text = result.get("answer") or "".join(partial_chunks) or "*No answer produced.*"
|
| 547 |
+
stream_placeholder.markdown(answer_text)
|
| 548 |
+
|
| 549 |
+
# Show plot next (slightly larger, below the text)
|
| 550 |
+
plot_path = result.get("plot_path")
|
| 551 |
+
if plot_path:
|
| 552 |
+
if not os.path.isabs(plot_path):
|
| 553 |
+
plot_path = str((PRIMARY_DIR / plot_path).resolve())
|
| 554 |
+
if os.path.exists(plot_path):
|
| 555 |
+
plot_placeholder.image(plot_path, caption="Chart", width=560)
|
| 556 |
+
|
| 557 |
+
# Finally show SQL
|
| 558 |
+
gen_sql = (result.get("sql") or "").strip()
|
| 559 |
+
if gen_sql:
|
| 560 |
+
with st.expander("View generated SQL", expanded=False):
|
| 561 |
+
st.markdown(f"<div class='sqlbox'>{gen_sql}</div>", unsafe_allow_html=True)
|
| 562 |
+
|
| 563 |
+
# Persist assistant message
|
| 564 |
+
st.session_state.chat_history.append({
|
| 565 |
+
"role": "assistant",
|
| 566 |
+
"content": answer_text,
|
| 567 |
+
"sql": gen_sql,
|
| 568 |
+
"plot_path": result.get("plot_path")
|
| 569 |
+
})
|
| 570 |
+
finally:
|
| 571 |
+
try:
|
| 572 |
+
con2.close()
|
| 573 |
+
except Exception:
|
| 574 |
+
pass
|