arthikrangan commited on
Commit
f7021ae
·
verified ·
1 Parent(s): 8c57ffe

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +161 -14
  2. duckdb_react_agent.py +628 -0
  3. excel_to_duckdb.py +595 -0
  4. requirements.txt +13 -3
  5. streamlit_app.py +574 -0
README.md CHANGED
@@ -1,19 +1,166 @@
 
 
 
 
1
  ---
2
- title: Data Analysis Agent
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: Streamlit template space
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- # Welcome to Streamlit!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
17
 
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
1
+ # Data Analysis Agent (Streamlit + LangGraph ReAct Agent)
2
+
3
+ A lightweight Streamlit application that ingests **Excel (.xlsx)** files into a local **DuckDB** database and lets you **chat with your data**. It uses a **LangGraph ReAct-style agent** to generate DuckDB SQL from natural language, refine on errors, and optionally render plots. Answers stream live, are aware of chat history, and show the generated SQL in an expander for transparency.
4
+
5
  ---
6
+
7
+ ## Highlights
8
+
9
+ - **One‑click ingestion**: Excel → DuckDB using `excel_to_duckdb.py` (keeps ingestion logs).
10
+ - **Overview & preview**: Human‑friendly overview and quick table previews before chatting.
11
+ - **Chat with your dataset**: Natural‑language questions → SQL → answer (+ optional chart).
12
+ - **Streaming**: Answers appear token‑by‑token (“Answer pending…” shows immediately).
13
+ - **History‑aware**: Follow‑ups consider prior Q&A context for better intent & filters.
14
+ - **Conservative plotting**: Only plots when useful (trends/many rows or explicitly asked).
15
+ - **Clean UI**: Chart appears **below** the answer and **before** the generated SQL.
16
+ - **Session isolation**: Uploading a new file **clears** overview, previews, and chat.
17
+ - **Local by default**: DuckDB DB & plots are saved locally (`./dbs`, `./plots`).
18
+
19
+ ---
20
+
21
+ ## Repo Structure
22
+
23
+ ```
24
+ streamlit_app.py # Streamlit app (UI, flow, chat)
25
+ duckdb_react_agent.py # LangGraph ReAct agent (SQL generation, refine, answer, plots)
26
+ excel_to_duckdb.py # Excel → DuckDB ingestion script (unchanged behavior)
27
+ uploads/ # Uploaded Excel files (created at runtime)
28
+ dbs/ # DuckDB files per upload (created at runtime)
29
+ plots/ # Charts generated by the agent (created at runtime)
30
+ ```
31
+
32
+ > **Note:** The app intentionally skips internal tables (e.g., `__excel_*`) when discovering “user tables.” Those tables hold ingestion metadata and previews.
33
+
34
  ---
35
 
36
+ ## Requirements
37
+
38
+ - Python 3.10+
39
+ - Recommended packages:
40
+ ```bash
41
+ pip install streamlit duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai
42
+ ```
43
+ - **OpenAI** API key in environment or `.env` file:
44
+ ```dotenv
45
+ OPENAI_API_KEY=sk-...
46
+ # optional: model override (defaults to gpt-4o-mini in app; gpt-4o in CLI agent)
47
+ OPENAI_MODEL=gpt-4o-mini
48
+ ```
49
+
50
+ If your ingestion or preview relies on Excel parsing features, make sure you have:
51
+ ```bash
52
+ pip install openpyxl
53
+ ```
54
+
55
+ ---
56
+
57
+ ## Quick Start
58
+
59
+ 1. Place the three files side‑by‑side:
60
+ ```
61
+ streamlit_app.py
62
+ duckdb_react_agent.py
63
+ excel_to_duckdb.py
64
+ ```
65
+
66
+ 2. Set your API key (env var or `.env`):
67
+ ```bash
68
+ export OPENAI_API_KEY=sk-... # macOS/Linux
69
+ setx OPENAI_API_KEY "sk-..." # Windows PowerShell (new sessions)
70
+ ```
71
+
72
+ 3. Run the app:
73
+ ```bash
74
+ streamlit run streamlit_app.py
75
+ ```
76
+
77
+ 4. In the UI:
78
+ - **Upload** an `.xlsx` file → watch the **Processing logs**.
79
+ - Once ingestion finishes, you’ll see the **overview** and **quick previews**.
80
+ - Scroll down to **Chat with your dataset** and start asking questions.
81
+ - Each bot reply shows: **answer text**, optional **chart**, and an expander for **SQL**.
82
+
83
+ > **New upload behavior:** As soon as you upload a different file, the app **clears** the previous overview, table previews, and chat, then ingests the new file.
84
+
85
+ ---
86
+
87
+ ## How It Works
88
+
89
+ ### 1) Ingestion (Excel → DuckDB)
90
+ - The app shells out to `excel_to_duckdb.py`:
91
+ - Detects multiple tables per sheet (using blank rows as separators), handles merged headers, ignores pivot/chart‑like blocks, and writes one DuckDB table per block.
92
+ - Creates helpful metadata tables (e.g., `__excel_tables`, `__excel_schema`).
93
+ - The app then opens the DuckDB file and builds a **schema snapshot** for the agent.
94
+ - **User tables**: The app queries `information_schema` to list the “real” user tables (skips internals like `__excel_*`).
95
+
96
+ ### 2) LangGraph ReAct Agent (DuckDB Q&A)
97
+ - **ReAct loop**: Draft a SELECT query → run → if error, **refine up to 3×** with the error and prior SQL.
98
+ - **Safety**: Only **SELECT** queries are permitted.
99
+ - **Streaming answers**: Final natural‑language answer streams to the UI.
100
+ - **History awareness**: Recent Q&A is summarized and fed to the LLM for better follow‑ups.
101
+ - **Plotting (optional)**: If the result truly benefits from visuals (or you ask for one), a chart is generated and saved to `./plots`; the app then displays it **below** the answer.
102
+
103
+ > The agent avoids revealing plot file names or paths. Answers refer generically to “the chart” and the UI renders the image inline.
104
+
105
+ ---
106
+
107
+ ## Configuration & Tweaks
108
+
109
+ - **Models**: Set `OPENAI_MODEL` in your env to change the LLM (`gpt-4o-mini`, `gpt-4o`, etc.).
110
+ - **Plot policy**: The agent is conservative by default. To make it more/less eager to chart, tune the **viz prompt** or the **row‑count threshold** in `duckdb_react_agent.py` (`many_rows = df.shape[0] > 20`).
111
+ - **Chart size**: In `streamlit_app.py`, the plot width is ~**560px** for answers and ~**320‑360px** in history; adjust to taste.
112
+ - **Schemas**: The schema summary focuses on `main`. To include more schemas, extend the `allowed_schemas` list where `get_schema_summary()` is called.
113
+
114
+ ---
115
+
116
+ ## Troubleshooting
117
+
118
+ **“No user tables were discovered. Showing ingestion metadata…”**
119
+ - Your Excel may not have recognizable data blocks under the current heuristics. Check the **Processing logs** and the `__excel_tables` preview to verify what got detected.
120
+ - If it’s a header/merged‑cell issue, revisit your spreadsheet (ensure at least one empty row between blocks).
121
+
122
+ **Agent errors / SQL fails repeatedly**
123
+ - The agent refines up to 3×. If it still fails, it returns the error. Try asking the question more explicitly (table names, time ranges, filters).
124
+
125
+ **Streamlit cache / stale code**
126
+ - If updates don’t show up, restart Streamlit or clear cache:
127
+ ```bash
128
+ streamlit cache clear
129
+ ```
130
+
131
+ **Windows console encoding warnings**
132
+ - The app sets `PYTHONIOENCODING=utf-8` for ingestion logs. If you still see encoding issues, ensure system locale supports UTF‑8.
133
+
134
+ ---
135
+
136
+ ## Security & Privacy Notes
137
+
138
+ - **Local files**: DuckDB files and plots are stored locally by default (`./dbs`, `./plots`).
139
+ - **What goes to the LLM**: The agent sends the **schema snapshot** and a **small result preview** (first rows) along with your **question** and a compact **history summary**—not the full dataset. For stricter controls, reduce or strip the preview passed to the LLM in `duckdb_react_agent.py`.
140
+
141
+ ---
142
+
143
+ ## Extending the App
144
+
145
+ - **CSV support**: Add a CSV branch next to the Excel ingest that writes to the same DuckDB.
146
+ - **Persisted chat**: Save `chat_history` to a file keyed by DB name to restore on reload.
147
+ - **Role‑specific prompts**: Swap in system prompts for financial analysis, marketing analytics, etc.
148
+ - **Auth & multi‑user**: Gate uploads and route per‑user databases under a workspace prefix.
149
+
150
+ ---
151
+
152
+ ## CLI (Agent Only)
153
+
154
+ You can also run the agent from the command line for ad‑hoc Q&A:
155
+
156
+ ```bash
157
+ python duckdb_react_agent.py --duckdb ./dbs/mydata.duckdb --stream
158
+ ```
159
+
160
+ Type your question; the agent prints streaming answers and shows where charts were saved (the Streamlit app displays them inline).
161
+
162
+ ---
163
 
164
+ ## License
165
 
166
+ Proprietary / internal. Adapt as needed for your environment.
 
duckdb_react_agent.py ADDED
@@ -0,0 +1,628 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DuckDB Q&A Agent (LangGraph + ReAct) — Streaming Edition
3
+ ========================================================
4
+
5
+ Adds:
6
+ - Streaming responses for the final natural-language answer (CLI prints tokens live).
7
+ - Stricter schema discovery focused on *user* tables from the provided DB (defaults to schema 'main').
8
+ - Keeps SELECT-only safety, up to 3 SQL refinements on error, optional plotting saved to ./plots.
9
+
10
+ Quick start
11
+ -----------
12
+ 1) Install deps (example):
13
+ pip install --upgrade duckdb pandas matplotlib python-dotenv langgraph langchain langchain-openai
14
+
15
+ 2) Ensure an OpenAI API key is available:
16
+ - Put it in a .env file as: OPENAI_API_KEY=sk-...
17
+ - Or set an env var: set OPENAI_API_KEY=... (Windows) / export OPENAI_API_KEY=... (macOS/Linux)
18
+
19
+ 3) Run the agent in an interactive loop:
20
+ python duckdb_react_agent.py --duckdb path/to/your.db --stream
21
+
22
+ Notes
23
+ -----
24
+ - Targets DuckDB SQL. The LLM prompt instructs to write *only* a SELECT statement.
25
+ - ReAct loop: on error, the LLM sees the error message + previous SQL and attempts to fix it (max 3 tries).
26
+ - The agent avoids DDL/DML; SELECT-only for safety.
27
+ - Plots are saved to ./plots/ as PNG files (no GUI required). The script uses a non-interactive backend.
28
+ - Internal/helper tables beginning with "__" (e.g., ingestion metadata) and non-user schemas are ignored.
29
+ """
30
+
31
+ import os
32
+ import re
33
+ import json
34
+ import uuid
35
+ import argparse
36
+ from typing import Any, Dict, List, Optional, TypedDict, Callable
37
+
38
+ # Non-GUI backend for saving plots on servers/Windows terminals
39
+ import matplotlib
40
+ matplotlib.use("Agg")
41
+ import matplotlib.pyplot as plt
42
+
43
+ import duckdb
44
+ import pandas as pd
45
+
46
+ from dotenv import load_dotenv, find_dotenv
47
+
48
+ # LangChain / LangGraph
49
+ from langchain_openai import ChatOpenAI
50
+ from langchain_core.messages import SystemMessage, HumanMessage
51
+ from langgraph.graph import StateGraph, END
52
+
53
+
54
+ # -----------------------------
55
+ # Utility: safe number formatting
56
+ # -----------------------------
57
+ def format_number(x: Any) -> str:
58
+ try:
59
+ if isinstance(x, (int, float)) and not isinstance(x, bool):
60
+ if abs(x) >= 1000 and abs(x) < 1e12:
61
+ return f"{x:,.2f}".rstrip('0').rstrip('.')
62
+ else:
63
+ return f"{x:.4g}" if isinstance(x, float) else str(x)
64
+ return str(x)
65
+ except Exception:
66
+ return str(x)
67
+
68
+
69
+ # -----------------------------
70
+ # Schema introspection from DuckDB (user tables only)
71
+ # -----------------------------
72
+ def get_schema_summary(con: duckdb.DuckDBPyConnection, sample_rows: int = 3, include_views: bool = True,
73
+ allowed_schemas: Optional[List[str]] = None) -> str:
74
+ """
75
+ Build a compact schema snapshot for user tables in the *provided DB*.
76
+ By default we only list tables/views in schema 'main' (user content) and skip internals.
77
+ """
78
+ if allowed_schemas is None:
79
+ allowed_schemas = ["main"]
80
+
81
+ type_filter = "('BASE TABLE')" if not include_views else "('BASE TABLE','VIEW')"
82
+
83
+ tables_df = con.execute(
84
+ f"""
85
+ SELECT table_schema, table_name, table_type
86
+ FROM information_schema.tables
87
+ WHERE table_type IN {type_filter}
88
+ AND table_schema IN ({','.join(['?']*len(allowed_schemas))})
89
+ AND table_name NOT LIKE 'duckdb_%'
90
+ AND table_name NOT LIKE 'sqlite_%'
91
+ ORDER BY table_schema, table_name
92
+ """,
93
+ allowed_schemas
94
+ ).fetchdf()
95
+
96
+ lines: List[str] = []
97
+ for _, row in tables_df.iterrows():
98
+ schema = row["table_schema"]
99
+ name = row["table_name"]
100
+ if name.startswith("__"):
101
+ # Common ingestion metadata; not user-facing
102
+ continue
103
+
104
+ cols = con.execute(
105
+ """
106
+ SELECT column_name, data_type
107
+ FROM information_schema.columns
108
+ WHERE table_schema = ? AND table_name = ?
109
+ ORDER BY ordinal_position
110
+ """,
111
+ [schema, name],
112
+ ).fetchdf()
113
+
114
+ lines.append(f"TABLE {schema}.{name}")
115
+ if len(cols) == 0:
116
+ lines.append(" (no columns discovered)")
117
+ continue
118
+
119
+ # Sample small set of values per column to guide the LLM
120
+ try:
121
+ sample = con.execute(f"SELECT * FROM {schema}.{name} LIMIT {sample_rows}").fetchdf()
122
+ except Exception:
123
+ sample = pd.DataFrame(columns=cols["column_name"].tolist())
124
+
125
+ for _, c in cols.iterrows():
126
+ col = c["column_name"]
127
+ dtype = c["data_type"]
128
+ examples: List[str] = []
129
+ if col in sample.columns:
130
+ for v in sample[col].tolist():
131
+ examples.append(format_number(v)[:80])
132
+ example_str = ", ".join(examples) if examples else ""
133
+ lines.append(f" - {col} :: {dtype} e.g. [{example_str}]")
134
+ lines.append("")
135
+
136
+ if not lines:
137
+ lines.append("(No user tables discovered in schema(s): " + ", ".join(allowed_schemas) + ")")
138
+
139
+ return "\n".join(lines)
140
+
141
+
142
+ # -----------------------------
143
+ # LLM helpers
144
+ # -----------------------------
145
+ def make_llm(model: str = "gpt-4o-mini", temperature: float = 0.0) -> ChatOpenAI:
146
+ return ChatOpenAI(model=model, temperature=temperature)
147
+
148
+
149
+ SQL_SYSTEM_PROMPT = """You are an expert DuckDB SQL writer. You will receive:
150
+ - A database schema (tables and columns) with a few sample values
151
+ - A user's natural-language question
152
+
153
+ Write exactly ONE valid DuckDB SQL SELECT statement to answer the question.
154
+ Rules:
155
+ - Output ONLY the SQL (no backticks, no explanation, no comments)
156
+ - Use DuckDB SQL
157
+ - Never write DDL/DML (CREATE/INSERT/UPDATE/DELETE), only SELECT statements
158
+ - Prefer explicit column names, avoid SELECT *
159
+ - If time grouping is needed, use date_trunc('month', col), date_trunc('day', col), etc.
160
+ - If casting is needed, use CAST(... AS TYPE)
161
+ - Use ONLY the listed user tables from the provided database (e.g., schema 'main'). Do not rely on external/attached sources.
162
+ - Avoid tables starting with "__" unless explicitly referenced.
163
+ - If uncertain, make a reasonable assumption and produce the best possible query
164
+
165
+ Be careful with joins and filters. Ensure SQL parses successfully.
166
+ """
167
+
168
+ REFINE_SYSTEM_PROMPT = """You are fixing a DuckDB SQL query that errored. You'll see:
169
+ - the user's question
170
+ - the previous SQL
171
+ - the exact error message
172
+
173
+ Return a corrected DuckDB SELECT statement that addresses the error.
174
+ Rules:
175
+ - Output ONLY the SQL (no backticks, no explanation, no comments)
176
+ - SELECT-only (no DDL/DML)
177
+ - Keep the intent of the question intact
178
+ - Fix joins, field names, casts, aggregations or date functions as needed
179
+ - Use ONLY the listed user tables from the provided database (e.g., schema 'main')
180
+ """
181
+
182
+ ANSWER_SYSTEM_PROMPT = """You are a helpful data analyst. Given a user's question and the query result data,
183
+ write a clear, concise, conversational answer in plain English. Use bullet points or short paragraphs.
184
+ - Format large numbers with thousands separators where appropriate.
185
+ - If percentages are present, include % signs.
186
+ - If a chart/image accompanies the answer, DO NOT mention file paths or filenames. Refer generically, e.g., 'as depicted in the chart' or 'see the chart alongside this answer.'
187
+ - If the result is empty, say so and suggest a next step.
188
+ - Do not invent columns or values that are not in the result.
189
+ - (Optional) You may refer to prior Q&A context if it informs this answer.
190
+ """
191
+
192
+ VIZ_SYSTEM_PROMPT = """You will decide whether a chart would help answer the question using the SQL result.
193
+ Return STRICT JSON with keys: {"make_plot": bool, "chart": "line|bar|scatter|hist|box|pie", "x": "<col or null>", "y": "<col or null>", "series": "<col or null>", "agg": "<sum|avg|count|max|min|null>", "reason": "<short reason>"}.
194
+ Conservative rules (prefer NO plot):
195
+ - Only set make_plot=true if the user explicitly asks for a chart/visualization OR the result involves trends over time, distributions, or comparisons with many rows (e.g., > 20) where a chart adds clarity.
196
+ - For simple aggregates or few rows (<= 10), set make_plot=false unless explicitly requested.
197
+ Guidelines when make_plot=true:
198
+ - Prefer line for trends over time
199
+ - Prefer bar for ranking/comparison of categories
200
+ - Prefer scatter for correlation between two numeric columns
201
+ - Prefer hist for distribution of a single numeric column
202
+ - Prefer box for distribution with quartiles per category
203
+ - Prefer pie sparingly for simple part-to-whole
204
+ JSON only. No prose.
205
+ """
206
+
207
+
208
+ def extract_sql(text: str) -> str:
209
+ """Extract SQL from potential LLM output. We expect raw SQL with no fences, but be defensive."""
210
+ text = text.strip()
211
+ fence = re.compile(r"```(?:sql)?(.*?)```", re.DOTALL | re.IGNORECASE)
212
+ m = fence.search(text)
213
+ if m:
214
+ return m.group(1).strip()
215
+ return text
216
+
217
+
218
+ # -----------------------------
219
+ # LangGraph State
220
+ # -----------------------------
221
+ class AgentState(TypedDict):
222
+ question: str
223
+ schema: str
224
+ attempts: int
225
+ sql: Optional[str]
226
+ error: Optional[str]
227
+ result_json: Optional[str] # JSON records preview
228
+ result_columns: Optional[List[str]]
229
+ plot_path: Optional[str]
230
+ final_answer: Optional[str]
231
+ _result_df: Optional[pd.DataFrame]
232
+ _viz_spec: Optional[Dict[str, Any]]
233
+ _stream: bool
234
+ _token_cb: Optional[Callable[[str], None]]
235
+
236
+
237
+ # -----------------------------
238
+ # Nodes: SQL drafting, execution, refinement, viz, answer
239
+ # -----------------------------
240
+ def node_draft_sql(state: AgentState, llm: ChatOpenAI) -> AgentState:
241
+ msgs = [
242
+ SystemMessage(content=SQL_SYSTEM_PROMPT),
243
+ HumanMessage(content=f"SCHEMA:\n{state['schema']}\n\nQUESTION:\n{state['question']}"),
244
+ ]
245
+ resp = llm.invoke(msgs) # type: ignore
246
+ sql = extract_sql(resp.content or "")
247
+ return {**state, "sql": sql, "error": None}
248
+
249
+
250
+ def run_duckdb_query(con: duckdb.DuckDBPyConnection, sql: str) -> pd.DataFrame:
251
+ first_token = re.split(r"\s+", sql.strip(), maxsplit=1)[0].upper()
252
+ if first_token != "SELECT" and not sql.strip().upper().startswith("WITH "):
253
+ raise ValueError("Only SELECT queries are allowed.")
254
+ df = con.execute(sql).fetchdf()
255
+ return df
256
+
257
+
258
+ def node_run_sql(state: AgentState, con: duckdb.DuckDBPyConnection) -> AgentState:
259
+ try:
260
+ df = run_duckdb_query(con, state["sql"] or "")
261
+ preview = df.head(50).to_dict(orient="records")
262
+ return {
263
+ **state,
264
+ "error": None,
265
+ "result_json": json.dumps(preview, default=str),
266
+ "result_columns": list(df.columns),
267
+ "_result_df": df,
268
+ }
269
+ except Exception as e:
270
+ return {**state, "error": str(e), "result_json": None, "result_columns": None, "_result_df": None}
271
+
272
+
273
+ def node_refine_sql(state: AgentState, llm: ChatOpenAI) -> AgentState:
274
+ if state.get("attempts", 0) >= 3:
275
+ return state
276
+ msgs = [
277
+ SystemMessage(content=REFINE_SYSTEM_PROMPT),
278
+ HumanMessage(
279
+ content=(
280
+ f"QUESTION:\n{state['question']}\n\n"
281
+ f"PREVIOUS SQL:\n{state.get('sql','')}\n\n"
282
+ f"ERROR:\n{state.get('error','')}"
283
+ )
284
+ ),
285
+ ]
286
+ resp = llm.invoke(msgs) # type: ignore
287
+ sql = extract_sql(resp.content or "")
288
+ return {**state, "sql": sql, "attempts": state.get("attempts", 0) + 1, "error": None}
289
+
290
+
291
+ def node_decide_viz(state: AgentState, llm: ChatOpenAI) -> AgentState:
292
+ if not state.get("result_json"):
293
+ return state
294
+
295
+ result_preview = state["result_json"]
296
+ msgs = [
297
+ SystemMessage(content=VIZ_SYSTEM_PROMPT),
298
+ HumanMessage(
299
+ content=(
300
+ f"QUESTION:\n{state['question']}\n\n"
301
+ f"COLUMNS: {state.get('result_columns',[])}\n"
302
+ f"RESULT PREVIEW (first rows):\n{result_preview}"
303
+ )
304
+ ),
305
+ ]
306
+ resp = llm.invoke(msgs) # type: ignore
307
+
308
+ spec = {"make_plot": False}
309
+ try:
310
+ spec = json.loads(resp.content) # type: ignore
311
+ if not isinstance(spec, dict) or "make_plot" not in spec:
312
+ spec = {"make_plot": False}
313
+ except Exception:
314
+ spec = {"make_plot": False}
315
+
316
+ return {**state, "_viz_spec": spec}
317
+
318
+
319
+ def node_make_plot(state: AgentState) -> AgentState:
320
+ spec = state.get("_viz_spec") or {}
321
+ if not spec or not spec.get("make_plot"):
322
+ return state
323
+
324
+ df: Optional[pd.DataFrame] = state.get("_result_df")
325
+ if df is None or df.empty:
326
+ return state
327
+
328
+ # Additional guard: avoid trivial plots unless explicitly requested
329
+ q_lower = (state.get('question') or '').lower()
330
+ explicit_viz = any(k in q_lower for k in ['chart', 'plot', 'graph', 'visual', 'visualize', 'trend'])
331
+ many_rows = df.shape[0] > 20
332
+ if not explicit_viz and not many_rows:
333
+ return state
334
+
335
+ x = spec.get("x")
336
+ y = spec.get("y")
337
+ series = spec.get("series")
338
+ chart = (spec.get("chart") or "").lower()
339
+
340
+ def col_ok(c: Optional[str]) -> bool:
341
+ return isinstance(c, str) and c in df.columns
342
+
343
+ if not col_ok(x) and df.shape[1] >= 1:
344
+ x = df.columns[0]
345
+ if not col_ok(y) and df.shape[1] >= 2:
346
+ y = df.columns[1]
347
+
348
+ try:
349
+ os.makedirs("plots", exist_ok=True)
350
+ fig = plt.figure()
351
+ ax = fig.gca()
352
+
353
+ if chart == "line":
354
+ if series and col_ok(series):
355
+ for k, g in df.groupby(series):
356
+ ax.plot(g[x], g[y], label=str(k))
357
+ ax.legend(loc="best")
358
+ else:
359
+ ax.plot(df[x], df[y])
360
+ ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
361
+
362
+ elif chart == "bar":
363
+ if series and col_ok(series):
364
+ pivot = df.pivot_table(index=x, columns=series, values=y, aggfunc="sum", fill_value=0)
365
+ pivot.plot(kind="bar", ax=ax)
366
+ ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
367
+ else:
368
+ ax.bar(df[x], df[y])
369
+ ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
370
+ fig.autofmt_xdate(rotation=45)
371
+
372
+ elif chart == "scatter":
373
+ ax.scatter(df[x], df[y])
374
+ ax.set_xlabel(str(x)); ax.set_ylabel(str(y))
375
+
376
+ elif chart == "hist":
377
+ ax.hist(df[y] if col_ok(y) else df[x], bins=30)
378
+ ax.set_xlabel(str(y if col_ok(y) else x)); ax.set_ylabel("Frequency")
379
+
380
+ elif chart == "box":
381
+ ax.boxplot([df[y]] if col_ok(y) else [df[x]])
382
+ ax.set_ylabel(str(y if col_ok(y) else x))
383
+
384
+ elif chart == "pie":
385
+ labels = df[x].astype(str).tolist()
386
+ values = df[y].tolist()
387
+ ax.pie(values, labels=labels, autopct="%1.1f%%")
388
+ ax.axis("equal")
389
+
390
+ else:
391
+ plt.close(fig)
392
+ return state
393
+
394
+ out_path = os.path.join("plots", f"duckdb_answer_{uuid.uuid4().hex[:8]}.png")
395
+ fig.tight_layout()
396
+ fig.savefig(out_path, dpi=150)
397
+ plt.close(fig)
398
+ return {**state, "plot_path": out_path}
399
+
400
+ except Exception:
401
+ return state
402
+
403
+
404
+ def _stream_llm_answer(llm: ChatOpenAI, msgs: List[Any], token_cb: Optional[Callable[[str], None]]) -> str:
405
+ """Stream tokens for the final answer; returns full text as well."""
406
+ out = ""
407
+ try:
408
+ for chunk in llm.stream(msgs): # type: ignore
409
+ piece = getattr(chunk, "content", None)
410
+ if not piece and hasattr(chunk, "message") and getattr(chunk, "message", None):
411
+ piece = getattr(chunk.message, "content", "")
412
+ if not piece:
413
+ continue
414
+ out += piece
415
+ if token_cb:
416
+ token_cb(piece)
417
+ except Exception:
418
+ # Fallback to non-streaming if stream not supported
419
+ resp = llm.invoke(msgs) # type: ignore
420
+ out = resp.content if isinstance(resp.content, str) else str(resp.content)
421
+ if token_cb:
422
+ token_cb(out)
423
+ return out
424
+
425
+
426
+ def node_answer(state: AgentState, llm: ChatOpenAI) -> AgentState:
427
+ preview = state.get("result_json") or "[]"
428
+ columns = state.get("result_columns") or []
429
+ plot_path = state.get("plot_path")
430
+ if len(preview) > 4000:
431
+ preview = preview[:4000] + " ..."
432
+
433
+ msgs = [
434
+ SystemMessage(content=ANSWER_SYSTEM_PROMPT),
435
+ HumanMessage(
436
+ content=(
437
+ f"QUESTION:\n{state['question']}\n\n"
438
+ f"COLUMNS: {columns}\n"
439
+ f"RESULT PREVIEW (rows):\n{preview}\n\n"
440
+ f"PLOT_PATH: {plot_path if plot_path else 'None'}"
441
+ )
442
+ ),
443
+ ]
444
+
445
+ if state.get("error") and not state.get("result_json"):
446
+ err_text = (
447
+ "I couldn't produce a working SQL query after 3 attempts.\n\n"
448
+ "Details:\n" + (state.get("error") or "Unknown error")
449
+ )
450
+ # Stream the error too for consistency
451
+ if state.get("_stream") and state.get("_token_cb"):
452
+ state["_token_cb"](err_text)
453
+ return {**state, "final_answer": err_text}
454
+
455
+ # Stream the final answer if requested
456
+ if state.get("_stream"):
457
+ answer = _stream_llm_answer(llm, msgs, state.get("_token_cb"))
458
+ else:
459
+ resp = llm.invoke(msgs) # type: ignore
460
+ answer = resp.content if isinstance(resp.content, str) else str(resp.content)
461
+
462
+ return {**state, "final_answer": answer}
463
+
464
+
465
+ # -----------------------------
466
+ # Graph assembly
467
+ # -----------------------------
468
+ def build_graph(con: duckdb.DuckDBPyConnection, llm: ChatOpenAI):
469
+ g = StateGraph(AgentState)
470
+
471
+ g.add_node("draft_sql", lambda s: node_draft_sql(s, llm))
472
+ g.add_node("run_sql", lambda s: node_run_sql(s, con))
473
+ g.add_node("refine_sql", lambda s: node_refine_sql(s, llm))
474
+ g.add_node("decide_viz", lambda s: node_decide_viz(s, llm))
475
+ g.add_node("make_plot", node_make_plot)
476
+ g.add_node("answer", lambda s: node_answer(s, llm))
477
+
478
+ g.set_entry_point("draft_sql")
479
+ g.add_edge("draft_sql", "run_sql")
480
+
481
+ def on_run_sql(state: AgentState):
482
+ if state.get("error"):
483
+ if state.get("attempts", 0) < 3:
484
+ return "refine_sql"
485
+ else:
486
+ return "answer"
487
+ return "decide_viz"
488
+
489
+ g.add_conditional_edges("run_sql", on_run_sql, {
490
+ "refine_sql": "refine_sql",
491
+ "decide_viz": "decide_viz",
492
+ "answer": "answer",
493
+ })
494
+
495
+ g.add_edge("refine_sql", "run_sql")
496
+
497
+ def on_decide_viz(state: AgentState):
498
+ spec = state.get("_viz_spec") or {}
499
+ if spec.get("make_plot"):
500
+ return "make_plot"
501
+ return "answer"
502
+
503
+ g.add_conditional_edges("decide_viz", on_decide_viz, {
504
+ "make_plot": "make_plot",
505
+ "answer": "answer",
506
+ })
507
+
508
+ g.add_edge("make_plot", "answer")
509
+ g.add_edge("answer", END)
510
+
511
+ return g.compile()
512
+
513
+
514
+ # -----------------------------
515
+ # Public function to ask a question
516
+ # -----------------------------
517
+ def answer_question(con: duckdb.DuckDBPyConnection, llm: ChatOpenAI, schema_text: str, question: str,
518
+ stream: bool = False, token_callback: Optional[Callable[[str], None]] = None) -> Dict[str, Any]:
519
+ app = build_graph(con, llm)
520
+ initial: AgentState = {
521
+ "question": question,
522
+ "schema": schema_text,
523
+ "attempts": 0,
524
+ "sql": None,
525
+ "error": None,
526
+ "result_json": None,
527
+ "result_columns": None,
528
+ "plot_path": None,
529
+ "final_answer": None,
530
+ "_result_df": None,
531
+ "_viz_spec": None,
532
+ "_stream": stream,
533
+ "_token_cb": token_callback,
534
+ }
535
+ final_state: AgentState = app.invoke(initial) # type: ignore
536
+
537
+ return {
538
+ "question": question,
539
+ "sql": final_state.get("sql"),
540
+ "answer": final_state.get("final_answer"),
541
+ "plot_path": final_state.get("plot_path"),
542
+ "error": final_state.get("error"),
543
+ }
544
+
545
+
546
+ # -----------------------------
547
+ # CLI
548
+ # -----------------------------
549
+ def main():
550
+ load_dotenv(find_dotenv())
551
+
552
+ parser = argparse.ArgumentParser(description="DuckDB Q&A ReAct Agent (Streaming)")
553
+ parser.add_argument("--duckdb", required=True, help="Path to DuckDB database file")
554
+ parser.add_argument("--model", default="gpt-4o", help="OpenAI chat model (default: gpt-4o)")
555
+ parser.add_argument("--stream", action="store_true", help="Stream the final answer to stdout")
556
+ parser.add_argument("--schemas", nargs="*", default=["main"], help="Schemas to include (default: main)")
557
+ args = parser.parse_args()
558
+
559
+ api_key = os.environ.get("OPENAI_API_KEY")
560
+ if not api_key:
561
+ print("ERROR: OPENAI_API_KEY not set. Put it in a .env file or export the env var.")
562
+ return
563
+
564
+ # Connect DuckDB strictly to the provided file (user DB)
565
+ try:
566
+ con = duckdb.connect(database=args.duckdb, read_only=True)
567
+ except Exception as e:
568
+ print(f"Failed to open DuckDB at {args.duckdb}: {e}")
569
+ return
570
+
571
+ # Introspect schema (user tables only)
572
+ schema_text = get_schema_summary(con, allowed_schemas=args.schemas)
573
+
574
+ # Init LLM; enable streaming capability (used only in final answer node)
575
+ llm = make_llm(model=args.model, temperature=0.0)
576
+
577
+ print("DuckDB Q&A Agent (ReAct, Streaming)\n")
578
+ print(f"Connected to: {args.duckdb}")
579
+ print(f"Schemas included: {', '.join(args.schemas)}")
580
+ print("\nSchema snapshot:\n-----------------\n")
581
+ print(schema_text)
582
+ print("\nType your question and press ENTER. Type 'exit' to quit.\n")
583
+
584
+ def print_token(t: str):
585
+ print(t, end="", flush=True)
586
+
587
+ while True:
588
+ try:
589
+ q = input("Q> ").strip()
590
+ except (EOFError, KeyboardInterrupt):
591
+ print("\nExiting.")
592
+ break
593
+ if q.lower() in ("exit", "quit"):
594
+ print("Goodbye.")
595
+ break
596
+ if not q:
597
+ continue
598
+
599
+ if args.stream:
600
+ print("\n--- ANSWER (streaming) ---")
601
+ result = answer_question(con, llm, schema_text, q, stream=True, token_callback=print_token)
602
+ print("") # newline after stream
603
+ if result.get("plot_path"):
604
+ print(f"\nChart saved to: {result['plot_path']}")
605
+ print("\n--- SQL ---")
606
+ print((result.get("sql") or "").strip())
607
+ if result.get("error"):
608
+ print("\nERROR: " + str(result.get("error")))
609
+ else:
610
+ result = answer_question(con, llm, schema_text, q, stream=False, token_callback=None)
611
+ print("\n--- SQL ---")
612
+ print((result.get("sql") or "").strip())
613
+ if result.get("error"):
614
+ print("\n--- RESULT ---")
615
+ print("Sorry, I couldn't resolve a working query after 3 attempts.")
616
+ print("Error: " + str(result.get("error")))
617
+ else:
618
+ print("\n--- ANSWER ---")
619
+ print(result.get("answer") or "")
620
+ if result.get("plot_path"):
621
+ print(f"\nChart saved to: {result['plot_path']}")
622
+ print("\n")
623
+
624
+ con.close()
625
+
626
+
627
+ if __name__ == "__main__":
628
+ main()
excel_to_duckdb.py ADDED
@@ -0,0 +1,595 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Excel → DuckDB ingestion (generic, robust, multi-table)
3
+
4
+ - Hierarchical headers with merged-cell parent context (titles removed)
5
+ - Merged rows/cols resolved to master (top-left) value for consistent replication
6
+ - Multiple tables detected ONLY when separated by at least one completely empty row
7
+ - Footer detection (ignore trailing notes/summaries)
8
+ - Pivot detection (skip pivot-looking rows; optional sheet-level pivot/charthood skip)
9
+ - Optional LLM inference for unnamed columns and table titles (EXCEL_LLM_INFER=1)
10
+ - One DuckDB table per detected table block
11
+ - Traceability:
12
+ __excel_schema (sheet_name, table_name, column_ordinal, original_name, sql_column)
13
+ __excel_tables (sheet_name, table_name, block_index, start_row, end_row,
14
+ header_rows_json, inferred_title, original_title_text)
15
+
16
+ Usage:
17
+ python excel_to_duckdb.py --excel /path/file.xlsx --duckdb /path/out.duckdb
18
+ """
19
+
20
+ import os
21
+ import re
22
+ import sys
23
+ import json
24
+ import argparse
25
+ from pathlib import Path
26
+ from typing import List, Tuple, Dict
27
+ import hashlib
28
+
29
+ from openpyxl import load_workbook
30
+ from openpyxl.worksheet.worksheet import Worksheet
31
+
32
+
33
+ def _nonempty(vals):
34
+ return [v for v in vals if v not in (None, "")]
35
+
36
+ def _is_numlike(x):
37
+ if isinstance(x, (int, float)):
38
+ return True
39
+ s = str(x).strip().replace(",", "")
40
+ if s.endswith("%"):
41
+ s = s[:-1]
42
+ if not s:
43
+ return False
44
+ if any(c.isalpha() for c in s):
45
+ return False
46
+ try:
47
+ float(s); return True
48
+ except: return False
49
+
50
+ def _is_year_token(x):
51
+ if isinstance(x, int) and 1800 <= x <= 2100: return True
52
+ s = str(x).strip()
53
+ return s.isdigit() and 1800 <= int(s) <= 2100
54
+
55
+ def is_probably_footer(cells):
56
+ nonempty = [(i, v) for i, v in enumerate(cells) if v not in (None, "")]
57
+ if not nonempty: return False
58
+ if len(nonempty) <= 2:
59
+ text = " ".join(str(v) for _, v in nonempty).strip().lower()
60
+ if any(text.startswith(k) for k in ["note","notes","source","summary","disclaimer"]): return True
61
+ if len(text) > 50: return True
62
+ return False
63
+
64
+ def is_probably_data(cells, num_cols):
65
+ vals = [v for v in cells if v not in (None, "")]
66
+ if not vals: return False
67
+ nums_list = [v for v in vals if _is_numlike(v)]
68
+ num_num = len(nums_list); num_text = len(vals) - num_num
69
+ density = len(vals) / max(1, num_cols)
70
+ if num_num >= 2 and all(_is_year_token(v) for v in nums_list) and num_text >= 2:
71
+ return False
72
+ if num_num >= max(2, num_text): return True
73
+ if density >= 0.6 and num_num >= 2: return True
74
+ first = str(vals[0]).strip().lower() if vals else ""
75
+ if first in ("total","totals","grand total"): return True
76
+ return False
77
+
78
+ PIVOT_MARKERS = {"row labels","column labels","values","grand total","report filter","filters","∑ values","σ values","Σ values"}
79
+ def is_pivot_marker_string(s: str) -> bool:
80
+ if not s: return False
81
+ t = str(s).strip().lower()
82
+ if t in PIVOT_MARKERS: return True
83
+ if t.startswith(("sum of ","count of ","avg of ","average of ")): return True
84
+ if t.endswith(" total") or t.startswith("total "): return True
85
+ return False
86
+
87
+ def is_pivot_row(cells) -> bool:
88
+ text_cells = [str(v).strip() for v in cells if v not in (None, "")]
89
+ if not text_cells: return False
90
+ if any(is_pivot_marker_string(x) for x in text_cells): return True
91
+ agg_hits = sum(1 for x in text_cells if x.lower().startswith(("sum of","count of","avg of","average of","min of","max of")))
92
+ return agg_hits >= 2
93
+
94
+ def is_pivot_or_chart_sheet(ws: Worksheet) -> bool:
95
+ try:
96
+ if getattr(ws, "_charts", None): return True
97
+ except Exception: pass
98
+ if hasattr(ws, "_pivots") and getattr(ws, "_pivots"): return True
99
+ scan_rows = min(ws.max_row, 40); scan_cols = min(ws.max_column, 20)
100
+ pivotish = 0
101
+ for r in range(1, scan_rows+1):
102
+ row = [ws.cell(r,c).value for c in range(1, scan_cols+1)]
103
+ if is_pivot_row(row):
104
+ pivotish += 1
105
+ if pivotish >= 2: return True
106
+ name = (ws.title or "").lower()
107
+ if any(k in name for k in ("pivot","dashboard","chart","charts")): return True
108
+ return False
109
+
110
+ def sanitize_table_name(name: str) -> str:
111
+ t = re.sub(r"[^\w]", "_", str(name))
112
+ t = re.sub(r"_+", "_", t).strip("_")
113
+ if t and not t[0].isalpha(): t = "table_" + t
114
+ return t or "sheet_data"
115
+
116
+ def _samples_for_column(rows, col_idx, max_items=20):
117
+ vals = []
118
+ for row in rows:
119
+ if col_idx < len(row):
120
+ v = row[col_idx]
121
+ if v not in (None, ""): vals.append(v)
122
+ if len(vals) >= max_items: break
123
+ return vals
124
+
125
+ def _heuristic_infer_col_name(samples):
126
+ if not samples: return None
127
+ if sum(1 for v in samples if _is_year_token(v)) >= max(2, int(0.8*len(samples))): return "year"
128
+ pct_hits = 0
129
+ for v in samples:
130
+ s = str(v).strip()
131
+ if s.endswith("%"): pct_hits += 1
132
+ else:
133
+ try:
134
+ f = float(s.replace(",",""))
135
+ if 0 <= f <= 1.0 or 0 <= f <= 100: pct_hits += 0.5
136
+ except: pass
137
+ if pct_hits >= max(2, int(0.7*len(samples))): return "percentage"
138
+ if sum(1 for v in samples if _is_numlike(v)) >= max(3, int(0.7*len(samples))):
139
+ intish = 0
140
+ for v in samples:
141
+ try:
142
+ if float(str(v).replace(",","")) == int(float(str(v).replace(",",""))): intish += 1
143
+ except: pass
144
+ if intish >= max(2, int(0.6*len(samples))): return "count"
145
+ return "value"
146
+ uniq = {str(v).strip().lower() for v in samples}
147
+ if len(uniq) <= 3 and max(len(str(v)) for v in samples) >= 30: return "question"
148
+ if sum(1 for v in samples if re.search(r"\d", str(v)) and ("-" in str(v) or "–" in str(v))) >= max(2, int(0.6*len(samples))): return "range"
149
+ if len(uniq) < max(5, int(0.5*len(samples))): return "category"
150
+ return None
151
+
152
+ def clean_col_name(s: str) -> str:
153
+ s = re.sub(r"[^\w\s%#‰]", "", str(s).strip())
154
+ s = s.replace("%"," pct").replace("‰"," permille").replace("#"," count ")
155
+ s = re.sub(r"\s+"," ", s)
156
+ s = re.sub(r"\s+","_", s)
157
+ s = re.sub(r"_+","_", s).strip("_")
158
+ if s and s[0].isdigit(): s = "col_" + s
159
+ return s or "unnamed_column"
160
+
161
+ def ensure_unique(names):
162
+ seen = {}; out = []
163
+ for n in names:
164
+ base = (n or "unnamed_column").lower()
165
+ if base not in seen:
166
+ seen[base] = 0; out.append(n)
167
+ else:
168
+ i = seen[base] + 1
169
+ while f"{n}_{i}".lower() in seen: i += 1
170
+ seen[base] = i; out.append(f"{n}_{i}")
171
+ seen[(f"{n}_{i}").lower()] = 0
172
+ return out
173
+
174
+ def compose_col(parts):
175
+ cleaned = []; prev = None
176
+ for p in parts:
177
+ if not p: continue
178
+ p_norm = str(p).strip()
179
+ if prev is not None and p_norm.lower() == prev.lower(): continue
180
+ cleaned.append(p_norm); prev = p_norm
181
+ if not cleaned: return ""
182
+ return clean_col_name("_".join(cleaned))
183
+
184
+ def used_bounds(ws: Worksheet) -> Tuple[int,int,int,int]:
185
+ min_row, max_row, min_col, max_col = None, 0, None, 0
186
+ for r in ws.iter_rows():
187
+ for c in r:
188
+ v = c.value
189
+ if v is not None and str(v).strip() != "":
190
+ if min_row is None or c.row < min_row: min_row = c.row
191
+ if c.row > max_row: max_row = c.row
192
+ if min_col is None or c.column < min_col: min_col = c.column
193
+ if c.column > max_col: max_col = c.column
194
+ if min_row is None: return 1,0,1,0
195
+ return min_row, max_row, min_col, max_col
196
+
197
+ def build_merged_master_map(ws: Worksheet):
198
+ mapping = {}
199
+ for mr in ws.merged_cells.ranges:
200
+ min_col, min_row, max_col, max_row = mr.min_col, mr.min_row, mr.max_col, mr.max_row
201
+ master = (min_row, min_col)
202
+ for r in range(min_row, max_row+1):
203
+ for c in range(min_col, max_col+1):
204
+ mapping[(r,c)] = master
205
+ return mapping
206
+
207
+ def build_value_grid(ws: Worksheet, min_row: int, max_row: int, min_col: int, max_col: int):
208
+ merged_map = build_merged_master_map(ws)
209
+ nrows = max_row - min_row + 1; ncols = max_col - min_col + 1
210
+ grid = [[None]*ncols for _ in range(nrows)]
211
+ for r in range(min_row, max_row+1):
212
+ rr = r - min_row
213
+ for c in range(min_col, max_col+1):
214
+ cc = c - min_col
215
+ master = merged_map.get((r,c))
216
+ if master:
217
+ mr, mc = master; grid[rr][cc] = ws.cell(mr, mc).value
218
+ else:
219
+ grid[rr][cc] = ws.cell(r, c).value
220
+ return grid
221
+
222
+ def row_vals_from_grid(grid, r, min_row):
223
+ return grid[r - min_row]
224
+
225
+ def is_empty_row_vals(vals):
226
+ return not any(v not in (None, "") for v in vals)
227
+
228
+ def is_title_like_row_vals(vals, total_cols=20):
229
+ vals_ne = _nonempty(vals)
230
+ if not vals_ne: return False
231
+ if len(vals_ne) == 1: return True
232
+ coverage = len(vals_ne) / max(1, total_cols)
233
+ if coverage <= 0.2 and all(isinstance(v,str) and len(str(v))>20 for v in vals_ne): return True
234
+ uniq = {str(v).strip().lower() for v in vals_ne}
235
+ if len(uniq) == 1: return True
236
+ block = {"local currency unit per us dollar","exchange rate","average annual exchange rate"}
237
+ if any(str(v).strip().lower() in block for v in vals_ne): return True
238
+ return False
239
+
240
+ def is_header_candidate_row_vals(vals, total_cols=20):
241
+ vals_ne = _nonempty(vals)
242
+ if not vals_ne: return False
243
+ if is_title_like_row_vals(vals, total_cols): return False
244
+ nums = sum(1 for v in vals_ne if _is_numlike(v))
245
+ years = sum(1 for v in vals_ne if _is_year_token(v))
246
+ has_text = any(not _is_numlike(v) for v in vals_ne)
247
+ if years >= 2 and has_text: return True
248
+ if nums >= max(2, len(vals_ne)-nums): return years >= max(2, int(0.6*len(vals_ne)))
249
+ uniq_labels = {str(v).strip().lower() for v in vals_ne if not _is_numlike(v)}
250
+ return (len(vals_ne) >= 2) or (len(uniq_labels) >= 2)
251
+
252
+ def detect_tables_fast(ws: Worksheet, grid, min_row, max_row, min_col, max_col):
253
+ blocks = []
254
+ if is_pivot_or_chart_sheet(ws): return blocks
255
+ total_cols = max_col - min_col + 1
256
+ r = min_row
257
+ while r <= max_row:
258
+ vals = row_vals_from_grid(grid, r, min_row)
259
+ if is_empty_row_vals(vals) or is_title_like_row_vals(vals, total_cols) or is_pivot_row(vals):
260
+ r += 1; continue
261
+ if not is_probably_data(vals, total_cols):
262
+ r += 1; continue
263
+ data_start = r
264
+ header_rows = []
265
+ up = data_start - 1
266
+ while up >= min_row:
267
+ vup = row_vals_from_grid(grid, up, min_row)
268
+ if is_empty_row_vals(vup): break
269
+ if is_title_like_row_vals(vup, total_cols) or is_pivot_row(vup):
270
+ up -= 1; continue
271
+ if is_header_candidate_row_vals(vup, total_cols):
272
+ header_rows = []
273
+ hdr_row = up
274
+ while hdr_row >= min_row:
275
+ hdr_vals = row_vals_from_grid(grid, hdr_row, min_row)
276
+ if is_empty_row_vals(hdr_vals): break
277
+ if is_header_candidate_row_vals(hdr_vals, total_cols):
278
+ header_rows.insert(0, hdr_row); hdr_row -= 1
279
+ else: break
280
+ break
281
+ data_end = data_start
282
+ rr = data_start + 1
283
+ while rr <= max_row:
284
+ v = row_vals_from_grid(grid, rr, min_row)
285
+ if is_probably_footer(v) or is_pivot_row(v): break
286
+ if is_empty_row_vals(v): break
287
+ if is_probably_data(v, total_cols) or is_header_candidate_row_vals(v, total_cols):
288
+ data_end = rr
289
+ rr += 1
290
+ title_text = None
291
+ if header_rows:
292
+ top = header_rows[0]
293
+ for tr in range(max(min_row, top-3), top):
294
+ tv = row_vals_from_grid(grid, tr, min_row)
295
+ if is_title_like_row_vals(tv, total_cols):
296
+ first = next((str(x).strip() for x in tv if x not in (None,"")), None)
297
+ if first: title_text = first
298
+ break
299
+ if (header_rows or data_end - data_start >= 1) and data_start <= data_end:
300
+ blocks.append({"header_rows": header_rows, "data_start": data_start, "data_end": data_end, "title_text": title_text})
301
+ r = data_end + 1
302
+ while r <= max_row and is_empty_row_vals(row_vals_from_grid(grid, r, min_row)):
303
+ r += 1
304
+ return blocks
305
+
306
+ def expand_headers_from_grid(grid, header_rows, min_row, min_col, eff_max_col):
307
+ if not header_rows: return []
308
+ mat = []
309
+ for r in header_rows:
310
+ row_vals = row_vals_from_grid(grid, r, min_row)
311
+ row = [("" if (row_vals[c] is None) else str(row_vals[c]).strip()) for c in range(0, eff_max_col)]
312
+ last = ""
313
+ for i in range(len(row)):
314
+ if row[i] == "" and i > 0: row[i] = last
315
+ else: last = row[i]
316
+ mat.append(row)
317
+ return mat
318
+
319
+ def sheet_block_to_df_fast(ws, grid, min_row, max_row, min_col, max_col, header_rows, data_start, data_end):
320
+ import pandas as pd
321
+ total_cols = max_col - min_col + 1
322
+ if (not header_rows) and data_start and data_start > min_row:
323
+ prev = row_vals_from_grid(grid, data_start - 1, min_row)
324
+ if is_header_candidate_row_vals(prev, total_cols):
325
+ header_rows = [data_start - 1]
326
+ if (not header_rows) and data_start:
327
+ cur = row_vals_from_grid(grid, data_start, min_row)
328
+ nxt = row_vals_from_grid(grid, data_start + 1, min_row) if data_start + 1 <= max_row else []
329
+ if is_header_candidate_row_vals(cur, total_cols) and is_probably_data(nxt, total_cols):
330
+ header_rows = [data_start]; data_start += 1
331
+ if not header_rows or data_start is None or data_end is None or data_end < data_start:
332
+ return pd.DataFrame(), [], []
333
+ def used_upto_col():
334
+ maxc = 0
335
+ for r in list(header_rows) + list(range(data_start, data_end+1)):
336
+ vals = row_vals_from_grid(grid, r, min_row)
337
+ for c_off in range(total_cols):
338
+ v = vals[c_off]
339
+ if v not in (None, ""): maxc = max(maxc, c_off+1)
340
+ return maxc or total_cols
341
+ eff_max_col = used_upto_col()
342
+ header_mat = expand_headers_from_grid(grid, header_rows, min_row, min_col, eff_max_col)
343
+ def is_title_level(values):
344
+ total = len(values)
345
+ filled = [str(v).strip() for v in values if v not in (None, "")]
346
+ if total == 0: return False
347
+ coverage = len(filled) / total
348
+ if coverage <= 0.2 and len(filled) <= 2: return True
349
+ if filled:
350
+ uniq = {v.lower() for v in filled}
351
+ if len(uniq) == 1:
352
+ label = next(iter(uniq))
353
+ dom = sum(1 for v in values if isinstance(v,str) and v.strip().lower() == label)
354
+ if dom / total >= 0.6: return True
355
+ return False
356
+ usable_levels = [i for i in range(len(header_mat)) if not is_title_level(header_mat[i])]
357
+ if not usable_levels and header_mat: usable_levels = [len(header_mat) - 1]
358
+ cols = []
359
+ for c_off in range(eff_max_col):
360
+ parts = [header_mat[l][c_off] for l in range(usable_levels[0], usable_levels[-1]+1)] if usable_levels else []
361
+ cols.append(compose_col(parts))
362
+ cols = ensure_unique([clean_col_name(x) for x in cols])
363
+ data_rows = []
364
+ for r in range(data_start, data_end+1):
365
+ vals = row_vals_from_grid(grid, r, min_row)
366
+ row = [vals[c_off] for c_off in range(eff_max_col)]
367
+ if is_probably_footer(row): break
368
+ data_rows.append(row[:len(cols)])
369
+ if not data_rows: return pd.DataFrame(columns=cols), header_mat, cols
370
+ keep_mask = [any(row[i] not in (None, "") for row in data_rows) for i in range(len(cols))]
371
+ kept_cols = [c for c,k in zip(cols, keep_mask) if k]
372
+ trimmed_rows = [[v for v,k in zip(row, keep_mask) if k] for row in data_rows]
373
+ df = pd.DataFrame(trimmed_rows, columns=kept_cols)
374
+ if any(str(c).startswith("unnamed_column") for c in df.columns):
375
+ new_names = list(df.columns)
376
+ for idx, name in enumerate(list(df.columns)):
377
+ if not str(name).startswith("unnamed_column"): continue
378
+ samples = _samples_for_column(trimmed_rows, idx, max_items=20)
379
+ guess = _heuristic_infer_col_name(samples)
380
+ if guess: new_names[idx] = clean_col_name(guess)
381
+ df.columns = ensure_unique([clean_col_name(x) for x in new_names])
382
+ return df, header_mat, kept_cols
383
+
384
+ def _llm_infer_table_title(header_mat, sample_rows, sheet_name):
385
+ if os.environ.get("EXCEL_LLM_INFER","0") != "1": return None
386
+ api_key = os.environ.get("OPENAI_API_KEY")
387
+ if not api_key: return None
388
+ headers = []
389
+ if header_mat:
390
+ for c in range(len(header_mat[0])):
391
+ parts = [header_mat[l][c] for l in range(len(header_mat))]
392
+ parts = [p for p in parts if p]
393
+ if parts: headers.append("_".join(parts))
394
+ headers = headers[:10]
395
+ samples = [[str(x) for x in r[:6]] for r in sample_rows[:5]]
396
+ prompt = (
397
+ "Propose a short, human-readable title for a data table.\n"
398
+ "Keep it 3-6 words, Title Case, no punctuation at the end.\n"
399
+ f"Sheet: {sheet_name}\nHeaders: {headers}\nRow samples: {samples}\n"
400
+ "Answer with JSON: {\"title\": \"...\"}"
401
+ )
402
+ try:
403
+ from openai import OpenAI
404
+ client = OpenAI(api_key=api_key)
405
+ resp = client.chat.completions.create(
406
+ model=os.environ.get("OPENAI_MODEL","gpt-4o-mini"),
407
+ messages=[{"role":"user","content":prompt}], temperature=0.2,
408
+ )
409
+ text = resp.choices[0].message.content.strip()
410
+ except Exception:
411
+ return None
412
+ import re, json as _json
413
+ m = re.search(r"\{.*\}", text, re.S)
414
+ if not m: return None
415
+ try:
416
+ obj = _json.loads(m.group(0)); title = obj.get("title","").strip()
417
+ return title or None
418
+ except Exception: return None
419
+
420
+ def _heuristic_table_title(header_mat, sheet_name, idx):
421
+ if header_mat:
422
+ parts = []
423
+ levels = len(header_mat)
424
+ cols = len(header_mat[0]) if header_mat else 0
425
+ for c in range(min(6, cols)):
426
+ colparts = [header_mat[l][c] for l in range(min(levels, 2)) if header_mat[l][c]]
427
+ if colparts: parts.extend(colparts)
428
+ if parts:
429
+ base = " ".join(dict.fromkeys(parts))
430
+ return base[:60]
431
+ return f"{sheet_name} Table {idx}"
432
+
433
+ def infer_table_title(header_mat, sample_rows, sheet_name, idx):
434
+ title = _heuristic_table_title(header_mat, sheet_name, idx)
435
+ llm = _llm_infer_table_title(header_mat, sample_rows, sheet_name)
436
+ return llm or title
437
+
438
+ def ensure_unique_table_name(existing: set, name: str) -> str:
439
+ base = sanitize_table_name(name) or "table"
440
+ if base not in existing:
441
+ existing.add(base); return base
442
+ i = 2
443
+ while f"{base}_{i}" in existing: i += 1
444
+ out = f"{base}_{i}"; existing.add(out); return out
445
+
446
+ # --- NEW: block coalescing to avoid nested/overlapping duplicates ---
447
+ def coalesce_blocks(blocks: List[Dict]) -> List[Dict]:
448
+ """Keep only maximal non-overlapping blocks by data row range."""
449
+ if not blocks: return blocks
450
+ blocks_sorted = sorted(blocks, key=lambda b: (b["data_start"], b["data_end"]))
451
+ result = []
452
+ for b in blocks_sorted:
453
+ if any(b["data_start"] >= x["data_start"] and b["data_end"] <= x["data_end"] for x in result):
454
+ # fully contained -> drop
455
+ continue
456
+ result.append(b)
457
+ return result
458
+
459
+ # ------------------------- Persistence -------------------------
460
+ def persist(excel_path, duckdb_path):
461
+ try:
462
+ from duckdb import connect
463
+ except ImportError:
464
+ print("Error: DuckDB library not installed. Install with: pip install duckdb"); sys.exit(1)
465
+ try:
466
+ wb = load_workbook(excel_path, data_only=True)
467
+ except FileNotFoundError:
468
+ print(f"Error: Excel file not found: {excel_path}"); sys.exit(1)
469
+ except Exception as e:
470
+ print(f"Error loading Excel file: {e}"); sys.exit(1)
471
+
472
+ db_path = Path(duckdb_path)
473
+ db_path.parent.mkdir(parents=True, exist_ok=True)
474
+ new_db = not db_path.exists()
475
+ con = connect(str(db_path))
476
+ if new_db: print(f"Created new DuckDB at: {db_path}")
477
+
478
+ con.execute("""
479
+ CREATE TABLE IF NOT EXISTS __excel_schema (
480
+ sheet_name TEXT,
481
+ table_name TEXT,
482
+ column_ordinal INTEGER,
483
+ original_name TEXT,
484
+ sql_column TEXT
485
+ )
486
+ """)
487
+ con.execute("""
488
+ CREATE TABLE IF NOT EXISTS __excel_tables (
489
+ sheet_name TEXT,
490
+ table_name TEXT,
491
+ block_index INTEGER,
492
+ start_row INTEGER,
493
+ end_row INTEGER,
494
+ header_rows_json TEXT,
495
+ inferred_title TEXT,
496
+ original_title_text TEXT
497
+ )
498
+ """)
499
+ con.execute("DELETE FROM __excel_schema")
500
+ con.execute("DELETE FROM __excel_tables")
501
+
502
+ used_names = set(); total_tables = 0; total_rows = 0
503
+
504
+ for sheet in wb.sheetnames:
505
+ ws = wb[sheet]
506
+ try:
507
+ if not isinstance(ws, Worksheet):
508
+ print(f"Skipping chartsheet: {sheet}"); continue
509
+ except Exception: pass
510
+ if is_pivot_or_chart_sheet(ws):
511
+ print(f"Skipping pivot/chart-like sheet: {sheet}"); continue
512
+
513
+ min_row, max_row, min_col, max_col = used_bounds(ws)
514
+ if max_row < min_row: continue
515
+ grid = build_value_grid(ws, min_row, max_row, min_col, max_col)
516
+
517
+ blocks = detect_tables_fast(ws, grid, min_row, max_row, min_col, max_col)
518
+ blocks = coalesce_blocks(blocks) # NEW: drop contained duplicates
519
+ if not blocks: continue
520
+
521
+ # NEW: per-sheet content hash set to avoid identical duplicate content
522
+ seen_content = set()
523
+
524
+ for idx, blk in enumerate(blocks, start=1):
525
+ df, header_mat, kept_cols = sheet_block_to_df_fast(
526
+ ws, grid, min_row, max_row, min_col, max_col,
527
+ blk["header_rows"], blk["data_start"], blk["data_end"]
528
+ )
529
+ if df.empty: continue
530
+
531
+ # Content hash (stable CSV representation)
532
+ csv_bytes = df.to_csv(index=False).encode("utf-8")
533
+ h = hashlib.sha256(csv_bytes).hexdigest()
534
+ if h in seen_content:
535
+ # Skip duplicate block with identical content
536
+ print(f"Skipping duplicate content on sheet {sheet} (block {idx})")
537
+ continue
538
+ seen_content.add(h)
539
+
540
+ # Build original composite header names for lineage mapping
541
+ original_cols = []
542
+ if header_mat:
543
+ levels = len(header_mat)
544
+ cols = len(header_mat[0]) if header_mat else 0
545
+ for c in range(cols):
546
+ parts = [header_mat[l][c] for l in range(levels)]
547
+ original_cols.append("_".join([p for p in parts if p]))
548
+ else:
549
+ original_cols = list(df.columns)
550
+ while len(original_cols) < len(df.columns): original_cols.append("unnamed")
551
+
552
+ title_orig = blk.get("title_text")
553
+ title = title_orig or infer_table_title(header_mat, df.values.tolist(), sheet, idx)
554
+ candidate = title if title else f"{sheet} Table {idx}"
555
+ table = ensure_unique_table_name(used_names, candidate)
556
+
557
+ con.execute(f'DROP TABLE IF EXISTS "{table}"')
558
+ con.register(f"{table}_temp", df)
559
+ con.execute(f'CREATE TABLE "{table}" AS SELECT * FROM {table}_temp')
560
+ con.unregister(f"{table}_temp")
561
+
562
+ for cidx, (orig, sqlc) in enumerate(zip(original_cols[:len(df.columns)], df.columns), start=1):
563
+ con.execute("INSERT INTO __excel_schema VALUES (?, ?, ?, ?, ?)", [sheet, table, cidx, str(orig), str(sqlc)])
564
+
565
+ con.execute(
566
+ """
567
+ INSERT INTO __excel_tables
568
+ (sheet_name, table_name, block_index, start_row, end_row, header_rows_json, inferred_title, original_title_text)
569
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?)
570
+ """,
571
+ [sheet, table, int(idx), int(blk["data_start"]), int(blk["data_end"]), json.dumps(blk["header_rows"]), title if title else None, title_orig if title_orig else None],
572
+ )
573
+
574
+ print(f"Created table {table} from sheet {sheet} with {len(df)} rows and {len(df.columns)} columns.")
575
+ total_tables += 1; total_rows += len(df)
576
+
577
+ con.close()
578
+ print(f"""\n✅ Completed.
579
+ - Created {total_tables} tables with {total_rows} total rows
580
+ - Column lineage: __excel_schema
581
+ - Block metadata: __excel_tables""")
582
+
583
+
584
+ def main():
585
+ ap = argparse.ArgumentParser(description="Excel → DuckDB (v2-compatible perf + no-dup safe guards).")
586
+ ap.add_argument("--excel", required=True, help="Path to .xlsx")
587
+ ap.add_argument("--duckdb", required=True, help="Path to DuckDB file")
588
+ args = ap.parse_args()
589
+ if not os.path.exists(args.excel):
590
+ print(f"Error: Excel file not found: {args.excel}"); sys.exit(1)
591
+ persist(args.excel, args.duckdb)
592
+
593
+
594
+ if __name__ == "__main__":
595
+ main()
requirements.txt CHANGED
@@ -1,3 +1,13 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
 
 
 
 
 
 
1
+ duckdb
2
+ pandas
3
+ openpyxl
4
+ xlsxwriter
5
+ langgraph
6
+ langchain-openai
7
+ langchain-core
8
+ faker
9
+ python-dotenv==1.0.1
10
+ streamlit
11
+ matplotlib
12
+ openai
13
+ tabulate
streamlit_app.py ADDED
@@ -0,0 +1,574 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import duckdb
4
+ import json
5
+ import subprocess
6
+ from pathlib import Path
7
+ from typing import Dict, List, Tuple
8
+
9
+ import streamlit as st
10
+
11
+
12
+ try:
13
+ from dotenv import load_dotenv, find_dotenv
14
+ load_dotenv(find_dotenv())
15
+ except Exception:
16
+ pass
17
+
18
+ # ---------- Basic page setup ----------
19
+ st.set_page_config(page_title="Excel → Dataset", page_icon="📊", layout="wide")
20
+
21
+ PRIMARY_DIR = Path(__file__).parent.resolve()
22
+ UPLOAD_DIR = PRIMARY_DIR / "uploads"
23
+ DB_DIR = PRIMARY_DIR / "dbs"
24
+ SCRIPT_PATH = PRIMARY_DIR / "excel_to_duckdb.py" # must be colocated
25
+
26
+ UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
27
+ DB_DIR.mkdir(parents=True, exist_ok=True)
28
+
29
+ st.markdown(
30
+ """
31
+ <style>
32
+ .logbox { border: 1px solid #e5e7eb; background:#fafafa; padding:10px; border-radius:12px; }
33
+ .logbox code { white-space: pre-wrap; font-size: 0.85rem; }
34
+ </style>
35
+ """,
36
+ unsafe_allow_html=True,
37
+ )
38
+
39
+ st.title("Data Analysis Agent")
40
+ # --------- Session state helpers ---------
41
+ if "processing" not in st.session_state:
42
+ st.session_state.processing = False
43
+ if "processed_key" not in st.session_state:
44
+ st.session_state.processed_key = None
45
+ if "last_overview_md" not in st.session_state:
46
+ st.session_state.last_overview_md = None
47
+ if "last_preview_items" not in st.session_state:
48
+ st.session_state.last_preview_items = [] # list of dicts: {'table_ref': str, 'label': str}
49
+
50
+ def _file_key(uploaded) -> str:
51
+ # unique-ish key per upload (name + size)
52
+ try:
53
+ size = len(uploaded.getbuffer())
54
+ except Exception:
55
+ size = 0
56
+ return f"{uploaded.name}:{size}"
57
+
58
+ # ---------- DuckDB helpers ----------
59
+ def list_user_tables(con: duckdb.DuckDBPyConnection) -> List[str]:
60
+ """
61
+ Robust discovery:
62
+ 1) information_schema.tables for BASE TABLE/VIEW in any schema, excluding __*
63
+ 2) duckdb_tables() as a secondary path
64
+ 3) __excel_tables mapping + existence check as last resort
65
+ """
66
+ # 1) information_schema (most portable)
67
+ try:
68
+ q = (
69
+ "SELECT table_schema, table_name "
70
+ "FROM information_schema.tables "
71
+ "WHERE table_type IN ('BASE TABLE','VIEW') "
72
+ "AND table_name NOT LIKE '__%%' "
73
+ "ORDER BY table_schema, table_name"
74
+ )
75
+ rows = con.execute(q).fetchall()
76
+ names = []
77
+ for schema, name in rows:
78
+ if (schema or '').lower() == 'main':
79
+ names.append(name)
80
+ else:
81
+ names.append(f'{schema}."{name}"')
82
+ if names:
83
+ return names
84
+ except Exception:
85
+ pass
86
+
87
+ # 2) duckdb_tables()
88
+ try:
89
+ q2 = (
90
+ "SELECT schema_name, table_name "
91
+ "FROM duckdb_tables() "
92
+ "WHERE table_type = 'BASE TABLE' "
93
+ "AND table_name NOT LIKE '__%%' "
94
+ "ORDER BY schema_name, table_name"
95
+ )
96
+ rows = con.execute(q2).fetchall()
97
+ names = []
98
+ for schema, name in rows:
99
+ if (schema or '').lower() == 'main':
100
+ names.append(name)
101
+ else:
102
+ names.append(f'{schema}."{name}"')
103
+ if names:
104
+ return names
105
+ except Exception:
106
+ pass
107
+
108
+ # 3) Fallback to metadata table
109
+ try:
110
+ meta = con.execute("SELECT DISTINCT table_name FROM __excel_tables").fetchall()
111
+ names = []
112
+ for (t,) in meta:
113
+ try:
114
+ con.execute(f'SELECT 1 FROM "{t}" LIMIT 1').fetchone()
115
+ names.append(t)
116
+ except Exception:
117
+ continue
118
+ return names
119
+ except Exception:
120
+ return []
121
+
122
+ def get_columns(con: duckdb.DuckDBPyConnection, table: str) -> List[Tuple[str,str]]:
123
+ # Normalize table name for information_schema lookup
124
+ if table.lower().startswith("main."):
125
+ tname = table.split('.', 1)[1].strip('"')
126
+ schema_filter = 'main'
127
+ elif '.' in table:
128
+ schema_filter, tname_raw = table.split('.', 1)
129
+ tname = tname_raw.strip('"')
130
+ else:
131
+ schema_filter = 'main'
132
+ tname = table.strip('"')
133
+ q = (
134
+ "SELECT column_name, data_type "
135
+ "FROM information_schema.columns "
136
+ "WHERE table_schema=? AND table_name=? "
137
+ "ORDER BY ordinal_position"
138
+ )
139
+ return con.execute(q, [schema_filter, tname]).fetchall()
140
+
141
+ def detect_year_column(con, table: str, col: str) -> bool:
142
+ try:
143
+ sql = (
144
+ f'SELECT AVG(CASE WHEN TRY_CAST("{col}" AS INTEGER) BETWEEN 1900 AND 2100 '
145
+ f'THEN 1.0 ELSE 0.0 END) FROM {table}'
146
+ )
147
+ v = con.execute(sql).fetchone()[0]
148
+ return (v or 0) > 0.7
149
+ except Exception:
150
+ return False
151
+
152
+ def role_of_column(con, table: str, col: str, dtype: str) -> str:
153
+ d = (dtype or '').upper()
154
+ if any(tok in d for tok in ["DATE", "TIMESTAMP"]):
155
+ return "date"
156
+ if any(tok in d for tok in ["INT", "BIGINT", "DOUBLE", "DECIMAL", "FLOAT", "HUGEINT", "REAL"]):
157
+ if detect_year_column(con, table, col):
158
+ return "year"
159
+ return "numeric"
160
+ if any(tok in d for tok in ["CHAR", "STRING", "TEXT", "VARCHAR"]):
161
+ try:
162
+ sql = f'SELECT COUNT(*), COUNT(DISTINCT "{col}") FROM {table}'
163
+ n, nd = con.execute(sql).fetchone()
164
+ if n and nd is not None:
165
+ ratio = (nd / n) if n else 0
166
+ if ratio > 0.95:
167
+ return "id_like"
168
+ if 0.01 <= ratio <= 0.35:
169
+ return "category"
170
+ if ratio < 0.01:
171
+ return "binary_flag"
172
+ except Exception:
173
+ pass
174
+ return "text"
175
+ return "other"
176
+
177
+ def quick_table_profile(con: duckdb.DuckDBPyConnection, table: str) -> Dict:
178
+ rows = con.execute(f'SELECT COUNT(*) FROM {table}').fetchone()[0]
179
+ cols = get_columns(con, table)
180
+ roles = {"category": [], "numeric": [], "date": [], "year": [], "id_like": [], "text": [], "binary": [], "other": []}
181
+ for c, d in cols:
182
+ r = role_of_column(con, table, c, d)
183
+ if r == "binary_flag":
184
+ roles["binary"].append(c)
185
+ else:
186
+ roles.setdefault(r, []).append(c)
187
+ return {
188
+ "rows": int(rows or 0),
189
+ "n_cols": len(cols),
190
+ "n_cat": len(roles["category"]),
191
+ "n_num": len(roles["numeric"]),
192
+ "n_time": len(roles["year"]) + len(roles["date"]),
193
+ }
194
+
195
+ def table_mapping(con: duckdb.DuckDBPyConnection, user_tables: List[str]) -> Dict[str, Dict]:
196
+ """
197
+ Map db_table (normalized) -> {sheet_name, title} using __excel_tables if present.
198
+ """
199
+ normalize = lambda t: t.split('.', 1)[1].strip('"') if '.' in t else t.strip('"')
200
+ want_names = {normalize(t) for t in user_tables}
201
+ mapping: Dict[str, Dict] = {}
202
+ try:
203
+ rows = con.execute(
204
+ "SELECT sheet_name, table_name, inferred_title, original_title_text, block_index, start_row "
205
+ "FROM __excel_tables ORDER BY block_index, start_row"
206
+ ).fetchall()
207
+ for sheet_name, table_name, inferred_title, original_title_text, block_index, start_row in rows:
208
+ if table_name not in want_names:
209
+ continue
210
+ title = inferred_title or original_title_text or 'untitled'
211
+ mapping[table_name] = {'sheet_name': sheet_name, 'title': title}
212
+ except Exception:
213
+ pass
214
+ return mapping
215
+
216
+ def excel_schema_samples(con: duckdb.DuckDBPyConnection, mapping: Dict[str, Dict], max_cols: int = 8) -> Dict[str, List[str]]:
217
+ """ Return up to max_cols original column names per table_name (normalized) for LLM hints. """
218
+ samples: Dict[str, List[str]] = {}
219
+ try:
220
+ rows = con.execute("SELECT sheet_name, table_name, column_ordinal, original_name FROM __excel_schema ORDER BY sheet_name, table_name, column_ordinal").fetchall()
221
+ for sheet_name, table_name, ordn, orig in rows:
222
+ if table_name not in mapping:
223
+ continue
224
+ lst = samples.setdefault(table_name, [])
225
+ if orig and len(lst) < max_cols:
226
+ lst.append(str(orig))
227
+ except Exception:
228
+ pass
229
+ return samples
230
+
231
+ # ---------- OpenAI ----------
232
+ def ai_overview_from_context(context: Dict) -> str:
233
+ api_key = os.environ.get("OPENAI_API_KEY") or st.secrets.get("OPENAI_API_KEY", None)
234
+ if not api_key:
235
+ raise RuntimeError("OPENAI_API_KEY is not set. Please add it to .env or Streamlit secrets.")
236
+
237
+ try:
238
+ from openai import OpenAI
239
+ client = OpenAI(api_key=api_key)
240
+ model = os.environ.get("OPENAI_MODEL", "gpt-4o-mini")
241
+ except Exception as e:
242
+ raise RuntimeError("OpenAI client not available. Install 'openai' >= 1.0 and try again.") from e
243
+
244
+ prompt = f'''
245
+ Start directly (no greeting). Write a concise, conversational overview (max two short paragraphs) of the dataset created from the uploaded Excel.
246
+ Requirements:
247
+ - Do NOT mention database engines, schemas, or technical column/table names.
248
+ - For each segment, reference it as: Sheet "<sheet_name>" — Table "<title>" (use "untitled" if missing).
249
+ - Use any provided original Excel header hints ONLY to infer friendlier human concepts; do not quote them verbatim.
250
+ - After the overview, list 6–8 simple questions a user could ask in natural language.
251
+ - Output Markdown with headings: "Overview" and "Try These Questions".
252
+
253
+ Context (JSON):
254
+ {json.dumps(context, ensure_ascii=False, indent=2)}
255
+ '''
256
+ resp = client.chat.completions.create(
257
+ model=model,
258
+ messages=[{"role": "user", "content": prompt}],
259
+ temperature=0.4,
260
+ )
261
+ return resp.choices[0].message.content.strip()
262
+
263
+ # ---------- Orchestration ----------
264
+ def run_ingestion_pipeline(xlsx_path: Path, db_path: Path, log_placeholder):
265
+ # Combined log function
266
+ log_lines: List[str] = []
267
+ def _append(line: str):
268
+ log_lines.append(line)
269
+ log_placeholder.markdown(
270
+ f"<div class='logbox'><code>{'</code><br/><code>'.join(map(str, log_lines[-400:]))}</code></div>",
271
+ unsafe_allow_html=True,
272
+ )
273
+
274
+ # 1) Save (already saved by caller, but we log here for a single place)
275
+ _append("[app] Saving file…")
276
+ _append("[app] Saved.")
277
+ if not SCRIPT_PATH.exists():
278
+ _append("[app] ERROR: ingestion component not found next to the app.")
279
+ raise FileNotFoundError("Required ingestion component not found.")
280
+
281
+ # 2) Ingest
282
+ _append("[app] Ingesting…")
283
+ env = os.environ.copy()
284
+ env["PYTHONIOENCODING"] = "utf-8"
285
+
286
+ cmd = [sys.executable, str(SCRIPT_PATH), "--excel", str(xlsx_path), "--duckdb", str(db_path)]
287
+ try:
288
+ proc = subprocess.Popen(
289
+ cmd, cwd=str(PRIMARY_DIR),
290
+ stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
291
+ text=True, bufsize=1, universal_newlines=True, env=env
292
+ )
293
+ except Exception as e:
294
+ _append(f"[app] ERROR: failed to start ingestion: {e}")
295
+ raise
296
+
297
+ for line in iter(proc.stdout.readline, ""):
298
+ _append(line.rstrip("\n"))
299
+ proc.wait()
300
+ if proc.returncode != 0:
301
+ _append("[app] ERROR: ingestion reported a non-zero exit code.")
302
+ raise RuntimeError("Ingestion failed. See logs.")
303
+
304
+ _append("[app] Ingestion complete.")
305
+
306
+ # 3) Open dataset
307
+ _append("[app] Opening dataset…")
308
+ con = duckdb.connect(str(db_path))
309
+ _append("[app] Dataset open.")
310
+
311
+ return con, _append
312
+
313
+ def analyze_and_summarize(con: duckdb.DuckDBPyConnection):
314
+ user_tables = list_user_tables(con)
315
+ preview_items = [] # list of {'table_ref': t, 'label': label} for UI
316
+ if not user_tables:
317
+ # Try to provide metadata if no tables are found
318
+ try:
319
+ meta_df = con.execute("SELECT * FROM __excel_tables").fetchdf()
320
+ st.warning("No user tables were discovered. Showing ingestion metadata for reference.")
321
+ st.dataframe(meta_df, use_container_width=True, hide_index=True)
322
+ except Exception:
323
+ st.error("No user tables were discovered and no metadata table is available.")
324
+ return "", []
325
+
326
+ # Build mapping + schema hints
327
+ mapping = table_mapping(con, user_tables) # normalized table_name -> {sheet_name, title}
328
+ schema_hints = excel_schema_samples(con, mapping, max_cols=8)
329
+
330
+ # Build compact profiling context for LLM; avoid raw db table names
331
+ per_table = []
332
+ for idx, t in enumerate(user_tables, start=1):
333
+ prof = quick_table_profile(con, t)
334
+ norm = t.split('.', 1)[1].strip('"') if '.' in t else t.strip('"')
335
+ m = mapping.get(norm, {})
336
+ sheet = m.get('sheet_name')
337
+ title = m.get('title')
338
+ per_table.append({
339
+ "idx": idx,
340
+ "sheet_name": sheet,
341
+ "title": title or "untitled",
342
+ "rows": prof["rows"],
343
+ "n_cols": prof["n_cols"],
344
+ "category_fields": prof["n_cat"],
345
+ "numeric_measures": prof["n_num"],
346
+ "time_fields": prof["n_time"],
347
+ "example_original_headers": schema_hints.get(norm, [])
348
+ })
349
+ label = f'Sheet "{sheet}" — Table "{title or "untitled"}"' if sheet else f'Table "{title or "untitled"}"'
350
+ preview_items.append({'table_ref': t, 'label': label})
351
+
352
+ context = {
353
+ "segments": per_table
354
+ }
355
+
356
+ # Generate overview (LLM only)
357
+ overview_md = ai_overview_from_context(context)
358
+ return overview_md, preview_items
359
+
360
+ # ---------- UI flow ----------
361
+ file = st.file_uploader("Upload an Excel file", type=["xlsx"])
362
+
363
+ if file is None and not st.session_state.last_overview_md:
364
+ st.info("Upload an .xlsx file to begin.")
365
+
366
+ # Only show logs AFTER there is an upload or some result to show
367
+ logs_placeholder = None
368
+ if file is not None or st.session_state.processing or st.session_state.last_overview_md:
369
+ logs_exp = st.expander("Processing logs", expanded=False)
370
+ logs_placeholder = logs_exp.empty()
371
+
372
+ if file is not None:
373
+ key = _file_key(file)
374
+ stem = Path(file.name).stem
375
+ saved_xlsx = UPLOAD_DIR / f"{stem}.xlsx"
376
+ db_path = DB_DIR / f"{stem}.duckdb"
377
+
378
+ # --- CLEAR state immediately on new upload ---
379
+ if st.session_state.get("processed_key") != key:
380
+ st.session_state["last_overview_md"] = None
381
+ st.session_state["last_preview_items"] = []
382
+ st.session_state["chat_history"] = []
383
+ st.session_state["schema_text"] = None
384
+ st.session_state["db_path"] = None
385
+ # Optional: clear any previous logs shown in UI on rerun
386
+ # (no explicit log buffer stored; the log expander will refresh)
387
+
388
+
389
+ # Auto-start ingestion exactly once per unique upload
390
+ if (st.session_state.processed_key != key) and (not st.session_state.processing):
391
+ st.session_state.processing = True
392
+ if logs_placeholder is None:
393
+ logs_exp = st.expander("Ingestion logs", expanded=False)
394
+ logs_placeholder = logs_exp.empty()
395
+
396
+ # Save uploaded file
397
+ with open(saved_xlsx, "wb") as f:
398
+ f.write(file.getbuffer())
399
+
400
+ try:
401
+ con, app_log = run_ingestion_pipeline(saved_xlsx, db_path, logs_placeholder)
402
+ # Analyze + overview
403
+ app_log("[app] Analyzing data…")
404
+ overview_md, preview_items = analyze_and_summarize(con)
405
+ app_log("[app] Overview complete.")
406
+ con.close()
407
+
408
+ st.session_state.last_overview_md = overview_md
409
+ st.session_state.last_preview_items = preview_items
410
+ st.session_state.processed_key = key
411
+ st.session_state.processing = False
412
+
413
+ except Exception as e:
414
+ st.session_state.processing = False
415
+ st.error(f"Ingestion failed. See logs for details. Error: {e}")
416
+
417
+ # Display results if available (and avoid re-triggering ingestion)
418
+ if st.session_state.last_overview_md:
419
+ #st.subheader("Overview")
420
+ st.markdown(st.session_state.last_overview_md)
421
+
422
+ with st.expander("Quick preview (verification only)", expanded=False):
423
+ try:
424
+ # Reconnect to current dataset path (if present)
425
+ if file is not None:
426
+ stem = Path(file.name).stem
427
+ db_path = DB_DIR / f"{stem}.duckdb"
428
+ con = duckdb.connect(str(db_path))
429
+ for item in st.session_state.last_preview_items:
430
+ t = item['table_ref']
431
+ label = item['label']
432
+ df = con.execute(f"SELECT * FROM {t} LIMIT 50").df()
433
+ st.caption(f"Preview — {label}")
434
+ st.dataframe(df, use_container_width=True, hide_index=True)
435
+ con.close()
436
+ except Exception as e:
437
+ st.warning(f"Could not preview tables: {e}")
438
+
439
+
440
+ # =====================
441
+ # Chat with your dataset
442
+ # (Appends after overview & preview; leaves earlier logic untouched)
443
+ # =====================
444
+ if st.session_state.get("last_overview_md"):
445
+ st.divider()
446
+ st.subheader("Chat with your dataset")
447
+
448
+ # Lazy imports so nothing changes before preview completes
449
+ def _lazy_imports():
450
+ from duckdb_react_agent import get_schema_summary, make_llm, answer_question # noqa: F401
451
+ return get_schema_summary, make_llm, answer_question
452
+
453
+ # Initialize chat memory
454
+ st.session_state.setdefault("chat_history", []) # [{role, content, sql?, plot_path?}]
455
+
456
+ # --- 1) Take input first so the user's question appears immediately ---
457
+ user_q = st.chat_input("Ask a question about the dataset…")
458
+ if user_q:
459
+ st.session_state.chat_history.append({"role": "user", "content": user_q})
460
+
461
+ # --- 2) Render history in strict User → Assistant order ---
462
+ for msg in st.session_state.chat_history:
463
+ with st.chat_message("user" if msg["role"] == "user" else "assistant"):
464
+ # Always: text first
465
+ st.markdown(msg["content"])
466
+
467
+ # Then: optional plot (below the answer)
468
+ plot_path_hist = msg.get("plot_path")
469
+ if plot_path_hist:
470
+ if not os.path.isabs(plot_path_hist):
471
+ plot_path_hist = str((PRIMARY_DIR / plot_path_hist).resolve())
472
+ if os.path.exists(plot_path_hist):
473
+ st.image(plot_path_hist, caption="Chart", width=520)
474
+
475
+ # Finally: optional SQL expander
476
+ if msg.get("sql"):
477
+ with st.expander("View generated SQL", expanded=False):
478
+ st.markdown(f"<div class='sqlbox'>{msg['sql']}</div>", unsafe_allow_html=True)
479
+
480
+ # --- 3) If a new question arrived, stream the assistant answer now ---
481
+ if user_q:
482
+ with st.chat_message("assistant"):
483
+ # Placeholders in sequence: text → plot → SQL
484
+ stream_placeholder = st.empty()
485
+ plot_placeholder = st.empty()
486
+ sql_placeholder = st.empty()
487
+
488
+ # Show pending immediately
489
+ stream_placeholder.markdown("_Answer pending…_")
490
+
491
+ partial_chunks = []
492
+ def on_token(t: str):
493
+ partial_chunks.append(t)
494
+ stream_placeholder.markdown("".join(partial_chunks))
495
+
496
+ # Resolve DB path
497
+ if 'db_path' in locals():
498
+ _db_path = db_path # from the preview scope if defined
499
+ else:
500
+ if 'file' in locals() and file is not None:
501
+ _stem = Path(file.name).stem
502
+ _db_path = DB_DIR / f"{_stem}.duckdb"
503
+ else:
504
+ _candidates = sorted(DB_DIR.glob("*.duckdb"), key=lambda p: p.stat().st_mtime, reverse=True)
505
+ _db_path = _candidates[0] if _candidates else None
506
+
507
+ if not _db_path or not Path(_db_path).exists():
508
+ stream_placeholder.error("No dataset found. Please re-upload the Excel file in this session.")
509
+ else:
510
+ # Call agent lazily
511
+ get_schema_summary, make_llm, answer_question = _lazy_imports()
512
+ try:
513
+ try:
514
+ con2 = duckdb.connect(str(_db_path), read_only=True)
515
+ except Exception:
516
+ con2 = duckdb.connect(str(_db_path))
517
+
518
+ schema_text = get_schema_summary(con2, allowed_schemas=["main"])
519
+ llm = make_llm(model=os.environ.get("OPENAI_MODEL", "gpt-4o-mini"), temperature=0.0)
520
+
521
+ import inspect as _inspect
522
+ _sig = None
523
+ try:
524
+ _sig = _inspect.signature(answer_question)
525
+ except Exception:
526
+ _sig = None
527
+
528
+ def _call_answer():
529
+ try:
530
+ if _sig and "history" in _sig.parameters and "token_callback" in _sig.parameters and "stream" in _sig.parameters:
531
+ return answer_question(con2, llm, schema_text, user_q, stream=True, token_callback=on_token, history=st.session_state.chat_history)
532
+ elif _sig and "token_callback" in _sig.parameters and "stream" in _sig.parameters:
533
+ return answer_question(con2, llm, schema_text, user_q, stream=True, token_callback=on_token)
534
+ else:
535
+ return answer_question(con2, llm, schema_text, user_q)
536
+ except TypeError:
537
+ try:
538
+ return answer_question(con2, llm, schema_text, user_q, stream=True, token_callback=on_token)
539
+ except TypeError:
540
+ return answer_question(con2, llm, schema_text, user_q)
541
+
542
+ result = _call_answer()
543
+ con2.close()
544
+
545
+ # Finalize text
546
+ answer_text = result.get("answer") or "".join(partial_chunks) or "*No answer produced.*"
547
+ stream_placeholder.markdown(answer_text)
548
+
549
+ # Show plot next (slightly larger, below the text)
550
+ plot_path = result.get("plot_path")
551
+ if plot_path:
552
+ if not os.path.isabs(plot_path):
553
+ plot_path = str((PRIMARY_DIR / plot_path).resolve())
554
+ if os.path.exists(plot_path):
555
+ plot_placeholder.image(plot_path, caption="Chart", width=560)
556
+
557
+ # Finally show SQL
558
+ gen_sql = (result.get("sql") or "").strip()
559
+ if gen_sql:
560
+ with st.expander("View generated SQL", expanded=False):
561
+ st.markdown(f"<div class='sqlbox'>{gen_sql}</div>", unsafe_allow_html=True)
562
+
563
+ # Persist assistant message
564
+ st.session_state.chat_history.append({
565
+ "role": "assistant",
566
+ "content": answer_text,
567
+ "sql": gen_sql,
568
+ "plot_path": result.get("plot_path")
569
+ })
570
+ finally:
571
+ try:
572
+ con2.close()
573
+ except Exception:
574
+ pass