jashdoshi77 commited on
Commit
559689a
·
1 Parent(s): f409fe8

chnaged million and biilion to lakhs and crores

Browse files
Files changed (2) hide show
  1. ai/signatures.py +10 -0
  2. claude.md +379 -0
ai/signatures.py CHANGED
@@ -104,6 +104,16 @@ class SQLCritiqueAndFix(dspy.Signature):
104
  class InterpretAndInsight(dspy.Signature):
105
  """Interpret SQL query results for a non-technical user and generate insights.
106
 
 
 
 
 
 
 
 
 
 
 
107
  1. Summarize the main findings in plain English (2-3 sentences)
108
  2. Identify patterns, dominant contributors, outliers, and business implications"""
109
 
 
104
  class InterpretAndInsight(dspy.Signature):
105
  """Interpret SQL query results for a non-technical user and generate insights.
106
 
107
+ All monetary values are in INDIAN RUPEES (INR).
108
+ When talking about amounts, you MUST:
109
+ - Prefer the Indian number system (thousands, lakhs, crores) instead of millions/billions.
110
+ - Example conversions:
111
+ - 1,00,000 = 1 lakh
112
+ - 10,00,000 = 10 lakhs
113
+ - 1,00,00,000 = 1 crore
114
+ - Never say "million" or "billion". Use "lakhs" and "crores" instead when numbers are large.
115
+ - If exact conversion is unclear, keep numbers as raw INR amounts with commas (e.g., 12,34,56,789 INR).
116
+
117
  1. Summarize the main findings in plain English (2-3 sentences)
118
  2. Identify patterns, dominant contributors, outliers, and business implications"""
119
 
claude.md ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SQLBOT — AI SQL Analyst (Project Overview)
2
+
3
+ This document gives a detailed overview of the project:
4
+ - What the app does
5
+ - Tech stack
6
+ - Project structure
7
+ - Responsibilities of each module/file
8
+ - How the end‑to‑end pipeline works
9
+ - How data and schema stay in sync as Excel files change
10
+
11
+ ---
12
+
13
+ ## 1. What this project does
14
+
15
+ **Goal:** Provide an AI SQL analyst that can answer natural-language questions about your PostgreSQL (Neon) data, generating SQL, executing it safely, and returning both results and human-readable explanations.
16
+
17
+ Key properties:
18
+
19
+ - **Dynamic schema awareness**: No hardcoded table/column lists. The app introspects the live Neon database on every run.
20
+ - **Excel-driven data**: Source data lives in Excel files. A sync script loads them into Neon as normalized tables.
21
+ - **Safe SQL execution**: Only `SELECT` queries are allowed; dangerous commands are blocked.
22
+ - **Multi-turn memory**: The chatbot remembers the last few Q&A turns per browser session (stored in Neon) to handle follow-up questions.
23
+ - **Deployable**:
24
+ - As a Docker app on Hugging Face Spaces (`sdk: docker`).
25
+ - Mirrored in a GitHub repo.
26
+
27
+ ---
28
+
29
+ ## 2. Tech stack
30
+
31
+ - **Backend framework**: FastAPI
32
+ - **Application server**: uvicorn
33
+ - **Database**: PostgreSQL (Neon cloud), accessed via SQLAlchemy engines
34
+ - **Data loading**: pandas + SQLAlchemy `to_sql` from Excel into Postgres
35
+ - **AI / LLM orchestration**:
36
+ - dspy for defining prompt “signatures” and multi-step pipelines
37
+ - groq client (and optionally openai) via litellm / client libraries
38
+ - **Config / env**: python-dotenv and a central `config.py`
39
+ - **Frontend**: Vanilla HTML / CSS / JS (no framework), served by FastAPI
40
+ - **Containerization**: Dockerfile for Hugging Face Spaces (`sdk: docker`)
41
+ - **Version control**: git, with remotes to Hugging Face Space and GitHub
42
+
43
+ ---
44
+
45
+ ## 3. High-level architecture
46
+
47
+ The system has four main layers:
48
+
49
+ 1. **Data layer (Neon + Excel sync)**
50
+ - Excel files (`inventory_v5.xlsx`, `purchase_orders_v6.xlsx`, `sales_table_v2.xlsx`, etc.)
51
+ - `data_sync.py` converts Excel sheets → normalized Postgres tables.
52
+ - Dynamic schema + relationship + profiling components.
53
+
54
+ 2. **AI reasoning layer**
55
+ - `ai/signatures.py`: prompt contracts (Analyze & Plan, Generate SQL, Repair, Interpret & Insight).
56
+ - `ai/pipeline.py`: orchestrates LLM calls, validation, and execution.
57
+ - `ai/groq_setup.py`: loads LLM clients from environment/config.
58
+
59
+ 3. **API layer (FastAPI)**
60
+ - `app.py`: defines REST endpoints (`/chat`, `/generate-sql`, `/schema`, etc.).
61
+ - Handles conversation IDs and stores/retrieves chat history from Neon.
62
+
63
+ 4. **Frontend**
64
+ - `frontend/index.html` + `style.css` + `script.js`.
65
+ - SPA-style UI that calls `/chat` and renders SQL, table results, explanations, and insights.
66
+
67
+ ---
68
+
69
+ ## 4. Project structure and file responsibilities
70
+
71
+ ### 4.1. Top-level
72
+
73
+ #### `app.py`
74
+
75
+ Main FastAPI application and entrypoint when run with uvicorn.
76
+
77
+ Responsibilities:
78
+
79
+ - Create FastAPI app and configure CORS.
80
+ - Define request/response models:
81
+ - `QuestionRequest`: `{ question, provider, conversation_id }`
82
+ - `GenerateSQLResponse`: `{ sql }`
83
+ - `ChatResponse`: `{ sql, data, answer, insights }`
84
+ - Endpoints:
85
+ - `POST /generate-sql`
86
+ - Uses `SQLAnalystPipeline.generate_sql_only(question)` to return SQL only.
87
+ - `POST /chat`
88
+ - Imports `SQLAnalystPipeline` and `db.memory` functions.
89
+ - Accepts `question`, `provider`, `conversation_id`.
90
+ - Fetches last 5 conversation turns for that `conversation_id`.
91
+ - Builds an augmented prompt including recent Q&A context.
92
+ - Runs `pipeline.run(question_with_context)`.
93
+ - Stores the new turn (original question, answer, sql) to the `chat_history` table.
94
+ - `GET /schema`
95
+ - Returns structured schema from `db.schema.get_schema()`.
96
+ - `GET /relationships`
97
+ - Returns inferred table relationships from `db.relationships.discover_relationships()`.
98
+ - Frontend serving:
99
+ - Mounts `/static` to serve `frontend` assets.
100
+ - `GET /` returns `frontend/index.html`.
101
+ - Local dev entrypoint: `if __name__ == "__main__": uvicorn.run("app:app", ...)`.
102
+
103
+ #### `config.py`
104
+
105
+ Central configuration for environment variables and defaults.
106
+
107
+ - Loads `.env` via `dotenv.load_dotenv()`.
108
+ - Exposes:
109
+ - `DATABASE_URL`
110
+ - `GROQ_API_KEY`, `GROQ_MODEL`
111
+ - `OPENAI_API_KEY`, `OPENAI_MODEL`
112
+
113
+ #### `data_sync.py`
114
+
115
+ Excel → PostgreSQL data synchronization script.
116
+
117
+ - CLI usage:
118
+ - `python data_sync.py path/to/file.xlsx`
119
+ - `python data_sync.py path/to/folder/`
120
+ - `normalize_column(name)`: cleans and normalizes column names (lowercase, non-alphanumeric → `_`, dedupe).
121
+ - `sync_dataframe(df, table_name)`: writes DataFrame to Postgres with `if_exists="replace"`.
122
+ - `sync_excel(filepath)`:
123
+ - If one sheet: table name from file name.
124
+ - If multiple sheets: each becomes `filebasename_sheetname` (normalized).
125
+
126
+ This script is how new Excel data is pushed into Neon; the chatbot then automatically picks up the new schema.
127
+
128
+ #### `Dockerfile`
129
+
130
+ Container build for Hugging Face Spaces (Docker SDK).
131
+
132
+ - Base image: `python:3.11-slim`.
133
+ - Installs system deps, copies project, installs `requirements.txt`.
134
+ - Sets `PORT=7860` and runs `uvicorn app:app`.
135
+
136
+ #### `.dockerignore`
137
+
138
+ Avoids sending unnecessary/secret files in Docker build context:
139
+
140
+ - Ignores `__pycache__`, `.git`, `.env`, virtualenvs, and large `.xlsx` files.
141
+
142
+ #### `.gitignore`
143
+
144
+ Standard Python/git hygiene and secret protection:
145
+
146
+ - Ignores `.env`, virtualenvs, `__pycache__`, editor/OS junk, Excel data files.
147
+
148
+ #### `README.md`
149
+
150
+ Space / repo metadata and human‑oriented overview.
151
+
152
+ - YAML frontmatter recognized by Hugging Face:
153
+
154
+ ```yaml
155
+ ---
156
+ title: sqlbot
157
+ emoji: 🧠
158
+ colorFrom: blue
159
+ colorTo: green
160
+ sdk: docker
161
+ app_port: 7860
162
+ ---
163
+ ```
164
+
165
+ ---
166
+
167
+ ### 4.2. `ai` package — LLM reasoning
168
+
169
+ #### `ai/signatures.py`
170
+
171
+ Defines DSPy Signatures that describe the inputs/outputs of each LLM stage.
172
+
173
+ Main signatures:
174
+
175
+ - `AnalyzeAndPlan`
176
+ - Inputs: `question`, `schema_info`, `relationships`, `data_profile`.
177
+ - Outputs: `intent`, `relevant_tables`, `relevant_columns`, `join_conditions`, `where_conditions`, `aggregations`, `group_by`, `order_by`, `limit_val`.
178
+ - Prompt includes business rules (e.g., how to treat status columns, transaction vs catalog queries).
179
+
180
+ - `SQLGeneration`
181
+ - Inputs: `question`, `schema_info`, `query_plan`.
182
+ - Output: `sql_query` as raw PostgreSQL `SELECT` (no markdown, no explanation).
183
+
184
+ - `SQLCritiqueAndFix`
185
+ - Evaluates SQL vs schema and can generate corrected SQL.
186
+
187
+ - `InterpretAndInsight`
188
+ - Inputs: `question`, `sql_query`, `query_results` (JSON).
189
+ - Outputs: `answer` (plain-language explanation) and `insights` (3–5 analytic bullet points).
190
+
191
+ - `SQLRepair`
192
+ - Given failing SQL + error message + schema + question, outputs corrected raw SQL.
193
+
194
+ #### `ai/pipeline.py`
195
+
196
+ Orchestrates the full reasoning flow via `SQLAnalystPipeline`.
197
+
198
+ Key steps in `run(question)`:
199
+
200
+ 1. Build context:
201
+ - `schema_str = format_schema()` from `db.schema`.
202
+ - `rels_str = format_relationships()` from `db.relationships`.
203
+ - `profile_str = get_data_profile()` from `db.profiler`.
204
+ 2. Analyze & Plan:
205
+ - Calls `self.analyze(...)` to create a structured plan.
206
+ 3. SQL Generation:
207
+ - Calls `self.generate_sql(...)`, cleans the raw text to pure SQL.
208
+ 4. Schema validation:
209
+ - Uses `check_sql_against_schema` to detect non-existing tables/columns and optionally regenerates SQL with feedback.
210
+ 5. Safety validation:
211
+ - `validate_sql(sql)` ensures a safe `SELECT` query only.
212
+ 6. Execution + repair loop:
213
+ - Uses `execute_sql(sql)`.
214
+ - On DB error, calls `self.repair(...)` and retries up to `MAX_REPAIR_RETRIES`.
215
+ 7. Interpretation & insights:
216
+ - Serializes up to 50 result rows.
217
+ - Calls `self.interpret(question=..., sql_query=..., query_results=...)`.
218
+ 8. Returns a dict: `{ "sql": sql, "data": rows, "answer": answer, "insights": insights }`.
219
+
220
+ Also exposes `generate_sql_only(question)` and helper `_clean_sql`.
221
+
222
+ ---
223
+
224
+ ### 4.3. `db` package — database utilities
225
+
226
+ #### `db/connection.py`
227
+
228
+ Singleton SQLAlchemy engine and connection helpers.
229
+
230
+ - `get_engine()` uses `config.DATABASE_URL`.
231
+ - `get_connection()` returns a new connection context manager.
232
+
233
+ #### `db/schema.py`
234
+
235
+ Schema introspection via `information_schema.columns`.
236
+
237
+ - `get_schema(force_refresh=False)` returns `{table_name: [{column_name, data_type, is_nullable}, ...]}`.
238
+ - `format_schema()` returns a prompt‑friendly string view of the schema.
239
+ - `get_table_names()` returns a list of all public tables.
240
+
241
+ #### `db/relationships.py`
242
+
243
+ Relationship discovery between tables.
244
+
245
+ - Reads explicit foreign keys from `information_schema.table_constraints`.
246
+ - Adds implicit relationships:
247
+ - Exact column name matches
248
+ - ID-pattern matches (`*_id`, `*_key`)
249
+ - Fuzzy name similarity
250
+ - `format_relationships()` renders them as readable text.
251
+
252
+ #### `db/profiler.py`
253
+
254
+ Profiles actual database content to give the LLM richer context:
255
+
256
+ - Row counts.
257
+ - Distinct values and counts for categorical columns.
258
+ - Min/max/avg for numeric columns.
259
+ - Date ranges for date columns.
260
+ - Adds business-rule text to the profile for the LLM to follow.
261
+
262
+ Results are cached to reduce DB load.
263
+
264
+ #### `db/executor.py`
265
+
266
+ Safe SQL execution against PostgreSQL.
267
+
268
+ - Validates SQL with `validate_sql` (only `SELECT`/`WITH`).
269
+ - Executes using SQLAlchemy and returns:
270
+ - Success flag
271
+ - Data rows (as list of dicts)
272
+ - Column names
273
+ - Error string (on failure)
274
+
275
+ #### `db/memory.py`
276
+
277
+ Conversation memory stored in Neon (`chat_history` table).
278
+
279
+ - Ensures table exists:
280
+
281
+ ```sql
282
+ chat_history (
283
+ id BIGSERIAL PRIMARY KEY,
284
+ conversation_id TEXT,
285
+ question TEXT,
286
+ answer TEXT,
287
+ sql_query TEXT,
288
+ created_at TIMESTAMPTZ DEFAULT NOW()
289
+ )
290
+ ```
291
+
292
+ - `add_turn(conversation_id, question, answer, sql_query)` inserts a Q/A turn.
293
+ - `get_recent_history(conversation_id, limit=5)` returns the last `limit` turns (oldest first).
294
+
295
+ Used by `/chat` to give the LLM context for follow-up questions.
296
+
297
+ ---
298
+
299
+ ### 4.4. Frontend (`frontend` folder)
300
+
301
+ #### `frontend/index.html`
302
+
303
+ Main UI markup.
304
+
305
+ - Header with logo, title, and model switcher (Groq / OpenAI).
306
+ - Input section with textarea and submit button.
307
+ - Loading indicator with step animation.
308
+ - Results section:
309
+ - Generated SQL
310
+ - Query results table
311
+ - Explanation
312
+ - Insights
313
+ - Error section for displaying API errors.
314
+
315
+ #### `frontend/style.css`
316
+
317
+ Visual styling for a modern, dark-themed UI:
318
+
319
+ - Glassmorphism cards, gradient background, responsive layout.
320
+ - Styled table, tags, buttons, and loading indicators.
321
+
322
+ #### `frontend/script.js`
323
+
324
+ Frontend logic and API integration.
325
+
326
+ - Tracks:
327
+ - Selected model provider.
328
+ - A persistent `conversationId` stored in `localStorage`.
329
+ - Handles:
330
+ - Submitting questions (button or Enter).
331
+ - Calling `POST /chat` with `{ question, provider, conversation_id }`.
332
+ - Rendering SQL, tabular data, answer text, and insights.
333
+ - Showing loading state and errors.
334
+ - Copy-to-clipboard for generated SQL.
335
+
336
+ ---
337
+
338
+ ## 5. End-to-end flow summary
339
+
340
+ 1. **Data ingestion**
341
+ - You add/update Excel files.
342
+ - Run `python data_sync.py <file or folder>` to replace tables in Neon.
343
+
344
+ 2. **User interaction**
345
+ - User opens the web UI (locally or on Hugging Face).
346
+ - Types a question and clicks submit.
347
+
348
+ 3. **API request**
349
+ - Frontend sends JSON `{ question, provider, conversation_id }` to `/chat`.
350
+
351
+ 4. **Context building & memory**
352
+ - Backend loads recent chat history from `db.memory`.
353
+ - Builds an augmented question including recent Q&A.
354
+
355
+ 5. **Reasoning pipeline**
356
+ - `SQLAnalystPipeline` uses live schema, relationships, and data profile from Neon.
357
+ - Generates, validates, and (if needed) repairs SQL.
358
+ - Executes SQL and interprets results into explanations and insights.
359
+
360
+ 6. **Response**
361
+ - API returns `{ sql, data, answer, insights }`.
362
+ - Frontend renders the results and the turn is saved to `chat_history` for future context.
363
+
364
+ ---
365
+
366
+ ## 6. Design principles
367
+
368
+ - **Schema-driven, not hardcoded**
369
+ - Schema and relationships are discovered dynamically from Neon.
370
+ - **Separation of concerns**
371
+ - Clear layers: data sync, schema/relationships, LLM reasoning, API, frontend.
372
+ - **Safe by default**
373
+ - Only `SELECT` queries are executed; destructive SQL is rejected.
374
+ - **Deployable & portable**
375
+ - Dockerfile + `sdk: docker` make it simple to run on Hugging Face or other container platforms.
376
+ - **Extensible**
377
+ - Business rules live in prompt text and can evolve.
378
+ - Memory is a simple table and can be extended with more metadata as needed.
379
+