sofhiaazzhr commited on
Commit
8557a20
Β·
1 Parent(s): c93ec90

initial commit

Browse files
ARCHITECTURE.md ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture β€” Data Eyond Agentic Service
2
+
3
+ **Last updated**: 2026-05-07
4
+ **Status**: Design phase β€” folder skeleton in place, implementation in progress
5
+
6
+ ---
7
+
8
+ ## TL;DR
9
+
10
+ A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.
11
+
12
+ The architecture has two paths:
13
+
14
+ - **Unstructured** (PDF, DOCX, TXT) β€” dense similarity over prose chunks (the right primitive for free-form text).
15
+ - **Structured** (databases, XLSX, CSV, Parquet) β€” a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
16
+
17
+ The LLM produces *intent*, not query syntax. Deterministic code does the rest.
18
+
19
+ ---
20
+
21
+ ## 1. Why catalog-driven design
22
+
23
+ For a database or spreadsheet, a user's question maps to *known tables and columns* β€” not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.
24
+
25
+ A central per-user catalog also means:
26
+
27
+ - One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
28
+ - The query planner sees the user's full data landscape in a single prompt.
29
+ - Schema stays stable across user sessions without hitting the source DB on every query.
30
+ - New sources auto-update the catalog without re-embedding chunks.
31
+
32
+ ---
33
+
34
+ ## 2. Source taxonomy
35
+
36
+ ```
37
+ Sources
38
+ β”œβ”€β”€ Unstructured (pdf, docx, txt) β†’ Cu (prose chunks via DocumentRetriever)
39
+ └── Structured
40
+ β”œβ”€β”€ Schema (DB) β†’ Cs (DB tables + columns)
41
+ └── Tabular (xlsx, csv, parquet) β†’ Ct (sheets + columns)
42
+ Cs βˆͺ Ct = Data Catalog Context
43
+ ```
44
+
45
+ - **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
46
+ - **Cs** = DB schema context (tables, columns, descriptions, sample values).
47
+ - **Ct** = tabular file context (sheets, columns, descriptions, sample values).
48
+ - **Data Catalog Context** = `Cs βˆͺ Ct`. Passed to the query planner as a single unified view.
49
+
50
+ DB vs tabular is **not** a routing concern β€” it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).
51
+
52
+ ---
53
+
54
+ ## 3. Routing model
55
+
56
+ ```
57
+ source_hint ∈ { chat, unstructured, structured }
58
+ ```
59
+
60
+ - `chat` β€” no search, conversational reply only
61
+ - `unstructured` β€” DocumentRetriever path (Cu)
62
+ - `structured` β€” catalog-driven path (Cs βˆͺ Ct β†’ planner β†’ compiler β†’ executor)
63
+
64
+ The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.
65
+
66
+ ---
67
+
68
+ ## 4. Core architectural decisions
69
+
70
+ ### 4.1 Catalog as primary context, not retrieval
71
+
72
+ For most users (≀50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.
73
+
74
+ When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.
75
+
76
+ ### 4.2 JSON IR over raw SQL
77
+
78
+ The planner LLM emits a structured JSON IR describing query intent β€” not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.
79
+
80
+ Benefits:
81
+
82
+ - Validatable with Pydantic before execution
83
+ - Compiler whitelists allowed operations (no DROP, DELETE, etc.)
84
+ - Portable: same IR β†’ SQL (any dialect) / pandas / polars
85
+ - Cheaper tokens, easier to debug, trivially testable without an LLM
86
+ - LLM cannot emit valid-but-wrong SQL syntax
87
+
88
+ ### 4.3 Deterministic compiler, not LLM SQL writer
89
+
90
+ The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.
91
+
92
+ ### 4.4 Pipeline stage isolation
93
+
94
+ Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.
95
+
96
+ ### 4.5 Minimal LLM surface
97
+
98
+ LLM calls happen in exactly four places:
99
+
100
+ 1. **`IntentRouter`** β€” once per user message
101
+ 2. **`CatalogEnricher`** β€” once per source, at ingestion (not query time)
102
+ 3. **`QueryPlanner`** β€” once per structured query (produces the IR)
103
+ 4. **`ChatbotAgent`** β€” once per answer (formats the response)
104
+
105
+ Compiler and executors are pure code. No LLM in the hot path of query construction.
106
+
107
+ ---
108
+
109
+ ## 5. End-to-end flow
110
+
111
+ ### Ingestion (when user uploads a file or connects a DB)
112
+
113
+ ```
114
+ source upload / DB connect
115
+ ↓
116
+ introspect schema (DB: information_schema; tabular: file headers + sample rows)
117
+ ↓
118
+ CatalogEnricher (1 LLM call per source β€” generates AI descriptions)
119
+ ↓
120
+ validate (Pydantic)
121
+ ↓
122
+ write to catalog store (Postgres jsonb, keyed by user_id)
123
+ ```
124
+
125
+ For unstructured files: chunk + embed β†’ PGVector.
126
+
127
+ ### Query (per user message)
128
+
129
+ ```
130
+ User message
131
+ ↓
132
+ Chat cache check (Redis, 24h TTL)
133
+ ↓ miss
134
+ Load chat history
135
+ ↓
136
+ IntentRouter LLM β†’ needs_search? source_hint?
137
+ ↓
138
+ β”œβ”€β”€ chat β†’ ChatbotAgent β†’ SSE stream
139
+ β”œβ”€β”€ unstructured β†’ DocumentRetriever β†’ answerer
140
+ └── structured β†’
141
+ CatalogReader (load full Cs βˆͺ Ct for user)
142
+ ↓
143
+ QueryPlanner LLM β†’ JSON IR
144
+ ↓
145
+ IRValidator (Pydantic + columns-exist + ops whitelist)
146
+ ↓
147
+ QueryCompiler β†’ SQL (schema source) or pandas (tabular source)
148
+ ↓
149
+ QueryExecutor (DbExecutor or TabularExecutor)
150
+ ↓
151
+ QueryResult
152
+ ↓
153
+ ChatbotAgent β†’ SSE stream
154
+ ```
155
+
156
+ ---
157
+
158
+ ## 6. Data catalog
159
+
160
+ ### Storage
161
+
162
+ Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.
163
+
164
+ ### Schema (initial scope)
165
+
166
+ ```
167
+ Catalog
168
+ β”œβ”€β”€ user_id, schema_version, generated_at
169
+ └── sources[]
170
+ └── Source
171
+ β”œβ”€β”€ source_id, source_type, name, description, location_ref, updated_at
172
+ └── tables[]
173
+ └── Table
174
+ β”œβ”€β”€ table_id, name, description, row_count
175
+ └── columns[]
176
+ └── Column
177
+ β”œβ”€β”€ column_id, name, data_type, description
178
+ β”œβ”€β”€ nullable
179
+ β”œβ”€β”€ pii_flag
180
+ β”œβ”€β”€ sample_values[]
181
+ └── stats: { min, max, distinct_count } | null
182
+ ```
183
+
184
+ ### Best-practice fields deferred
185
+
186
+ `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.
187
+
188
+ ### Stable IDs
189
+
190
+ `source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.
191
+
192
+ ### PII handling
193
+
194
+ Columns with `pii_flag: true` have `sample_values: null` β€” real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.
195
+
196
+ ---
197
+
198
+ ## 7. JSON IR
199
+
200
+ ### Schema (initial scope)
201
+
202
+ ```
203
+ QueryIR
204
+ β”œβ”€β”€ ir_version : "1.0"
205
+ β”œβ”€β”€ source_id : str (references catalog)
206
+ β”œβ”€β”€ table_id : str (references catalog)
207
+ β”œβ”€β”€ select[] : SelectItem
208
+ β”‚ β”œβ”€β”€ { kind: "column", column_id, alias? }
209
+ β”‚ └── { kind: "agg", fn, column_id?, alias? }
210
+ β”œβ”€β”€ filters[] : { column_id, op, value, value_type }
211
+ β”œβ”€β”€ group_by[] : column_id
212
+ β”œβ”€β”€ order_by[] : { column_id | alias, dir }
213
+ └── limit : int | null
214
+ ```
215
+
216
+ ### Whitelisted operators
217
+
218
+ ```
219
+ Filter ops: = != < <= > >= in not_in is_null is_not_null like between
220
+ Agg fns: count count_distinct sum avg min max
221
+ ```
222
+
223
+ ### Validation rules (enforced before execution)
224
+
225
+ - `source_id` exists in catalog for this user
226
+ - `table_id` belongs to that source
227
+ - Every `column_id` exists in that table
228
+ - Every `agg.fn` and `filter.op` is whitelisted
229
+ - `value_type` consistent with column's `data_type`
230
+ - `limit` positive int, ≀ hard cap (e.g. 10000)
231
+
232
+ If any rule fails β†’ reject IR β†’ re-prompt planner with error context (max 3 retries).
233
+
234
+ ### Deferred features
235
+
236
+ `having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.
237
+
238
+ ---
239
+
240
+ ## 8. Executors
241
+
242
+ Same input (validated IR), same output (`QueryResult`), different backends.
243
+
244
+ ### DbExecutor (schema sources)
245
+
246
+ ```
247
+ IR β†’ SqlCompiler β†’ SQL string + params
248
+ ↓
249
+ sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
250
+ ↓
251
+ asyncpg / pymysql in read-only transaction with timeout (30s)
252
+ ↓
253
+ QueryResult
254
+ ```
255
+
256
+ Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized β€” never inlined as strings.
257
+
258
+ ### TabularExecutor (tabular sources)
259
+
260
+ ```
261
+ IR β†’ PandasCompiler β†’ operation chain
262
+ ↓
263
+ choose strategy by file size:
264
+ ≀ 100 MB β†’ eager pandas
265
+ 100 MB-1 GB β†’ pyarrow with predicate pushdown
266
+ > 1 GB β†’ polars lazy scan
267
+ ↓
268
+ execute in asyncio.to_thread (CPU work off the event loop)
269
+ ↓
270
+ QueryResult
271
+ ```
272
+
273
+ Initially eager pandas is sufficient. Add the others when a real file is too big.
274
+
275
+ ### Shared safety guarantees
276
+
277
+ 1. IR validated before reaching compiler
278
+ 2. Compiler is deterministic (no LLM)
279
+ 3. Identifiers from catalog (trusted)
280
+ 4. Values parameterized
281
+ 5. sqlglot second-line defence for SQL
282
+ 6. Read-only at every layer
283
+ 7. Timeouts and row caps
284
+
285
+ ---
286
+
287
+ ## 9. Implementation scope
288
+
289
+ ### Initial PR β€” what ships first
290
+
291
+ | Item | Folder |
292
+ |---|---|
293
+ | Data catalog Pydantic models | `src/catalog/models.py` |
294
+ | Catalog ingestion (introspect β†’ enrich β†’ validate β†’ store) | `src/catalog/`, `src/pipeline/` |
295
+ | `IntentRouter` with 3-way source_hint | `src/agents/` |
296
+ | `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
297
+ | `QueryPlanner` LLM call | `src/query/planner/` |
298
+ | JSON IR Pydantic models | `src/query/ir/models.py` |
299
+ | IR validator | `src/query/ir/validator.py` |
300
+
301
+ **Output**: a validated JSON IR object. Execution lands in a follow-up PR.
302
+
303
+ ### Follow-up PRs
304
+
305
+ | PR | Scope |
306
+ |---|---|
307
+ | 2 | `QueryCompiler` (IR β†’ SQL / pandas) |
308
+ | 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
309
+ | 4 | Retry / self-correction loop on execution failure |
310
+ | 5 | Eval harness (golden question→IR→result examples) |
311
+ | 6 | Auto PII tagging in catalog |
312
+ | Later | Joins in IR, schema drift detection, hybrid catalog search |
313
+
314
+ ---
315
+
316
+ ## 10. Open questions
317
+
318
+ | # | Question | Why it matters |
319
+ |---|---|---|
320
+ | 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
321
+ | 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
322
+ | 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
323
+ | 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
324
+ | 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |
325
+
326
+ ---
327
+
328
+ ## 11. References
329
+
330
+ - `docs/flowchart.html` β€” interactive end-to-end diagram (open in browser)
331
+ - `docs/flowchart.mmd` β€” mermaid source for the diagram
332
+
333
+ ---
334
+
335
+ ## Glossary
336
+
337
+ - **Cu** β€” unstructured context (prose chunks)
338
+ - **Cs** β€” schema context (DB tables/columns from catalog)
339
+ - **Ct** β€” tabular context (file sheets/columns from catalog)
340
+ - **IR** β€” intermediate representation (the JSON query shape)
341
+ - **PR** β€” pull request (a unit of code change)
342
+ - **PII** β€” personally identifiable information (names, emails, etc.)
343
+ - **ABC** β€” abstract base class (Python contract for subclasses)
src/catalog/README.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # catalog
2
+
3
+ Per-user data catalog: identity layer for structured sources (DB schemas + tabular files).
4
+ Holds AI-enriched table/column descriptions, consumed by `query/planner` to generate JSON IR.
5
+
6
+ See `ARCHITECTURE.md` (root) for the full design.
src/catalog/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Catalog domain β€” per-user data catalog (Cs + Ct)."""
src/catalog/introspect/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Source-specific schema introspection (databases, tabular files)."""
src/query/README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # query
2
+
3
+ Catalog-driven query subsystem. User question β†’ IR β†’ SQL/pandas β†’ result.
4
+
5
+ Subpackages:
6
+ - `ir/` β€” JSON IR Pydantic models + validator
7
+ - `planner/` β€” LLM step: question + catalog β†’ IR
8
+ - `compiler/` β€” deterministic IR β†’ SQL or pandas op chain (no LLM)
9
+ - `executor/` β€” runs the compiled query against DB or Parquet
10
+
11
+ See `ARCHITECTURE.md` (root) for the full design.
src/query/compiler/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Deterministic IR β†’ SQL / pandas compilers (no LLM)."""
src/query/executor/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Query executors β€” run compiled queries against user DBs or tabular files."""
src/query/ir/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """JSON IR (intermediate representation) for catalog-driven queries."""
src/query/planner/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """LLM-based query planner β€” turns user questions + catalog into JSON IR."""
src/retrieval/README.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # retrieval
2
+
3
+ Unstructured-source retrieval (PDF, DOCX, TXT) β€” Cu in the architecture.
4
+ Dense similarity over prose chunks via PGVector.
5
+
6
+ Structured (DB / tabular) sources do **not** pass through here β€” they go through `catalog/` + `query/`.
7
+
8
+ See `ARCHITECTURE.md` (root) for the full design.
src/retrieval/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Retrieval for unstructured sources (Cu) β€” prose chunks via dense similarity."""
src/security/README.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # security
2
+
3
+ Cross-cutting security primitives:
4
+ - credential encryption (Fernet) for stored DB credentials
5
+ - authentication / password / JWT helpers
6
+ - PII detection patterns used by the catalog enricher
7
+
8
+ Consolidates utilities previously split between `utils/` and `users/`.
src/security/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Security primitives β€” credentials, auth, PII patterns."""