Commit Β·
8557a20
1
Parent(s): c93ec90
initial commit
Browse files- ARCHITECTURE.md +343 -0
- src/catalog/README.md +6 -0
- src/catalog/__init__.py +1 -0
- src/catalog/introspect/__init__.py +1 -0
- src/query/README.md +11 -0
- src/query/compiler/__init__.py +1 -0
- src/query/executor/__init__.py +1 -0
- src/query/ir/__init__.py +1 -0
- src/query/planner/__init__.py +1 -0
- src/retrieval/README.md +8 -0
- src/retrieval/__init__.py +1 -0
- src/security/README.md +8 -0
- src/security/__init__.py +1 -0
ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,343 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture β Data Eyond Agentic Service
|
| 2 |
+
|
| 3 |
+
**Last updated**: 2026-05-07
|
| 4 |
+
**Status**: Design phase β folder skeleton in place, implementation in progress
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## TL;DR
|
| 9 |
+
|
| 10 |
+
A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.
|
| 11 |
+
|
| 12 |
+
The architecture has two paths:
|
| 13 |
+
|
| 14 |
+
- **Unstructured** (PDF, DOCX, TXT) β dense similarity over prose chunks (the right primitive for free-form text).
|
| 15 |
+
- **Structured** (databases, XLSX, CSV, Parquet) β a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
|
| 16 |
+
|
| 17 |
+
The LLM produces *intent*, not query syntax. Deterministic code does the rest.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 1. Why catalog-driven design
|
| 22 |
+
|
| 23 |
+
For a database or spreadsheet, a user's question maps to *known tables and columns* β not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.
|
| 24 |
+
|
| 25 |
+
A central per-user catalog also means:
|
| 26 |
+
|
| 27 |
+
- One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
|
| 28 |
+
- The query planner sees the user's full data landscape in a single prompt.
|
| 29 |
+
- Schema stays stable across user sessions without hitting the source DB on every query.
|
| 30 |
+
- New sources auto-update the catalog without re-embedding chunks.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 2. Source taxonomy
|
| 35 |
+
|
| 36 |
+
```
|
| 37 |
+
Sources
|
| 38 |
+
βββ Unstructured (pdf, docx, txt) β Cu (prose chunks via DocumentRetriever)
|
| 39 |
+
βββ Structured
|
| 40 |
+
βββ Schema (DB) β Cs (DB tables + columns)
|
| 41 |
+
βββ Tabular (xlsx, csv, parquet) β Ct (sheets + columns)
|
| 42 |
+
Cs βͺ Ct = Data Catalog Context
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
- **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
|
| 46 |
+
- **Cs** = DB schema context (tables, columns, descriptions, sample values).
|
| 47 |
+
- **Ct** = tabular file context (sheets, columns, descriptions, sample values).
|
| 48 |
+
- **Data Catalog Context** = `Cs βͺ Ct`. Passed to the query planner as a single unified view.
|
| 49 |
+
|
| 50 |
+
DB vs tabular is **not** a routing concern β it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 3. Routing model
|
| 55 |
+
|
| 56 |
+
```
|
| 57 |
+
source_hint β { chat, unstructured, structured }
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
- `chat` β no search, conversational reply only
|
| 61 |
+
- `unstructured` β DocumentRetriever path (Cu)
|
| 62 |
+
- `structured` β catalog-driven path (Cs βͺ Ct β planner β compiler β executor)
|
| 63 |
+
|
| 64 |
+
The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## 4. Core architectural decisions
|
| 69 |
+
|
| 70 |
+
### 4.1 Catalog as primary context, not retrieval
|
| 71 |
+
|
| 72 |
+
For most users (β€50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.
|
| 73 |
+
|
| 74 |
+
When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.
|
| 75 |
+
|
| 76 |
+
### 4.2 JSON IR over raw SQL
|
| 77 |
+
|
| 78 |
+
The planner LLM emits a structured JSON IR describing query intent β not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.
|
| 79 |
+
|
| 80 |
+
Benefits:
|
| 81 |
+
|
| 82 |
+
- Validatable with Pydantic before execution
|
| 83 |
+
- Compiler whitelists allowed operations (no DROP, DELETE, etc.)
|
| 84 |
+
- Portable: same IR β SQL (any dialect) / pandas / polars
|
| 85 |
+
- Cheaper tokens, easier to debug, trivially testable without an LLM
|
| 86 |
+
- LLM cannot emit valid-but-wrong SQL syntax
|
| 87 |
+
|
| 88 |
+
### 4.3 Deterministic compiler, not LLM SQL writer
|
| 89 |
+
|
| 90 |
+
The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.
|
| 91 |
+
|
| 92 |
+
### 4.4 Pipeline stage isolation
|
| 93 |
+
|
| 94 |
+
Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.
|
| 95 |
+
|
| 96 |
+
### 4.5 Minimal LLM surface
|
| 97 |
+
|
| 98 |
+
LLM calls happen in exactly four places:
|
| 99 |
+
|
| 100 |
+
1. **`IntentRouter`** β once per user message
|
| 101 |
+
2. **`CatalogEnricher`** β once per source, at ingestion (not query time)
|
| 102 |
+
3. **`QueryPlanner`** β once per structured query (produces the IR)
|
| 103 |
+
4. **`ChatbotAgent`** β once per answer (formats the response)
|
| 104 |
+
|
| 105 |
+
Compiler and executors are pure code. No LLM in the hot path of query construction.
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## 5. End-to-end flow
|
| 110 |
+
|
| 111 |
+
### Ingestion (when user uploads a file or connects a DB)
|
| 112 |
+
|
| 113 |
+
```
|
| 114 |
+
source upload / DB connect
|
| 115 |
+
β
|
| 116 |
+
introspect schema (DB: information_schema; tabular: file headers + sample rows)
|
| 117 |
+
β
|
| 118 |
+
CatalogEnricher (1 LLM call per source β generates AI descriptions)
|
| 119 |
+
β
|
| 120 |
+
validate (Pydantic)
|
| 121 |
+
β
|
| 122 |
+
write to catalog store (Postgres jsonb, keyed by user_id)
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
For unstructured files: chunk + embed β PGVector.
|
| 126 |
+
|
| 127 |
+
### Query (per user message)
|
| 128 |
+
|
| 129 |
+
```
|
| 130 |
+
User message
|
| 131 |
+
β
|
| 132 |
+
Chat cache check (Redis, 24h TTL)
|
| 133 |
+
β miss
|
| 134 |
+
Load chat history
|
| 135 |
+
β
|
| 136 |
+
IntentRouter LLM β needs_search? source_hint?
|
| 137 |
+
β
|
| 138 |
+
βββ chat β ChatbotAgent β SSE stream
|
| 139 |
+
βββ unstructured β DocumentRetriever β answerer
|
| 140 |
+
βββ structured β
|
| 141 |
+
CatalogReader (load full Cs βͺ Ct for user)
|
| 142 |
+
β
|
| 143 |
+
QueryPlanner LLM β JSON IR
|
| 144 |
+
β
|
| 145 |
+
IRValidator (Pydantic + columns-exist + ops whitelist)
|
| 146 |
+
β
|
| 147 |
+
QueryCompiler β SQL (schema source) or pandas (tabular source)
|
| 148 |
+
β
|
| 149 |
+
QueryExecutor (DbExecutor or TabularExecutor)
|
| 150 |
+
β
|
| 151 |
+
QueryResult
|
| 152 |
+
β
|
| 153 |
+
ChatbotAgent β SSE stream
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## 6. Data catalog
|
| 159 |
+
|
| 160 |
+
### Storage
|
| 161 |
+
|
| 162 |
+
Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.
|
| 163 |
+
|
| 164 |
+
### Schema (initial scope)
|
| 165 |
+
|
| 166 |
+
```
|
| 167 |
+
Catalog
|
| 168 |
+
βββ user_id, schema_version, generated_at
|
| 169 |
+
βββ sources[]
|
| 170 |
+
βββ Source
|
| 171 |
+
βββ source_id, source_type, name, description, location_ref, updated_at
|
| 172 |
+
βββ tables[]
|
| 173 |
+
βββ Table
|
| 174 |
+
βββ table_id, name, description, row_count
|
| 175 |
+
βββ columns[]
|
| 176 |
+
βββ Column
|
| 177 |
+
βββ column_id, name, data_type, description
|
| 178 |
+
βββ nullable
|
| 179 |
+
βββ pii_flag
|
| 180 |
+
βββ sample_values[]
|
| 181 |
+
βββ stats: { min, max, distinct_count } | null
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
### Best-practice fields deferred
|
| 185 |
+
|
| 186 |
+
`description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.
|
| 187 |
+
|
| 188 |
+
### Stable IDs
|
| 189 |
+
|
| 190 |
+
`source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.
|
| 191 |
+
|
| 192 |
+
### PII handling
|
| 193 |
+
|
| 194 |
+
Columns with `pii_flag: true` have `sample_values: null` β real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## 7. JSON IR
|
| 199 |
+
|
| 200 |
+
### Schema (initial scope)
|
| 201 |
+
|
| 202 |
+
```
|
| 203 |
+
QueryIR
|
| 204 |
+
βββ ir_version : "1.0"
|
| 205 |
+
βββ source_id : str (references catalog)
|
| 206 |
+
βββ table_id : str (references catalog)
|
| 207 |
+
βββ select[] : SelectItem
|
| 208 |
+
β βββ { kind: "column", column_id, alias? }
|
| 209 |
+
β βββ { kind: "agg", fn, column_id?, alias? }
|
| 210 |
+
βββ filters[] : { column_id, op, value, value_type }
|
| 211 |
+
βββ group_by[] : column_id
|
| 212 |
+
βββ order_by[] : { column_id | alias, dir }
|
| 213 |
+
βββ limit : int | null
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
### Whitelisted operators
|
| 217 |
+
|
| 218 |
+
```
|
| 219 |
+
Filter ops: = != < <= > >= in not_in is_null is_not_null like between
|
| 220 |
+
Agg fns: count count_distinct sum avg min max
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
### Validation rules (enforced before execution)
|
| 224 |
+
|
| 225 |
+
- `source_id` exists in catalog for this user
|
| 226 |
+
- `table_id` belongs to that source
|
| 227 |
+
- Every `column_id` exists in that table
|
| 228 |
+
- Every `agg.fn` and `filter.op` is whitelisted
|
| 229 |
+
- `value_type` consistent with column's `data_type`
|
| 230 |
+
- `limit` positive int, β€ hard cap (e.g. 10000)
|
| 231 |
+
|
| 232 |
+
If any rule fails β reject IR β re-prompt planner with error context (max 3 retries).
|
| 233 |
+
|
| 234 |
+
### Deferred features
|
| 235 |
+
|
| 236 |
+
`having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.
|
| 237 |
+
|
| 238 |
+
---
|
| 239 |
+
|
| 240 |
+
## 8. Executors
|
| 241 |
+
|
| 242 |
+
Same input (validated IR), same output (`QueryResult`), different backends.
|
| 243 |
+
|
| 244 |
+
### DbExecutor (schema sources)
|
| 245 |
+
|
| 246 |
+
```
|
| 247 |
+
IR β SqlCompiler β SQL string + params
|
| 248 |
+
β
|
| 249 |
+
sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
|
| 250 |
+
β
|
| 251 |
+
asyncpg / pymysql in read-only transaction with timeout (30s)
|
| 252 |
+
β
|
| 253 |
+
QueryResult
|
| 254 |
+
```
|
| 255 |
+
|
| 256 |
+
Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized β never inlined as strings.
|
| 257 |
+
|
| 258 |
+
### TabularExecutor (tabular sources)
|
| 259 |
+
|
| 260 |
+
```
|
| 261 |
+
IR β PandasCompiler β operation chain
|
| 262 |
+
β
|
| 263 |
+
choose strategy by file size:
|
| 264 |
+
β€ 100 MB β eager pandas
|
| 265 |
+
100 MB-1 GB β pyarrow with predicate pushdown
|
| 266 |
+
> 1 GB β polars lazy scan
|
| 267 |
+
β
|
| 268 |
+
execute in asyncio.to_thread (CPU work off the event loop)
|
| 269 |
+
β
|
| 270 |
+
QueryResult
|
| 271 |
+
```
|
| 272 |
+
|
| 273 |
+
Initially eager pandas is sufficient. Add the others when a real file is too big.
|
| 274 |
+
|
| 275 |
+
### Shared safety guarantees
|
| 276 |
+
|
| 277 |
+
1. IR validated before reaching compiler
|
| 278 |
+
2. Compiler is deterministic (no LLM)
|
| 279 |
+
3. Identifiers from catalog (trusted)
|
| 280 |
+
4. Values parameterized
|
| 281 |
+
5. sqlglot second-line defence for SQL
|
| 282 |
+
6. Read-only at every layer
|
| 283 |
+
7. Timeouts and row caps
|
| 284 |
+
|
| 285 |
+
---
|
| 286 |
+
|
| 287 |
+
## 9. Implementation scope
|
| 288 |
+
|
| 289 |
+
### Initial PR β what ships first
|
| 290 |
+
|
| 291 |
+
| Item | Folder |
|
| 292 |
+
|---|---|
|
| 293 |
+
| Data catalog Pydantic models | `src/catalog/models.py` |
|
| 294 |
+
| Catalog ingestion (introspect β enrich β validate β store) | `src/catalog/`, `src/pipeline/` |
|
| 295 |
+
| `IntentRouter` with 3-way source_hint | `src/agents/` |
|
| 296 |
+
| `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
|
| 297 |
+
| `QueryPlanner` LLM call | `src/query/planner/` |
|
| 298 |
+
| JSON IR Pydantic models | `src/query/ir/models.py` |
|
| 299 |
+
| IR validator | `src/query/ir/validator.py` |
|
| 300 |
+
|
| 301 |
+
**Output**: a validated JSON IR object. Execution lands in a follow-up PR.
|
| 302 |
+
|
| 303 |
+
### Follow-up PRs
|
| 304 |
+
|
| 305 |
+
| PR | Scope |
|
| 306 |
+
|---|---|
|
| 307 |
+
| 2 | `QueryCompiler` (IR β SQL / pandas) |
|
| 308 |
+
| 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
|
| 309 |
+
| 4 | Retry / self-correction loop on execution failure |
|
| 310 |
+
| 5 | Eval harness (golden questionβIRβresult examples) |
|
| 311 |
+
| 6 | Auto PII tagging in catalog |
|
| 312 |
+
| Later | Joins in IR, schema drift detection, hybrid catalog search |
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
## 10. Open questions
|
| 317 |
+
|
| 318 |
+
| # | Question | Why it matters |
|
| 319 |
+
|---|---|---|
|
| 320 |
+
| 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
|
| 321 |
+
| 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
|
| 322 |
+
| 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
|
| 323 |
+
| 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
|
| 324 |
+
| 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |
|
| 325 |
+
|
| 326 |
+
---
|
| 327 |
+
|
| 328 |
+
## 11. References
|
| 329 |
+
|
| 330 |
+
- `docs/flowchart.html` β interactive end-to-end diagram (open in browser)
|
| 331 |
+
- `docs/flowchart.mmd` β mermaid source for the diagram
|
| 332 |
+
|
| 333 |
+
---
|
| 334 |
+
|
| 335 |
+
## Glossary
|
| 336 |
+
|
| 337 |
+
- **Cu** β unstructured context (prose chunks)
|
| 338 |
+
- **Cs** β schema context (DB tables/columns from catalog)
|
| 339 |
+
- **Ct** β tabular context (file sheets/columns from catalog)
|
| 340 |
+
- **IR** β intermediate representation (the JSON query shape)
|
| 341 |
+
- **PR** β pull request (a unit of code change)
|
| 342 |
+
- **PII** β personally identifiable information (names, emails, etc.)
|
| 343 |
+
- **ABC** β abstract base class (Python contract for subclasses)
|
src/catalog/README.md
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# catalog
|
| 2 |
+
|
| 3 |
+
Per-user data catalog: identity layer for structured sources (DB schemas + tabular files).
|
| 4 |
+
Holds AI-enriched table/column descriptions, consumed by `query/planner` to generate JSON IR.
|
| 5 |
+
|
| 6 |
+
See `ARCHITECTURE.md` (root) for the full design.
|
src/catalog/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Catalog domain β per-user data catalog (Cs + Ct)."""
|
src/catalog/introspect/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Source-specific schema introspection (databases, tabular files)."""
|
src/query/README.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# query
|
| 2 |
+
|
| 3 |
+
Catalog-driven query subsystem. User question β IR β SQL/pandas β result.
|
| 4 |
+
|
| 5 |
+
Subpackages:
|
| 6 |
+
- `ir/` β JSON IR Pydantic models + validator
|
| 7 |
+
- `planner/` β LLM step: question + catalog β IR
|
| 8 |
+
- `compiler/` β deterministic IR β SQL or pandas op chain (no LLM)
|
| 9 |
+
- `executor/` β runs the compiled query against DB or Parquet
|
| 10 |
+
|
| 11 |
+
See `ARCHITECTURE.md` (root) for the full design.
|
src/query/compiler/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Deterministic IR β SQL / pandas compilers (no LLM)."""
|
src/query/executor/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Query executors β run compiled queries against user DBs or tabular files."""
|
src/query/ir/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""JSON IR (intermediate representation) for catalog-driven queries."""
|
src/query/planner/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""LLM-based query planner β turns user questions + catalog into JSON IR."""
|
src/retrieval/README.md
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# retrieval
|
| 2 |
+
|
| 3 |
+
Unstructured-source retrieval (PDF, DOCX, TXT) β Cu in the architecture.
|
| 4 |
+
Dense similarity over prose chunks via PGVector.
|
| 5 |
+
|
| 6 |
+
Structured (DB / tabular) sources do **not** pass through here β they go through `catalog/` + `query/`.
|
| 7 |
+
|
| 8 |
+
See `ARCHITECTURE.md` (root) for the full design.
|
src/retrieval/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Retrieval for unstructured sources (Cu) β prose chunks via dense similarity."""
|
src/security/README.md
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# security
|
| 2 |
+
|
| 3 |
+
Cross-cutting security primitives:
|
| 4 |
+
- credential encryption (Fernet) for stored DB credentials
|
| 5 |
+
- authentication / password / JWT helpers
|
| 6 |
+
- PII detection patterns used by the catalog enricher
|
| 7 |
+
|
| 8 |
+
Consolidates utilities previously split between `utils/` and `users/`.
|
src/security/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Security primitives β credentials, auth, PII patterns."""
|