Agentic-Service-Data-Eyond-Catalog

Sleeping

App Files Files Community

sofhiaazzhr commited on 22 days ago

Commit

8557a20

1 Parent(s): c93ec90

initial commit

Browse files

Files changed (13) hide show

ARCHITECTURE.md +343 -0
src/catalog/README.md +6 -0
src/catalog/__init__.py +1 -0
src/catalog/introspect/__init__.py +1 -0
src/query/README.md +11 -0
src/query/compiler/__init__.py +1 -0
src/query/executor/__init__.py +1 -0
src/query/ir/__init__.py +1 -0
src/query/planner/__init__.py +1 -0
src/retrieval/README.md +8 -0
src/retrieval/__init__.py +1 -0
src/security/README.md +8 -0
src/security/__init__.py +1 -0

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,343 @@

+# Architecture — Data Eyond Agentic Service
+**Last updated**: 2026-05-07
+**Status**: Design phase — folder skeleton in place, implementation in progress
+---
+## TL;DR
+A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.
+The architecture has two paths:
+- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text).
+- **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
+The LLM produces *intent*, not query syntax. Deterministic code does the rest.
+---
+## 1. Why catalog-driven design
+For a database or spreadsheet, a user's question maps to *known tables and columns* — not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.
+A central per-user catalog also means:
+- One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
+- The query planner sees the user's full data landscape in a single prompt.
+- Schema stays stable across user sessions without hitting the source DB on every query.
+- New sources auto-update the catalog without re-embedding chunks.
+---
+## 2. Source taxonomy
+```
+Sources
+├── Unstructured (pdf, docx, txt)        →  Cu  (prose chunks via DocumentRetriever)
+└── Structured
+    ├── Schema (DB)                       →  Cs  (DB tables + columns)
+    └── Tabular (xlsx, csv, parquet)      →  Ct  (sheets + columns)
+                                           Cs ∪ Ct = Data Catalog Context
+```
+- **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
+- **Cs** = DB schema context (tables, columns, descriptions, sample values).
+- **Ct** = tabular file context (sheets, columns, descriptions, sample values).
+- **Data Catalog Context** = `Cs ∪ Ct`. Passed to the query planner as a single unified view.
+DB vs tabular is **not** a routing concern — it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).
+---
+## 3. Routing model
+```
+source_hint ∈ { chat, unstructured, structured }
+```
+- `chat` — no search, conversational reply only
+- `unstructured` — DocumentRetriever path (Cu)
+- `structured` — catalog-driven path (Cs ∪ Ct → planner → compiler → executor)
+The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.
+---
+## 4. Core architectural decisions
+### 4.1 Catalog as primary context, not retrieval
+For most users (≤50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.
+When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.
+### 4.2 JSON IR over raw SQL
+The planner LLM emits a structured JSON IR describing query intent — not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.
+Benefits:
+- Validatable with Pydantic before execution
+- Compiler whitelists allowed operations (no DROP, DELETE, etc.)
+- Portable: same IR → SQL (any dialect) / pandas / polars
+- Cheaper tokens, easier to debug, trivially testable without an LLM
+- LLM cannot emit valid-but-wrong SQL syntax
+### 4.3 Deterministic compiler, not LLM SQL writer
+The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.
+### 4.4 Pipeline stage isolation
+Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.
+### 4.5 Minimal LLM surface
+LLM calls happen in exactly four places:
+1. **`IntentRouter`** — once per user message
+2. **`CatalogEnricher`** — once per source, at ingestion (not query time)
+3. **`QueryPlanner`** — once per structured query (produces the IR)
+4. **`ChatbotAgent`** — once per answer (formats the response)
+Compiler and executors are pure code. No LLM in the hot path of query construction.
+---
+## 5. End-to-end flow
+### Ingestion (when user uploads a file or connects a DB)
+```
+source upload / DB connect
+    ↓
+introspect schema (DB: information_schema; tabular: file headers + sample rows)
+    ↓
+CatalogEnricher  (1 LLM call per source — generates AI descriptions)
+    ↓
+validate (Pydantic)
+    ↓
+write to catalog store (Postgres jsonb, keyed by user_id)
+```
+For unstructured files: chunk + embed → PGVector.
+### Query (per user message)
+```
+User message
+    ↓
+Chat cache check (Redis, 24h TTL)
+    ↓ miss
+Load chat history
+    ↓
+IntentRouter LLM   →  needs_search?  source_hint?
+    ↓
+    ├── chat        → ChatbotAgent → SSE stream
+    ├── unstructured → DocumentRetriever → answerer
+    └── structured  →
+            CatalogReader (load full Cs ∪ Ct for user)
+                ↓
+            QueryPlanner LLM  →  JSON IR
+                ↓
+            IRValidator  (Pydantic + columns-exist + ops whitelist)
+                ↓
+            QueryCompiler  →  SQL (schema source) or pandas (tabular source)
+                ↓
+            QueryExecutor  (DbExecutor or TabularExecutor)
+                ↓
+            QueryResult
+                ↓
+            ChatbotAgent → SSE stream
+```
+---
+## 6. Data catalog
+### Storage
+Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.
+### Schema (initial scope)
+```
+Catalog
+├── user_id, schema_version, generated_at
+└── sources[]
+    └── Source
+        ├── source_id, source_type, name, description, location_ref, updated_at
+        └── tables[]
+            └── Table
+                ├── table_id, name, description, row_count
+                └── columns[]
+                    └── Column
+                        ├── column_id, name, data_type, description
+                        ├── nullable
+                        ├── pii_flag
+                        ├── sample_values[]
+                        └── stats: { min, max, distinct_count } | null
+```
+### Best-practice fields deferred
+`description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.
+### Stable IDs
+`source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.
+### PII handling
+Columns with `pii_flag: true` have `sample_values: null` — real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.
+---
+## 7. JSON IR
+### Schema (initial scope)
+```
+QueryIR
+├── ir_version          : "1.0"
+├── source_id           : str   (references catalog)
+├── table_id            : str   (references catalog)
+├── select[]            : SelectItem
+│   ├── { kind: "column", column_id, alias? }
+│   └── { kind: "agg",    fn, column_id?, alias? }
+├── filters[]           : { column_id, op, value, value_type }
+├── group_by[]          : column_id
+├── order_by[]          : { column_id | alias, dir }
+└── limit               : int | null
+```
+### Whitelisted operators
+```
+Filter ops:  = != < <= > >= in not_in is_null is_not_null like between
+Agg fns:     count count_distinct sum avg min max
+```
+### Validation rules (enforced before execution)
+- `source_id` exists in catalog for this user
+- `table_id` belongs to that source
+- Every `column_id` exists in that table
+- Every `agg.fn` and `filter.op` is whitelisted
+- `value_type` consistent with column's `data_type`
+- `limit` positive int, ≤ hard cap (e.g. 10000)
+If any rule fails → reject IR → re-prompt planner with error context (max 3 retries).
+### Deferred features
+`having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.
+---
+## 8. Executors
+Same input (validated IR), same output (`QueryResult`), different backends.
+### DbExecutor (schema sources)
+```
+IR → SqlCompiler → SQL string + params
+     ↓
+sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
+     ↓
+asyncpg / pymysql in read-only transaction with timeout (30s)
+     ↓
+QueryResult
+```
+Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized — never inlined as strings.
+### TabularExecutor (tabular sources)
+```
+IR → PandasCompiler → operation chain
+     ↓
+choose strategy by file size:
+  ≤ 100 MB    → eager pandas
+  100 MB-1 GB → pyarrow with predicate pushdown
+  > 1 GB      → polars lazy scan
+     ↓
+execute in asyncio.to_thread (CPU work off the event loop)
+     ↓
+QueryResult
+```
+Initially eager pandas is sufficient. Add the others when a real file is too big.
+### Shared safety guarantees
+1. IR validated before reaching compiler
+2. Compiler is deterministic (no LLM)
+3. Identifiers from catalog (trusted)
+4. Values parameterized
+5. sqlglot second-line defence for SQL
+6. Read-only at every layer
+7. Timeouts and row caps
+---
+## 9. Implementation scope
+### Initial PR — what ships first
+| Item | Folder |
+|---|---|
+| Data catalog Pydantic models | `src/catalog/models.py` |
+| Catalog ingestion (introspect → enrich → validate → store) | `src/catalog/`, `src/pipeline/` |
+| `IntentRouter` with 3-way source_hint | `src/agents/` |
+| `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
+| `QueryPlanner` LLM call | `src/query/planner/` |
+| JSON IR Pydantic models | `src/query/ir/models.py` |
+| IR validator | `src/query/ir/validator.py` |
+**Output**: a validated JSON IR object. Execution lands in a follow-up PR.
+### Follow-up PRs
+| PR | Scope |
+|---|---|
+| 2 | `QueryCompiler` (IR → SQL / pandas) |
+| 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
+| 4 | Retry / self-correction loop on execution failure |
+| 5 | Eval harness (golden question→IR→result examples) |
+| 6 | Auto PII tagging in catalog |
+| Later | Joins in IR, schema drift detection, hybrid catalog search |
+---
+## 10. Open questions
+| # | Question | Why it matters |
+|---|---|---|
+| 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
+| 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
+| 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
+| 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
+| 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |
+---
+## 11. References
+- `docs/flowchart.html` — interactive end-to-end diagram (open in browser)
+- `docs/flowchart.mmd` — mermaid source for the diagram
+---
+## Glossary
+- **Cu** — unstructured context (prose chunks)
+- **Cs** — schema context (DB tables/columns from catalog)
+- **Ct** — tabular context (file sheets/columns from catalog)
+- **IR** — intermediate representation (the JSON query shape)
+- **PR** — pull request (a unit of code change)
+- **PII** — personally identifiable information (names, emails, etc.)
+- **ABC** — abstract base class (Python contract for subclasses)

src/catalog/README.md ADDED Viewed

	@@ -0,0 +1,6 @@

+# catalog
+Per-user data catalog: identity layer for structured sources (DB schemas + tabular files).
+Holds AI-enriched table/column descriptions, consumed by `query/planner` to generate JSON IR.
+See `ARCHITECTURE.md` (root) for the full design.

src/catalog/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Catalog domain — per-user data catalog (Cs + Ct)."""

src/catalog/introspect/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Source-specific schema introspection (databases, tabular files)."""

src/query/README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# query
+Catalog-driven query subsystem. User question → IR → SQL/pandas → result.
+Subpackages:
+- `ir/` — JSON IR Pydantic models + validator
+- `planner/` — LLM step: question + catalog → IR
+- `compiler/` — deterministic IR → SQL or pandas op chain (no LLM)
+- `executor/` — runs the compiled query against DB or Parquet
+See `ARCHITECTURE.md` (root) for the full design.

src/query/compiler/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Deterministic IR → SQL / pandas compilers (no LLM)."""

src/query/executor/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Query executors — run compiled queries against user DBs or tabular files."""

src/query/ir/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """JSON IR (intermediate representation) for catalog-driven queries."""

src/query/planner/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """LLM-based query planner — turns user questions + catalog into JSON IR."""

src/retrieval/README.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# retrieval
+Unstructured-source retrieval (PDF, DOCX, TXT) — Cu in the architecture.
+Dense similarity over prose chunks via PGVector.
+Structured (DB / tabular) sources do **not** pass through here — they go through `catalog/` + `query/`.
+See `ARCHITECTURE.md` (root) for the full design.

src/retrieval/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Retrieval for unstructured sources (Cu) — prose chunks via dense similarity."""

src/security/README.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# security
+Cross-cutting security primitives:
+- credential encryption (Fernet) for stored DB credentials
+- authentication / password / JWT helpers
+- PII detection patterns used by the catalog enricher
+Consolidates utilities previously split between `utils/` and `users/`.

src/security/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Security primitives — credentials, auth, PII patterns."""