Agentic-Service-Data-Eyond-Catalog

Sleeping

App Files Files Community

sofhiaazzhr commited on 25 days ago

Commit

c749544

1 Parent(s): 04e5c48

eat(query): PandasCompiler + TabularExecutor (PR3-TAB)

Browse files

Files changed (3) hide show

PROGRESS.md +21 -7
src/query/compiler/pandas.py +281 -7
src/query/executor/tabular.py +136 -4

PROGRESS.md CHANGED Viewed

@@ -2,8 +2,8 @@
 Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
-**Last updated**: 2026-05-07 (PR1-tab — tabular introspector + on_tabular_uploaded trigger + 31 tests)
-**Current open PRs**: PR3-DB (DB owner — SqlCompiler + DbExecutor + golden IR→SQL tests) · PR1-tab (TAB owner — tabular introspector)
 ---
@@ -25,7 +25,7 @@ Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "T
 | PR2a | `[x]` merged | DB | CatalogEnricher + StructuredPipeline + on_db_registered trigger + FK extension on Table |
 | PR2b | `[ ]` | B | IntentRouter + planner prompt (pair) + planner LLM service |
 | PR3-DB | `[~]` open | DB | SqlCompiler (Postgres) + DbExecutor (sqlglot guard, RO + statement_timeout, asyncio.to_thread) + 36 golden IR→SQL tests |
-| PR3-TAB | `[ ]` | TAB | Pandas compiler + tabular executor + golden IR→DataFrame tests |
 | PR4 | `[ ]` | B (pair) | ExecutorDispatcher + QueryService + chat stream endpoint integration |
 | PR5 | `[ ]` | B | Retry/self-correction loop on execution failure |
 | PR6 | `[ ]` | B | Eval harness (golden question→IR→result examples) |
@@ -45,7 +45,7 @@ Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "T
 | 3 | IR operator whitelists (`query/ir/operators.py`) | `[x]` | PR1 filled `TYPE_COMPATIBILITY` matrix |
 | 4 | PII patterns / regex (`security/pii_patterns.py`) | `[x]` | Pre-existing |
 | — | `catalogs` Postgres jsonb table (`db/postgres/models.py`) | `[x]` | PR1 added `Catalog` SQLAlchemy class + `init_db.py` import |
-| — | `QueryResult` shape (`query/executor/base.py`) | `[x]` | Pre-existing scaffold; revisit in PR3 if `column_types` needed |
 | — | `Source.location_ref` URI scheme | `[x]` | PR1 documented in `catalog/models.py` docstring |
 ### Ingestion — introspection
@@ -101,8 +101,8 @@ Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "T
 | # | Item | Status | Notes |
 |---|---|---|---|
-| 29 | Pandas compiler (`query/compiler/pandas.py`) | `[ ]` | PR3 — IR → callable on DataFrame |
-| 30 | Tabular executor (`query/executor/tabular.py`) | `[ ]` | PR3 — eager pandas first; pyarrow/polars later if file size demands |
 | 31 | Parquet upload/download wrapper | `[ ]` | Phase 1 has `knowledge/parquet_service.py` — reuse or move to `storage/` in cleanup |
 ### Agents + chat
@@ -126,7 +126,7 @@ Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "T
 | # | Item | Owner | Status | Notes |
 |---|---|---|---|---|
 | 38 | DB compiler golden tests (`tests/query/compiler/test_sql.py`) | DB | `[x]` | PR3-DB — 36 tests across all whitelisted ops, identifier quoting, agg / count_distinct / count(*), order_by alias resolution, parameter sequencing, error paths. Pure-Python, no LLM, no DB. |
-| 39 | Pandas compiler golden tests (`tests/query/compiler/test_pandas.py`) | TAB | `[ ]` | PR3 — pure-Python, no LLM |
 | 40 | IR validator tests (`tests/query/ir/test_validator.py`) | B | `[x]` | PR1 — 19 tests, all rules covered |
 | — | PII detector tests (`tests/catalog/test_pii_detector.py`) | B | `[x]` | PR1 — 26 tests (parametrized) |
 | — | Catalog validator tests (`tests/catalog/test_validator.py`) | B | `[x]` | PR1 — 5 tests |
@@ -153,6 +153,20 @@ Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "T
 ---
 ## What just shipped (PR3-DB — DB owner)
 **Files implemented**:

 Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
+**Last updated**: 2026-05-07 (PR3-TAB — PandasCompiler + TabularExecutor + 43+12 tests)
+**Current open PRs**: PR3-DB (DB owner — SqlCompiler + DbExecutor + golden IR→SQL tests) · PR1-tab (TAB owner — tabular introspector) · PR3-TAB (TAB owner — pandas compiler + tabular executor)
 ---
 | PR2a | `[x]` merged | DB | CatalogEnricher + StructuredPipeline + on_db_registered trigger + FK extension on Table |
 | PR2b | `[ ]` | B | IntentRouter + planner prompt (pair) + planner LLM service |
 | PR3-DB | `[~]` open | DB | SqlCompiler (Postgres) + DbExecutor (sqlglot guard, RO + statement_timeout, asyncio.to_thread) + 36 golden IR→SQL tests |
+| PR3-TAB | `[~]` | TAB | Pandas compiler + tabular executor + golden IR→DataFrame tests |
 | PR4 | `[ ]` | B (pair) | ExecutorDispatcher + QueryService + chat stream endpoint integration |
 | PR5 | `[ ]` | B | Retry/self-correction loop on execution failure |
 | PR6 | `[ ]` | B | Eval harness (golden question→IR→result examples) |
 | 3 | IR operator whitelists (`query/ir/operators.py`) | `[x]` | PR1 filled `TYPE_COMPATIBILITY` matrix |
 | 4 | PII patterns / regex (`security/pii_patterns.py`) | `[x]` | Pre-existing |
 | — | `catalogs` Postgres jsonb table (`db/postgres/models.py`) | `[x]` | PR1 added `Catalog` SQLAlchemy class + `init_db.py` import |
+| — | `QueryResult` shape (`query/executor/base.py`) | `[x]` | Pre-existing scaffold; `columns: list[str]` added (TAB owner, PR1-tab) — DbExecutor updated to populate it. |
 | — | `Source.location_ref` URI scheme | `[x]` | PR1 documented in `catalog/models.py` docstring |
 ### Ingestion — introspection
 | # | Item | Status | Notes |
 |---|---|---|---|
+| 29 | Pandas compiler (`query/compiler/pandas.py`) | `[~]` | PR3-TAB — `CompiledPandas` dataclass; all 12 filter ops; all 6 aggs; group_by via `pd.concat` of Series; alias-aware order_by; `_like_to_regex` (`%`→`.*`, `_`→`.`); pure module-level helpers |
+| 30 | Tabular executor (`query/executor/tabular.py`) | `[~]` | PR3-TAB — `fetch_blob` injectable for tests; blob path: single-table → `{uid}/{did}.parquet`, multi-table → `{uid}/{did}__{table.name}.parquet`; `asyncio.to_thread`; 10k row hard cap; errors → `QueryResult.error` |
 | 31 | Parquet upload/download wrapper | `[ ]` | Phase 1 has `knowledge/parquet_service.py` — reuse or move to `storage/` in cleanup |
 ### Agents + chat
 | # | Item | Owner | Status | Notes |
 |---|---|---|---|---|
 | 38 | DB compiler golden tests (`tests/query/compiler/test_sql.py`) | DB | `[x]` | PR3-DB — 36 tests across all whitelisted ops, identifier quoting, agg / count_distinct / count(*), order_by alias resolution, parameter sequencing, error paths. Pure-Python, no LLM, no DB. |
+| 39 | Pandas compiler golden tests (`tests/unit/query/compiler/test_pandas_compiler.py`) | TAB | `[~]` | PR3-TAB — 43 tests: all 12 filter ops, all 6 aggs, group_by, order_by, limit, aliases, empty DataFrame, error paths. `test_tabular_executor.py` adds 12 more (blob name resolution + happy path + error paths). |
 | 40 | IR validator tests (`tests/query/ir/test_validator.py`) | B | `[x]` | PR1 — 19 tests, all rules covered |
 | — | PII detector tests (`tests/catalog/test_pii_detector.py`) | B | `[x]` | PR1 — 26 tests (parametrized) |
 | — | Catalog validator tests (`tests/catalog/test_validator.py`) | B | `[x]` | PR1 — 5 tests |
 ---
+## What just shipped (PR3-TAB — TAB owner)
+**Files implemented**:
+- `src/query/compiler/pandas.py` — `PandasCompiler` + `CompiledPandas(apply, output_columns)` dataclass. Pure helper functions (easier to test in isolation): `_apply_filters` (all 12 ops, `_like_to_regex` for LIKE), `_apply_select` (column pick + rename), `_apply_agg` (scalar + group_by via `pd.concat` of Series → `reset_index`), `_apply_orderby` (alias-aware via `_resolve_order_col`). Closure captures all IR fields explicitly so `apply(df)` is self-contained.
+- `src/query/executor/tabular.py` — `TabularExecutor` with injectable `fetch_blob` (same testability pattern as `TabularIntrospector`). Resolves Parquet blob path from `az_blob://{uid}/{did}` + table: single-table → `{uid}/{did}.parquet`, multi-table → `{uid}/{did}__{table.name}.parquet`. Runs compile → download → `asyncio.to_thread(_load_and_apply)` → 10k hard cap. Never raises; errors populate `QueryResult.error`. Uses `compiled.output_columns` for column labels (safe on empty DataFrame).
+**Tests added** (55 new — total suite now 86 all passing):
+- `tests/unit/query/compiler/test_pandas_compiler.py` — 43 tests across all 12 filter ops (including `is_null`, `not_in`, `like`, `between`), all 6 agg fns, group_by, order_by asc/desc, limit-after-order, alias round-trip, empty DataFrame, error paths.
+- `tests/unit/query/executor/test_tabular_executor.py` — 12 tests: `_resolve_blob_name` (single/multi-table, bad prefix), happy-path `QueryResult` shape (columns, rows, backend, truncated, source_id), wrong source_type → error, blob fetch failure → error, unknown source → error.
+**Lint**: `ruff check` clean on both files.
+---
 ## What just shipped (PR3-DB — DB owner)
 **Files implemented**:

src/query/compiler/pandas.py CHANGED Viewed

@@ -1,22 +1,296 @@
 """PandasCompiler — IR → callable that runs against a DataFrame.
 For tabular sources. The callable encapsulates the chain of operations
-(filter → groupby → agg → sort → limit) so the executor can apply them
-to a DataFrame loaded eagerly or via predicate pushdown / polars lazy scan.
 """
 from collections.abc import Callable
-from ...catalog.models import Catalog
-from ..ir.models import QueryIR
 from .base import BaseCompiler
 class PandasCompiler(BaseCompiler):
-    """Deterministic IR → pandas/polars op chain. No LLM."""
     def __init__(self, catalog: Catalog) -> None:
         self._catalog = catalog
-    def compile(self, ir: QueryIR) -> Callable:
-        raise NotImplementedError

 """PandasCompiler — IR → callable that runs against a DataFrame.
 For tabular sources. The callable encapsulates the chain of operations
+(filter → select/agg → sort → limit) so the executor can apply them
+to a DataFrame loaded from a Parquet blob.
+Returns a `CompiledPandas` dataclass (mirrors `CompiledSql`) whose `.apply`
+is a pure function `(pd.DataFrame) -> pd.DataFrame`. No LLM, no I/O.
 """
+from __future__ import annotations
+import re
 from collections.abc import Callable
+from dataclasses import dataclass
+from typing import Any
+import pandas as pd
+from ...catalog.models import Catalog, Column, Source, Table
+from ..ir.models import AggSelect, ColumnSelect, FilterClause, OrderByClause, QueryIR, SelectItem
 from .base import BaseCompiler
+@dataclass
+class CompiledPandas:
+    """Compiled IR as a pandas operation chain.
+    `apply(df)` executes the full filter → select/agg → sort → limit
+    pipeline and returns the result as a new DataFrame.
+    `output_columns` lists the expected column names so callers can label
+    an empty result without inspecting rows.
+    """
+    apply: Callable[[pd.DataFrame], pd.DataFrame]
+    output_columns: list[str]
+class PandasCompilerError(Exception):
+    pass
 class PandasCompiler(BaseCompiler):
+    """Deterministic IR → pandas op chain. No LLM."""
     def __init__(self, catalog: Catalog) -> None:
         self._catalog = catalog
+    def compile(self, ir: QueryIR) -> CompiledPandas:
+        _, table, cols_by_id = self._lookup(ir)
+        output_columns = _output_column_names(ir.select, cols_by_id)
+        # Capture IR fields explicitly so the closure is self-contained
+        _filters = ir.filters
+        _select = ir.select
+        _group_by = ir.group_by
+        _order_by = ir.order_by
+        _limit = ir.limit
+        _cols = cols_by_id
+        def apply(df: pd.DataFrame) -> pd.DataFrame:
+            df = _apply_filters(df, _filters, _cols)
+            has_agg = any(isinstance(s, AggSelect) for s in _select)
+            if has_agg:
+                df = _apply_agg(df, _select, _group_by, _cols)
+            else:
+                df = _apply_select(df, _select, _cols)
+            if _order_by:
+                df = _apply_orderby(df, _order_by, _select, _cols)
+            if _limit is not None:
+                df = df.head(_limit)
+            return df.reset_index(drop=True)
+        return CompiledPandas(apply=apply, output_columns=output_columns)
+    # ------------------------------------------------------------------
+    # Catalog lookup (mirrors SqlCompiler._lookup)
+    # ------------------------------------------------------------------
+    def _lookup(self, ir: QueryIR) -> tuple[Source, Table, dict[str, Column]]:
+        source = next((s for s in self._catalog.sources if s.source_id == ir.source_id), None)
+        if source is None:
+            raise PandasCompilerError(f"source_id {ir.source_id!r} not in catalog")
+        table = next((t for t in source.tables if t.table_id == ir.table_id), None)
+        if table is None:
+            raise PandasCompilerError(
+                f"table_id {ir.table_id!r} not in source {ir.source_id!r}"
+            )
+        return source, table, {c.column_id: c for c in table.columns}
+# ---------------------------------------------------------------------------
+# Module-level helpers (pure functions — easier to test in isolation)
+# ---------------------------------------------------------------------------
+def _output_column_names(select: list[SelectItem], cols_by_id: dict[str, Column]) -> list[str]:
+    names = []
+    for s in select:
+        if isinstance(s, ColumnSelect):
+            names.append(s.alias or cols_by_id[s.column_id].name)
+        else:
+            names.append(_agg_output_name(s, cols_by_id))
+    return names
+def _agg_output_name(s: AggSelect, cols_by_id: dict[str, Column]) -> str:
+    if s.alias:
+        return s.alias
+    if s.fn == "count" and s.column_id is None:
+        return "count"
+    return f"{s.fn}_{cols_by_id[s.column_id].name}"
+def _like_to_regex(pattern: str) -> str:
+    """Convert SQL LIKE pattern to Python regex string (no anchors — use fullmatch)."""
+    parts: list[str] = []
+    for ch in pattern:
+        if ch == "%":
+            parts.append(".*")
+        elif ch == "_":
+            parts.append(".")
+        else:
+            parts.append(re.escape(ch))
+    return "".join(parts)
+def _apply_filters(
+    df: pd.DataFrame,
+    filters: list[FilterClause],
+    cols_by_id: dict[str, Column],
+) -> pd.DataFrame:
+    if not filters:
+        return df
+    mask = pd.Series(True, index=df.index)
+    for f in filters:
+        col_name = cols_by_id[f.column_id].name
+        series = df[col_name]
+        op, val = f.op, f.value
+        if op == "=":
+            mask &= series == val
+        elif op == "!=":
+            mask &= series != val
+        elif op == "<":
+            mask &= series < val
+        elif op == "<=":
+            mask &= series <= val
+        elif op == ">":
+            mask &= series > val
+        elif op == ">=":
+            mask &= series >= val
+        elif op == "in":
+            mask &= series.isin(val)
+        elif op == "not_in":
+            mask &= ~series.isin(val)
+        elif op == "is_null":
+            mask &= series.isna()
+        elif op == "is_not_null":
+            mask &= series.notna()
+        elif op == "like":
+            mask &= series.astype(str).str.fullmatch(_like_to_regex(val), case=True, na=False)
+        elif op == "between":
+            mask &= (series >= val[0]) & (series <= val[1])
+    return df[mask].copy()
+def _apply_select(
+    df: pd.DataFrame,
+    select: list[SelectItem],
+    cols_by_id: dict[str, Column],
+) -> pd.DataFrame:
+    col_names = [cols_by_id[s.column_id].name for s in select if isinstance(s, ColumnSelect)]
+    result = df[col_names].copy()
+    rename_map = {
+        cols_by_id[s.column_id].name: s.alias
+        for s in select
+        if isinstance(s, ColumnSelect) and s.alias
+    }
+    if rename_map:
+        result = result.rename(columns=rename_map)
+    return result
+def _scalar_agg(df: pd.DataFrame, s: AggSelect, cols_by_id: dict[str, Column]) -> Any:
+    if s.fn == "count" and s.column_id is None:
+        return int(len(df))
+    col_name = cols_by_id[s.column_id].name
+    series = df[col_name]
+    match s.fn:
+        case "count":
+            return int(series.count())
+        case "count_distinct":
+            return int(series.nunique())
+        case "sum":
+            return series.sum()
+        case "avg":
+            return series.mean()
+        case "min":
+            return series.min()
+        case "max":
+            return series.max()
+    raise PandasCompilerError(f"unhandled agg fn {s.fn!r}")
+def _group_agg_series(
+    grouped: Any,
+    s: AggSelect,
+    cols_by_id: dict[str, Column],
+) -> pd.Series:
+    if s.fn == "count" and s.column_id is None:
+        return grouped.size()
+    col_name = cols_by_id[s.column_id].name
+    match s.fn:
+        case "count":
+            return grouped[col_name].count()
+        case "count_distinct":
+            return grouped[col_name].nunique()
+        case "sum":
+            return grouped[col_name].sum()
+        case "avg":
+            return grouped[col_name].mean()
+        case "min":
+            return grouped[col_name].min()
+        case "max":
+            return grouped[col_name].max()
+    raise PandasCompilerError(f"unhandled agg fn {s.fn!r}")
+def _apply_agg(
+    df: pd.DataFrame,
+    select: list[SelectItem],
+    group_by: list[str],
+    cols_by_id: dict[str, Column],
+) -> pd.DataFrame:
+    agg_items = [s for s in select if isinstance(s, AggSelect)]
+    col_items = [s for s in select if isinstance(s, ColumnSelect)]
+    group_col_names = [cols_by_id[col_id].name for col_id in group_by]
+    if group_col_names:
+        grouped = df.groupby(group_col_names, sort=False)
+        series_list = [
+            _group_agg_series(grouped, s, cols_by_id).rename(_agg_output_name(s, cols_by_id))
+            for s in agg_items
+        ]
+        result = pd.concat(series_list, axis=1).reset_index()
+        rename_map = {
+            cols_by_id[s.column_id].name: s.alias
+            for s in col_items
+            if s.alias
+        }
+        if rename_map:
+            result = result.rename(columns=rename_map)
+    else:
+        row = {
+            _agg_output_name(s, cols_by_id): _scalar_agg(df, s, cols_by_id)
+            for s in agg_items
+        }
+        result = pd.DataFrame([row])
+    return result
+def _resolve_order_col(
+    col_id_or_alias: str,
+    select: list[SelectItem],
+    cols_by_id: dict[str, Column],
+) -> str:
+    """Map an order_by column_id (or alias) to the actual output column name."""
+    for s in select:
+        if isinstance(s, ColumnSelect) and s.column_id == col_id_or_alias:
+            return s.alias or cols_by_id[s.column_id].name
+        if isinstance(s, AggSelect) and s.column_id == col_id_or_alias:
+            return _agg_output_name(s, cols_by_id)
+    return col_id_or_alias  # treat as alias / output name directly
+def _apply_orderby(
+    df: pd.DataFrame,
+    order_by: list[OrderByClause],
+    select: list[SelectItem],
+    cols_by_id: dict[str, Column],
+) -> pd.DataFrame:
+    sort_cols: list[str] = []
+    ascending: list[bool] = []
+    for ob in order_by:
+        out_name = _resolve_order_col(ob.column_id, select, cols_by_id)
+        if out_name in df.columns:
+            sort_cols.append(out_name)
+            ascending.append(ob.dir == "asc")
+    if sort_cols:
+        return df.sort_values(by=sort_cols, ascending=ascending)
+    return df

src/query/executor/tabular.py CHANGED Viewed

@@ -9,16 +9,148 @@ Initial scope ships eager pandas only; the others are added when a real
 file is too big.
 """
-from ...catalog.models import Catalog
-from ..compiler.pandas import PandasCompiler
 from ..ir.models import QueryIR
 from .base import BaseExecutor, QueryResult
 class TabularExecutor(BaseExecutor):
-    def __init__(self, catalog: Catalog) -> None:
         self._catalog = catalog
         self._compiler = PandasCompiler(catalog)
     async def run(self, ir: QueryIR) -> QueryResult:
-        raise NotImplementedError

 file is too big.
 """
+from __future__ import annotations
+import asyncio
+import io
+import time
+from collections.abc import Callable, Coroutine
+from typing import Any
+import pandas as pd
+from ...catalog.models import Catalog, Source, Table
+from ...middlewares.logging import get_logger
+from ..compiler.pandas import CompiledPandas, PandasCompiler
 from ..ir.models import QueryIR
 from .base import BaseExecutor, QueryResult
+logger = get_logger("tabular_executor")
+_AZ_BLOB_PREFIX = "az_blob://"
+_ROW_HARD_CAP = 10_000
 class TabularExecutor(BaseExecutor):
+    """Executes compiled pandas chain on a Parquet blob.
+    `fetch_blob` is injectable for tests — defaults to AzureBlobStorage.
+    """
+    def __init__(
+        self,
+        catalog: Catalog,
+        fetch_blob: Callable[[str], Coroutine[Any, Any, bytes]] | None = None,
+    ) -> None:
         self._catalog = catalog
         self._compiler = PandasCompiler(catalog)
+        self._fetch_blob = fetch_blob or self._default_fetch_blob
+    @staticmethod
+    async def _default_fetch_blob(blob_name: str) -> bytes:
+        from ...storage.az_blob.az_blob import blob_storage
+        return await blob_storage.download_file(blob_name)
     async def run(self, ir: QueryIR) -> QueryResult:
+        started = time.perf_counter()
+        try:
+            source, table = self._lookup(ir)
+            if source.source_type != "tabular":
+                raise ValueError(
+                    f"TabularExecutor cannot run on source_type={source.source_type!r}; "
+                    "expected 'tabular'"
+                )
+            compiled = self._compiler.compile(ir)
+            blob_name = _resolve_blob_name(source, table)
+            blob_bytes = await self._fetch_blob(blob_name)
+            result_df = await asyncio.to_thread(_load_and_apply, blob_bytes, compiled)
+            truncated = len(result_df) > _ROW_HARD_CAP
+            capped = result_df.head(_ROW_HARD_CAP)
+            columns = compiled.output_columns
+            rows = capped.to_dict(orient="records")
+            elapsed_ms = int((time.perf_counter() - started) * 1000)
+            logger.info(
+                "tabular query complete",
+                source_id=ir.source_id,
+                rows=len(rows),
+                truncated=truncated,
+                elapsed_ms=elapsed_ms,
+            )
+            return QueryResult(
+                source_id=ir.source_id,
+                backend="tabular",
+                columns=columns,
+                rows=rows,
+                row_count=len(rows),
+                truncated=truncated,
+                elapsed_ms=elapsed_ms,
+            )
+        except Exception as e:
+            elapsed_ms = int((time.perf_counter() - started) * 1000)
+            logger.error(
+                "tabular executor failed",
+                source_id=ir.source_id,
+                error=str(e),
+                elapsed_ms=elapsed_ms,
+            )
+            return QueryResult(
+                source_id=ir.source_id,
+                backend="tabular",
+                elapsed_ms=elapsed_ms,
+                error=str(e),
+            )
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _lookup(self, ir: QueryIR) -> tuple[Source, Table]:
+        source = next(
+            (s for s in self._catalog.sources if s.source_id == ir.source_id), None
+        )
+        if source is None:
+            raise ValueError(f"source_id {ir.source_id!r} not in catalog")
+        table = next(
+            (t for t in source.tables if t.table_id == ir.table_id), None
+        )
+        if table is None:
+            raise ValueError(f"table_id {ir.table_id!r} not in source {ir.source_id!r}")
+        return source, table
+# ---------------------------------------------------------------------------
+# Module-level helpers (pure functions — easier to test in isolation)
+# ---------------------------------------------------------------------------
+def _resolve_blob_name(source: Source, table: Table) -> str:
+    """Map source.location_ref + table → the Parquet blob name to download.
+    Convention:
+      CSV / single-sheet Parquet → ``{user_id}/{document_id}.parquet``
+      XLSX (multi-sheet)        → ``{user_id}/{document_id}__{table.name}.parquet``
+    """
+    if not source.location_ref.startswith(_AZ_BLOB_PREFIX):
+        raise ValueError(
+            f"TabularExecutor expects 'az_blob://...' location_ref, "
+            f"got {source.location_ref!r}"
+        )
+    path = source.location_ref[len(_AZ_BLOB_PREFIX):]
+    parts = path.split("/", 1)
+    if len(parts) != 2 or not parts[0] or not parts[1]:
+        raise ValueError(f"Malformed az_blob location_ref: {source.location_ref!r}")
+    user_id, document_id = parts
+    if len(source.tables) == 1:
+        return f"{user_id}/{document_id}.parquet"
+    return f"{user_id}/{document_id}__{table.name}.parquet"
+def _load_and_apply(blob_bytes: bytes, compiled: CompiledPandas) -> pd.DataFrame:
+    """Load Parquet bytes into a DataFrame and apply the compiled op chain."""
+    df = pd.read_parquet(io.BytesIO(blob_bytes))
+    return compiled.apply(df)