Agentic-Service-Data-Eyond-Catalog

Running

App Files Files Community

/fix analysis + report_inputs and storage update

by rhbt6767 - opened 3 days ago

base: refs/heads/main

←

from: refs/pr/9

Discussion Files changed

+168

-57

Files changed (9) hide show

DEV_PLAN.md +1 -1
REPO_STATUS.md +5 -3
src/agents/chat_handler.py +11 -25
src/api/v1/chat.py +2 -3
src/api/v2/chat.py +7 -13
src/config/settings.py +19 -6
src/query/executor/tabular.py +25 -6
src/storage/object_storage/__init__.py +13 -0
src/storage/object_storage/supabase_s3.py +85 -0

DEV_PLAN.md CHANGED Viewed

@@ -173,7 +173,7 @@ Status legend: ⬜ not started · 🔄 in progress · ✅ done · ⛔ blocked ·
 | 22 | Finalize `report_inputs` schema → hand to Harry for the dedorch migration | Rifqi → Harry | 🔄 | **DDL ready** (uuid `id`/`analysis_id` + FK→`analyses(id)`; `user_id`/`plan_id` text; `data` jsonb = serialized `AnalysisRecord`, shape documented). dedorch has empty `analysis_records` → rename. Resolves #16. **Action: send Harry the DDL + `data` shape** |
 | 23 | Report markdown formatting: tables, **bold**, *italic*, horizontal separators | Sofhia | ✅ | Done 2026-06-25. Added `---` separators between header + each section in `_render_markdown`. Tables (EDA) / bold (method labels) / italic (meta + citations) already emitted. Relaxed `report_summary.md` to allow inline `**bold**`/`*italic*` for emphasis (kept no-headings/no-bullets so it doesn't duplicate the section structure / Key Findings). Compile + ruff clean |
 | 24 | Clarify report input contract: records table (+ `last_report` for edit mode?) | Rifqi/Sofhia ↔ Harry | ⬜ new | Edit-mode input left open at the checkpoint |
-| 25 | Migrate Python chat path to Go `analyses_messages` (+ `analyses`) | Rifqi ↔ Harry | ⬜ | **Bigger than "confirm" (verified 2026-06-25):** dedorch `rooms` + `chat_messages` are **deprecated** (`zdeprecated_*`). Python's `Room`/`ChatMessage` models + `chat.py` `load_history`/`save_messages` target them → **break post-cutover**. Move history read/write to `analyses_messages` before the conn-string cutover |
 | 26 | **Charts (DEFERRED):** store Plotly JSON in a future `chart` table (not matplotlib PNG) | — | ⏸️ | After the markdown path is done end-to-end |
 | 27 | **Images (DEFERRED):** image table (id, analysis_id, msg/report ref, order) + originals in a bucket | — | ⏸️ | Maintenance-heavy; parked |
 | 28 | **UI research** (FE): new-analysis form, knowledge menu (user vs analysis level), report artifacts + version selector | Team | ⬜ new | No dedicated UI person; interview + old analysis UI removed |

 | 22 | Finalize `report_inputs` schema → hand to Harry for the dedorch migration | Rifqi → Harry | 🔄 | **DDL ready** (uuid `id`/`analysis_id` + FK→`analyses(id)`; `user_id`/`plan_id` text; `data` jsonb = serialized `AnalysisRecord`, shape documented). dedorch has empty `analysis_records` → rename. Resolves #16. **Action: send Harry the DDL + `data` shape** |
 | 23 | Report markdown formatting: tables, **bold**, *italic*, horizontal separators | Sofhia | ✅ | Done 2026-06-25. Added `---` separators between header + each section in `_render_markdown`. Tables (EDA) / bold (method labels) / italic (meta + citations) already emitted. Relaxed `report_summary.md` to allow inline `**bold**`/`*italic*` for emphasis (kept no-headings/no-bullets so it doesn't duplicate the section structure / Key Findings). Compile + ruff clean |
 | 24 | Clarify report input contract: records table (+ `last_report` for edit mode?) | Rifqi/Sofhia ↔ Harry | ⬜ new | Edit-mode input left open at the checkpoint |
+| 25 | Migrate Python chat path to Go `analyses_messages` (+ `analyses`) | Rifqi ↔ Harry | ✅ | Done 2026-07-02. Read path already on `analyses_messages` (commit `0066161`). This change makes Python **read-only**: removed the `save_messages` calls from `/api/v2/chat/stream` so **Go is the sole writer** — fixes the double-write both Go+Python were producing. `load_history` still reads `analyses_messages`. v1 `/chat/stream` is unwired so left untouched |
 | 26 | **Charts (DEFERRED):** store Plotly JSON in a future `chart` table (not matplotlib PNG) | — | ⏸️ | After the markdown path is done end-to-end |
 | 27 | **Images (DEFERRED):** image table (id, analysis_id, msg/report ref, order) + originals in a bucket | — | ⏸️ | Maintenance-heavy; parked |
 | 28 | **UI research** (FE): new-analysis form, knowledge menu (user vs analysis level), report artifacts + version selector | Team | ⬜ new | No dedicated UI person; interview + old analysis UI removed |

REPO_STATUS.md CHANGED Viewed

@@ -129,8 +129,10 @@ GET /report/{analysis_id}/{ver}  → fetch one version
 ```
 Two facts to internalise:
-- **Records only exist on the slow path.** With `ENABLE_SLOW_PATH=false` (the default) no
-  records accumulate, so generation 409s — by design, not a bug.
 - **dedorch `reports` stores markdown only.** Structured report fields are computed at
   generation, rendered into `rendered_markdown`, and only the markdown is persisted; on
   read-back the structured fields come back empty.
@@ -289,7 +291,7 @@ only.
 | Flag | Where | Default | Effect |
 |---|---|---|---|
-| `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
 | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
 | `SKIP_INIT_DB` | `settings.skip_init_db` (.env/env) | **on** | Skip `init_db()` on startup — the dedorch cutover switch. **Defaults TRUE** (Go owns the dedorch schema); set `false` only for a local Python-owned DB. |
 | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |

 ```
 Two facts to internalise:
+- **Records only exist on the slow path.** The slow path is now **always on** for
+  `structured_flow` (the `ENABLE_SLOW_PATH` flag was removed 2026-07-02), so every
+  structured question persists a record. Reports still 409 until at least one `analyze_*`
+  task has actually succeeded (chat/help/check/unstructured turns write no record).
 - **dedorch `reports` stores markdown only.** Structured report fields are computed at
   generation, rendered into `rendered_markdown`, and only the markdown is persisted; on
   read-back the structured fields come back empty.
 | Flag | Where | Default | Effect |
 |---|---|---|---|
+| ~~`ENABLE_SLOW_PATH`~~ | — | **removed 2026-07-02** | Flag deleted. `structured_flow` now **always** runs Planner/TaskRunner/Assembler (the single-query `QueryService` fast path was retired from the chat handler), so records always persist. `extra="allow"` ignores a stale `ENABLE_SLOW_PATH` left in any `.env`. |
 | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
 | `SKIP_INIT_DB` | `settings.skip_init_db` (.env/env) | **on** | Skip `init_db()` on startup — the dedorch cutover switch. **Defaults TRUE** (Go owns the dedorch schema); set `false` only for a local Python-owned DB. |
 | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |

src/agents/chat_handler.py CHANGED Viewed

@@ -5,7 +5,8 @@ End-to-end flow per user message:
   1. `OrchestratorAgent.classify` → RouterDecision (one of six intents).
   2. Route by intent:
        - `chat`              → no context. Pass straight to ChatbotAgent.
-       - `structured_flow`   → CatalogReader → slow path / QueryService.
        - `unstructured_flow` → DocumentRetriever (RAG over PGVector) →
                                list[DocumentChunk].
        - `check`             → check_data / check_knowledge tool → rendered table.
@@ -48,7 +49,6 @@ from .orchestration import OrchestratorAgent
 if TYPE_CHECKING:
     from ..catalog.reader import CatalogReader
-    from ..query.service import QueryService
     from ..retrieval.router import RetrievalRouter
     from .gate import AnalysisState
     from .slow_path.coordinator import SlowPathCoordinator
@@ -75,10 +75,8 @@ class ChatHandler:
         intent_router: OrchestratorAgent | None = None,
         answer_agent: ChatbotAgent | None = None,
         catalog_reader: CatalogReader | None = None,
-        query_service: QueryService | None = None,
         document_retriever: RetrievalRouter | None = None,
         *,
-        enable_slow_path: bool = False,
         slow_path_coordinator_factory: (
             Callable[[str], SlowPathCoordinator] | None
         ) = None,
@@ -94,16 +92,13 @@ class ChatHandler:
         self._intent_router = intent_router
         self._answer_agent = answer_agent
         self._catalog_reader = catalog_reader
-        self._query_service = query_service
         self._document_retriever = document_retriever
         # Langfuse tracing (tokens + latency). OFF by default so tests never hit
         # Langfuse; the live endpoint opts in with ChatHandler(enable_tracing=True).
         self._enable_tracing = enable_tracing
-        # Slow analytical path (Planner -> TaskRunner -> Assembler). OFF by default:
-        # gated until the lead's real BusinessContext lands. When True, `structured`
-        # intents route here instead of the single-query QueryService path. The
         # factory + store are injectable for tests.
-        self._enable_slow_path = enable_slow_path
         self._slow_path_factory = slow_path_coordinator_factory
         self._analysis_store = analysis_store
         # `check` skill: builds the data-access invoker (check_data/check_knowledge)
@@ -144,13 +139,6 @@ class ChatHandler:
             self._catalog_reader = CatalogReader(CatalogStore())
         return self._catalog_reader
-    def _get_query_service(self) -> QueryService:
-        if self._query_service is None:
-            from ..query.service import QueryService
-            self._query_service = QueryService()
-        return self._query_service
     def _get_document_retriever(self) -> RetrievalRouter:
         if self._document_retriever is None:
             from ..retrieval.router import RetrievalRouter
@@ -351,15 +339,13 @@ class ChatHandler:
                 bound = await self._bound_source_ids(analysis_id)
                 reader = _ScopedCatalogReader(req_reader, bound) if bound else req_reader
                 catalog = await reader.read(user_id, "structured")
-                if self._enable_slow_path:
-                    async for event in self._run_slow_path(
-                        user_id, rewritten, catalog, tracer, reader, analysis_id
-                    ):
-                        yield event
-                    return
-                query_result = await self._get_query_service().run(
-                    user_id, rewritten, catalog
-                )
             except Exception as e:
                 logger.error(
                     "structured route failed",

   1. `OrchestratorAgent.classify` → RouterDecision (one of six intents).
   2. Route by intent:
        - `chat`              → no context. Pass straight to ChatbotAgent.
+       - `structured_flow`   → CatalogReader → slow analytical path
+                               (Planner → TaskRunner → Assembler).
        - `unstructured_flow` → DocumentRetriever (RAG over PGVector) →
                                list[DocumentChunk].
        - `check`             → check_data / check_knowledge tool → rendered table.
 if TYPE_CHECKING:
     from ..catalog.reader import CatalogReader
     from ..retrieval.router import RetrievalRouter
     from .gate import AnalysisState
     from .slow_path.coordinator import SlowPathCoordinator
         intent_router: OrchestratorAgent | None = None,
         answer_agent: ChatbotAgent | None = None,
         catalog_reader: CatalogReader | None = None,
         document_retriever: RetrievalRouter | None = None,
         *,
         slow_path_coordinator_factory: (
             Callable[[str], SlowPathCoordinator] | None
         ) = None,
         self._intent_router = intent_router
         self._answer_agent = answer_agent
         self._catalog_reader = catalog_reader
         self._document_retriever = document_retriever
         # Langfuse tracing (tokens + latency). OFF by default so tests never hit
         # Langfuse; the live endpoint opts in with ChatHandler(enable_tracing=True).
         self._enable_tracing = enable_tracing
+        # Slow analytical path (Planner -> TaskRunner -> Assembler): the only route for
+        # `structured_flow` now (the ENABLE_SLOW_PATH flag was removed 2026-07-02). The
         # factory + store are injectable for tests.
         self._slow_path_factory = slow_path_coordinator_factory
         self._analysis_store = analysis_store
         # `check` skill: builds the data-access invoker (check_data/check_knowledge)
             self._catalog_reader = CatalogReader(CatalogStore())
         return self._catalog_reader
     def _get_document_retriever(self) -> RetrievalRouter:
         if self._document_retriever is None:
             from ..retrieval.router import RetrievalRouter
                 bound = await self._bound_source_ids(analysis_id)
                 reader = _ScopedCatalogReader(req_reader, bound) if bound else req_reader
                 catalog = await reader.read(user_id, "structured")
+                # structured_flow always runs the slow analytical path (the
+                # ENABLE_SLOW_PATH flag was removed 2026-07-02).
+                async for event in self._run_slow_path(
+                    user_id, rewritten, catalog, tracer, reader, analysis_id
+                ):
+                    yield event
+                return
             except Exception as e:
                 logger.error(
                     "structured route failed",

src/api/v1/chat.py CHANGED Viewed

@@ -26,11 +26,10 @@ router = APIRouter(prefix="/api/v1", tags=["Chat"])
 # is passed into handle()), and lazily builds + caches the Orchestrator/Chatbot
 # chains — so reusing it keeps the Azure OpenAI clients (and their httpx/TLS pools)
 # warm across requests instead of re-handshaking on the first call of every request.
-# enable_slow_path is env-gated (ENABLE_SLOW_PATH): when on, structured intents route
-# Orchestrator -> Planner -> TaskRunner -> Assembler so the team can test e2e here.
 _chat_handler = ChatHandler(
     enable_tracing=True,
-    enable_slow_path=settings.enable_slow_path,
     enable_gate=settings.enable_gate,
 )

 # is passed into handle()), and lazily builds + caches the Orchestrator/Chatbot
 # chains — so reusing it keeps the Azure OpenAI clients (and their httpx/TLS pools)
 # warm across requests instead of re-handshaking on the first call of every request.
+# Structured intents always route Orchestrator -> Planner -> TaskRunner -> Assembler
+# (the analytical slow path); the ENABLE_SLOW_PATH flag was removed 2026-07-02.
 _chat_handler = ChatHandler(
     enable_tracing=True,
     enable_gate=settings.enable_gate,
 )

src/api/v2/chat.py CHANGED Viewed

@@ -12,10 +12,11 @@
 Only chat moves to v2; the tools group + observability stay on `/api/v1` (contract:
 API_ENDPOINTS_RESTRUCTURE.md §1).
-⚠️ Persistence (transitional). This mirrors v1: it still load/saves turn history via the
-analysis-keyed message tables so multi-turn context works in the playground. Moving the
-read/write to Go-owned `analyses_messages` (and making Python read-only) is DEV_PLAN #25.
-Note Sofhia's `/tools/help` is already generative-only — align chat with that under #25.
 """
 import json
@@ -38,7 +39,6 @@ from src.api.v1.chat import (
     cache_response,
     get_cached_response,
     load_history,
-    save_messages,
 )
 from src.db.postgres.connection import get_db
 from src.db.redis.connection import get_redis
@@ -88,7 +88,6 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
         logger.info("Returning cached response")
         cached_text = cached["response"]
         cached_sources = cached["sources"]
-        await save_messages(db, analysis_id, request.user_id, request.message, cached_text)
         async def stream_cached():
             yield {"event": "sources", "data": json.dumps(cached_sources)}
@@ -103,7 +102,6 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
         direct = _fast_intent(request.message)
         if direct:
             await cache_response(redis, cache_key, direct, sources=[])
-            await save_messages(db, analysis_id, request.user_id, request.message, direct)
             async def stream_direct():
                 yield {"event": "sources", "data": json.dumps([])}
@@ -142,12 +140,8 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
                     # Only cache stateless `chat` replies (see _CACHEABLE_INTENTS).
                     if effective_intent in _CACHEABLE_INTENTS:
                         await cache_response(redis, cache_key, full_response, sources=sources)
-                    try:
-                        await save_messages(
-                            db, analysis_id, request.user_id, request.message, full_response
-                        )
-                    except Exception as e:
-                        logger.error("save_messages failed", analysis_id=analysis_id, error=str(e))
                     yield done_event
                 elif event["event"] == "status":
                     # slow-path progress: forward so the client shows activity.

 Only chat moves to v2; the tools group + observability stay on `/api/v1` (contract:
 API_ENDPOINTS_RESTRUCTURE.md §1).
+Persistence (DEV_PLAN #25 — done). Python is **read-only** on the Go-owned
+`analyses_messages` table: it *reads* turn history (so multi-turn context works) but no
+longer writes the user/AI turns — Go is the sole writer. This removes the double-write
+that appeared when both Go and Python's stream persisted the same turn. Aligns chat with
+`/tools/help`, which was already generative-only.
 """
 import json
     cache_response,
     get_cached_response,
     load_history,
 )
 from src.db.postgres.connection import get_db
 from src.db.redis.connection import get_redis
         logger.info("Returning cached response")
         cached_text = cached["response"]
         cached_sources = cached["sources"]
         async def stream_cached():
             yield {"event": "sources", "data": json.dumps(cached_sources)}
         direct = _fast_intent(request.message)
         if direct:
             await cache_response(redis, cache_key, direct, sources=[])
             async def stream_direct():
                 yield {"event": "sources", "data": json.dumps([])}
                     # Only cache stateless `chat` replies (see _CACHEABLE_INTENTS).
                     if effective_intent in _CACHEABLE_INTENTS:
                         await cache_response(redis, cache_key, full_response, sources=sources)
+                    # Persistence is Go's job now (DEV_PLAN #25): Python reads history but
+                    # no longer writes turns, so Go stays the sole writer of analyses_messages.
                     yield done_event
                 elif event["event"] == "status":
                     # slow-path progress: forward so the client shows activity.

src/config/settings.py CHANGED Viewed

@@ -17,12 +17,10 @@ class Settings(BaseSettings):
     )
     # Feature flags
-    # Route `structured` chat intents through the analytical SLOW PATH
-    # (Planner -> TaskRunner -> Assembler) instead of the single-query QueryService.
-    # Off by default; the team flips ENABLE_SLOW_PATH=true to test end-to-end from
-    # the /chat/stream endpoint. BusinessContext is still a stub until the lead's
-    # real source lands, so this stays opt-in.
-    enable_slow_path: bool = Field(alias="enable_slow_path", default=False)
     # DEPRECATED 2026-06-24: the problem_validated gate was removed (the goal is now
     # user-entered objective + business_questions, no agent validation). This flag no
@@ -65,6 +63,21 @@ class Settings(BaseSettings):
     azureai_container_name: str = Field(alias="azureai__container__name", default="")
     azureai_container_account_name: str = Field(alias="azureai__container__account__name", default="")
     # Langfuse
     LANGFUSE_PUBLIC_KEY: str
     LANGFUSE_SECRET_KEY: str

     )
     # Feature flags
+    # REMOVED 2026-07-02: the former ENABLE_SLOW_PATH flag is gone. `structured_flow`
+    # now always runs the analytical slow path (Planner -> TaskRunner -> Assembler), so
+    # every structured question persists a report_inputs record (reports no longer
+    # depend on flipping a flag). extra="allow" ignores a stale ENABLE_SLOW_PATH in .env.
     # DEPRECATED 2026-06-24: the problem_validated gate was removed (the goal is now
     # user-entered objective + business_questions, no agent validation). This flag no
     azureai_container_name: str = Field(alias="azureai__container__name", default="")
     azureai_container_account_name: str = Field(alias="azureai__container__account__name", default="")
+    # Object storage provider toggle. Mirrors the Go data plane's `storage.provider`
+    # (see Orchestrator config: azure_blob | supabase_s3). Blank falls back to azure_blob.
+    # Tabular query execution reads the processed Parquet from whichever backend is active.
+    storage_provider: str = Field(alias="STORAGE_PROVIDER", default="azure_blob")
+    # Supabase S3-compatible object storage (active when storage_provider == "supabase_s3").
+    # BRIDGE (Option C): lets Python read the same bucket Go writes to, until Go exposes a
+    # source-data endpoint (see docs/proposal_go_source_data_endpoint.md) — then this client
+    # is removed and Python stops touching storage directly.
+    supabase_s3_bucket: str = Field(alias="SUPABASE_S3_BUCKET", default="")
+    supabase_s3_endpoint: str = Field(alias="SUPABASE_S3_ENDPOINT", default="")
+    supabase_s3_region: str = Field(alias="SUPABASE_S3_REGION", default="")
+    supabase_s3_access_key_id: str = Field(alias="SUPABASE_S3_ACCESS_KEY_ID", default="")
+    supabase_s3_secret_access_key: str = Field(alias="SUPABASE_S3_SECRET_ACCESS_KEY", default="")
     # Langfuse
     LANGFUSE_PUBLIC_KEY: str
     LANGFUSE_SECRET_KEY: str

src/query/executor/tabular.py CHANGED Viewed

@@ -29,13 +29,18 @@ from .base import BaseExecutor, QueryResult
 logger = get_logger("tabular_executor")
 _AZ_BLOB_PREFIX = "az_blob://"
 _ROW_HARD_CAP = 10_000
 class TabularExecutor(BaseExecutor):
     """Executes compiled pandas chain on a Parquet blob.
-    `fetch_blob` is injectable for tests — defaults to AzureBlobStorage.
     """
     def __init__(
@@ -49,6 +54,16 @@ class TabularExecutor(BaseExecutor):
     @staticmethod
     async def _default_fetch_blob(blob_name: str) -> bytes:
         from ...storage.az_blob.az_blob import blob_storage
         return await blob_storage.download_file(blob_name)
@@ -157,15 +172,19 @@ def _resolve_blob_name(source: Source, table: Table) -> str:
     on the upload pipeline preserving the file extension, which it does today
     because `Document.filename` is set once at upload and never renamed.
     """
-    if not source.location_ref.startswith(_AZ_BLOB_PREFIX):
         raise ValueError(
-            f"TabularExecutor expects 'az_blob://...' location_ref, "
-            f"got {source.location_ref!r}"
         )
-    path = source.location_ref[len(_AZ_BLOB_PREFIX):]
     parts = path.split("/", 1)
     if len(parts) != 2 or not parts[0] or not parts[1]:
-        raise ValueError(f"Malformed az_blob location_ref: {source.location_ref!r}")
     user_id, document_id = parts
     is_xlsx = source.name.lower().endswith(".xlsx")
     sheet_name = table.name if is_xlsx else None

 logger = get_logger("tabular_executor")
 _AZ_BLOB_PREFIX = "az_blob://"
+# Go's Supabase S3 data plane writes location_ref with this prefix instead of az_blob://.
+# Both encode the same path structure after the prefix: {user_id}/{document_id}.
+_OBJECT_STORAGE_PREFIX = "object_storage://"
+_LOCATION_REF_PREFIXES = (_AZ_BLOB_PREFIX, _OBJECT_STORAGE_PREFIX)
 _ROW_HARD_CAP = 10_000
 class TabularExecutor(BaseExecutor):
     """Executes compiled pandas chain on a Parquet blob.
+    `fetch_blob` is injectable for tests — defaults to the storage backend
+    selected by `settings.storage_provider` (azure_blob | supabase_s3).
     """
     def __init__(
     @staticmethod
     async def _default_fetch_blob(blob_name: str) -> bytes:
+        # Pick the storage backend by the same toggle the Go data plane uses.
+        # Blank/unknown falls back to Azure Blob (the pre-migration default).
+        from ...config.settings import settings
+        provider = (settings.storage_provider or "").strip().lower()
+        if provider == "supabase_s3":
+            from ...storage.object_storage import object_storage
+            return await object_storage.download_file(blob_name)
         from ...storage.az_blob.az_blob import blob_storage
         return await blob_storage.download_file(blob_name)
     on the upload pipeline preserving the file extension, which it does today
     because `Document.filename` is set once at upload and never renamed.
     """
+    matched_prefix = next(
+        (p for p in _LOCATION_REF_PREFIXES if source.location_ref.startswith(p)),
+        None,
+    )
+    if matched_prefix is None:
         raise ValueError(
+            f"TabularExecutor expects 'az_blob://...' or 'object_storage://...' "
+            f"location_ref, got {source.location_ref!r}"
         )
+    path = source.location_ref[len(matched_prefix):]
     parts = path.split("/", 1)
     if len(parts) != 2 or not parts[0] or not parts[1]:
+        raise ValueError(f"Malformed location_ref: {source.location_ref!r}")
     user_id, document_id = parts
     is_xlsx = source.name.lower().endswith(".xlsx")
     sheet_name = table.name if is_xlsx else None

src/storage/object_storage/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""Supabase S3-compatible object storage (read side).
+BRIDGE MODULE (Option C) — self-contained on purpose. It exists only so the
+tabular query path can read the processed Parquet from the same bucket the Go
+data plane writes to, after Go switched storage provider Azure Blob -> Supabase S3.
+Once Go exposes a source-data endpoint (docs/proposal_go_source_data_endpoint.md),
+delete this whole package and point TabularExecutor at that endpoint instead.
+"""
+from .supabase_s3 import object_storage
+__all__ = ["object_storage"]

src/storage/object_storage/supabase_s3.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""Supabase S3-compatible object storage client (read side).
+Mirrors the Go data plane's Supabase S3 client (Orchestrator
+`internal/documents/supabase_s3.go`): path-style addressing against a custom
+endpoint, static credentials, single bucket.
+boto3 is synchronous; download runs in a worker thread so the async call sites
+(TabularExecutor._fetch_blob) stay non-blocking. Only the read path is
+implemented — Python is read-only here; Go owns all writes.
+"""
+from __future__ import annotations
+import asyncio
+from src.config.settings import settings
+from src.middlewares.logging import get_logger
+logger = get_logger("supabase_s3")
+class SupabaseS3Storage:
+    """Read-only client for the Supabase S3 bucket Go writes to."""
+    def __init__(self) -> None:
+        self._bucket = settings.supabase_s3_bucket
+        self._endpoint = settings.supabase_s3_endpoint.rstrip("/")
+        self._region = settings.supabase_s3_region
+        self._access_key_id = settings.supabase_s3_access_key_id
+        self._secret_access_key = settings.supabase_s3_secret_access_key
+        self._client = None  # lazy — avoid import-time failure on Azure deployments
+    def _is_configured(self) -> bool:
+        return all(
+            (
+                self._bucket,
+                self._endpoint,
+                self._region,
+                self._access_key_id,
+                self._secret_access_key,
+            )
+        )
+    def _get_client(self):
+        if self._client is None:
+            if not self._is_configured():
+                raise RuntimeError(
+                    "Supabase S3 storage is not fully configured "
+                    "(set STORAGE_PROVIDER=supabase_s3 and SUPABASE_S3_* env vars)"
+                )
+            import boto3
+            from botocore.config import Config
+            self._client = boto3.client(
+                "s3",
+                endpoint_url=self._endpoint,
+                region_name=self._region,
+                aws_access_key_id=self._access_key_id,
+                aws_secret_access_key=self._secret_access_key,
+                config=Config(
+                    signature_version="s3v4",
+                    s3={"addressing_style": "path"},  # path-style, like Go's UsePathStyle
+                ),
+            )
+        return self._client
+    def _download_sync(self, object_name: str) -> bytes:
+        client = self._get_client()
+        resp = client.get_object(Bucket=self._bucket, Key=object_name)
+        return resp["Body"].read()
+    async def download_file(self, object_name: str) -> bytes:
+        """Download an object's bytes. Interface-compatible with AzureBlobStorage.download_file."""
+        try:
+            logger.info(f"Downloading object {object_name}")
+            content = await asyncio.to_thread(self._download_sync, object_name)
+            logger.info(f"Successfully downloaded {object_name} ({len(content)} bytes)")
+            return content
+        except Exception as e:
+            logger.error(f"Failed to download object {object_name}", error=str(e))
+            raise
+# Singleton (lazy client construction — safe to import even when not configured)
+object_storage = SupabaseS3Storage()