Agentic-Service-Data-Eyond-Catalog

Sleeping

App Files Files Community

sofhiaazzhr commited on 1 day ago

Commit

78c598c

1 Parent(s): 6b3637d

[KM-624] Implement Compute Functions Tools

Browse files

Files changed (14) hide show

.gitignore +3 -1
ARCHITECTURE.md +6 -5
PHASE1_TO_PHASE2_REPORT.md +16 -3
REPO_CONTEXT.md +2 -2
src/tools/__init__.py +7 -0
src/tools/analytics/__init__.py +1 -0
src/tools/analytics/aggregation.py +91 -0
src/tools/analytics/comparison.py +108 -0
src/tools/analytics/decomposition.py +110 -0
src/tools/analytics/descriptive.py +111 -0
src/tools/analytics/quality.py +108 -0
src/tools/analytics/relationship.py +108 -0
src/tools/analytics/segmentation.py +124 -0
src/tools/analytics/temporal.py +159 -0

.gitignore CHANGED Viewed

@@ -48,4 +48,6 @@ software/
 tests/
 .claude/
-migratego/

 tests/
 .claude/
+migratego/
+docs/specs/tabular_parquet_contract.md
+docs/specs/tabular_parquet.md

ARCHITECTURE.md CHANGED Viewed

@@ -1,7 +1,7 @@
 # Architecture — Data Eyond Agentic Service
-**Last updated**: 2026-05-07
-**Status**: Design phase — folder skeleton in place, implementation in progress
 ---
@@ -21,7 +21,7 @@ A catalog-driven AI service for data analysis. Users upload documents and regist
 The architecture has two paths:
-- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text).
 - **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
 The LLM produces *intent*, not query syntax. Deterministic code does the rest.
@@ -120,6 +120,7 @@ Compiler and executors are pure code. No LLM in the hot path of query constructi
 ### Ingestion (when user uploads a file or connects a DB)
 ```
 source upload / DB connect
     ↓
 introspect schema (DB: information_schema; tabular: file headers + sample rows)
@@ -129,7 +130,7 @@ validate (Pydantic)
 write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id)
 ```
-For unstructured files: chunk + embed → PGVector.
 ### Query (per user message)
@@ -143,7 +144,7 @@ Load chat history
 IntentRouter LLM   →  needs_search?  source_hint?
     ↓
     ├── chat        → ChatbotAgent → SSE stream
-    ├── unstructured → DocumentRetriever → answerer
     └── structured  →
             CatalogReader (load full Cs ∪ Ct for user)
                 ↓

 # Architecture — Data Eyond Agentic Service
+**Last updated**: 2026-05-20
+**Status**: Phase 2 catalog path shipped; document ingestion has moved to a separate Go service. The long-term split is **Python = agent/ML layer, Go = data plane**; this document covers the Python side only.
 ---
 The architecture has two paths:
+- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text). **Ingestion is handled by a separate Go service**; this Python service reads embeddings from PGVector at query time.
 - **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
 The LLM produces *intent*, not query syntax. Deterministic code does the rest.
 ### Ingestion (when user uploads a file or connects a DB)
 ```
+Structured sources (DB connect / XLSX / CSV / Parquet upload) — Python:
 source upload / DB connect
     ↓
 introspect schema (DB: information_schema; tabular: file headers + sample rows)
 write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id)
 ```
+**Unstructured ingestion (PDF / DOCX / TXT) is handled by a separate Go service**, which writes chunks + embeddings into the `documents` collection in PGVector. The Python service does not own this path — it reads only.
 ### Query (per user message)
 IntentRouter LLM   →  needs_search?  source_hint?
     ↓
     ├── chat        → ChatbotAgent → SSE stream
+    ├── unstructured → DocumentRetriever (raw SQL: pgvector `<=>` cosine or `<+>` manhattan) → answerer
     └── structured  →
             CatalogReader (load full Cs ∪ Ct for user)
                 ↓

PHASE1_TO_PHASE2_REPORT.md CHANGED Viewed

@@ -52,7 +52,7 @@ Everything else — IR validation, SQL/pandas compilation, execution — is dete
 | `src/agents/chatbot.py` (Phase 1) → deleted, then `src/agents/answer_agent.py` (`AnswerAgent`) → renamed | `src/agents/chatbot.py::ChatbotAgent` | Final answer formation; streams via `astream` |
 | `src/knowledge/parquet_service.py` | `src/storage/parquet.py` | Parquet upload/download helper |
 | `src/pipeline/document_pipeline/document_pipeline.py` (folder) | `src/pipeline/document_pipeline.py` (flat) | Single module |
-| `src/rag/retrievers/document.py` | `src/retrieval/document.py` | `DocumentRetriever` migrated; tabular file types filtered out of results |
 | `src/rag/router.py` | `src/retrieval/router.py` | `RetrievalRouter`, Redis-cached, unstructured-only; dead `db: AsyncSession` + `source_hint` params removed |
 | `src/rag/base.py` (`RetrievalResult`, `BaseRetriever`) | `src/retrieval/base.py` | Same dataclass + ABC |
@@ -177,7 +177,7 @@ POST /api/v1/chat/stream
                 │
                 ├── source_hint == "unstructured"
                 │     → RetrievalRouter.retrieve()                 [retrieval/router.py, Redis-cached]
-                │         → DocumentRetriever (PGVector MMR/cosine/etc.)
                 │     → ChatbotAgent.astream(chunks=...)
                 │
                 └── source_hint == "structured"
@@ -237,7 +237,7 @@ POST /api/v1/chat/stream
 | Top-level chat orchestration | Inline in `api/v1/chat.py` | `agents/chat_handler.py::ChatHandler` | Extracted to a reusable module |
 | Final answer formation | `agents/chatbot.py` (Phase 1 LangChain) | `agents/chatbot.py::ChatbotAgent` (was `AnswerAgent` in `answer_agent.py` mid-cycle) | Rewritten + renamed |
 | Schema retrieval (DB / tabular) | `rag/retrievers/schema.py` + PGVector chunks | **Removed**. Replaced by catalog (`catalog/store.py` jsonb) loaded verbatim into planner prompt | Whole concept replaced |
-| Doc retrieval (PDF / DOCX / TXT) | `rag/retrievers/document.py`, `rag/router.py` | `retrieval/document.py`, `retrieval/router.py` | Moved; Redis cache restored; tabular files filtered |
 | Query writing | `query/query_executor.py` + `models/sql_query.py` (LLM writes SQL) | `query/planner/service.py` (LLM writes IR) + `query/compiler/sql.py` (deterministic) | LLM emits intent, not SQL |
 | DB execution | `query/executors/db_executor.py` | `query/executor/db.py::DbExecutor` | Folder renamed (`executors` → `executor`); sqlglot guard + RO txn + 30 s timeout kept |
 | Tabular execution | `query/executors/tabular.py` | `query/executor/tabular.py::TabularExecutor` | Parquet-only; pandas compiler split out |
@@ -258,3 +258,16 @@ POST /api/v1/chat/stream
 **Bottom line.** Every Phase 1 file under `src/rag/`, `src/tools/`, `src/query/executors/`, `src/query/query_executor.py`, `src/query/base.py`, `src/api/v1/knowledge.py`, and `src/config/agents/` is gone. Phase 1 introspection helpers under `src/pipeline/db_pipeline/` and `src/database_client/` are still imported by Phase 2 — they were not rewritten, just wrapped. The three LLM call sites are now explicit and the SQL-writing one no longer exists; the planner emits a Pydantic-validated `QueryIR` instead.
 The one filename gotcha to remember: the **intent router** still lives at `src/agents/orchestration.py` as class `OrchestratorAgent` (Phase 1 name kept for import-site compatibility, Phase 2 body). The matching prompt and tests use the `intent_router` name, but the source module does not.

 | `src/agents/chatbot.py` (Phase 1) → deleted, then `src/agents/answer_agent.py` (`AnswerAgent`) → renamed | `src/agents/chatbot.py::ChatbotAgent` | Final answer formation; streams via `astream` |
 | `src/knowledge/parquet_service.py` | `src/storage/parquet.py` | Parquet upload/download helper |
 | `src/pipeline/document_pipeline/document_pipeline.py` (folder) | `src/pipeline/document_pipeline.py` (flat) | Single module |
+| `src/rag/retrievers/document.py` | `src/retrieval/document.py` | `DocumentRetriever` migrated; tabular file types filtered out of results. **Post-report update (mentor commit 61c746f, 2026-05-20):** rewritten to raw SQL (pgvector `<=>` cosine, `<+>` manhattan only) to dodge asyncpg type-mapping issues with the Go-ingested schema. MMR / euclidean / inner_product dropped. |
 | `src/rag/router.py` | `src/retrieval/router.py` | `RetrievalRouter`, Redis-cached, unstructured-only; dead `db: AsyncSession` + `source_hint` params removed |
 | `src/rag/base.py` (`RetrievalResult`, `BaseRetriever`) | `src/retrieval/base.py` | Same dataclass + ABC |
                 │
                 ├── source_hint == "unstructured"
                 │     → RetrievalRouter.retrieve()                 [retrieval/router.py, Redis-cached]
+                │         → DocumentRetriever (raw SQL: pgvector `<=>` cosine or `<+>` manhattan)
                 │     → ChatbotAgent.astream(chunks=...)
                 │
                 └── source_hint == "structured"
 | Top-level chat orchestration | Inline in `api/v1/chat.py` | `agents/chat_handler.py::ChatHandler` | Extracted to a reusable module |
 | Final answer formation | `agents/chatbot.py` (Phase 1 LangChain) | `agents/chatbot.py::ChatbotAgent` (was `AnswerAgent` in `answer_agent.py` mid-cycle) | Rewritten + renamed |
 | Schema retrieval (DB / tabular) | `rag/retrievers/schema.py` + PGVector chunks | **Removed**. Replaced by catalog (`catalog/store.py` jsonb) loaded verbatim into planner prompt | Whole concept replaced |
+| Doc retrieval (PDF / DOCX / TXT) | `rag/retrievers/document.py`, `rag/router.py` | `retrieval/document.py`, `retrieval/router.py` | Moved; Redis cache restored; tabular files filtered. **Post-report update:** rewritten to raw SQL (cosine / manhattan only); collection renamed `document_embeddings` → `documents` to match the Go ingestion service. |
 | Query writing | `query/query_executor.py` + `models/sql_query.py` (LLM writes SQL) | `query/planner/service.py` (LLM writes IR) + `query/compiler/sql.py` (deterministic) | LLM emits intent, not SQL |
 | DB execution | `query/executors/db_executor.py` | `query/executor/db.py::DbExecutor` | Folder renamed (`executors` → `executor`); sqlglot guard + RO txn + 30 s timeout kept |
 | Tabular execution | `query/executors/tabular.py` | `query/executor/tabular.py::TabularExecutor` | Parquet-only; pandas compiler split out |
 **Bottom line.** Every Phase 1 file under `src/rag/`, `src/tools/`, `src/query/executors/`, `src/query/query_executor.py`, `src/query/base.py`, `src/api/v1/knowledge.py`, and `src/config/agents/` is gone. Phase 1 introspection helpers under `src/pipeline/db_pipeline/` and `src/database_client/` are still imported by Phase 2 — they were not rewritten, just wrapped. The three LLM call sites are now explicit and the SQL-writing one no longer exists; the planner emits a Pydantic-validated `QueryIR` instead.
 The one filename gotcha to remember: the **intent router** still lives at `src/agents/orchestration.py` as class `OrchestratorAgent` (Phase 1 name kept for import-site compatibility, Phase 2 body). The matching prompt and tests use the `intent_router` name, but the source module does not.
+---
+## 5. Addendum — post-report changes (2026-05-20, mentor commit `61c746f`)
+This report was originally written as a snapshot at Phase 2 completion. The Phase 2 architecture itself hasn't changed, but a few implementation details have shifted as the Go migration progresses. Captured here so the report stays trustworthy:
+- **Doc ingestion is now a Go service.** PDF/DOCX/TXT chunking + embedding + writes into PGVector are no longer done by Python. The Python service reads only.
+- **PGVector collection renamed:** `document_embeddings` → `documents` (to match the Go service's writes). Touched files: `db/postgres/vector_store.py`, `retrieval/document.py`.
+- **`DocumentRetriever` rewritten to raw SQL.** Uses pgvector operators directly (`<=>` cosine, `<+>` manhattan). The LangChain ORM path couldn't cope with the schema written by the Go service (asyncpg type-mapping issues — id String vs UUID, jsonb_path_match binding quirks). MMR / euclidean / inner_product were dropped as part of the rewrite.
+- **Intent router defaults flipped.** Ambiguous topical/knowledge questions now prefer `unstructured` (was `structured`). Indonesian few-shot examples added to the prompt.
+- **Cache management endpoints added:** `DELETE /api/v1/chat/cache`, `DELETE /api/v1/chat/cache/room/{id}`, `DELETE /api/v1/retrieval/cache/{user_id}`. Redis chat cache now stores `{response, sources}` (was just `response`) so cached replies repopulate `message_sources`.
+- **Direction.** The long-term split is **Python = agent/ML layer, Go = data plane**. More pieces are expected to follow doc ingestion out of Python.

REPO_CONTEXT.md CHANGED Viewed

@@ -156,7 +156,7 @@ makes any LLM calls.)
 | `db/postgres/connection.py` | two async engines: `engine` (app) and `_pgvector_engine` (PGVector) |
 | `db/postgres/init_db.py` | startup: creates `vector` extension, all tables, HNSW + GIN indexes |
 | `db/postgres/models.py` | SQLAlchemy app tables (users, rooms, chat messages, …) |
-| `db/postgres/vector_store.py` | shared PGVector instance (collection `document_embeddings`) |
 | `db/redis/connection.py` | async Redis client |
 | `storage/az_blob/az_blob.py` | Azure Blob async wrapper (uploads + Parquet) |
 | `middlewares/{cors,logging,rate_limit}.py` | CORS allow-all (POC), structlog JSON, slowapi |
@@ -318,7 +318,7 @@ Single-table only in v1. `having`, `offset`, boolean filter trees, `distinct`, j
 | QueryService | ✅ | plan → validate → retry-on-fail (max 3) → dispatch → execute → `QueryResult` |
 | `ChatbotAgent` + prompt + guardrails | ✅ | Renamed from `AnswerAgent` in Cleanup PR. Guardrails appended to `chatbot_system.md` |
 | `ChatHandler` (top-level chat orchestrator) | ✅ | SSE events: `intent` / `chunk` / `done` / `error` |
-| `DocumentRetriever` + `RetrievalRouter` (Redis-cached) | ✅ | Migrated from `src/rag/` (now deleted); MMR/cosine/euclidean/manhattan/inner_product |
 | `/api/v1/chat/stream` | ✅ | Rewired to `ChatHandler`; Redis cache + fast intent + history + message persistence remain in chat.py |
 | `/api/v1/db-clients/{id}/ingest` | ✅ | Calls only `on_db_registered`; Phase 1 dual-write removed |
 | `/api/v1/document/{upload,process,delete}` | ✅ | `/process` triggers `on_tabular_uploaded` for CSV/XLSX |

 | `db/postgres/connection.py` | two async engines: `engine` (app) and `_pgvector_engine` (PGVector) |
 | `db/postgres/init_db.py` | startup: creates `vector` extension, all tables, HNSW + GIN indexes |
 | `db/postgres/models.py` | SQLAlchemy app tables (users, rooms, chat messages, …) |
+| `db/postgres/vector_store.py` | shared PGVector instance (collection `documents` — written by Go ingestion service) |
 | `db/redis/connection.py` | async Redis client |
 | `storage/az_blob/az_blob.py` | Azure Blob async wrapper (uploads + Parquet) |
 | `middlewares/{cors,logging,rate_limit}.py` | CORS allow-all (POC), structlog JSON, slowapi |
 | QueryService | ✅ | plan → validate → retry-on-fail (max 3) → dispatch → execute → `QueryResult` |
 | `ChatbotAgent` + prompt + guardrails | ✅ | Renamed from `AnswerAgent` in Cleanup PR. Guardrails appended to `chatbot_system.md` |
 | `ChatHandler` (top-level chat orchestrator) | ✅ | SSE events: `intent` / `chunk` / `done` / `error` |
+| `DocumentRetriever` + `RetrievalRouter` (Redis-cached) | ✅ | Migrated from `src/rag/` (now deleted). Mentor commit `61c746f` rewrote to raw SQL (pgvector `<=>` cosine, `<+>` manhattan) to dodge asyncpg type-mapping issues with Go-ingested schema. Methods reduced to `cosine | manhattan`. Collection: `documents`. |
 | `/api/v1/chat/stream` | ✅ | Rewired to `ChatHandler`; Redis cache + fast intent + history + message persistence remain in chat.py |
 | `/api/v1/db-clients/{id}/ingest` | ✅ | Calls only `on_db_registered`; Phase 1 dual-write removed |
 | `/api/v1/document/{upload,process,delete}` | ✅ | `/process` triggers `on_tabular_uploaded` for CSV/XLSX |

src/tools/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""Analytics & utility tools (KM-608).
+Each tool is a deterministic computation (no LLM, no SQL generation) invoked by
+the Planner/TaskRunner. The compute layer (calculation logic) lives in per-family
+submodules (e.g. `analytics`); the wrapper layer (ToolSpec + ToolOutput +
+registry) is added once the Planner seam is settled.
+"""

src/tools/analytics/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Analytics tool family (KM-608)."""

src/tools/analytics/aggregation.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""analyze_aggregate — group-by aggregation (KM-608).
+An analytical "family" tool: in ONE call it groups rows by one or more keys
+and computes aggregates (sum, mean, count, min, max, median, nunique) per
+group. This is the deterministic compute twin of the Planner's
+`query_structured` step — the wrapper layer later maps a QueryIR onto this.
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+# Aggregation functions the tool understands. Whitelisted on purpose so an
+# unknown function fails loudly instead of silently doing the wrong thing.
+SUPPORTED_AGGS = ("sum", "mean", "count", "min", "max", "median", "nunique")
+class UnsupportedAggregationError(ValueError):
+    """Requested aggregation is not in SUPPORTED_AGGS (maps to UNSUPPORTED_AGG)."""
+def _clean(value: object) -> object:
+    """Convert numpy scalars to plain Python so the output is JSON-clean."""
+    if hasattr(value, "item"):
+        return value.item()
+    return value
+def analyze_aggregate(
+    df: pd.DataFrame,
+    aggregations: dict[str, list[str]],
+    group_by: list[str] | None = None,
+) -> list[dict[str, object]]:
+    """Group-by aggregation over one or many keys.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        aggregations: which columns to aggregate and how, e.g.
+            {"revenue": ["sum", "mean"], "order_id": ["count"]}.
+        group_by: grouping keys. If None/empty, the whole table is aggregated
+            into a single row.
+    Returns:
+        list[dict]: one row per group. Each row holds the group keys plus a
+        column per aggregate, named "<column>_<func>" (e.g. "revenue_sum").
+    Raises:
+        ColumnNotFoundError: if any group_by or aggregated column is absent.
+        UnsupportedAggregationError: if a requested function is not supported.
+    """
+    group_by = group_by or []
+    # Validate columns first (fail-fast on caller mistakes).
+    referenced = list(group_by) + list(aggregations.keys())
+    missing = [c for c in referenced if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    # Validate aggregation functions.
+    for col, funcs in aggregations.items():
+        bad = [f for f in funcs if f not in SUPPORTED_AGGS]
+        if bad:
+            raise UnsupportedAggregationError(
+                f"unsupported aggregation(s) {bad} for column '{col}'; "
+                f"supported: {list(SUPPORTED_AGGS)}"
+            )
+    # Build named aggregations: {"revenue_sum": ("revenue", "sum"), ...}
+    named = {
+        f"{col}_{func}": (col, func)
+        for col, funcs in aggregations.items()
+        for func in funcs
+    }
+    # No grouping → aggregate the entire table into a single row.
+    if not group_by:
+        row = {name: _clean(df[col].agg(func)) for name, (col, func) in named.items()}
+        return [row]
+    # dropna=False keeps groups whose key is null so completeness is honest.
+    grouped = df.groupby(group_by, dropna=False).agg(**named).reset_index()
+    return [{k: _clean(v) for k, v in record.items()} for record in grouped.to_dict("records")]

src/tools/analytics/comparison.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""analyze_comparison — compare a metric across two groups (KM-608).
+An analytical "family" tool: in ONE call it aggregates a value for two groups
+of a dimension (e.g. region "A" vs "B", channel "online" vs "store") and
+reports the gap between them — absolute difference, percent difference, and
+direction. group_a is treated as the baseline. Answers questions like
+"how does revenue in region A compare to region B?".
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+# How to aggregate the value within each group before comparing.
+SUPPORTED_AGGS = ("sum", "mean", "count", "min", "max", "median")
+class UnsupportedAggregationError(ValueError):
+    """The requested aggregation is not supported (maps to error_code UNSUPPORTED_AGG)."""
+class GroupNotFoundError(ValueError):
+    """A requested group value does not occur in the dimension column (maps to GROUP_NOT_FOUND)."""
+def analyze_comparison(
+    df: pd.DataFrame,
+    dimension: str,
+    value_column: str,
+    group_a: object,
+    group_b: object,
+    agg: str = "sum",
+) -> dict[str, object]:
+    """Compare one aggregated metric between two groups of a dimension.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        dimension: the categorical column whose values define the two groups.
+        value_column: numeric column to aggregate for each group.
+        group_a: baseline group value (the "from").
+        group_b: comparison group value (the "to").
+        agg: how to aggregate within each group — one of SUPPORTED_AGGS.
+    Returns:
+        dict with:
+            dimension, value_column, agg  — echo of the chosen settings
+            group_a, value_a              — baseline group + its aggregate
+            group_b, value_b              — comparison group + its aggregate
+            diff_abs                      — value_b - value_a
+            diff_pct                      — diff_abs / value_a, or None if value_a == 0
+            comparison                    — "higher" | "lower" | "equal" (b relative to a)
+    Raises:
+        ColumnNotFoundError: if dimension or value_column is absent.
+        UnsupportedAggregationError: if agg is not supported.
+        GroupNotFoundError: if group_a or group_b has no rows.
+    """
+    missing = [c for c in (dimension, value_column) if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    if agg not in SUPPORTED_AGGS:
+        raise UnsupportedAggregationError(
+            f"unsupported aggregation '{agg}'; supported: {list(SUPPORTED_AGGS)}"
+        )
+    rows_a = df.loc[df[dimension] == group_a, value_column]
+    rows_b = df.loc[df[dimension] == group_b, value_column]
+    empty = [g for g, rows in ((group_a, rows_a), (group_b, rows_b)) if rows.empty]
+    if empty:
+        raise GroupNotFoundError(
+            f"no rows for group(s) {empty} in column '{dimension}'"
+        )
+    value_a = float(rows_a.agg(agg))
+    value_b = float(rows_b.agg(agg))
+    diff_abs = value_b - value_a
+    diff_pct = (diff_abs / value_a) if value_a != 0 else None
+    if diff_abs > 0:
+        comparison = "higher"
+    elif diff_abs < 0:
+        comparison = "lower"
+    else:
+        comparison = "equal"
+    return {
+        "dimension": dimension,
+        "value_column": value_column,
+        "agg": agg,
+        "group_a": group_a,
+        "value_a": value_a,
+        "group_b": group_b,
+        "value_b": value_b,
+        "diff_abs": diff_abs,
+        "diff_pct": diff_pct,
+        "comparison": comparison,
+    }

src/tools/analytics/decomposition.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""analyze_contribution — share-of-total per category (KM-608).
+An analytical "family" tool: in ONE call it breaks a total down into each
+category's contribution (absolute value, share of total, and running
+cumulative share), sorted largest-first. This is a single snapshot — "who
+makes up the total right now" — as opposed to analyze_comparison (two groups)
+or analyze_trend (movement over time). Answers questions like "which region
+drives most of revenue?" and supports Pareto (80/20) reasoning.
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+# Share-of-total is most meaningful for additive aggregates (sum/count), but
+# mean/min/max are allowed; "total" is then the sum of the group aggregates.
+SUPPORTED_AGGS = ("sum", "mean", "count", "min", "max", "median")
+class UnsupportedAggregationError(ValueError):
+    """The requested aggregation is not supported (maps to error_code UNSUPPORTED_AGG)."""
+def _clean(value: object) -> object:
+    """Convert numpy scalars to plain Python so the output is JSON-clean."""
+    if hasattr(value, "item"):
+        return value.item()
+    return value
+def analyze_contribution(
+    df: pd.DataFrame,
+    dimension: str,
+    value_column: str,
+    agg: str = "sum",
+    top_n: int | None = None,
+) -> dict[str, object]:
+    """Contribution (share of total) of each category, largest first.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        dimension: categorical column to break the total down by.
+        value_column: numeric column to aggregate per category.
+        agg: how to aggregate within each category — one of SUPPORTED_AGGS.
+        top_n: if set, keep the top N categories and lump the remainder into a
+            single "Others" row.
+    Returns:
+        dict with:
+            dimension, value_column, agg  — echo of the chosen settings
+            total                         — sum of all category aggregates
+            items                         — [{"category", "value", "share",
+                                             "cumulative_share"}] largest first
+    Raises:
+        ColumnNotFoundError: if dimension or value_column is absent.
+        UnsupportedAggregationError: if agg is not supported.
+    """
+    missing = [c for c in (dimension, value_column) if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    if agg not in SUPPORTED_AGGS:
+        raise UnsupportedAggregationError(
+            f"unsupported aggregation '{agg}'; supported: {list(SUPPORTED_AGGS)}"
+        )
+    grouped = df.groupby(dimension, dropna=False)[value_column].agg(agg)
+    grouped = grouped.sort_values(ascending=False)
+    pairs = list(grouped.items())
+    # Optionally collapse the long tail into a single "Others" bucket.
+    if top_n is not None and len(pairs) > top_n:
+        head = pairs[:top_n]
+        others_value = sum(val for _, val in pairs[top_n:])
+        head.append(("Others", others_value))
+        pairs = head
+    total = sum(val for _, val in pairs)
+    items = []
+    cumulative = 0.0
+    for cat, val in pairs:
+        share = (val / total) if total else None
+        if share is not None:
+            cumulative += share
+        items.append(
+            {
+                "category": _clean(cat),
+                "value": _clean(val),
+                "share": share,
+                "cumulative_share": cumulative if total else None,
+            }
+        )
+    return {
+        "dimension": dimension,
+        "value_column": value_column,
+        "agg": agg,
+        "total": _clean(total),
+        "items": items,
+    }

src/tools/analytics/descriptive.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""analyze_descriptive — single/multi-column EDA (KM-608).
+An analytical "family" tool: in ONE call it computes a column's center,
+spread, shape, and completeness (mean, median, mode, std, variance,
+quartiles, min/max, skew, null_rate).
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import pandas as pd
+# Default metrics used when the caller does not narrow them via `metrics`.
+DEFAULT_METRICS = (
+    "count",
+    "mean",
+    "median",
+    "mode",
+    "std",
+    "var",
+    "q1",
+    "q3",
+    "min",
+    "max",
+    "skew",
+    "null_count",
+    "null_rate",
+)
+class ColumnNotFoundError(ValueError):
+    """A requested column is absent from the DataFrame (maps to error_code COLUMN_NOT_FOUND)."""
+def _describe_one(series: pd.Series, metrics: tuple[str, ...]) -> dict[str, object]:
+    """Compute descriptive metrics for a single column.
+    Numeric metrics are computed over non-null values. `null_rate` & `count`
+    are computed over all rows (nulls included) so they reflect completeness
+    as-is. Undefined cases (e.g. std of a single value) return None — degrade
+    gracefully instead of raising.
+    """
+    total = len(series)
+    non_null = series.dropna()
+    is_numeric = pd.api.types.is_numeric_dtype(series)
+    out: dict[str, object] = {}
+    for m in metrics:
+        if m == "count":
+            out["count"] = int(total)
+        elif m == "null_count":
+            out["null_count"] = int(series.isna().sum())
+        elif m == "null_rate":
+            out["null_rate"] = float(series.isna().mean()) if total else 0.0
+        elif m == "mode":
+            modes = non_null.mode()
+            out["mode"] = modes.iloc[0] if not modes.empty else None
+        elif not is_numeric:
+            out[m] = None
+        elif m == "mean":
+            out["mean"] = float(non_null.mean()) if not non_null.empty else None
+        elif m == "median":
+            out["median"] = float(non_null.median()) if not non_null.empty else None
+        elif m == "std":
+            out["std"] = float(non_null.std()) if non_null.shape[0] > 1 else None
+        elif m == "var":
+            out["var"] = float(non_null.var()) if non_null.shape[0] > 1 else None
+        elif m == "q1":
+            out["q1"] = float(non_null.quantile(0.25)) if not non_null.empty else None
+        elif m == "q3":
+            out["q3"] = float(non_null.quantile(0.75)) if not non_null.empty else None
+        elif m == "min":
+            out["min"] = float(non_null.min()) if not non_null.empty else None
+        elif m == "max":
+            out["max"] = float(non_null.max()) if not non_null.empty else None
+        elif m == "skew":
+            out["skew"] = float(non_null.skew()) if non_null.shape[0] > 2 else None
+    return out
+def analyze_descriptive(
+    df: pd.DataFrame,
+    column_ids: list[str],
+    metrics: list[str] | None = None,
+) -> dict[str, dict[str, object]]:
+    """Descriptive EDA for one or many columns.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        column_ids: columns to analyze.
+        metrics: subset of metrics; defaults to all of DEFAULT_METRICS.
+    Returns:
+        dict: { column_id: { metric: value, ... }, ... }
+    Raises:
+        ColumnNotFoundError: if any column_id is absent from df.
+    """
+    chosen = tuple(metrics) if metrics else DEFAULT_METRICS
+    missing = [c for c in column_ids if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    return {col: _describe_one(df[col], chosen) for col in column_ids}

src/tools/analytics/quality.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""analyze_profile — per-column data-quality profile (KM-608).
+An analytical "family" tool: in ONE call it profiles each column's health —
+dtype, inferred type, completeness (null count/rate), cardinality (distinct
+count/rate, constant flag), and — for numeric columns — min/max/mean plus an
+IQR-based outlier count; for non-numeric columns the most frequent value.
+Answers "is this data clean enough to analyze?" and surfaces issues (lots of
+nulls, a constant column, outliers) before deeper analysis.
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+def _clean(value: object) -> object:
+    """Convert numpy scalars to plain Python so the output is JSON-clean."""
+    if hasattr(value, "item"):
+        return value.item()
+    return value
+def _profile_one(series: pd.Series) -> dict[str, object]:
+    """Build the quality profile for a single column."""
+    total = len(series)
+    non_null = series.dropna()
+    nn = len(non_null)
+    distinct = int(series.nunique(dropna=True))
+    is_bool = pd.api.types.is_bool_dtype(series)
+    is_datetime = pd.api.types.is_datetime64_any_dtype(series)
+    # bool is technically numeric in pandas; treat it as its own type.
+    is_numeric = pd.api.types.is_numeric_dtype(series) and not is_bool
+    if is_bool:
+        inferred = "boolean"
+    elif is_datetime:
+        inferred = "datetime"
+    elif is_numeric:
+        inferred = "numeric"
+    else:
+        inferred = "categorical"
+    out: dict[str, object] = {
+        "dtype": str(series.dtype),
+        "inferred_type": inferred,
+        "count": int(total),
+        "null_count": int(series.isna().sum()),
+        "null_rate": float(series.isna().mean()) if total else 0.0,
+        "distinct_count": distinct,
+        "distinct_rate": (distinct / nn) if nn else 0.0,  # over non-null values
+        "is_constant": distinct <= 1,
+    }
+    if is_numeric and nn > 0:
+        out["min"] = _clean(non_null.min())
+        out["max"] = _clean(non_null.max())
+        out["mean"] = _clean(non_null.mean())
+        # IQR rule: values outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR] are outliers.
+        # Needs enough points for stable quartiles.
+        if nn >= 4:
+            q1 = non_null.quantile(0.25)
+            q3 = non_null.quantile(0.75)
+            iqr = q3 - q1
+            lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
+            out["outlier_count"] = int(((non_null < lower) | (non_null > upper)).sum())
+        else:
+            out["outlier_count"] = None
+    elif not is_numeric and nn > 0:
+        counts = non_null.value_counts()
+        out["top_value"] = _clean(counts.index[0])
+        out["top_freq"] = int(counts.iloc[0])
+    return out
+def analyze_profile(
+    df: pd.DataFrame,
+    column_ids: list[str] | None = None,
+) -> dict[str, dict[str, object]]:
+    """Per-column data-quality profile.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        column_ids: columns to profile. If None, every column is profiled.
+    Returns:
+        dict: { column_id: { profile fields, ... }, ... }
+    Raises:
+        ColumnNotFoundError: if any column_id is absent from df.
+    """
+    cols = list(column_ids) if column_ids is not None else list(df.columns)
+    missing = [c for c in cols if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    return {col: _profile_one(df[col]) for col in cols}

src/tools/analytics/relationship.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""analyze_correlation — correlation among numeric columns (KM-608).
+An analytical "family" tool: in ONE call it measures how strongly numeric
+columns move together. Returns the full correlation matrix plus a list of
+column pairs ranked by strength. Answers questions like "does price relate to
+units sold?".
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import math
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+# Correlation methods supported by pandas .corr().
+SUPPORTED_METHODS = ("pearson", "spearman", "kendall")
+class InvalidMethodError(ValueError):
+    """The requested method is not supported (maps to error_code INVALID_METHOD)."""
+class NonNumericColumnError(ValueError):
+    """A requested column is not numeric (maps to error_code NON_NUMERIC_COLUMN)."""
+class NotEnoughColumnsError(ValueError):
+    """Correlation needs at least two numeric columns (maps to NOT_ENOUGH_COLUMNS)."""
+def _clean(value: object) -> float | None:
+    """Cast to plain float; NaN (e.g. a zero-variance column) -> None."""
+    if value is None:
+        return None
+    f = float(value)  # type: ignore[arg-type]
+    return None if math.isnan(f) else f
+def analyze_correlation(
+    df: pd.DataFrame,
+    column_ids: list[str] | None = None,
+    method: str = "pearson",
+) -> dict[str, object]:
+    """Pairwise correlation across numeric columns.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        column_ids: numeric columns to correlate. If None, every numeric
+            column in df is used.
+        method: "pearson" (linear), "spearman" (rank), or "kendall".
+    Returns:
+        dict with:
+            method   — echo of the chosen method
+            columns  — the numeric columns actually correlated
+            matrix   — { col: { col: corr|None } } full square matrix
+            pairs    — [{"a", "b", "corr"}] unique pairs, strongest |corr| first
+    Raises:
+        InvalidMethodError: if method is unknown.
+        ColumnNotFoundError: if an explicit column is absent.
+        NonNumericColumnError: if an explicit column is not numeric.
+        NotEnoughColumnsError: if fewer than two numeric columns remain.
+    """
+    if method not in SUPPORTED_METHODS:
+        raise InvalidMethodError(
+            f"unknown method '{method}'; supported: {list(SUPPORTED_METHODS)}"
+        )
+    if column_ids is None:
+        cols = [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])]
+    else:
+        missing = [c for c in column_ids if c not in df.columns]
+        if missing:
+            raise ColumnNotFoundError(f"columns not found: {missing}")
+        non_numeric = [
+            c for c in column_ids if not pd.api.types.is_numeric_dtype(df[c])
+        ]
+        if non_numeric:
+            raise NonNumericColumnError(f"columns are not numeric: {non_numeric}")
+        cols = list(column_ids)
+    if len(cols) < 2:
+        raise NotEnoughColumnsError(
+            f"need >= 2 numeric columns, got {len(cols)}: {cols}"
+        )
+    corr = df[cols].corr(method=method)
+    matrix = {a: {b: _clean(corr.loc[a, b]) for b in cols} for a in cols}
+    pairs = []
+    for i in range(len(cols)):
+        for j in range(i + 1, len(cols)):
+            val = _clean(corr.iloc[i, j])
+            if val is not None:
+                pairs.append({"a": cols[i], "b": cols[j], "corr": val})
+    pairs.sort(key=lambda p: abs(p["corr"]), reverse=True)
+    return {"method": method, "columns": cols, "matrix": matrix, "pairs": pairs}

src/tools/analytics/segmentation.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""analyze_segment — bucket rows into segments (KM-608).
+An analytical "family" tool: in ONE call it bins a numeric column into
+segments and reports how rows distribute across them (count, and optionally an
+aggregate of another column per segment). Two binning modes: explicit cut
+"edges" (e.g. age 0-18-35-60) or equal-frequency "quantile" buckets (quartiles,
+deciles). Answers questions like "split customers into age brackets" or "bucket
+orders into value tiers".
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import math
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+# Binning strategies.
+SUPPORTED_METHODS = ("edges", "quantile")
+# How to aggregate the value column within each segment.
+SUPPORTED_AGGS = ("sum", "mean", "count", "min", "max", "median")
+class InvalidMethodError(ValueError):
+    """The requested binning method is not supported (maps to INVALID_METHOD)."""
+class NonNumericColumnError(ValueError):
+    """The column to segment on is not numeric (maps to NON_NUMERIC_COLUMN)."""
+class UnsupportedAggregationError(ValueError):
+    """The requested aggregation is not supported (maps to UNSUPPORTED_AGG)."""
+def _clean(value: object) -> object:
+    """Convert numpy scalars to plain Python; NaN -> None for JSON-clean output."""
+    if value is None:
+        return None
+    if hasattr(value, "item"):
+        value = value.item()
+    if isinstance(value, float) and math.isnan(value):
+        return None
+    return value
+def analyze_segment(
+    df: pd.DataFrame,
+    column: str,
+    bins: list[float] | int,
+    method: str = "edges",
+    labels: list[str] | None = None,
+    value_column: str | None = None,
+    agg: str = "sum",
+) -> dict[str, object]:
+    """Segment rows by binning a numeric column.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        column: numeric column to bin on.
+        bins: for method "edges", the list of cut boundaries (e.g.
+            [0, 18, 35, 60]); for method "quantile", the number of equal-
+            frequency buckets (e.g. 4 for quartiles).
+        method: "edges" (explicit boundaries) or "quantile" (equal frequency).
+        labels: optional segment names; for "edges" there must be
+            len(bins) - 1 of them.
+        value_column: if given, also aggregate this column per segment.
+        agg: how to aggregate value_column — one of SUPPORTED_AGGS.
+    Returns:
+        dict with:
+            column, method   — echo of the chosen settings
+            agg              — present only when value_column is given
+            segments         — [{"segment", "count", ("value")}], in bin order
+    Raises:
+        ColumnNotFoundError: if column or value_column is absent.
+        NonNumericColumnError: if column is not numeric.
+        InvalidMethodError: if method is unknown.
+        UnsupportedAggregationError: if agg is not supported.
+    """
+    referenced = [column] + ([value_column] if value_column else [])
+    missing = [c for c in referenced if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    if not pd.api.types.is_numeric_dtype(df[column]):
+        raise NonNumericColumnError(f"column '{column}' is not numeric")
+    if method not in SUPPORTED_METHODS:
+        raise InvalidMethodError(
+            f"unknown method '{method}'; supported: {list(SUPPORTED_METHODS)}"
+        )
+    if value_column is not None and agg not in SUPPORTED_AGGS:
+        raise UnsupportedAggregationError(
+            f"unsupported aggregation '{agg}'; supported: {list(SUPPORTED_AGGS)}"
+        )
+    if method == "edges":
+        cats = pd.cut(df[column], bins=bins, labels=labels, include_lowest=True)
+    else:  # quantile
+        cats = pd.qcut(df[column], q=bins, labels=labels, duplicates="drop")
+    grouped = df.groupby(cats, observed=False)
+    counts = grouped.size()
+    segments = []
+    for seg in counts.index:
+        row = {"segment": str(seg), "count": int(counts[seg])}
+        if value_column is not None:
+            row["value"] = _clean(grouped[value_column].agg(agg).get(seg))
+        segments.append(row)
+    out: dict[str, object] = {"column": column, "method": method, "segments": segments}
+    if value_column is not None:
+        out["agg"] = agg
+    return out

src/tools/analytics/temporal.py ADDED Viewed

	@@ -0,0 +1,159 @@

+"""analyze_trend — time-series trend over a period (KM-608).
+An analytical "family" tool: in ONE call it buckets rows into time periods
+(day/week/month/quarter/year), aggregates a value per period, and summarizes
+the movement (first vs last, absolute & percent change, direction, linear
+slope). Answers questions like "how did revenue trend month over month?".
+STATUS: compute layer only — the function takes an already-materialized
+DataFrame. The wrapper layer (fetching data from the catalog via source_id,
+the ToolOutput envelope, ToolSpec registration) is added once the Planner
+seam (KM-418) is settled. Keeping compute separate from data-fetching makes
+this function easy to unit-test in isolation and stable when wrapped.
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+from src.tools.analytics.descriptive import ColumnNotFoundError
+# Friendly period name -> pandas resample rule. Using the non-deprecated
+# pandas 2.2 codes (ME/QE/YE) avoids FutureWarnings.
+FREQ_MAP = {
+    "day": "D",
+    "week": "W",
+    "month": "ME",
+    "quarter": "QE",
+    "year": "YE",
+}
+# How to aggregate the value within each period.
+SUPPORTED_AGGS = ("sum", "mean", "count", "min", "max", "median")
+class InvalidFrequencyError(ValueError):
+    """The requested period is not in FREQ_MAP (maps to error_code INVALID_FREQUENCY)."""
+class UnsupportedAggregationError(ValueError):
+    """The requested aggregation is not supported (maps to error_code UNSUPPORTED_AGG)."""
+def _clean(value: object) -> object:
+    """Convert numpy scalars to plain Python; NaN -> None for JSON-clean output."""
+    if value is None:
+        return None
+    if isinstance(value, float) and np.isnan(value):
+        return None
+    if hasattr(value, "item"):
+        value = value.item()
+        return None if isinstance(value, float) and np.isnan(value) else value
+    return value
+def _period_label(ts: pd.Timestamp, freq: str) -> str:
+    """Human-readable period label keyed off the friendly frequency name."""
+    if freq == "month":
+        return str(ts.strftime("%Y-%m"))
+    if freq == "quarter":
+        return f"{ts.year}-Q{ts.quarter}"
+    if freq == "year":
+        return str(ts.strftime("%Y"))
+    return str(ts.strftime("%Y-%m-%d"))  # day / week
+def analyze_trend(
+    df: pd.DataFrame,
+    date_column: str,
+    value_column: str,
+    freq: str = "month",
+    agg: str = "sum",
+) -> dict[str, object]:
+    """Time-series trend of one value over evenly-spaced periods.
+    Args:
+        df: already-materialized data (in the real system the wrapper fetches
+            this from a source_id).
+        date_column: column holding dates/timestamps.
+        value_column: numeric column to aggregate per period.
+        freq: period granularity — one of FREQ_MAP keys (default "month").
+        agg: how to aggregate within a period — one of SUPPORTED_AGGS.
+    Returns:
+        dict with:
+            freq, agg          — echo of the chosen settings
+            points             — [{"period": str, "value": number|None}, ...]
+            first, last        — value of the first/last non-empty period
+            change_abs         — last - first
+            change_pct         — (last - first) / first, or None if first == 0
+            direction          — "up" | "down" | "flat"
+            slope              — linear slope across periods, or None if < 2 points
+    Raises:
+        ColumnNotFoundError: if date_column or value_column is absent.
+        InvalidFrequencyError: if freq is not a known period.
+        UnsupportedAggregationError: if agg is not supported.
+    """
+    missing = [c for c in (date_column, value_column) if c not in df.columns]
+    if missing:
+        raise ColumnNotFoundError(f"columns not found: {missing}")
+    if freq not in FREQ_MAP:
+        raise InvalidFrequencyError(
+            f"unknown frequency '{freq}'; supported: {list(FREQ_MAP)}"
+        )
+    if agg not in SUPPORTED_AGGS:
+        raise UnsupportedAggregationError(
+            f"unsupported aggregation '{agg}'; supported: {list(SUPPORTED_AGGS)}"
+        )
+    # Build a clean datetime-indexed series, then resample into periods.
+    s = df[[date_column, value_column]].copy()
+    s[date_column] = pd.to_datetime(s[date_column])
+    s = s.dropna(subset=[date_column]).set_index(date_column).sort_index()
+    resampled = s[value_column].resample(FREQ_MAP[freq]).agg(agg)
+    points = [
+        {"period": _period_label(ts, freq), "value": _clean(val)}
+        for ts, val in resampled.items()
+    ]
+    # Summary stats are computed over non-empty periods only.
+    non_null = resampled.dropna()
+    first: float | None
+    last: float | None
+    change_abs: float | None
+    change_pct: float | None
+    slope: float | None
+    if non_null.empty:
+        first = last = change_abs = change_pct = slope = None
+        direction = "flat"
+    else:
+        first = float(non_null.iloc[0])
+        last = float(non_null.iloc[-1])
+        change_abs = last - first
+        change_pct = (change_abs / first) if first != 0 else None
+        if change_abs > 0:
+            direction = "up"
+        elif change_abs < 0:
+            direction = "down"
+        else:
+            direction = "flat"
+        if non_null.shape[0] > 1:
+            x = np.arange(non_null.shape[0])
+            slope = float(np.polyfit(x, non_null.to_numpy(dtype=float), 1)[0])
+        else:
+            slope = None
+    return {
+        "freq": freq,
+        "agg": agg,
+        "points": points,
+        "first": first,
+        "last": last,
+        "change_abs": change_abs,
+        "change_pct": change_pct,
+        "direction": direction,
+        "slope": slope,
+    }