Spaces:

AdithyaVardan
/

GodSpeed

Sleeping

Ananth Shyam commited on 17 days ago

Commit

451d52a

1 Parent(s): b391fb0

Implement anomaly detection and forecasting features

- Added new documentation for Area 4: Anomaly Detection & Forecasting, detailing its purpose, implementation, and design principles.
- Created the anomaly detection module with the following components:
- `src/anomaly/__init__.py`: Initialization file for the anomaly module.
- `src/anomaly/db.py`: Supabase persistence layer for anomaly signals and query events.
- `src/anomaly/notifier.py`: In-process WebSocket broadcast bridge for real-time notifications of critical/high anomaly signals.
- `src/anomaly/router.py`: FastAPI router for the anomaly detection API endpoints.
- `src/anomaly/tasks.py`: Celery tasks for running anomaly detection algorithms, including Z-score detection, staleness scoring, and dependency risk modeling.
- Developed SQL migration script (`supabase/anomaly_migration.sql`) to create necessary tables for storing query events and anomaly signals.

Files changed (17) hide show

.claude/settings.json +21 -1
Docs/03_analytics_and_intelligence.md +3 -2
Docs/README.md +6 -5
Docs/anomaly-and-forecasting/01_data_layer.md +257 -0
Docs/anomaly-and-forecasting/02_detection_algorithms.md +328 -0
Docs/anomaly-and-forecasting/03_api_reference.md +240 -0
Docs/anomaly-and-forecasting/04_jobs_and_scheduling.md +198 -0
Docs/anomaly-and-forecasting/README.md +86 -0
agent/api.py +9 -0
main.py +2 -0
src/anomaly/__init__.py +0 -0
src/anomaly/db.py +262 -0
src/anomaly/notifier.py +74 -0
src/anomaly/router.py +135 -0
src/anomaly/tasks.py +406 -0
src/celery_app.py +48 -2
supabase/anomaly_migration.sql +86 -0

.claude/settings.json CHANGED Viewed

@@ -4,7 +4,27 @@
       "Bash(Get-ChildItem -Path \"d:\\\\Godspeed\" -Recurse -Include \"*.py\")",
       "Bash(Select-Object -First 20)",
       "Bash(dir /s /b d:\\\\Godspeed)",
-      "Bash(xargs wc -l)"
     ]
   }
 }

       "Bash(Get-ChildItem -Path \"d:\\\\Godspeed\" -Recurse -Include \"*.py\")",
       "Bash(Select-Object -First 20)",
       "Bash(dir /s /b d:\\\\Godspeed)",
+      "Bash(xargs wc -l)",
+      "Bash(Get-ChildItem -Path \"d:\\\\Godspeed\" -Directory)",
+      "Bash(Select-Object Name)",
+      "Bash(xargs grep -l \"anomaly\\\\|forecast\\\\|spike\\\\|trend\")",
+      "Bash(xargs grep -l \"validation\\\\|critic\\\\|generator\")",
+      "Bash(grep -E \"\\\\.py$\")",
+      "Bash(xargs grep -l \"BM25\\\\|qdrant\\\\|dense\\\\|sparse\\\\|retrieval\")",
+      "Bash(Get-ChildItem -Path \"d:\\\\Godspeed\" -Force)",
+      "Bash(Select-Object Name, FullName)",
+      "Bash(Get-ChildItem -Path \"d:\\\\Godspeed\" -Recurse -Directory)",
+      "Bash(Select-Object -ExpandProperty FullName)",
+      "Bash(xargs grep -l \"FastAPI\\\\|APIRouter\")",
+      "Bash(xargs grep -l \"redis\\\\|Redis\")",
+      "Bash(xargs grep -l \"apiFetch\\\\|http\")",
+      "PowerShell(Get-ChildItem -Path \"d:\\\\Godspeed\" -Force | Select-Object Name, @{Name=\"Type\"; Expression={if\\($_.PSIsContainer\\) {\"Dir\"} else {\"File\"}}} | Format-Table -AutoSize)",
+      "PowerShell(Get-ChildItem -Path \"d:\\\\Godspeed\\\\agent\" -Recurse | Select-Object FullName | ForEach-Object { $_.FullName.Replace\\(\"d:\\\\Godspeed\\\\\", \"\"\\) })",
+      "PowerShell(Get-ChildItem -Path \"d:\\\\Godspeed\\\\src\" -Recurse | Select-Object FullName | ForEach-Object { $_.FullName.Replace\\(\"d:\\\\Godspeed\\\\\", \"\"\\) })",
+      "PowerShell(Get-ChildItem -Path \"d:\\\\Godspeed\\\\frontend\" -Recurse -Depth 2 | Where-Object { -not $_.PSIsContainer } | Select-Object FullName | ForEach-Object { $_.FullName.Replace\\(\"d:\\\\Godspeed\\\\\", \"\"\\) } | Sort-Object | Select-Object -First 40)",
+      "PowerShell(Get-ChildItem -Path \"d:\\\\Godspeed\\\\ingestion\" -Recurse | Where-Object { -not $_.PSIsContainer } | Select-Object FullName | ForEach-Object { $_.FullName.Replace\\(\"d:\\\\Godspeed\\\\\", \"\"\\) })",
+      "PowerShell(Get-ChildItem -Path \"d:\\\\Godspeed\\\\graph_store\" -Recurse | Where-Object { -not $_.PSIsContainer } | Select-Object FullName | ForEach-Object { $_.FullName.Replace\\(\"d:\\\\Godspeed\\\\\", \"\"\\) })",
+      "Bash(git checkout *)"
     ]
   }
 }

Docs/03_analytics_and_intelligence.md CHANGED Viewed

@@ -512,9 +512,10 @@ ANALYTICS_PERMISSIONS = {
 ---
-## 9. Area 4 — Anomaly Detection & Forecasting (Planned)
-> **Status: Planned Extension.** All input streams already exist in Areas 1–3. No new data pipelines needed. Requires: time-series aggregation layer (InfluxDB or PostgreSQL time-series extension).
 ### Component 1 — Query Spike Detector

 ---
+## 9. Area 4 — Anomaly Detection & Forecasting
+> **Status: Implemented** on branch `anomaly-and-forecasting`. See [`anomaly-and-forecasting/`](./anomaly-and-forecasting/README.md) for full implementation docs.
+> The specs below describe the original design intent. The actual implementation uses plain PostgreSQL (no InfluxDB) and stdlib-only algorithms (no external ML dependencies).
 ### Component 1 — Query Spike Detector

Docs/README.md CHANGED Viewed

@@ -13,17 +13,18 @@
 | [`03_analytics_and_intelligence.md`](./03_analytics_and_intelligence.md) | Query classification, interaction log schema, retrieval feedback loop, NL analytics, health dashboard, proactive agent, silo detector, Areas 4 & 5 planned specs | Building analytics, dashboards, or Area 3 agents |
 | [`04_integrations_and_tech_stack.md`](./04_integrations_and_tech_stack.md) | Notion, Confluence, GitHub integration specs with full code; RBAC enforcement; change detection; tech stack; local dev setup; env vars | Building any integration or setting up development environment |
 | [`05_market_strategy_and_gtm.md`](./05_market_strategy_and_gtm.md) | Target customer, tool strategy, competitor analysis, USPs, country-by-country market analysis, GTM sequencing | Product, positioning, or expansion decisions |
 ---
 ## The Five Focus Areas at a Glance
 ```
-Area 1 — Hybrid RAG System          [Core — In Scope]
-Area 2 — Data Pipelines & Validation [Core — In Scope]
-Area 3 — Analytics & NL Intelligence [Core — In Scope]
-Area 4 — Anomaly & Forecasting       [Planned Extension]
-Area 5 — Knowledge Graph             [Planned Extension]
 ```
 All five areas are one system — not five products. See `01_problem_and_architecture.md` for the interaction map.

 | [`03_analytics_and_intelligence.md`](./03_analytics_and_intelligence.md) | Query classification, interaction log schema, retrieval feedback loop, NL analytics, health dashboard, proactive agent, silo detector, Areas 4 & 5 planned specs | Building analytics, dashboards, or Area 3 agents |
 | [`04_integrations_and_tech_stack.md`](./04_integrations_and_tech_stack.md) | Notion, Confluence, GitHub integration specs with full code; RBAC enforcement; change detection; tech stack; local dev setup; env vars | Building any integration or setting up development environment |
 | [`05_market_strategy_and_gtm.md`](./05_market_strategy_and_gtm.md) | Target customer, tool strategy, competitor analysis, USPs, country-by-country market analysis, GTM sequencing | Product, positioning, or expansion decisions |
+| [`anomaly-and-forecasting/`](./anomaly-and-forecasting/README.md) | **Area 4 implementation docs** — data layer, detection algorithms, API reference, Celery scheduling | Building or extending anomaly detection, forecasting, or the Anomalies frontend tab |
 ---
 ## The Five Focus Areas at a Glance
 ```
+Area 1 — Hybrid RAG System          [Core — Implemented]
+Area 2 — Data Pipelines & Validation [Core — Implemented]
+Area 3 — Analytics & NL Intelligence [Core — Implemented]
+Area 4 — Anomaly & Forecasting       [Implemented — branch: anomaly-and-forecasting]
+Area 5 — Knowledge Graph             [Implemented]
 ```
 All five areas are one system — not five products. See `01_problem_and_architecture.md` for the interaction map.

Docs/anomaly-and-forecasting/01_data_layer.md ADDED Viewed

	@@ -0,0 +1,257 @@

+# 01 · Data Layer — Time-Series Tables & Migration
+---
+## Why a New Data Layer?
+Redis (`gs:queries`) caps history at 1,000 events. That is enough for a live dashboard but useless for anomaly detection, which requires:
+- Hourly counts over a 14-day sliding window (≈ 336 rows per team)
+- Per-day escalation rates over a 14-day comparison window
+- Document age relative to query volume over 30 days
+- Persistent anomaly signal records with resolution workflow
+Three PostgreSQL tables are added in Supabase. No TimescaleDB extension is required — plain indexed `TIMESTAMPTZ` columns with a pre-aggregation function serve the access patterns needed.
+---
+## Migration File
+**Path:** `supabase/anomaly_migration.sql`
+**Run after:** `rbac_migration.sql`
+**Idempotent:** Yes — all statements use `IF NOT EXISTS` / `ON CONFLICT DO UPDATE`
+Apply via the Supabase dashboard SQL editor or:
+```bash
+supabase db push
+# or
+psql $DATABASE_URL -f supabase/anomaly_migration.sql
+```
+---
+## Table 1: `query_events`
+Persistent copy of every query event. Redis keeps the last 1,000; this table keeps the last 90 days (purged nightly by the staleness task).
+```sql
+CREATE TABLE IF NOT EXISTS query_events (
+    id              uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
+    event_id        text        NOT NULL UNIQUE,   -- same UUID as Redis event["id"]
+    team_id         text        NOT NULL,
+    session_id      text,
+    success         boolean     NOT NULL DEFAULT true,
+    duration_ms     integer,
+    escalated       boolean     NOT NULL DEFAULT false,
+    guardrail_score float,
+    agent_metrics   jsonb       NOT NULL DEFAULT '{}',
+    created_at      timestamptz NOT NULL DEFAULT now()
+);
+```
+| Column | Source | Notes |
+|---|---|---|
+| `event_id` | `agent/api.py` event `"id"` field | Used for idempotent upsert — Redis replays are safe |
+| `team_id` | Server-enforced from session | Never trusted from client |
+| `escalated` | `guardrail_score < 0.5` flag | Primary input for escalation trend detection |
+| `agent_metrics` | `event["agents"]` dict | `{"doc_search": {"chunk_count": 5, "confidence": "high"}}` |
+**Indexes:**
+- `(team_id, created_at DESC)` — primary scan for per-team time ranges
+- `(created_at DESC)` — global purge query
+- `(team_id, escalated, created_at DESC)` — escalation rate aggregation
+**Write path:** `agent/api.py` → `asyncio.ensure_future(async_upsert_query_event(event))` — fire-and-forget, never blocks SSE stream.
+---
+## Table 2: `query_events_hourly`
+Pre-aggregated hourly buckets. The Z-score detection task reads this table, not `query_events` directly, to avoid shipping thousands of raw rows to the Celery worker.
+```sql
+CREATE TABLE IF NOT EXISTS query_events_hourly (
+    team_id          text        NOT NULL,
+    hour_bucket      timestamptz NOT NULL,     -- truncated to the hour, UTC
+    query_count      integer     NOT NULL DEFAULT 0,
+    escalation_count integer     NOT NULL DEFAULT 0,
+    avg_duration_ms  integer,
+    PRIMARY KEY (team_id, hour_bucket)
+);
+```
+**Index:** `(team_id, hour_bucket DESC)` — range scan for last N days.
+**Write path:** The `aggregate_hourly_bucket` PostgreSQL function (see below) is called via Supabase RPC. The function does the aggregation inside the database — no rows are shipped to Python.
+---
+## Table 3: `anomaly_signals`
+Every detected anomaly is a row here. The frontend Anomalies tab reads from this table via the API.
+```sql
+CREATE TABLE IF NOT EXISTS anomaly_signals (
+    id           uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
+    team_id      text,
+    signal_type  text        NOT NULL,
+    entity_type  text,
+    entity_id    text,
+    severity     text        NOT NULL DEFAULT 'medium',
+    score        float       NOT NULL DEFAULT 0.0,
+    details      jsonb       NOT NULL DEFAULT '{}',
+    resolved     boolean     NOT NULL DEFAULT false,
+    resolved_by  uuid        REFERENCES users(id),
+    resolved_at  timestamptz,
+    detected_at  timestamptz NOT NULL DEFAULT now()
+);
+```
+### `signal_type` Values
+| Value | Produced By | `entity_type` | `score` Meaning |
+|---|---|---|---|
+| `query_spike` | Z-score task | `Team` | Z-score value (e.g. 4.2) |
+| `query_drop` | Z-score task | `Team` | Z-score value (negative, e.g. −2.8) |
+| `escalation_trend` | Z-score task | `Team` | Rate ratio (e.g. 1.8 = 80% worse) |
+| `staleness` | Staleness task | `Document` | Staleness risk 0.0–1.0 |
+| `dependency_risk` | Dep risk task | `Library` | Composite risk score 0.0–1.0 |
+### `severity` Values
+| Value | Trigger |
+|---|---|
+| `critical` | query_spike: \|Z\| ≥ 5.0 · staleness: ≥ 0.8 · dep_risk: ≥ 0.7 |
+| `high` | query_spike: \|Z\| ≥ 4.0 · staleness: ≥ 0.6 · dep_risk: ≥ 0.5 |
+| `medium` | query_spike: \|Z\| ≥ 3.5 · staleness: ≥ 0.3 · dep_risk: ≥ 0.3 |
+| `low` | Everything else above detection threshold |
+### `details` JSONB Shapes
+**query_spike / query_drop:**
+```json
+{
+  "z_score": 4.2,
+  "current_count": 87,
+  "baseline_mean": 23.4,
+  "baseline_stdev": 15.1,
+  "hour_bucket": "2026-05-17T14:00:00",
+  "window_hours": 335
+}
+```
+**escalation_trend:**
+```json
+{
+  "ratio": 1.92,
+  "current_rate": 0.192,
+  "prior_rate": 0.1,
+  "current_total_queries": 52,
+  "prior_total_queries": 60
+}
+```
+**staleness:**
+```json
+{
+  "title": "Kubernetes Ingress Setup Guide",
+  "age_days": 240,
+  "age_factor": 0.9306,
+  "query_pressure": 0.72,
+  "updated_at": "2025-09-20T10:00:00"
+}
+```
+**dependency_risk:**
+```json
+{
+  "library_name": "fastapi",
+  "current_version": "0.95.0",
+  "latest_version": "0.115.0",
+  "version_lag": 1.0,
+  "downstream_count": 8,
+  "downstream_normalized": 0.62,
+  "incident_count": 2,
+  "incident_rate": 0.0055,
+  "poisson_30d": 0.153
+}
+```
+**Indexes:**
+- `(team_id, signal_type, detected_at DESC)` — filtered list queries
+- `(severity, resolved, detected_at DESC)` — severity-sorted alert feed
+- `(resolved, team_id, detected_at DESC)` — active signals per team
+- `(entity_type, entity_id)` — entity-specific lookups
+**Deduplication:** `db.insert_signal()` checks for an identical unresolved signal for the same `(signal_type, team_id, entity_id)` in the last 2 hours before inserting, preventing the 15-minute task from creating redundant rows.
+---
+## PostgreSQL Aggregation Function
+`aggregate_hourly_bucket` runs inside the database, keeping the aggregation logic server-side.
+```sql
+CREATE OR REPLACE FUNCTION aggregate_hourly_bucket(p_team_id text, p_hour timestamptz)
+RETURNS void LANGUAGE plpgsql AS $$
+BEGIN
+    INSERT INTO query_events_hourly
+           (team_id, hour_bucket, query_count, escalation_count, avg_duration_ms)
+    SELECT  p_team_id, p_hour,
+            count(*)::integer,
+            count(*) FILTER (WHERE escalated = true)::integer,
+            avg(duration_ms)::integer
+    FROM    query_events
+    WHERE   team_id = p_team_id
+      AND   date_trunc('hour', created_at AT TIME ZONE 'UTC') = p_hour
+    ON CONFLICT (team_id, hour_bucket) DO UPDATE
+        SET query_count      = EXCLUDED.query_count,
+            escalation_count = EXCLUDED.escalation_count,
+            avg_duration_ms  = EXCLUDED.avg_duration_ms;
+END;
+$$;
+```
+**Called via Supabase RPC:**
+```python
+_sb().rpc("aggregate_hourly_bucket", {
+    "p_team_id": team_id,
+    "p_hour":    hour_bucket,
+}).execute()
+```
+---
+## Data Persistence Layer: `src/anomaly/db.py`
+All Supabase reads and writes go through this module. Every function is wrapped in `try/except` — callers never crash if Supabase is unavailable.
+### Key Functions
+| Function | Direction | Called By |
+|---|---|---|
+| `upsert_query_event(event)` | Write | `agent/api.py` (sync, via executor) |
+| `async_upsert_query_event(event)` | Write | `agent/api.py` (async shim) |
+| `aggregate_hourly(team_id, hour)` | Write | Future: aggregation task |
+| `get_hourly_counts(team_id, days)` | Read | `tasks.run_zscore_anomaly_detection` |
+| `get_all_team_ids()` | Read | `tasks.run_zscore_anomaly_detection` |
+| `insert_signal(...)` | Write | All three detection algorithms |
+| `get_signals(team_id, ...)` | Read | `router.list_signals` |
+| `resolve_signal(id, user_id)` | Write | `router.resolve_signal` |
+| `get_signals_summary()` | Read | `router.signals_summary` |
+| `get_staleness_top(limit)` | Read | `router.staleness_list` |
+| `get_dependency_risk(limit)` | Read | `router.dependency_risk` |
+| `purge_old_events()` | Write | `tasks.run_staleness_scoring` (nightly) |
+---
+## 90-Day Event Purge
+`purge_old_events()` runs at the end of every `compute_staleness_scores` Celery task (daily 03:00 UTC):
+```python
+_sb().table("query_events").delete().lt("created_at", cutoff).execute()
+```
+This keeps the `query_events` table bounded. At 1 query/minute per team, 90 days = ~129,600 rows per team — manageable for PostgreSQL without partitioning.

Docs/anomaly-and-forecasting/02_detection_algorithms.md ADDED Viewed

	@@ -0,0 +1,328 @@

+# 02 · Detection Algorithms
+All four algorithms live in `src/anomaly/tasks.py`. No external ML libraries are used — only Python stdlib (`statistics`, `math`).
+---
+## Algorithm 1: Z-Score Query Spike / Drop Detection
+**Celery task:** `poll_metrics_anomalies` — every 15 minutes (queue: `polling`)
+**Function:** `run_zscore_anomaly_detection()`
+**Signal types produced:** `query_spike`, `query_drop`
+### How It Works
+For every team with at least 24 hours of history in `query_events_hourly`:
+1. Fetch hourly query counts for the last 14 days (`get_hourly_counts(team_id, days=14)`)
+2. Exclude the current (partial) hour from the baseline — it is still accumulating
+3. Compute baseline mean (μ) and standard deviation (σ) using `statistics.mean` / `statistics.stdev`
+4. Compute Z-score for the most recent complete hour:
+```
+Z = (current_count − μ) / σ
+```
+5. Flag as **spike** if `Z > 3.0`, **drop** if `Z < −2.0`
+### Why Z > 3.0?
+A Z-score above 3.0 means the observation is more than 3 standard deviations above the mean. Under a normal distribution, this occurs by chance less than 0.3% of the time — so when it happens, it is almost always a real event: an incident, a deployment failure, or a sudden surge of new users.
+The drop threshold is softer (`−2.0`) because a quiet hour is less urgent than a spike but still worth flagging — it may indicate the integration broke silently.
+### Severity Mapping
+| Z-score magnitude | Severity |
+|---|---|
+| \|Z\| ≥ 5.0 | `critical` |
+| \|Z\| ≥ 4.0 | `high` |
+| \|Z\| ≥ 3.5 | `medium` |
+| \|Z\| ≥ 3.0 | `low` |
+### Edge Cases Handled
+| Condition | Handling |
+|---|---|
+| Fewer than 24 hourly rows | Skip team — insufficient baseline |
+| `stdev == 0` (constant rate) | Skip team — Z-score undefined |
+| `baseline < 2 rows` | Skip team — `statistics.stdev` requires ≥ 2 values |
+| Duplicate signal within 2 hours | `insert_signal()` dedup suppresses it |
+### Example
+A team normally sends ~25 queries/hour (μ = 25, σ = 8). At 14:00, they send 72 queries:
+```
+Z = (72 − 25) / 8 = 5.875  →  severity: critical
+```
+The signal is inserted with `score = 5.875` and `details.hour_bucket = "2026-05-17T14:00:00"`.
+---
+## Algorithm 2: Escalation Rate Trend Detection
+**Runs alongside:** Z-score task (same `poll_metrics_anomalies` Celery beat)
+**Function:** `_check_escalation_trend(team_id, rows, now)`
+**Signal type produced:** `escalation_trend`
+### How It Works
+Using the same hourly rows fetched for Z-score:
+1. Split rows into two 7-day windows:
+   - **Current window:** `now − 7d` to `now`
+   - **Prior window:** `now − 14d` to `now − 7d`
+2. For each window, sum `query_count` and `escalation_count`
+3. Compute escalation rate per window:
+```
+escalation_rate = escalation_count / query_count
+```
+4. Compute ratio:
+```
+ratio = current_rate / prior_rate
+```
+5. If `ratio > 1.5` and `current_queries ≥ 10` (noise threshold), insert signal
+### Why 10-Query Minimum?
+A team with 2 total queries and 1 escalation has a 50% escalation rate. That is meaningless — one bad query doubles the rate. The 10-query floor ensures the rate is statistically grounded.
+### Severity
+| Ratio | Severity |
+|---|---|
+| ratio > 2.5 | `high` |
+| ratio > 1.5 | `medium` |
+### Example
+- Prior 7 days: 60 queries, 6 escalations → rate = 10%
+- Current 7 days: 52 queries, 10 escalations → rate = 19.2%
+- Ratio: 1.92 → `medium` severity
+The team's escalation rate is 92% worse than the prior week.
+---
+## Algorithm 3: Document Staleness Scoring
+**Celery task:** `compute_staleness_scores` — daily at 03:00 UTC (queue: `low`)
+**Function:** `run_staleness_scoring()`
+**Signal type produced:** `staleness`
+### The Core Formula
+```
+staleness_risk = age_factor × query_pressure
+```
+Both factors are independent, each in [0.0, 1.0]. Their product gives a score in [0.0, 1.0].
+### Age Factor
+Exponential decay with a 90-day half-life (actually ~62 days to 50%, ~90 days to 63%):
+```
+age_factor = 1 − exp(−age_days / 90)
+```
+| Document age | age_factor |
+|---|---|
+| 0 days (just updated) | 0.000 |
+| 30 days | 0.283 |
+| 60 days | 0.487 |
+| 90 days | 0.632 |
+| 180 days | 0.865 |
+| 365 days | 0.983 |
+| 2 years | 0.9998 |
+The exponential shape reflects reality: most knowledge stales rapidly in the first 3 months, then slows. A 3-year-old document and a 2-year-old document are both "very stale."
+### Query Pressure
+How heavily queried is this document's team in the last 30 days, relative to all teams?
+```
+query_pressure = min(1.0, monthly_team_query_count / p95_team_query_count)
+```
+The p95 value is the 95th percentile of monthly query counts across all teams. This means only the top 5% most-active teams get `query_pressure = 1.0`. A moderately active team might get 0.4–0.6.
+**Why team-level, not document-level?**
+The current `query_events` schema stores per-query agent metrics (chunk counts) but does not track *which document* each chunk came from. Team-level pressure is a conservative proxy — it correctly identifies high-risk documents in busy teams, and under-weights documents in quiet teams (which is the safe direction: a stale doc nobody queries is low priority).
+### Severity
+| `staleness_risk` | Severity |
+|---|---|
+| ≥ 0.8 | `critical` |
+| ≥ 0.6 | `high` |
+| ≥ 0.3 | `medium` |
+| < 0.3 (but ≥ 0.1) | `low` |
+| < 0.1 | Not inserted (below noise floor) |
+### Example
+A Kubernetes runbook last updated 8 months ago (240 days), in a team that sends 180 queries/month when p95 is 250:
+```
+age_factor     = 1 − exp(−240 / 90) = 0.931
+query_pressure = min(1.0, 180 / 250) = 0.72
+staleness_risk = 0.931 × 0.72       = 0.670  →  high
+```
+The same runbook in a team that sends 5 queries/month:
+```
+query_pressure = min(1.0, 5 / 250) = 0.02
+staleness_risk = 0.931 × 0.02      = 0.019  →  below threshold, not inserted
+```
+### Cleanup
+At the end of every staleness run, `purge_old_events()` deletes `query_events` rows older than 90 days, keeping the table bounded.
+---
+## Algorithm 4: Dependency Risk Modelling
+**Celery task:** `compute_dependency_risk` — daily at 03:30 UTC (queue: `low`)
+**Function:** `run_dependency_risk_modeling()`
+**Signal type produced:** `dependency_risk`
+### The Risk Formula
+Three independent sub-scores, weighted and summed:
+```
+risk = 0.40 × version_lag
+     + 0.35 × downstream_normalized
+     + 0.25 × incident_rate
+```
+All three sub-scores are normalised to [0.0, 1.0]. The final risk score is also [0.0, 1.0].
+### Sub-Score 1: Version Lag
+Derived from semantic versioning distance between `lib.version` and `lib.latest_version` in Neo4j:
+| Gap | Score |
+|---|---|
+| Different major version | 1.0 |
+| Different minor version (same major) | 0.6 |
+| Different patch (same major.minor) | 0.2 |
+| Same version | 0.0 |
+A library that is 2 major versions behind still scores 1.0 — version lag is binary once you're behind a major boundary.
+### Sub-Score 2: Downstream Exposure
+How many other services `DEPENDS_ON` this library in the Neo4j graph:
+```
+downstream_normalized = min(1.0, downstream_count / max_downstream_across_all_libs)
+```
+A library with 12 dependents when the most-connected library has 15 gets `downstream_normalized = 0.8`. This is an org-wide normalisation — the denominator is the most-exposed library in the entire graph.
+### Sub-Score 3: Historical Incident Rate
+How many `Incident` nodes have a `CAUSED_BY` edge to this library:
+```
+incident_rate = min(1.0, incident_count / 365)
+```
+This caps at 1.0 when there have been ≥ 365 incidents (one per day) — which is theoretical. In practice, even 5 incidents scores `incident_rate = 0.0137`, a meaningful non-zero signal.
+### Poisson Forecast: Probability of Incident in Next 30 Days
+The Poisson model treats historical incidents as a counting process with rate λ:
+```
+λ = incident_count / 365   (incidents per day)
+P(at least one incident in next 30 days) = 1 − exp(−λ × 30)
+```
+This gives a concrete probability — not a vague "high risk" label — that can be shown in the UI:
+| incident_count | λ | P(incident in 30d) |
+|---|---|---|
+| 0 | 0 | 0% |
+| 1 | 0.00274 | 7.8% |
+| 2 | 0.00548 | 15.3% |
+| 5 | 0.01370 | 33.6% |
+| 12 | 0.03288 | 63.0% |
+### Neo4j Query
+```cypher
+MATCH (lib:Library)
+OPTIONAL MATCH (lib)<-[:DEPENDS_ON]-(downstream)
+OPTIONAL MATCH (lib)<-[:CAUSED_BY]-(inc:Incident)
+RETURN lib.name                              AS name,
+       coalesce(lib.version, '0.0.0')        AS current_version,
+       coalesce(lib.latest_version, '0.0.0') AS latest_version,
+       count(DISTINCT downstream)            AS downstream_count,
+       count(DISTINCT inc)                   AS incident_count
+```
+This is the only query to Neo4j in the algorithm — all scoring is done in Python after fetching.
+### Severity
+| `risk` | Severity |
+|---|---|
+| ≥ 0.70 | `critical` |
+| ≥ 0.50 | `high` |
+| ≥ 0.30 | `medium` |
+| < 0.30 | `low` |
+### Example
+`fastapi` library: 2 major versions behind, 8 downstream services, 2 incidents in Neo4j. Max downstream across all libs is 15.
+```
+version_lag           = 1.0      (major version gap)
+downstream_normalized = 8/15 = 0.533
+incident_rate         = 2/365 = 0.00548
+risk = 0.40×1.0 + 0.35×0.533 + 0.25×0.00548
+     = 0.400 + 0.187 + 0.001
+     = 0.588  →  high
+λ = 2/365 = 0.00548
+poisson_30d = 1 − exp(−0.00548 × 30) = 15.3%
+```
+---
+## Algorithm Interaction
+The four algorithms are independent and run on separate schedules, but they complement each other:
+```
+A query spike on team_infra (Z=4.8, critical)
+  ↓
+Cross-reference gs:topics → "kubernetes" dominant this hour
+  ↓
+Staleness algorithm (ran last night) already flagged:
+  "k8s-ingress-runbook.md" → staleness_risk=0.72 (high)
+  ↓
+Dependency risk algorithm (ran last night) flagged:
+  "helm" library → risk=0.61 (high), poisson_30d=22%
+  ↓
+Three active anomaly_signals pointing at same root cause
+→ Alert feed surfaces them together, sorted by severity
+```
+No automated correlation is built yet — the signals exist separately in `anomaly_signals`. Grouping them is a UI/UX concern (future Area 4 extension).

Docs/anomaly-and-forecasting/03_api_reference.md ADDED Viewed

	@@ -0,0 +1,240 @@

+# 03 · API Reference — `/api/anomaly`
+**Router file:** `src/anomaly/router.py`
+**Registered in:** `main.py`
+**Auth:** All endpoints require a valid session cookie (`get_current_user`). Team isolation is enforced server-side for non-admin roles.
+---
+## `GET /api/anomaly/signals`
+List anomaly signals with optional filters. Non-admin users are automatically scoped to their own `team_id`.
+### Query Parameters
+| Param | Type | Default | Description |
+|---|---|---|---|
+| `type` | string | — | Filter by `signal_type` (`query_spike`, `query_drop`, `escalation_trend`, `staleness`, `dependency_risk`) |
+| `severity` | string | — | Filter by `severity` (`critical`, `high`, `medium`, `low`) |
+| `team_id` | string | — | Admin/org_admin only — query a specific team. Non-admins are always scoped to their own team. |
+| `resolved` | boolean | `false` | Include resolved signals |
+| `limit` | integer | `50` | Max rows returned (1–200) |
+### Response
+```json
+{
+  "signals": [
+    {
+      "id": "3f2a1b4c-...",
+      "team_id": "team_infra",
+      "signal_type": "query_spike",
+      "entity_type": "Team",
+      "entity_id": "team_infra",
+      "severity": "critical",
+      "score": 5.875,
+      "details": {
+        "z_score": 5.875,
+        "current_count": 72,
+        "baseline_mean": 25.4,
+        "baseline_stdev": 7.9,
+        "hour_bucket": "2026-05-17T14:00:00",
+        "window_hours": 335
+      },
+      "resolved": false,
+      "resolved_by": null,
+      "resolved_at": null,
+      "detected_at": "2026-05-17T14:15:03.221Z"
+    }
+  ],
+  "total": 1
+}
+```
+### Side Effect
+Triggers `broadcast_new_critical_signals()` as a FastAPI `BackgroundTask`. Rate-limited to once per 5 minutes — pushes `escalation_spike` or `knowledge_gap` WebSocket notifications to connected clients for new critical/high signals detected in the last 10 minutes.
+---
+## `GET /api/anomaly/signals/summary`
+Returns unresolved signal counts grouped by type and severity. Used by the frontend to render the severity badge strip at the top of the Anomalies tab.
+### Response
+```json
+{
+  "total": 7,
+  "by_type": {
+    "query_spike": 2,
+    "staleness": 3,
+    "dependency_risk": 2
+  },
+  "by_severity": {
+    "critical": 1,
+    "high": 3,
+    "medium": 2,
+    "low": 1
+  }
+}
+```
+---
+## `PATCH /api/anomaly/signals/{signal_id}/resolve`
+Mark a signal as resolved. **Admin / org_admin only** — returns `403` for other roles.
+### Path Parameter
+| Param | Description |
+|---|---|
+| `signal_id` | UUID of the `anomaly_signals` row |
+### Response
+```json
+{ "ok": true }
+```
+Returns `404` if the signal does not exist or is already resolved.
+**Effect:** Sets `resolved = true`, `resolved_by = user.id`, `resolved_at = now()` on the row. The signal is excluded from all `resolved=false` queries going forward.
+---
+## `GET /api/anomaly/query-patterns`
+Returns hourly query counts for the last N days, with anomaly overlay markers. Powers `QuerySpikeChart.tsx`.
+### Query Parameters
+| Param | Type | Default | Description |
+|---|---|---|---|
+| `team_id` | string | — | Admin only. Non-admins use their session team. |
+| `days` | integer | `14` | Lookback window (1–90) |
+### Response
+```json
+{
+  "team_id": "team_infra",
+  "hourly": [
+    {
+      "hour": "2026-05-03T09:00:00+00:00",
+      "count": 18,
+      "escalations": 2,
+      "anomaly_score": null,
+      "anomaly_type": null,
+      "anomaly_severity": null
+    },
+    {
+      "hour": "2026-05-17T14:00:00+00:00",
+      "count": 72,
+      "escalations": 4,
+      "anomaly_score": 5.875,
+      "anomaly_type": "query_spike",
+      "anomaly_severity": "critical"
+    }
+  ]
+}
+```
+`anomaly_score`, `anomaly_type`, `anomaly_severity` are `null` for normal hours and populated for hours that have a matching unresolved `query_spike` or `query_drop` signal. The frontend uses these to render `ReferenceArea` overlays on the chart.
+---
+## `GET /api/anomaly/staleness`
+Top documents by staleness risk score. Powers `StalenessRiskList.tsx`.
+### Query Parameters
+| Param | Type | Default | Description |
+|---|---|---|---|
+| `limit` | integer | `30` | Max rows (1–100) |
+### Response
+```json
+{
+  "documents": [
+    {
+      "entity_id": "confluence:12345",
+      "score": 0.671,
+      "details": {
+        "title": "Kubernetes Ingress Setup Guide",
+        "age_days": 240,
+        "age_factor": 0.931,
+        "query_pressure": 0.72,
+        "updated_at": "2025-09-20T10:00:00"
+      },
+      "detected_at": "2026-05-17T03:04:22.100Z"
+    }
+  ],
+  "total": 1
+}
+```
+Results are sorted descending by `score`. The `details` object contains all the inputs to the staleness formula, making it possible to explain the score in a tooltip.
+---
+## `GET /api/anomaly/dependency-risk`
+Libraries scored by dependency risk, including Poisson incident probability. Powers the risk column added to `DependencyTracker.tsx`.
+### Response
+```json
+{
+  "libraries": [
+    {
+      "entity_id": "fastapi",
+      "score": 0.588,
+      "details": {
+        "library_name": "fastapi",
+        "current_version": "0.95.0",
+        "latest_version": "0.115.0",
+        "version_lag": 1.0,
+        "downstream_count": 8,
+        "downstream_normalized": 0.533,
+        "incident_count": 2,
+        "incident_rate": 0.0055,
+        "poisson_30d": 0.153
+      },
+      "detected_at": "2026-05-17T03:34:11.200Z"
+    }
+  ],
+  "total": 1
+}
+```
+`poisson_30d` is the probability (0.0–1.0) that this library causes at least one incident in the next 30 days, modelled as a Poisson process on historical `CAUSED_BY` edges in Neo4j. Rendered in the UI as a percentage (e.g. "15% in 30d").
+---
+## Error Responses
+All endpoints return standard HTTP error codes:
+| Code | Cause |
+|---|---|
+| `401` | No valid session cookie |
+| `403` | Insufficient role (e.g. non-admin calling PATCH /resolve) |
+| `404` | Signal not found or already resolved |
+| `500` | Unexpected internal error (Supabase/Neo4j unavailable) |
+All 5xx errors are logged via `src/utils/logger.py` with request ID for tracing.
+---
+## Authentication Notes
+The `/api/anomaly/signals` endpoint accepts an optional `team_id` query param:
+- **admin / org_admin roles:** `team_id` param is respected — they can query any team's signals
+- **All other roles:** `team_id` is overridden with `user.get("team_id")` from the session — the client cannot change this
+The `dependency_risk` signals have `team_id = null` (library risk is org-wide). They appear for all roles regardless of team scoping.

Docs/anomaly-and-forecasting/04_jobs_and_scheduling.md ADDED Viewed

	@@ -0,0 +1,198 @@

+# 04 · Celery Jobs & Scheduling
+---
+## Overview
+Three Celery tasks handle all anomaly detection. They run in the background, writing to `anomaly_signals` in Supabase. The FastAPI app only reads from that table — no detection logic runs in the request path.
+```
+Every 15 minutes   poll_metrics_anomalies    → Z-score + escalation trend
+Daily 03:00 UTC    compute_staleness_scores  → Staleness risk for all documents
+Daily 03:30 UTC    compute_dependency_risk   → Library risk from Neo4j
+```
+---
+## Task Definitions
+All tasks are defined in `src/celery_app.py`.
+### `poll_metrics_anomalies`
+```python
+@shared_task(queue="polling", bind=True, max_retries=3)
+def poll_metrics_anomalies(self):
+    from src.anomaly.tasks import run_zscore_anomaly_detection
+    run_zscore_anomaly_detection()
+```
+| Property | Value |
+|---|---|
+| Queue | `polling` (priority 3) |
+| Schedule | Every `settings.sync.metrics_poll_interval` seconds (default 900s / 15 min) |
+| Max retries | 3, with 120s countdown on failure |
+| Runs | `run_zscore_anomaly_detection()` → `_check_escalation_trend()` |
+| Writes | `query_spike`, `query_drop`, `escalation_trend` signals |
+The interval is controlled by the `SYNC__METRICS_POLL_INTERVAL` env var (default 900). Lower it in staging to test faster:
+```env
+SYNC__METRICS_POLL_INTERVAL=60
+```
+### `compute_staleness_scores`
+```python
+@shared_task(queue="low", bind=True, max_retries=2)
+def compute_staleness_scores(self):
+    from src.anomaly.tasks import run_staleness_scoring
+    run_staleness_scoring()
+```
+| Property | Value |
+|---|---|
+| Queue | `low` (priority 1) |
+| Schedule | Daily at 03:00 UTC (`crontab(hour=3, minute=0)`) |
+| Max retries | 2, with 300s countdown |
+| Reads | `documents` table (Supabase), `query_events` table (Supabase) |
+| Writes | `staleness` signals, purges `query_events` > 90 days |
+### `compute_dependency_risk`
+```python
+@shared_task(queue="low", bind=True, max_retries=2)
+def compute_dependency_risk(self):
+    from src.anomaly.tasks import run_dependency_risk_modeling
+    run_dependency_risk_modeling()
+```
+| Property | Value |
+|---|---|
+| Queue | `low` (priority 1) |
+| Schedule | Daily at 03:30 UTC (`crontab(hour=3, minute=30)`) |
+| Max retries | 2, with 300s countdown |
+| Reads | Neo4j — `Library`, `DEPENDS_ON`, `CAUSED_BY` edges |
+| Writes | `dependency_risk` signals |
+---
+## Beat Schedule Registration
+Beat entries are registered in `setup_periodic_tasks()` in `src/celery_app.py`:
+```python
+@app.on_after_finalize.connect
+def setup_periodic_tasks(sender, **kwargs):
+    # ... existing tasks ...
+    from celery.schedules import crontab
+    sender.add_periodic_task(
+        crontab(hour=3, minute=0),
+        compute_staleness_scores.s(),
+        name="compute-staleness-scores",
+    )
+    sender.add_periodic_task(
+        crontab(hour=3, minute=30),
+        compute_dependency_risk.s(),
+        name="compute-dependency-risk",
+    )
+```
+The `crontab` import is inside the function to avoid a module-level circular import with `celery.schedules`. This mirrors the pattern already used in `ingestion/jobs/celery_app.py`.
+---
+## Running Workers
+Start a worker that handles the `polling` and `low` queues (required for anomaly tasks):
+```bash
+celery -A src.celery_app worker -Q polling,low --loglevel=info
+```
+Start the Beat scheduler:
+```bash
+celery -A src.celery_app beat --loglevel=info
+```
+For local development, combine both in one process:
+```bash
+celery -A src.celery_app worker --beat -Q polling,low,default --loglevel=info
+```
+---
+## Manually Triggering Tasks
+Useful for testing without waiting for the schedule:
+```bash
+# Trigger Z-score detection immediately
+celery -A src.celery_app call src.celery_app.poll_metrics_anomalies
+# Trigger staleness scoring
+celery -A src.celery_app call src.celery_app.compute_staleness_scores
+# Trigger dependency risk
+celery -A src.celery_app call src.celery_app.compute_dependency_risk
+```
+Or from a Python shell:
+```python
+from src.celery_app import poll_metrics_anomalies, compute_staleness_scores
+poll_metrics_anomalies.delay()
+compute_staleness_scores.delay()
+```
+---
+## Monitoring Task Execution
+Check the `anomaly_signals` table after a run:
+```sql
+SELECT signal_type, severity, score, entity_id, detected_at
+FROM   anomaly_signals
+WHERE  detected_at > NOW() - INTERVAL '1 hour'
+ORDER  BY detected_at DESC;
+```
+Check for failures in the Celery dead letter queue (`ingest:deadletter` Redis key) or via Flower:
+```bash
+pip install flower
+celery -A src.celery_app flower --port=5555
+# Open http://localhost:5555
+```
+---
+## Failure Modes
+| Task | Failure Mode | Effect |
+|---|---|---|
+| `poll_metrics_anomalies` | Supabase unavailable | `get_all_team_ids()` returns `[]` → task is a no-op, retried in 2 min |
+| `poll_metrics_anomalies` | Fewer than 24h of data | All teams skipped — not enough baseline, no false alarms |
+| `compute_staleness_scores` | `documents` table empty | Early return, no signals written |
+| `compute_staleness_scores` | `query_events` table empty | All teams get `query_pressure = 0` → staleness_risk = 0 → no signals |
+| `compute_dependency_risk` | Neo4j unavailable | `_fetch_library_risk_rows()` returns `[]` → early return |
+| Any task | Unhandled exception | Celery retries up to max_retries, then marks task as FAILED |
+All per-entity errors inside loops are caught individually (`except: logger.warning(...)`) — a single bad document or library never aborts the entire run.
+---
+## Adding a New Detection Algorithm
+1. Add the algorithm function to `src/anomaly/tasks.py`
+2. Add a new `@shared_task` to `src/celery_app.py`
+3. Register with `sender.add_periodic_task(...)` in `setup_periodic_tasks()`
+4. Use `insert_signal()` from `src/anomaly/db.py` with an appropriate `signal_type` value
+5. Add a new `GET /api/anomaly/<endpoint>` if the frontend needs to query the results separately
+The `signal_type` field in `anomaly_signals` is free-form text — no migration needed for a new type. Add it to the `signal_type` documentation in `01_data_layer.md`.

Docs/anomaly-and-forecasting/README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# Area 4 — Anomaly Detection & Forecasting
+> **Status: Implemented** on branch `anomaly-and-forecasting`.
+> Previously marked "Planned Extension" in `03_analytics_and_intelligence.md`.
+---
+## What This Area Does
+Area 4 consumes the raw event streams produced by Areas 1–3 and adds a detection and forecasting layer on top. It answers questions that static dashboards cannot:
+- Is this query spike normal, or did something break?
+- Which documents are dangerously stale *and* heavily queried?
+- Is our escalation rate trending worse or better?
+- Which library is most likely to cause an incident in the next 30 days?
+None of this required new data pipelines. All signals already existed in Redis, Supabase, and Neo4j — Area 4 adds the time-series persistence and mathematical models to make them actionable.
+---
+## Document Map
+| File | Contents |
+|---|---|
+| [`01_data_layer.md`](./01_data_layer.md) | Three new Supabase tables, PostgreSQL function, migration guide, data flow |
+| [`02_detection_algorithms.md`](./02_detection_algorithms.md) | Z-score spike detection, escalation trend, staleness scoring, dependency risk + Poisson model |
+| [`03_api_reference.md`](./03_api_reference.md) | All six `/api/anomaly` endpoints with request/response shapes |
+| [`04_jobs_and_scheduling.md`](./04_jobs_and_scheduling.md) | Celery Beat schedule, task configuration, manual trigger guide |
+---
+## File Map (Implementation)
+```
+supabase/
+  anomaly_migration.sql           ← Run first: 3 tables + aggregate function
+src/anomaly/
+  __init__.py
+  db.py                           ← All Supabase reads/writes
+  tasks.py                        ← Detection algorithms (stdlib only)
+  router.py                       ← FastAPI /api/anomaly endpoints
+  notifier.py                     ← In-process WebSocket broadcast bridge
+agent/api.py                      ← +7 lines: fire-and-forget event persist
+src/celery_app.py                 ← poll_metrics_anomalies implemented;
+                                     2 new daily tasks added
+main.py                           ← anomaly_router registered
+```
+---
+## How It Fits Into the Five-Area System
+```
+Area 1 (RAG)          → query retrieval confidence feeds staleness pressure signal
+Area 2 (Pipelines)    → document updated_at feeds staleness age factor
+Area 3 (Analytics)    → query events → query_events table → Z-score + trend models
+                        escalation events → escalation_trend detection
+                        Neo4j graph → dependency risk model
+Area 4 (This)         → anomaly_signals table → API → frontend Anomalies tab
+                        WebSocket push on critical/high signals → NotificationCenter
+Area 5 (Graph)        → Library DEPENDS_ON edges consumed by dependency risk model
+                        Incident CAUSED_BY edges feed Poisson λ estimate
+```
+---
+## Design Principles
+**1. Never break the query path.**
+Every write from `agent/api.py` is fire-and-forget with `asyncio.ensure_future` and a bare `except: pass`. If Supabase is unavailable, the SSE stream is unaffected.
+**2. No new Python dependencies.**
+Z-score uses `statistics.mean` / `statistics.stdev` (stdlib). Staleness and Poisson use `math.exp` (stdlib). No scikit-learn, no numpy.
+**3. Deduplication at the signal level.**
+`insert_signal()` checks for an identical unresolved signal for the same (type, team, entity) in the last 2 hours before inserting. A 15-minute Celery task cannot flood the table.
+**4. Team isolation everywhere.**
+Non-admin API callers are scoped to `user.get("team_id")` regardless of query params. Admin and org_admin can query across teams.
+**5. Celery workers cannot push WebSockets directly.**
+Workers run in a separate OS process; `_notification_clients` in `src/ws/router.py` is empty there. Real-time push uses an in-process FastAPI `BackgroundTask` via `src/anomaly/notifier.py`, rate-limited to once per 5 minutes.

agent/api.py CHANGED Viewed

@@ -84,6 +84,15 @@ async def _store_query_event(
                 await r.lpush("gs:escalations", json.dumps(escalation))
                 await r.ltrim("gs:escalations", 0, 499)
         finally:
             await r.aclose()
     except Exception as exc:

                 await r.lpush("gs:escalations", json.dumps(escalation))
                 await r.ltrim("gs:escalations", 0, 499)
+            # Persist to Supabase for time-series anomaly detection.
+            # Fire-and-forget: never allowed to fail the SSE stream.
+            try:
+                import asyncio as _asyncio
+                from src.anomaly.db import async_upsert_query_event
+                _asyncio.ensure_future(async_upsert_query_event(event))
+            except Exception:
+                pass
         finally:
             await r.aclose()
     except Exception as exc:

main.py CHANGED Viewed

@@ -17,6 +17,7 @@ from src.jira_agent.router import router as jira_router
 from src.utils.middleware import RequestLoggingMiddleware
 from src.auth.router import router as auth_router
 from src.analytics.router import router as analytics_router
 from src.admin.router import router as admin_router
 from src.admin.users_api import router as admin_users_router
 from src.admin.users_api import audit_router as admin_audit_router
@@ -59,6 +60,7 @@ app.add_middleware(RequestLoggingMiddleware)
 # ---------------------------------------------------------------------------
 app.include_router(auth_router)
 app.include_router(analytics_router)
 app.include_router(admin_router)
 app.include_router(admin_users_router)
 app.include_router(admin_audit_router)

 from src.utils.middleware import RequestLoggingMiddleware
 from src.auth.router import router as auth_router
 from src.analytics.router import router as analytics_router
+from src.anomaly.router import router as anomaly_router
 from src.admin.router import router as admin_router
 from src.admin.users_api import router as admin_users_router
 from src.admin.users_api import audit_router as admin_audit_router
 # ---------------------------------------------------------------------------
 app.include_router(auth_router)
 app.include_router(analytics_router)
+app.include_router(anomaly_router)
 app.include_router(admin_router)
 app.include_router(admin_users_router)
 app.include_router(admin_audit_router)

src/anomaly/__init__.py ADDED Viewed

File without changes

src/anomaly/db.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""Supabase persistence layer for anomaly detection.
+Every public function is wrapped in try/except so callers never crash
+if Supabase is unavailable or misconfigured.
+"""
+from __future__ import annotations
+import asyncio
+import logging
+from datetime import datetime, timedelta
+logger = logging.getLogger(__name__)
+def _sb():
+    """Return the Supabase sync client (service role key, bypasses RLS)."""
+    from src.auth.db import _client
+    return _client()
+# ── query_events ──────────────────────────────────────────────────────────────
+def upsert_query_event(event: dict) -> None:
+    """Write one query event row, idempotent on event_id."""
+    try:
+        _sb().table("query_events").upsert(
+            {
+                "event_id":        event["id"],
+                "team_id":         event.get("team_id", "unknown"),
+                "session_id":      event.get("session_id"),
+                "success":         event.get("success", True),
+                "duration_ms":     event.get("duration_ms"),
+                "escalated":       event.get("escalated", False),
+                "guardrail_score": event.get("guardrail_score"),
+                "agent_metrics":   event.get("agents", {}),
+                "created_at":      event.get("created_at", datetime.utcnow().isoformat()),
+            },
+            on_conflict="event_id",
+            ignore_duplicates=True,
+        ).execute()
+    except Exception:
+        logger.warning("anomaly_db: upsert_query_event failed", exc_info=True)
+async def async_upsert_query_event(event: dict) -> None:
+    """Async shim — runs the sync upsert in the default executor."""
+    loop = asyncio.get_event_loop()
+    await loop.run_in_executor(None, upsert_query_event, event)
+# ── query_events_hourly ───────────────────────────────────────────────────────
+def aggregate_hourly(team_id: str, hour_bucket: str) -> None:
+    """Recompute one (team_id, hour_bucket) row via the Supabase RPC."""
+    try:
+        _sb().rpc("aggregate_hourly_bucket", {
+            "p_team_id": team_id,
+            "p_hour":    hour_bucket,
+        }).execute()
+    except Exception:
+        logger.warning(
+            "anomaly_db: aggregate_hourly failed team=%s hour=%s",
+            team_id, hour_bucket, exc_info=True,
+        )
+def get_hourly_counts(team_id: str, days: int = 14) -> list[dict]:
+    """Return hourly aggregate rows for the last N days for a team."""
+    try:
+        cutoff = (datetime.utcnow() - timedelta(days=days)).isoformat()
+        result = (
+            _sb()
+            .table("query_events_hourly")
+            .select("hour_bucket,query_count,escalation_count")
+            .eq("team_id", team_id)
+            .gte("hour_bucket", cutoff)
+            .order("hour_bucket", desc=False)
+            .execute()
+        )
+        return result.data or []
+    except Exception:
+        logger.warning("anomaly_db: get_hourly_counts failed", exc_info=True)
+        return []
+def get_all_team_ids() -> list[str]:
+    """Return distinct team_ids with events in the last 90 days."""
+    try:
+        cutoff = (datetime.utcnow() - timedelta(days=90)).isoformat()
+        result = (
+            _sb()
+            .table("query_events")
+            .select("team_id")
+            .gte("created_at", cutoff)
+            .execute()
+        )
+        return list({r["team_id"] for r in (result.data or [])})
+    except Exception:
+        logger.warning("anomaly_db: get_all_team_ids failed", exc_info=True)
+        return []
+# ── anomaly_signals ───────────────────────────────────────────────────────────
+def insert_signal(
+    signal_type: str,
+    severity: str,
+    score: float,
+    details: dict,
+    team_id: str | None = None,
+    entity_type: str | None = None,
+    entity_id: str | None = None,
+) -> dict | None:
+    """Insert one anomaly signal with 2-hour dedup suppression.
+    If an identical unresolved signal already exists for the same
+    (signal_type, team_id, entity_id) within the last 2 hours, the
+    insert is skipped and None is returned.
+    """
+    try:
+        two_hours_ago = (datetime.utcnow() - timedelta(hours=2)).isoformat()
+        dupe_q = (
+            _sb()
+            .table("anomaly_signals")
+            .select("id")
+            .eq("signal_type", signal_type)
+            .eq("resolved", False)
+            .gte("detected_at", two_hours_ago)
+        )
+        if team_id:
+            dupe_q = dupe_q.eq("team_id", team_id)
+        if entity_id:
+            dupe_q = dupe_q.eq("entity_id", entity_id)
+        if dupe_q.execute().data:
+            return None  # suppress duplicate
+        row: dict = {
+            "signal_type": signal_type,
+            "severity":    severity,
+            "score":       score,
+            "details":     details,
+        }
+        if team_id:     row["team_id"]     = team_id
+        if entity_type: row["entity_type"] = entity_type
+        if entity_id:   row["entity_id"]   = entity_id
+        result = _sb().table("anomaly_signals").insert(row).execute()
+        return result.data[0] if result.data else None
+    except Exception:
+        logger.warning("anomaly_db: insert_signal failed type=%s", signal_type, exc_info=True)
+        return None
+def get_signals(
+    team_id: str | None,
+    severity: str | None,
+    signal_type: str | None,
+    resolved: bool,
+    limit: int,
+) -> list[dict]:
+    try:
+        q = (
+            _sb()
+            .table("anomaly_signals")
+            .select("*")
+            .eq("resolved", resolved)
+            .order("detected_at", desc=True)
+            .limit(limit)
+        )
+        if team_id:     q = q.eq("team_id", team_id)
+        if severity:    q = q.eq("severity", severity)
+        if signal_type: q = q.eq("signal_type", signal_type)
+        return q.execute().data or []
+    except Exception:
+        logger.warning("anomaly_db: get_signals failed", exc_info=True)
+        return []
+def resolve_signal(signal_id: str, resolver_user_id: str) -> bool:
+    try:
+        _sb().table("anomaly_signals").update({
+            "resolved":    True,
+            "resolved_by": resolver_user_id,
+            "resolved_at": datetime.utcnow().isoformat(),
+        }).eq("id", signal_id).eq("resolved", False).execute()
+        return True
+    except Exception:
+        logger.warning("anomaly_db: resolve_signal failed id=%s", signal_id, exc_info=True)
+        return False
+def get_signals_summary() -> dict:
+    """Return unresolved signal counts grouped by type and severity."""
+    try:
+        result = (
+            _sb()
+            .table("anomaly_signals")
+            .select("signal_type,severity")
+            .eq("resolved", False)
+            .execute()
+        )
+        rows = result.data or []
+        by_type: dict[str, int] = {}
+        by_severity: dict[str, int] = {}
+        for r in rows:
+            by_type[r["signal_type"]] = by_type.get(r["signal_type"], 0) + 1
+            by_severity[r["severity"]] = by_severity.get(r["severity"], 0) + 1
+        return {"total": len(rows), "by_type": by_type, "by_severity": by_severity}
+    except Exception:
+        logger.warning("anomaly_db: get_signals_summary failed", exc_info=True)
+        return {"total": 0, "by_type": {}, "by_severity": {}}
+def get_staleness_top(limit: int = 30) -> list[dict]:
+    try:
+        result = (
+            _sb()
+            .table("anomaly_signals")
+            .select("entity_id,score,details,detected_at")
+            .eq("signal_type", "staleness")
+            .eq("resolved", False)
+            .order("score", desc=True)
+            .limit(limit)
+            .execute()
+        )
+        return result.data or []
+    except Exception:
+        logger.warning("anomaly_db: get_staleness_top failed", exc_info=True)
+        return []
+def get_dependency_risk(limit: int = 50) -> list[dict]:
+    try:
+        result = (
+            _sb()
+            .table("anomaly_signals")
+            .select("entity_id,score,details,detected_at")
+            .eq("signal_type", "dependency_risk")
+            .eq("resolved", False)
+            .order("score", desc=True)
+            .limit(limit)
+            .execute()
+        )
+        return result.data or []
+    except Exception:
+        logger.warning("anomaly_db: get_dependency_risk failed", exc_info=True)
+        return []
+def purge_old_events() -> int:
+    """Delete query_events older than 90 days. Returns approximate deleted count."""
+    try:
+        cutoff = (datetime.utcnow() - timedelta(days=90)).isoformat()
+        result = _sb().table("query_events").delete().lt("created_at", cutoff).execute()
+        deleted = len(result.data or [])
+        if deleted:
+            logger.info("anomaly_db: purged %d old query_events", deleted)
+        return deleted
+    except Exception:
+        logger.warning("anomaly_db: purge_old_events failed", exc_info=True)
+        return 0

src/anomaly/notifier.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""In-process WebSocket broadcast bridge for anomaly signals.
+Celery workers run in a separate OS process, so `_notification_clients`
+in src/ws/router.py is always empty there. Real-time push is handled here:
+a FastAPI BackgroundTask (running inside the API server process) polls for
+recently-detected critical/high signals and calls broadcast_notification().
+Rate-limited to one check per 5 minutes via a module-level timestamp.
+"""
+from __future__ import annotations
+import logging
+from datetime import datetime, timedelta
+logger = logging.getLogger(__name__)
+_last_notified: datetime = datetime.utcnow() - timedelta(minutes=5)
+async def broadcast_new_critical_signals() -> None:
+    """Check for new critical/high anomaly signals and push via WebSocket.
+    Called as a BackgroundTask from GET /api/anomaly/signals.
+    No-ops if fewer than 5 minutes have elapsed since the last dispatch.
+    """
+    global _last_notified
+    now = datetime.utcnow()
+    if (now - _last_notified).total_seconds() < 300:
+        return
+    _last_notified = now
+    try:
+        from src.anomaly.db import get_signals
+        from src.ws.router import broadcast_notification
+        recent_signals = get_signals(
+            team_id=None,
+            severity=None,
+            signal_type=None,
+            resolved=False,
+            limit=20,
+        )
+        ten_minutes_ago = now - timedelta(minutes=10)
+        for sig in recent_signals:
+            if sig.get("severity") not in ("critical", "high"):
+                continue
+            try:
+                detected = datetime.fromisoformat(
+                    str(sig["detected_at"]).replace("Z", "+00:00")
+                ).replace(tzinfo=None)
+            except Exception:
+                continue
+            if detected < ten_minutes_ago:
+                continue
+            ws_type = (
+                "escalation_spike"
+                if sig["signal_type"] in ("query_spike", "query_drop", "escalation_trend")
+                else "knowledge_gap"
+            )
+            entity_suffix = f" — {sig['entity_id']}" if sig.get("entity_id") else ""
+            await broadcast_notification({
+                "type":      ws_type,
+                "message":   sig["signal_type"].replace("_", " ").title() + entity_suffix,
+                "severity":  sig["severity"],
+                "timestamp": sig["detected_at"],
+            })
+    except Exception:
+        logger.warning("notifier: broadcast_new_critical_signals failed", exc_info=True)

src/anomaly/router.py ADDED Viewed

	@@ -0,0 +1,135 @@

+"""Anomaly detection API — /api/anomaly endpoints."""
+from __future__ import annotations
+from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, Query
+from src.auth.deps import get_current_user, require_role
+from src.utils.logger import get_logger
+logger = get_logger(__name__)
+router = APIRouter(prefix="/api/anomaly", tags=["anomaly"])
+@router.get("/signals")
+async def list_signals(
+    background_tasks: BackgroundTasks,
+    team_id:     str | None = Query(default=None),
+    severity:    str | None = Query(default=None),
+    signal_type: str | None = Query(default=None, alias="type"),
+    resolved:    bool       = Query(default=False),
+    limit:       int        = Query(default=50, ge=1, le=200),
+    user: dict = Depends(get_current_user),
+) -> dict:
+    # Non-admin users are scoped to their own team only
+    effective_team = team_id
+    if user.get("role") not in ("admin", "org_admin"):
+        effective_team = user.get("team_id")
+    from src.anomaly.db import get_signals
+    signals = get_signals(
+        team_id=effective_team,
+        severity=severity,
+        signal_type=signal_type,
+        resolved=resolved,
+        limit=limit,
+    )
+    # Dispatch WebSocket push for new critical/high signals (in-process, rate-limited)
+    from src.anomaly.notifier import broadcast_new_critical_signals
+    background_tasks.add_task(broadcast_new_critical_signals)
+    return {"signals": signals, "total": len(signals)}
+@router.get("/signals/summary")
+async def signals_summary(
+    user: dict = Depends(get_current_user),
+) -> dict:
+    from src.anomaly.db import get_signals_summary
+    return get_signals_summary()
+@router.patch("/signals/{signal_id}/resolve")
+async def resolve_signal(
+    signal_id: str,
+    user: dict = Depends(require_role("admin", "org_admin")),
+) -> dict:
+    from src.anomaly.db import resolve_signal
+    ok = resolve_signal(signal_id, resolver_user_id=user["id"])
+    if not ok:
+        raise HTTPException(status_code=404, detail="Signal not found or already resolved")
+    return {"ok": True}
+@router.get("/query-patterns")
+async def query_patterns(
+    team_id: str | None = Query(default=None),
+    days:    int        = Query(default=14, ge=1, le=90),
+    user: dict = Depends(get_current_user),
+) -> dict:
+    """Hourly query counts with anomaly overlay markers for QuerySpikeChart."""
+    from src.anomaly.db import get_hourly_counts, get_signals
+    effective_team = team_id
+    if user.get("role") not in ("admin", "org_admin"):
+        effective_team = user.get("team_id") or team_id
+    if not effective_team:
+        return {"hourly": [], "team_id": None}
+    hourly  = get_hourly_counts(effective_team, days=days)
+    signals = get_signals(
+        team_id=effective_team,
+        severity=None,
+        signal_type=None,
+        resolved=False,
+        limit=200,
+    )
+    # Build a lookup of anomaly info keyed by "YYYY-MM-DDTHH" prefix
+    anomaly_map: dict[str, dict] = {}
+    for sig in signals:
+        if sig["signal_type"] in ("query_spike", "query_drop"):
+            hb = (sig.get("details") or {}).get("hour_bucket", "")
+            if hb:
+                anomaly_map[hb[:13]] = {
+                    "score":    sig["score"],
+                    "type":     sig["signal_type"],
+                    "severity": sig["severity"],
+                }
+    result = []
+    for row in hourly:
+        hb_str  = str(row["hour_bucket"])
+        anomaly = anomaly_map.get(hb_str[:13])
+        result.append({
+            "hour":             hb_str,
+            "count":            row["query_count"],
+            "escalations":      row["escalation_count"],
+            "anomaly_score":    anomaly["score"]    if anomaly else None,
+            "anomaly_type":     anomaly["type"]     if anomaly else None,
+            "anomaly_severity": anomaly["severity"] if anomaly else None,
+        })
+    return {"hourly": result, "team_id": effective_team}
+@router.get("/staleness")
+async def staleness_list(
+    limit: int = Query(default=30, ge=1, le=100),
+    user: dict = Depends(get_current_user),
+) -> dict:
+    from src.anomaly.db import get_staleness_top
+    items = get_staleness_top(limit=limit)
+    return {"documents": items, "total": len(items)}
+@router.get("/dependency-risk")
+async def dependency_risk(
+    user: dict = Depends(get_current_user),
+) -> dict:
+    from src.anomaly.db import get_dependency_risk
+    items = get_dependency_risk()
+    return {"libraries": items, "total": len(items)}

src/anomaly/tasks.py ADDED Viewed

	@@ -0,0 +1,406 @@

+"""Anomaly detection algorithms — called from Celery tasks in src/celery_app.py.
+Three top-level functions are exported:
+  run_zscore_anomaly_detection()   — query spikes + escalation trend (every 15 min)
+  run_staleness_scoring()          — document staleness risk (daily 03:00 UTC)
+  run_dependency_risk_modeling()   — library risk + Poisson forecast (daily 03:30 UTC)
+All use only stdlib (statistics, math) — no new pip dependencies.
+"""
+from __future__ import annotations
+import logging
+import math
+import statistics
+from datetime import datetime, timedelta
+logger = logging.getLogger(__name__)
+# ── Z-score helpers ───────────────────────────────────────────────────────────
+def _zscore_severity(z: float) -> str:
+    abs_z = abs(z)
+    if abs_z >= 5.0:
+        return "critical"
+    if abs_z >= 4.0:
+        return "high"
+    if abs_z >= 3.5:
+        return "medium"
+    return "low"
+# ── 1. Z-score spike detection + escalation trend ────────────────────────────
+def run_zscore_anomaly_detection() -> None:
+    """For every active team: compute hourly Z-scores and escalation rate trends.
+    Reads from query_events_hourly (pre-aggregated).
+    Writes to anomaly_signals via insert_signal() with 2-hour dedup suppression.
+    Fires WebSocket broadcast for critical/high signals (best-effort).
+    """
+    from src.anomaly.db import get_all_team_ids, get_hourly_counts, insert_signal
+    team_ids = get_all_team_ids()
+    now = datetime.utcnow()
+    current_hour = now.replace(minute=0, second=0, microsecond=0).isoformat()
+    for team_id in team_ids:
+        try:
+            rows = get_hourly_counts(team_id, days=14)
+            if len(rows) < 24:
+                continue
+            counts = [r["query_count"] for r in rows]
+            # Exclude the current (partial) hour from the baseline
+            baseline = counts[:-1]
+            if len(baseline) < 2:
+                continue
+            mean  = statistics.mean(baseline)
+            stdev = statistics.stdev(baseline)
+            if stdev == 0:
+                continue
+            current_count = counts[-1]
+            z = (current_count - mean) / stdev
+            if z > 3.0:
+                signal = insert_signal(
+                    signal_type="query_spike",
+                    team_id=team_id,
+                    entity_type="Team",
+                    entity_id=team_id,
+                    severity=_zscore_severity(z),
+                    score=round(z, 3),
+                    details={
+                        "z_score":         round(z, 3),
+                        "current_count":   current_count,
+                        "baseline_mean":   round(mean, 2),
+                        "baseline_stdev":  round(stdev, 2),
+                        "hour_bucket":     current_hour,
+                        "window_hours":    len(baseline),
+                    },
+                )
+                if signal and signal.get("severity") in ("critical", "high"):
+                    _try_broadcast({
+                        "type":      "escalation_spike",
+                        "message":   f"Query spike for team {team_id}: Z={z:.1f}",
+                        "timestamp": now.isoformat(),
+                    })
+            elif z < -2.0:
+                insert_signal(
+                    signal_type="query_drop",
+                    team_id=team_id,
+                    entity_type="Team",
+                    entity_id=team_id,
+                    severity=_zscore_severity(z),
+                    score=round(z, 3),
+                    details={
+                        "z_score":        round(z, 3),
+                        "current_count":  current_count,
+                        "baseline_mean":  round(mean, 2),
+                        "hour_bucket":    current_hour,
+                    },
+                )
+            _check_escalation_trend(team_id, rows, now)
+        except Exception:
+            logger.warning("zscore: error for team %s", team_id, exc_info=True)
+def _check_escalation_trend(team_id: str, rows: list[dict], now: datetime) -> None:
+    from src.anomaly.db import insert_signal
+    cutoff_7d  = now - timedelta(days=7)
+    cutoff_14d = now - timedelta(days=14)
+    current_window: list[dict] = []
+    prior_window:   list[dict] = []
+    for r in rows:
+        try:
+            hb = datetime.fromisoformat(
+                str(r["hour_bucket"]).replace("Z", "+00:00")
+            ).replace(tzinfo=None)
+        except Exception:
+            continue
+        if hb >= cutoff_7d:
+            current_window.append(r)
+        elif hb >= cutoff_14d:
+            prior_window.append(r)
+    current_queries     = sum(r["query_count"]     for r in current_window)
+    current_escalations = sum(r["escalation_count"] for r in current_window)
+    prior_queries       = sum(r["query_count"]     for r in prior_window)
+    prior_escalations   = sum(r["escalation_count"] for r in prior_window)
+    if current_queries < 10:
+        return
+    current_rate = current_escalations / current_queries if current_queries else 0.0
+    prior_rate   = prior_escalations   / prior_queries   if prior_queries   else 0.0
+    if prior_rate == 0.0:
+        return
+    ratio = current_rate / prior_rate
+    if ratio > 1.5:
+        severity = "high" if ratio > 2.5 else "medium"
+        insert_signal(
+            signal_type="escalation_trend",
+            team_id=team_id,
+            entity_type="Team",
+            entity_id=team_id,
+            severity=severity,
+            score=round(ratio, 3),
+            details={
+                "ratio":                  round(ratio, 3),
+                "current_rate":           round(current_rate, 4),
+                "prior_rate":             round(prior_rate,   4),
+                "current_total_queries":  current_queries,
+                "prior_total_queries":    prior_queries,
+            },
+        )
+# ── 2. Staleness scoring ──────────────────────────────────────────────────────
+def run_staleness_scoring() -> None:
+    """Compute staleness_risk = age_factor × query_pressure for all documents.
+    age_factor    = 1 − exp(−age_days / 90)   (exponential decay, half-life ~62 days)
+    query_pressure = min(1.0, monthly_team_queries / p95_monthly_team_queries)
+    Documents with staleness_risk < 0.1 are skipped to avoid noise.
+    Cleans up query_events older than 90 days at the end of each run.
+    """
+    from src.anomaly.db import insert_signal, purge_old_events
+    from src.auth.db import _client as _sb_client
+    sb  = _sb_client()
+    now = datetime.utcnow()
+    # Load all documents
+    try:
+        docs = sb.table("documents").select("id,doc_id,title,team_id,updated_at").execute().data or []
+    except Exception:
+        logger.warning("staleness: failed to load documents", exc_info=True)
+        return
+    if not docs:
+        return
+    # Monthly query count per team (proxy for query pressure at team level)
+    cutoff_30d = (now - timedelta(days=30)).isoformat()
+    team_query_counts: dict[str, int] = {}
+    try:
+        rows = (
+            sb.table("query_events")
+            .select("team_id")
+            .gte("created_at", cutoff_30d)
+            .execute()
+            .data or []
+        )
+        for r in rows:
+            tid = r.get("team_id", "unknown")
+            team_query_counts[tid] = team_query_counts.get(tid, 0) + 1
+    except Exception:
+        logger.warning("staleness: team query count failed", exc_info=True)
+    # p95 of team query counts
+    count_values = sorted(team_query_counts.values()) or [1]
+    p95_idx   = max(0, int(len(count_values) * 0.95) - 1)
+    p95_count = count_values[p95_idx] or 1
+    for doc in docs:
+        try:
+            try:
+                updated = datetime.fromisoformat(
+                    str(doc["updated_at"]).replace("Z", "+00:00")
+                ).replace(tzinfo=None)
+            except Exception:
+                updated = now - timedelta(days=180)
+            age_days       = max(0, (now - updated).days)
+            age_factor     = 1.0 - math.exp(-age_days / 90.0)
+            monthly_count  = team_query_counts.get(doc.get("team_id", ""), 0)
+            query_pressure = min(1.0, monthly_count / p95_count)
+            staleness_risk = round(age_factor * query_pressure, 4)
+            if staleness_risk < 0.1:
+                continue
+            severity = (
+                "critical" if staleness_risk >= 0.8 else
+                "high"     if staleness_risk >= 0.6 else
+                "medium"   if staleness_risk >= 0.3 else
+                "low"
+            )
+            insert_signal(
+                signal_type="staleness",
+                team_id=doc.get("team_id"),
+                entity_type="Document",
+                entity_id=doc.get("doc_id"),
+                severity=severity,
+                score=staleness_risk,
+                details={
+                    "title":          doc.get("title", ""),
+                    "age_days":       age_days,
+                    "age_factor":     round(age_factor,     4),
+                    "query_pressure": round(query_pressure, 4),
+                    "updated_at":     doc.get("updated_at"),
+                },
+            )
+        except Exception:
+            logger.warning("staleness: error on doc %s", doc.get("doc_id"), exc_info=True)
+    purge_old_events()
+# ── 3. Dependency risk modelling ──────────────────────────────────────────────
+def run_dependency_risk_modeling() -> None:
+    """Score every Library node in Neo4j for dependency risk.
+    risk = 0.40 × version_lag  +  0.35 × downstream_normalized  +  0.25 × incident_rate
+    poisson_30d = 1 − exp(−(incident_count / 365) × 30)
+    """
+    from src.anomaly.db import insert_signal
+    rows = _fetch_library_risk_rows()
+    if not rows:
+        logger.info("dep_risk: no library rows from Neo4j")
+        return
+    max_downstream = max((r.get("downstream_count", 0) for r in rows), default=1) or 1
+    for r in rows:
+        try:
+            name           = r.get("name", "")
+            current_ver    = r.get("current_version",  "0.0.0")
+            latest_ver     = r.get("latest_version",   "0.0.0")
+            downstream     = int(r.get("downstream_count", 0))
+            incident_count = int(r.get("incident_count",  0))
+            version_lag           = _version_lag_score(current_ver, latest_ver)
+            downstream_normalized = min(1.0, downstream / max_downstream)
+            incident_rate         = min(1.0, incident_count / 365.0)
+            risk = round(
+                0.40 * version_lag +
+                0.35 * downstream_normalized +
+                0.25 * incident_rate,
+                4,
+            )
+            lam          = incident_count / 365.0
+            poisson_30d  = round(1.0 - math.exp(-lam * 30), 4)
+            severity = (
+                "critical" if risk >= 0.7 else
+                "high"     if risk >= 0.5 else
+                "medium"   if risk >= 0.3 else
+                "low"
+            )
+            insert_signal(
+                signal_type="dependency_risk",
+                team_id=None,
+                entity_type="Library",
+                entity_id=name,
+                severity=severity,
+                score=risk,
+                details={
+                    "library_name":           name,
+                    "current_version":        current_ver,
+                    "latest_version":         latest_ver,
+                    "version_lag":            round(version_lag,           4),
+                    "downstream_count":       downstream,
+                    "downstream_normalized":  round(downstream_normalized, 4),
+                    "incident_count":         incident_count,
+                    "incident_rate":          round(incident_rate,         4),
+                    "poisson_30d":            poisson_30d,
+                },
+            )
+        except Exception:
+            logger.warning("dep_risk: error on library %s", r.get("name"), exc_info=True)
+def _fetch_library_risk_rows() -> list[dict]:
+    import asyncio
+    async def _query() -> list[dict]:
+        from graph_store.config import settings as neo4j_cfg
+        from neo4j import AsyncGraphDatabase
+        driver = AsyncGraphDatabase.driver(
+            neo4j_cfg.neo4j_uri,
+            auth=(neo4j_cfg.neo4j_username, neo4j_cfg.neo4j_password),
+        )
+        try:
+            async with driver.session() as session:
+                result = await session.run("""
+                    MATCH (lib:Library)
+                    OPTIONAL MATCH (lib)<-[:DEPENDS_ON]-(downstream)
+                    OPTIONAL MATCH (lib)<-[:CAUSED_BY]-(inc:Incident)
+                    RETURN lib.name                              AS name,
+                           coalesce(lib.version, '0.0.0')        AS current_version,
+                           coalesce(lib.latest_version, '0.0.0') AS latest_version,
+                           count(DISTINCT downstream)            AS downstream_count,
+                           count(DISTINCT inc)                   AS incident_count
+                """)
+                return await result.data()
+        finally:
+            await driver.close()
+    try:
+        return asyncio.run(_query())
+    except Exception:
+        logger.warning("dep_risk: neo4j fetch failed", exc_info=True)
+        return []
+def _version_lag_score(current: str, latest: str) -> float:
+    """Return 0.0–1.0 based on semver distance. Falls back to 0.5 on parse error."""
+    try:
+        def _parts(v: str) -> tuple[int, int, int]:
+            parts = v.lstrip("v").split(".")[:3]
+            ints = [(int(p) if p.isdigit() else 0) for p in (parts + ["0", "0", "0"])[:3]]
+            return ints[0], ints[1], ints[2]
+        c = _parts(current)
+        l = _parts(latest)
+        if l[0] > c[0]:
+            return 1.0   # major version behind
+        if l[1] > c[1]:
+            return 0.6   # minor version behind
+        if l[2] > c[2]:
+            return 0.2   # patch behind
+        return 0.0
+    except Exception:
+        return 0.5
+# ── WebSocket broadcast (best-effort, in-process only) ───────────────────────
+def _try_broadcast(payload: dict) -> None:
+    """Best-effort WebSocket push from a Celery worker.
+    Celery workers run in a separate process so _notification_clients in
+    src/ws/router.py will be empty here. This is intentional — real-time
+    push is handled by src/anomaly/notifier.py (in-process BackgroundTask).
+    This call is a no-op in the worker context but kept for testability.
+    """
+    try:
+        import asyncio
+        from src.ws.router import broadcast_notification
+        asyncio.run(broadcast_notification(payload))
+    except Exception:
+        pass

src/celery_app.py CHANGED Viewed

@@ -1,9 +1,13 @@
 """Celery app configuration and task setup."""
 from celery import Celery
 from kombu import Exchange, Queue
 from src.config import settings
 # Create Celery app
 app = Celery(settings.app_name)
@@ -93,6 +97,21 @@ def setup_periodic_tasks(sender, **kwargs):
         name="poll-error-traces",
     )
 # Task imports (to be implemented in tasks module)
 from celery import shared_task
@@ -124,8 +143,13 @@ def poll_server_logs(self):
 @shared_task(queue="polling", bind=True, max_retries=3)
 def poll_metrics_anomalies(self):
-    """Poll metrics for anomalies."""
-    pass
 @shared_task(queue="polling", bind=True, max_retries=3)
@@ -146,5 +170,27 @@ def enrich_and_index_document(self, document: dict):
     pass
 if __name__ == "__main__":
     app.start()

 """Celery app configuration and task setup."""
+import logging
 from celery import Celery
 from kombu import Exchange, Queue
 from src.config import settings
+logger = logging.getLogger(__name__)
 # Create Celery app
 app = Celery(settings.app_name)
         name="poll-error-traces",
     )
+    # Staleness scoring daily at 03:00 UTC
+    from celery.schedules import crontab
+    sender.add_periodic_task(
+        crontab(hour=3, minute=0),
+        compute_staleness_scores.s(),
+        name="compute-staleness-scores",
+    )
+    # Dependency risk daily at 03:30 UTC
+    sender.add_periodic_task(
+        crontab(hour=3, minute=30),
+        compute_dependency_risk.s(),
+        name="compute-dependency-risk",
+    )
 # Task imports (to be implemented in tasks module)
 from celery import shared_task
 @shared_task(queue="polling", bind=True, max_retries=3)
 def poll_metrics_anomalies(self):
+    """Z-score spike detection and escalation trend detection — runs every 15 min."""
+    try:
+        from src.anomaly.tasks import run_zscore_anomaly_detection
+        run_zscore_anomaly_detection()
+    except Exception as exc:
+        logger.error("poll_metrics_anomalies failed: %s", exc)
+        raise self.retry(exc=exc, countdown=120)
 @shared_task(queue="polling", bind=True, max_retries=3)
     pass
+@shared_task(queue="low", bind=True, max_retries=2)
+def compute_staleness_scores(self):
+    """Daily staleness risk scoring for all documents (03:00 UTC)."""
+    try:
+        from src.anomaly.tasks import run_staleness_scoring
+        run_staleness_scoring()
+    except Exception as exc:
+        logger.error("compute_staleness_scores failed: %s", exc)
+        raise self.retry(exc=exc, countdown=300)
+@shared_task(queue="low", bind=True, max_retries=2)
+def compute_dependency_risk(self):
+    """Daily dependency risk modelling from Neo4j graph (03:30 UTC)."""
+    try:
+        from src.anomaly.tasks import run_dependency_risk_modeling
+        run_dependency_risk_modeling()
+    except Exception as exc:
+        logger.error("compute_dependency_risk failed: %s", exc)
+        raise self.retry(exc=exc, countdown=300)
 if __name__ == "__main__":
     app.start()

supabase/anomaly_migration.sql ADDED Viewed

	@@ -0,0 +1,86 @@

+-- ============================================================
+-- Anomaly Detection Migration
+-- Safe to run multiple times (all IF NOT EXISTS / ON CONFLICT).
+-- Run AFTER rbac_migration.sql.
+-- ============================================================
+-- ── query_events: raw event persistence (90-day rolling window) ──────────────
+CREATE TABLE IF NOT EXISTS query_events (
+    id              uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
+    event_id        text        NOT NULL UNIQUE,   -- same UUID as Redis event["id"]
+    team_id         text        NOT NULL,
+    session_id      text,
+    success         boolean     NOT NULL DEFAULT true,
+    duration_ms     integer,
+    escalated       boolean     NOT NULL DEFAULT false,
+    guardrail_score float,
+    agent_metrics   jsonb       NOT NULL DEFAULT '{}',
+    created_at      timestamptz NOT NULL DEFAULT now()
+);
+CREATE INDEX IF NOT EXISTS qe_team_time_idx  ON query_events (team_id, created_at DESC);
+CREATE INDEX IF NOT EXISTS qe_created_at_idx ON query_events (created_at DESC);
+CREATE INDEX IF NOT EXISTS qe_escalated_idx  ON query_events (team_id, escalated, created_at DESC);
+-- ── query_events_hourly: pre-aggregated hourly buckets ────────────────────────
+CREATE TABLE IF NOT EXISTS query_events_hourly (
+    team_id          text        NOT NULL,
+    hour_bucket      timestamptz NOT NULL,
+    query_count      integer     NOT NULL DEFAULT 0,
+    escalation_count integer     NOT NULL DEFAULT 0,
+    avg_duration_ms  integer,
+    PRIMARY KEY (team_id, hour_bucket)
+);
+CREATE INDEX IF NOT EXISTS qeh_team_hour_idx ON query_events_hourly (team_id, hour_bucket DESC);
+-- ── anomaly_signals: detected anomaly records ─────────────────────────────────
+CREATE TABLE IF NOT EXISTS anomaly_signals (
+    id           uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
+    team_id      text,
+    signal_type  text        NOT NULL,
+    -- query_spike | query_drop | escalation_trend | staleness | dependency_risk
+    entity_type  text,
+    entity_id    text,
+    severity     text        NOT NULL DEFAULT 'medium',
+    -- critical | high | medium | low
+    score        float       NOT NULL DEFAULT 0.0,
+    details      jsonb       NOT NULL DEFAULT '{}',
+    resolved     boolean     NOT NULL DEFAULT false,
+    resolved_by  uuid        REFERENCES users(id),
+    resolved_at  timestamptz,
+    detected_at  timestamptz NOT NULL DEFAULT now()
+);
+CREATE INDEX IF NOT EXISTS as_team_type_idx    ON anomaly_signals (team_id, signal_type, detected_at DESC);
+CREATE INDEX IF NOT EXISTS as_severity_idx     ON anomaly_signals (severity, resolved, detected_at DESC);
+CREATE INDEX IF NOT EXISTS as_resolved_idx     ON anomaly_signals (resolved, team_id, detected_at DESC);
+CREATE INDEX IF NOT EXISTS as_entity_idx       ON anomaly_signals (entity_type, entity_id);
+-- RLS: all three tables are service-role only — no anon/authenticated policies
+-- needed because all reads go through the FastAPI backend using the service key.
+ALTER TABLE query_events          ENABLE ROW LEVEL SECURITY;
+ALTER TABLE query_events_hourly   ENABLE ROW LEVEL SECURITY;
+ALTER TABLE anomaly_signals       ENABLE ROW LEVEL SECURITY;
+-- ── PostgreSQL function: in-DB hourly aggregation ────────────────────────────
+-- Called via Supabase RPC so the worker never ships raw rows just to count them.
+CREATE OR REPLACE FUNCTION aggregate_hourly_bucket(p_team_id text, p_hour timestamptz)
+RETURNS void LANGUAGE plpgsql AS $$
+BEGIN
+    INSERT INTO query_events_hourly (team_id, hour_bucket, query_count, escalation_count, avg_duration_ms)
+    SELECT
+        p_team_id,
+        p_hour,
+        count(*)::integer,
+        count(*) FILTER (WHERE escalated = true)::integer,
+        avg(duration_ms)::integer
+    FROM   query_events
+    WHERE  team_id = p_team_id
+      AND  date_trunc('hour', created_at AT TIME ZONE 'UTC') = p_hour
+    ON CONFLICT (team_id, hour_bucket) DO UPDATE
+        SET query_count      = EXCLUDED.query_count,
+            escalation_count = EXCLUDED.escalation_count,
+            avg_duration_ms  = EXCLUDED.avg_duration_ms;
+END;
+$$;