Spaces:

AdithyaVardan
/

GodSpeed

Sleeping

AdithyaVardan commited on 23 days ago

Commit

159f5a5

1 Parent(s): 0493349

Add agent and ingestion modules for Enterprise Knowledge Copilot

- agent/: LangGraph orchestration with Gemini, Qdrant hybrid retrieval,
BGE-M3 embeddings, BGE reranker, GLiNER PII masking, FastAPI SSE endpoint
- ingestion/: document ingestion pipeline with Confluence, GitHub, PDF, and
Jira sources, semantic chunker, Celery background jobs, Supabase + Qdrant
storage, and nightly CAG snapshot job
- main.py: FastAPI entry point mounting both routers
- supabase/schema.sql: full schema with RLS team isolation
- requirements.txt, .env.example, .gitignore

Files changed (45) hide show

.env.example +36 -0
.gitignore +39 -0
agent/README.md +142 -0
agent/__init__.py +6 -0
agent/agents/__init__.py +1 -0
agent/agents/_gemini.py +88 -0
agent/agents/guardrail.py +42 -0
agent/agents/planner.py +31 -0
agent/agents/synthesiser.py +62 -0
agent/api.py +80 -0
agent/config.py +50 -0
agent/graph.py +251 -0
agent/models.py +58 -0
agent/prompts.py +107 -0
agent/tools/__init__.py +1 -0
agent/tools/doc_search.py +272 -0
agent/tools/live_docs.py +20 -0
agent/tools/summariser.py +31 -0
agent/tools/ticket_lookup.py +20 -0
ingestion/__init__.py +3 -0
ingestion/api.py +117 -0
ingestion/config.py +56 -0
ingestion/jobs/__init__.py +0 -0
ingestion/jobs/cag_job.py +185 -0
ingestion/jobs/celery_app.py +33 -0
ingestion/jobs/ingest_job.py +124 -0
ingestion/models.py +86 -0
ingestion/pipeline/__init__.py +0 -0
ingestion/pipeline/chunker.py +96 -0
ingestion/pipeline/embedder.py +75 -0
ingestion/pipeline/pii_masker.py +47 -0
ingestion/pipeline/reranker.py +32 -0
ingestion/sources/__init__.py +0 -0
ingestion/sources/base.py +15 -0
ingestion/sources/confluence.py +117 -0
ingestion/sources/github.py +163 -0
ingestion/sources/jira.py +69 -0
ingestion/sources/pdf.py +51 -0
ingestion/storage/__init__.py +0 -0
ingestion/storage/bm25_store.py +44 -0
ingestion/storage/qdrant_store.py +101 -0
ingestion/storage/supabase_store.py +138 -0
main.py +19 -0
requirements.txt +61 -6
supabase/schema.sql +81 -0

.env.example ADDED Viewed

	@@ -0,0 +1,36 @@

+# Google AI Studio
+GOOGLE_API_KEY=
+# Qdrant (defaults to localhost:6333)
+QDRANT_HOST=localhost
+QDRANT_PORT=6333
+QDRANT_COLLECTION=knowledge_base
+# Supabase
+SUPABASE_URL=https://your-project.supabase.co
+SUPABASE_KEY=your-anon-or-service-role-key
+# Redis (Celery broker + backend)
+REDIS_URL=redis://localhost:6379/0
+# Confluence
+CONFLUENCE_BASE_URL=https://your-org.atlassian.net
+CONFLUENCE_EMAIL=you@your-org.com
+CONFLUENCE_TOKEN=
+# GitHub
+GITHUB_TOKEN=
+GITHUB_PATH_FILTER=docs/
+GITHUB_BRANCH=main
+# Jira
+JIRA_BASE_URL=https://your-org.atlassian.net
+JIRA_API_TOKEN=
+JIRA_PROJECT_KEY=
+# Model overrides (optional — defaults shown)
+PLANNER_MODEL=gemini-2.5-pro
+SYNTHESISER_MODEL=gemini-2.5-pro
+SUMMARISER_MODEL=gemini-2.5-flash
+GUARDRAIL_MODEL=gemini-2.5-flash
+CAG_MODEL=gemini-2.5-pro

.gitignore ADDED Viewed

	@@ -0,0 +1,39 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.egg-info/
+dist/
+build/
+*.egg
+# Virtual environments
+.venv/
+venv/
+env/
+# Environment variables
+.env
+# Data / model artifacts
+data/
+*.pkl
+*.pt
+*.bin
+# IDE
+.vscode/
+.idea/
+# OS
+.DS_Store
+Thumbs.db
+# Logs
+*.log
+# Celery
+celerybeat-schedule
+celerybeat.pid

agent/README.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# Enterprise Knowledge Copilot — Agent Module
+LangGraph-based multi-agent RAG system with Gemini, Qdrant, BGE-M3, and streaming SSE.
+## Architecture
+```
+POST /agent/query
+       │
+       ▼
+  planner_node  (Gemini 2.5 Pro)
+       │  ExecutionPlan
+       ▼
+ ┌─────┴──────────┐
+ │                │       (parallel)
+doc_search    ticket_lookup
+ │    └──────────┘
+ │  live_docs   (conditional)
+ └──────────────┘
+       │
+  synthesiser_node  (Gemini 2.5 Pro, streaming)
+       │
+  guardrail_node   (Gemini 2.5 Flash)
+       │
+    done / escalate
+```
+### Two-level orchestration
+1. **Planner** (Level 1): Gemini analyses the query and returns a structured `ExecutionPlan` — which agents to run and which can be parallelised.
+2. **LangGraph** (Level 2): Executes the plan, running independent nodes concurrently via `asyncio`.
+### Parallelism rules
+- `doc_search` and `ticket_lookup` always run in parallel when both are needed.
+- `live_docs` runs after `doc_search` only if confidence is low OR the query names an external library.
+- Each agent node calls exactly one tool. No agent calls two tools.
+### Confidence gating
+After BGE reranker scoring:
+- `≥ 0.6` → `high`
+- `0.4–0.6` → `medium`
+- `< 0.4` → `low`
+The synthesiser adjusts its tone and the guardrail applies stricter escalation at low confidence.
+## Setup
+```bash
+# 1. Install dependencies
+pip install -r requirements.txt
+# 2. Copy env file and fill in keys
+cp .env.example .env
+# Set at minimum: GOOGLE_API_KEY
+# 3. Start Qdrant locally
+docker run -p 6333:6333 qdrant/qdrant
+# 4. Run the API
+uvicorn main:app --reload
+```
+Your `main.py` should include:
+```python
+from fastapi import FastAPI
+from agent.api import router
+app = FastAPI()
+app.include_router(router)
+```
+## Environment variables
+| Variable | Required | Description |
+|---|---|---|
+| `GOOGLE_API_KEY` | ✅ | Google AI Studio key |
+| `QDRANT_HOST` | optional | Default: `localhost` |
+| `QDRANT_PORT` | optional | Default: `6333` |
+| `JIRA_BASE_URL` | optional | Enables ticket_lookup |
+| `JIRA_API_TOKEN` | optional | Enables ticket_lookup |
+| `FIRECRAWL_API_KEY` | optional | Enables live_docs |
+| `TAVILY_API_KEY` | optional | Enables live_docs |
+## BM25 index
+`doc_search` expects a BM25 index at `data/bm25_index.pkl` as a pickle with:
+```python
+{
+  "index": BM25Okapi(...),
+  "corpus": ["doc text 1", "doc text 2", ...],
+  "doc_ids": ["chunk_id_1", "chunk_id_2", ...]
+}
+```
+If the file is missing, BM25 is silently skipped and only Qdrant vectors are used.
+## Qdrant collection schema
+Collection name: `knowledge_base`
+```
+dense vector:  name="dense",  size=1024
+sparse vector: name="sparse"
+payload:       chunk_id, text, source, source_type, team_id
+```
+Data is filtered by `team_id` on every query — teams see only their own documents.
+## Adding a new agent
+1. Add a new tool in `tools/my_tool.py` with `async def run_my_tool(query, team_id) -> list[RetrievedChunk]`.
+2. Add `"my_tool"` to the `Literal` in `models.py → AgentTask.agent`.
+3. Add a node function in `graph.py`:
+```python
+async def my_tool_node(state: KnowledgeGraphState) -> dict:
+    await _push_event(queue, "agent_started", {"agent": "my_tool"})
+    chunks = await run_my_tool(task_input, state.query_input.team_id)
+    ...
+```
+4. Register the node and wire its edges in `build_graph()`.
+5. Update `PLANNER_SYSTEM_PROMPT` in `prompts.py` to describe when to use the new agent.
+## SSE event stream
+Events emitted in order:
+| Event | Payload |
+|---|---|
+| `plan_ready` | `{tasks, reasoning}` |
+| `agent_started` | `{agent}` |
+| `agent_done` | `{agent, chunks, confidence}` |
+| `synthesis_started` | `{}` |
+| `answer_chunk` | `{chunk}` (one per token) |
+| `guardrail_result` | `{score, escalate}` |
+| `done` | `{}` |
+| `error` | `{message}` |

agent/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Enterprise Knowledge Copilot agent package."""
+from agent.api import router
+from agent.config import settings
+__all__ = ["router", "settings"]

agent/agents/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Agent implementations package."""

agent/agents/_gemini.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""Shared Gemini call helper with exponential-backoff retry."""
+from __future__ import annotations
+import asyncio
+import json
+import logging
+from typing import Any
+from langchain_google_genai import ChatGoogleGenerativeAI
+from langchain_core.messages import HumanMessage, SystemMessage
+from agent.config import settings
+logger = logging.getLogger(__name__)
+async def call_gemini_text(
+    model_name: str,
+    system_prompt: str,
+    user_message: str,
+) -> str:
+    llm = ChatGoogleGenerativeAI(
+        model=model_name,
+        google_api_key=settings.google_api_key,
+        temperature=0.0,
+    )
+    messages = [SystemMessage(content=system_prompt), HumanMessage(content=user_message)]
+    for attempt in range(settings.gemini_max_retries):
+        try:
+            response = await llm.ainvoke(messages)
+            return response.content
+        except Exception as exc:
+            if attempt == settings.gemini_max_retries - 1:
+                logger.error("Gemini call failed after %d retries: %s", settings.gemini_max_retries, exc)
+                raise
+            delay = settings.gemini_retry_base_delay * (2 ** attempt)
+            logger.warning("Gemini call attempt %d failed (%s); retrying in %.1fs", attempt + 1, exc, delay)
+            await asyncio.sleep(delay)
+    raise RuntimeError("Unreachable")
+async def call_gemini_json(
+    model_name: str,
+    system_prompt: str,
+    user_message: str,
+) -> dict[str, Any]:
+    raw = await call_gemini_text(model_name, system_prompt, user_message)
+    # Strip markdown code fences if model ignores the instruction
+    cleaned = raw.strip()
+    if cleaned.startswith("```"):
+        cleaned = cleaned.split("\n", 1)[-1]
+        if cleaned.endswith("```"):
+            cleaned = cleaned[: cleaned.rfind("```")]
+    try:
+        return json.loads(cleaned)
+    except json.JSONDecodeError as exc:
+        logger.error("Failed to parse Gemini JSON response: %s\nRaw: %s", exc, raw[:500])
+        raise
+async def stream_gemini_text(
+    model_name: str,
+    system_prompt: str,
+    user_message: str,
+):
+    llm = ChatGoogleGenerativeAI(
+        model=model_name,
+        google_api_key=settings.google_api_key,
+        temperature=0.0,
+        streaming=True,
+    )
+    messages = [SystemMessage(content=system_prompt), HumanMessage(content=user_message)]
+    for attempt in range(settings.gemini_max_retries):
+        try:
+            async for chunk in llm.astream(messages):
+                yield chunk.content
+            return
+        except Exception as exc:
+            if attempt == settings.gemini_max_retries - 1:
+                logger.error("Gemini stream failed after %d retries: %s", settings.gemini_max_retries, exc)
+                raise
+            delay = settings.gemini_retry_base_delay * (2 ** attempt)
+            logger.warning("Gemini stream attempt %d failed (%s); retrying in %.1fs", attempt + 1, exc, delay)
+            await asyncio.sleep(delay)

agent/agents/guardrail.py ADDED Viewed

	@@ -0,0 +1,42 @@

+"""Guardrail agent — checks answer groundedness, returns confidence score."""
+from __future__ import annotations
+import logging
+from typing import Optional
+from agent.agents._gemini import call_gemini_json
+from agent.config import settings
+from agent.models import RetrievedChunk
+from agent.prompts import GUARDRAIL_SYSTEM_PROMPT, build_guardrail_prompt
+logger = logging.getLogger(__name__)
+async def run_guardrail(
+    answer: str,
+    chunks: list[RetrievedChunk],
+) -> tuple[float, bool]:
+    """Returns (score, escalate) — score in [0,1], escalate True if score < 0.5."""
+    if not answer.strip():
+        logger.warning("guardrail: empty answer received, escalating")
+        return 0.0, True
+    chunks_text = "\n\n".join(
+        f"[{c.source}] {c.text}" for c in chunks
+    )
+    prompt = build_guardrail_prompt(answer, chunks_text)
+    try:
+        data = await call_gemini_json(
+            model_name=settings.guardrail_model,
+            system_prompt=GUARDRAIL_SYSTEM_PROMPT,
+            user_message=prompt,
+        )
+        score = float(data.get("score", 0.0))
+        escalate = bool(data.get("escalate", score < 0.5))
+        logger.info("guardrail: score=%.3f escalate=%s reasoning=%r", score, escalate, data.get("reasoning"))
+        return score, escalate
+    except Exception:
+        logger.exception("guardrail: evaluation failed — escalating by default")
+        return 0.0, True

agent/agents/planner.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""Planner agent — analyses query, returns ExecutionPlan via Gemini Pro."""
+from __future__ import annotations
+import logging
+from agent.agents._gemini import call_gemini_json
+from agent.config import settings
+from agent.models import AgentTask, ExecutionPlan, QueryInput
+from agent.prompts import PLANNER_SYSTEM_PROMPT
+logger = logging.getLogger(__name__)
+async def run_planner(query_input: QueryInput) -> ExecutionPlan:
+    logger.info("planner: generating execution plan for query=%r", query_input.query)
+    data = await call_gemini_json(
+        model_name=settings.planner_model,
+        system_prompt=PLANNER_SYSTEM_PROMPT,
+        user_message=f"Query: {query_input.query}",
+    )
+    tasks = [AgentTask(**t) for t in data["tasks"]]
+    plan = ExecutionPlan(tasks=tasks, reasoning=data.get("reasoning", ""))
+    logger.info(
+        "planner: plan has %d tasks — %s",
+        len(tasks),
+        [t.agent for t in tasks],
+    )
+    return plan

agent/agents/synthesiser.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""Synthesiser agent — merges agent results and streams a cited answer."""
+from __future__ import annotations
+import asyncio
+import logging
+from typing import AsyncGenerator
+from agent.agents._gemini import stream_gemini_text
+from agent.config import settings
+from agent.models import AgentResult, RetrievedChunk
+from agent.prompts import SYNTHESISER_SYSTEM_PROMPT, build_synthesiser_prompt
+logger = logging.getLogger(__name__)
+def _overall_confidence(agent_results: dict[str, AgentResult]) -> str:
+    levels = [r.retrieval_confidence for r in agent_results.values() if r.chunks]
+    if not levels:
+        return "low"
+    if "high" in levels:
+        return "high"
+    if "medium" in levels:
+        return "medium"
+    return "low"
+def _collect_all_chunks(agent_results: dict[str, AgentResult]) -> list[RetrievedChunk]:
+    seen_ids: set[str] = set()
+    chunks: list[RetrievedChunk] = []
+    for result in agent_results.values():
+        for chunk in result.chunks:
+            if chunk.chunk_id not in seen_ids:
+                seen_ids.add(chunk.chunk_id)
+                chunks.append(chunk)
+    return chunks
+async def stream_synthesis(
+    query: str,
+    agent_results: dict[str, AgentResult],
+) -> AsyncGenerator[str, None]:
+    confidence = _overall_confidence(agent_results)
+    all_chunks = _collect_all_chunks(agent_results)
+    chunks_text = "\n\n".join(
+        f"[{c.source}] {c.text}" for c in all_chunks
+    )
+    prompt = build_synthesiser_prompt(query, confidence, chunks_text)
+    logger.info(
+        "synthesiser: streaming answer — confidence=%s, chunks=%d",
+        confidence,
+        len(all_chunks),
+    )
+    async for token in stream_gemini_text(
+        model_name=settings.synthesiser_model,
+        system_prompt=SYNTHESISER_SYSTEM_PROMPT,
+        user_message=prompt,
+    ):
+        yield token

agent/api.py ADDED Viewed

	@@ -0,0 +1,80 @@

+"""FastAPI router with SSE streaming endpoint for the knowledge copilot."""
+from __future__ import annotations
+import asyncio
+import json
+import logging
+from typing import AsyncGenerator
+from fastapi import APIRouter
+from fastapi.responses import StreamingResponse
+from agent.graph import graph
+from agent.models import KnowledgeGraphState, QueryInput
+logger = logging.getLogger(__name__)
+router = APIRouter(prefix="/agent", tags=["agent"])
+async def _event_generator(
+    query_input: QueryInput,
+    queue: asyncio.Queue,
+) -> AsyncGenerator[str, None]:
+    _SENTINEL = object()
+    async def run_graph() -> None:
+        initial_state = KnowledgeGraphState(
+            query_input=query_input,
+            sse_queue=queue,
+        )
+        try:
+            await graph.ainvoke(initial_state)
+        except Exception as exc:
+            logger.exception("Graph execution error for session=%s", query_input.session_id)
+            await queue.put({"event": "error", "data": {"message": str(exc)}})
+        finally:
+            await queue.put(_SENTINEL)
+    task = asyncio.create_task(run_graph())
+    try:
+        while True:
+            item = await queue.get()
+            if item is _SENTINEL:
+                break
+            event_name = item.get("event", "message")
+            data_str = json.dumps(item.get("data", {}))
+            yield f"event: {event_name}\ndata: {data_str}\n\n"
+        yield "event: done\ndata: {}\n\n"
+    except asyncio.CancelledError:
+        logger.info("SSE stream cancelled for session=%s", query_input.session_id)
+        task.cancel()
+        raise
+    finally:
+        if not task.done():
+            task.cancel()
+@router.post("/query")
+async def query_endpoint(query_input: QueryInput) -> StreamingResponse:
+    queue: asyncio.Queue = asyncio.Queue()
+    logger.info(
+        "query_endpoint: session=%s team=%s query=%r",
+        query_input.session_id,
+        query_input.team_id,
+        query_input.query,
+    )
+    return StreamingResponse(
+        _event_generator(query_input, queue),
+        media_type="text/event-stream",
+        headers={
+            "Cache-Control": "no-cache",
+            "X-Accel-Buffering": "no",
+        },
+    )

agent/config.py ADDED Viewed

	@@ -0,0 +1,50 @@

+from pydantic_settings import BaseSettings, SettingsConfigDict
+class Settings(BaseSettings):
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        case_sensitive=False,
+        extra="ignore",
+    )
+    google_api_key: str = ""
+    planner_model: str = "gemini-2.5-pro"
+    synthesiser_model: str = "gemini-2.5-pro"
+    summariser_model: str = "gemini-2.5-flash"
+    guardrail_model: str = "gemini-2.5-flash"
+    qdrant_host: str = "localhost"
+    qdrant_port: int = 6333
+    qdrant_collection: str = "knowledge_base"
+    qdrant_dense_vector_name: str = "dense"
+    qdrant_sparse_vector_name: str = "sparse"
+    qdrant_dense_size: int = 1024
+    bge_embedding_model: str = "BAAI/bge-m3"
+    bge_reranker_model: str = "BAAI/bge-reranker-v2-m3"
+    bm25_index_path: str = "data/bm25_index.pkl"
+    gliner_model: str = "urchade/gliner_mediumv2.1"
+    rrf_top_k: int = 50
+    final_top_k: int = 5
+    reranker_high_threshold: float = 0.6
+    reranker_medium_threshold: float = 0.4
+    live_docs_confidence_threshold: float = 0.5
+    gemini_max_retries: int = 3
+    gemini_retry_base_delay: float = 1.0
+    jira_base_url: str = ""
+    jira_api_token: str = ""
+    jira_project_key: str = ""
+    firecrawl_api_key: str = ""
+    tavily_api_key: str = ""
+settings = Settings()

agent/graph.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""LangGraph graph definition — nodes, edges, and parallel execution."""
+from __future__ import annotations
+import asyncio
+import logging
+from typing import Any
+from langgraph.graph import END, StateGraph
+from agent.agents.guardrail import run_guardrail
+from agent.agents.planner import run_planner
+from agent.agents.synthesiser import stream_synthesis
+from agent.models import AgentResult, KnowledgeGraphState, RetrievedChunk
+from agent.tools.doc_search import compute_retrieval_confidence, run_doc_search
+from agent.tools.live_docs import run_live_docs
+from agent.tools.ticket_lookup import run_ticket_lookup
+logger = logging.getLogger(__name__)
+async def _push_event(queue: asyncio.Queue, event: str, data: Any) -> None:
+    if queue is not None:
+        await queue.put({"event": event, "data": data})
+async def planner_node(state: KnowledgeGraphState) -> dict:
+    queue = state.sse_queue
+    plan = await run_planner(state.query_input)
+    await _push_event(
+        queue,
+        "plan_ready",
+        {
+            "tasks": [t.model_dump() for t in plan.tasks],
+            "reasoning": plan.reasoning,
+        },
+    )
+    return {"execution_plan": plan}
+async def doc_search_node(state: KnowledgeGraphState) -> dict:
+    queue = state.sse_queue
+    await _push_event(queue, "agent_started", {"agent": "doc_search"})
+    task_input = _find_task_input(state, "doc_search") or state.query_input.query
+    chunks: list[RetrievedChunk] = []
+    error: str | None = None
+    try:
+        chunks = await run_doc_search(task_input, state.query_input.team_id)
+    except Exception as exc:
+        logger.exception("doc_search_node error")
+        error = str(exc)
+    confidence = compute_retrieval_confidence(chunks)
+    result = AgentResult(
+        agent="doc_search",
+        chunks=chunks,
+        retrieval_confidence=confidence,
+        error=error,
+    )
+    await _push_event(
+        queue,
+        "agent_done",
+        {"agent": "doc_search", "chunks": len(chunks), "confidence": confidence},
+    )
+    return {"agent_results": {**state.agent_results, "doc_search": result}}
+async def ticket_lookup_node(state: KnowledgeGraphState) -> dict:
+    queue = state.sse_queue
+    await _push_event(queue, "agent_started", {"agent": "ticket_lookup"})
+    task_input = _find_task_input(state, "ticket_lookup") or state.query_input.query
+    chunks: list[RetrievedChunk] = []
+    error: str | None = None
+    try:
+        chunks = await run_ticket_lookup(task_input, state.query_input.team_id)
+    except Exception as exc:
+        logger.exception("ticket_lookup_node error")
+        error = str(exc)
+    confidence = compute_retrieval_confidence(chunks)
+    result = AgentResult(
+        agent="ticket_lookup",
+        chunks=chunks,
+        retrieval_confidence=confidence,
+        error=error,
+    )
+    await _push_event(
+        queue,
+        "agent_done",
+        {"agent": "ticket_lookup", "chunks": len(chunks), "confidence": confidence},
+    )
+    return {"agent_results": {**state.agent_results, "ticket_lookup": result}}
+async def live_docs_node(state: KnowledgeGraphState) -> dict:
+    queue = state.sse_queue
+    await _push_event(queue, "agent_started", {"agent": "live_docs"})
+    task_input = _find_task_input(state, "live_docs") or state.query_input.query
+    chunks: list[RetrievedChunk] = []
+    error: str | None = None
+    try:
+        chunks = await run_live_docs(task_input, state.query_input.team_id)
+    except Exception as exc:
+        logger.exception("live_docs_node error")
+        error = str(exc)
+    confidence = compute_retrieval_confidence(chunks)
+    result = AgentResult(
+        agent="live_docs",
+        chunks=chunks,
+        retrieval_confidence=confidence,
+        error=error,
+    )
+    await _push_event(
+        queue,
+        "agent_done",
+        {"agent": "live_docs", "chunks": len(chunks), "confidence": confidence},
+    )
+    return {"agent_results": {**state.agent_results, "live_docs": result}}
+async def synthesiser_node(state: KnowledgeGraphState) -> dict:
+    queue = state.sse_queue
+    await _push_event(queue, "synthesis_started", {})
+    full_answer_parts: list[str] = []
+    async for token in stream_synthesis(state.query_input.query, state.agent_results):
+        full_answer_parts.append(token)
+        await _push_event(queue, "answer_chunk", {"chunk": token})
+    final_answer = "".join(full_answer_parts)
+    all_chunks: list[RetrievedChunk] = []
+    seen: set[str] = set()
+    for result in state.agent_results.values():
+        for chunk in result.chunks:
+            if chunk.chunk_id not in seen:
+                seen.add(chunk.chunk_id)
+                all_chunks.append(chunk)
+    return {"final_answer": final_answer, "citations": all_chunks}
+async def join_node(state: KnowledgeGraphState) -> dict:
+    """Fan-in synchronisation point — waits for all retrieval nodes, then hands off to synthesiser."""
+    await _push_event(state.sse_queue, "agent_started", {"agent": "synthesiser"})
+    return {}
+async def guardrail_node(state: KnowledgeGraphState) -> dict:
+    queue = state.sse_queue
+    score, escalate = await run_guardrail(
+        state.final_answer or "",
+        state.citations,
+    )
+    await _push_event(
+        queue,
+        "guardrail_result",
+        {"score": score, "escalate": escalate},
+    )
+    return {
+        "guardrail_passed": not escalate,
+        "guardrail_score": score,
+        "escalate": escalate,
+    }
+def _find_task_input(state: KnowledgeGraphState, agent: str) -> str | None:
+    if state.execution_plan is None:
+        return None
+    for task in state.execution_plan.tasks:
+        if task.agent == agent:
+            return task.input
+    return None
+def _plan_includes(state: KnowledgeGraphState, agent: str) -> bool:
+    if state.execution_plan is None:
+        return False
+    return any(t.agent == agent for t in state.execution_plan.tasks)
+def _route_after_planner(state: KnowledgeGraphState) -> list[str]:
+    if state.execution_plan is None:
+        return ["synthesiser_node"]
+    plan = state.execution_plan
+    immediate: list[str] = []
+    for task in plan.tasks:
+        if not task.depends_on:
+            immediate.append(f"{task.agent}_node")
+    # If nothing is immediate (shouldn't happen), fall back to synthesiser
+    return immediate or ["synthesiser_node"]
+def _route_after_guardrail(state: KnowledgeGraphState) -> str:
+    return "escalate" if state.escalate else END
+def build_graph() -> Any:
+    builder = StateGraph(KnowledgeGraphState)
+    builder.add_node("planner_node", planner_node)
+    builder.add_node("doc_search_node", doc_search_node)
+    builder.add_node("ticket_lookup_node", ticket_lookup_node)
+    builder.add_node("live_docs_node", live_docs_node)
+    builder.add_node("join_node", join_node)
+    builder.add_node("synthesiser_node", synthesiser_node)
+    builder.add_node("guardrail_node", guardrail_node)
+    builder.set_entry_point("planner_node")
+    builder.add_conditional_edges(
+        "planner_node",
+        _route_after_planner,
+        {
+            "doc_search_node": "doc_search_node",
+            "ticket_lookup_node": "ticket_lookup_node",
+            "live_docs_node": "live_docs_node",
+            "summariser_node": "synthesiser_node",
+            "synthesiser_node": "synthesiser_node",
+        },
+    )
+    # Retrieval nodes all converge on join_node — LangGraph waits for every
+    # incoming edge to fire before executing join_node (fan-in).
+    builder.add_edge("doc_search_node", "join_node")
+    builder.add_edge("ticket_lookup_node", "join_node")
+    builder.add_edge("live_docs_node", "join_node")
+    builder.add_edge("join_node", "synthesiser_node")
+    builder.add_edge("synthesiser_node", "guardrail_node")
+    builder.add_conditional_edges(
+        "guardrail_node",
+        _route_after_guardrail,
+        {END: END, "escalate": END},
+    )
+    return builder.compile()
+graph = build_graph()

agent/models.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from __future__ import annotations
+import asyncio
+from typing import Any, Literal, Optional
+from pydantic import BaseModel, Field
+class QueryInput(BaseModel):
+    query: str
+    team_id: str
+    session_id: str
+class AgentTask(BaseModel):
+    agent: Literal["doc_search", "ticket_lookup", "live_docs", "summariser"]
+    input: str
+    depends_on: list[str] = Field(default_factory=list)
+class ExecutionPlan(BaseModel):
+    tasks: list[AgentTask]
+    reasoning: str
+class RetrievedChunk(BaseModel):
+    chunk_id: str
+    text: str
+    source: str
+    source_type: str
+    score: float
+    reranker_score: Optional[float] = None
+class AgentResult(BaseModel):
+    agent: str
+    chunks: list[RetrievedChunk] = Field(default_factory=list)
+    retrieval_confidence: Literal["high", "medium", "low"] = "low"
+    error: Optional[str] = None
+class KnowledgeGraphState(BaseModel):
+    model_config = {"arbitrary_types_allowed": True}
+    query_input: QueryInput
+    execution_plan: Optional[ExecutionPlan] = None
+    agent_results: dict[str, AgentResult] = Field(default_factory=dict)
+    final_answer: Optional[str] = None
+    citations: list[RetrievedChunk] = Field(default_factory=list)
+    guardrail_passed: Optional[bool] = None
+    guardrail_score: Optional[float] = None
+    escalate: bool = False
+    sse_queue: Any = None  # asyncio.Queue for streaming events
+class SSEEvent(BaseModel):
+    event: str
+    data: Any

agent/prompts.py ADDED Viewed

	@@ -0,0 +1,107 @@

+PLANNER_SYSTEM_PROMPT = """You are a planning agent for an Enterprise Knowledge Copilot.
+Given a user query, decide which retrieval agents are needed and in what order.
+Available agents:
+- doc_search: Searches the internal knowledge base (Qdrant vector DB). Use for product docs, runbooks, internal wikis, architecture docs.
+- ticket_lookup: Searches Jira tickets. Use when query mentions bugs, issues, tickets, sprints, or task tracking.
+- live_docs: Fetches live web content via Firecrawl/Tavily. Use ONLY when the query mentions a specific external library, framework, or third-party tool where internal docs are insufficient.
+- summariser: Summarises a large set of retrieved chunks. Use ONLY when more than 10 chunks are expected.
+Rules:
+1. doc_search and ticket_lookup should run in parallel when both are needed — set depends_on: [] for both.
+2. live_docs only runs if you expect doc_search confidence will be low OR the query names a specific external library/framework. Set depends_on: ["doc_search"] to run after.
+3. summariser only runs after doc_search. Set depends_on: ["doc_search"].
+4. Do NOT include agents that are not needed for this query.
+5. Rephrase the input for each agent to be focused and specific to what that agent can retrieve.
+Return ONLY valid JSON matching this exact schema. No preamble. No markdown code fences. No explanation outside the JSON.
+Schema:
+{
+  "tasks": [
+    {
+      "agent": "<agent_name>",
+      "input": "<focused query for this agent>",
+      "depends_on": []
+    }
+  ],
+  "reasoning": "<one sentence explaining your agent selection>"
+}"""
+SYNTHESISER_SYSTEM_PROMPT = """You are a synthesiser agent for an Enterprise Knowledge Copilot.
+Your job: given a user query and retrieved knowledge chunks from multiple agents, produce a clear, accurate, cited answer.
+Rules:
+1. Every factual claim MUST be followed by an inline citation in the format [source_name].
+2. Do NOT make any claim that is not directly supported by the retrieved chunks.
+3. If retrieval_confidence is "low", explicitly state at the top: "Note: retrieved knowledge has low confidence. This answer may be incomplete."
+4. If retrieval_confidence is "medium", add a brief caveat recommending the user verify key details.
+5. Structure your answer with clear paragraphs. Use bullet points for lists of steps or options.
+6. If chunks from different agents contradict each other, note the discrepancy and present both views.
+7. Be concise — prefer 3-5 sentences over long paragraphs unless complexity demands more.
+You will receive:
+- original_query: the user's question
+- retrieval_confidence: overall confidence level
+- chunks: list of retrieved chunks with their source and text"""
+GUARDRAIL_SYSTEM_PROMPT = """You are a guardrail agent for an Enterprise Knowledge Copilot.
+Your job: evaluate whether the generated answer is fully grounded in the provided source chunks.
+For each claim in the answer, check if it appears in or is directly inferrable from the provided chunks.
+Scoring:
+- 1.0: Every claim is directly supported by a chunk.
+- 0.7-0.9: Most claims are supported; minor inferences acceptable.
+- 0.5-0.7: Some claims lack clear chunk support; uncertain.
+- 0.0-0.5: Significant claims are unsupported or hallucinated.
+Rules:
+- If score < 0.5, set escalate: true.
+- Return ONLY valid JSON. No preamble. No markdown code fences.
+Schema:
+{
+  "score": <float between 0.0 and 1.0>,
+  "escalate": <true or false>,
+  "reasoning": "<one sentence explaining the score>"
+}"""
+def build_synthesiser_prompt(
+    query: str,
+    retrieval_confidence: str,
+    chunks_text: str,
+) -> str:
+    return f"""original_query: {query}
+retrieval_confidence: {retrieval_confidence}
+Retrieved chunks:
+{chunks_text}
+Generate your answer now."""
+def build_guardrail_prompt(answer: str, chunks_text: str) -> str:
+    return f"""Answer to evaluate:
+{answer}
+Source chunks:
+{chunks_text}
+Evaluate grounding and return JSON now."""
+def build_summariser_prompt(chunks_text: str, query: str) -> str:
+    return f"""Summarise the following retrieved chunks in relation to this query: {query}
+Chunks:
+{chunks_text}
+Provide a concise summary (3-5 sentences) capturing the key points."""

agent/tools/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Agent tools package."""

agent/tools/doc_search.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""Qdrant hybrid retrieval with BGE-M3 embeddings, BM25, RRF fusion, and reranking."""
+from __future__ import annotations
+import logging
+import pickle
+from pathlib import Path
+from typing import Optional
+from FlagEmbedding import BGEM3FlagModel, FlagReranker
+from gliner import GLiNER
+from qdrant_client import AsyncQdrantClient
+from qdrant_client.http import models as qmodels
+from rank_bm25 import BM25Okapi
+from agent.config import settings
+from agent.models import RetrievedChunk
+logger = logging.getLogger(__name__)
+_PII_ENTITY_TYPES = [
+    "person",
+    "email",
+    "phone",
+    "ssn",
+    "credit_card",
+    "address",
+    "date_of_birth",
+]
+# singletons — loaded once on first call, models are expensive to initialise
+_embedding_model: Optional[BGEM3FlagModel] = None
+_reranker: Optional[FlagReranker] = None
+_gliner: Optional[GLiNER] = None
+_bm25_index: Optional[BM25Okapi] = None
+_bm25_corpus: Optional[list[str]] = None
+_bm25_doc_ids: Optional[list[str]] = None
+def _get_embedding_model() -> BGEM3FlagModel:
+    global _embedding_model
+    if _embedding_model is None:
+        logger.info("Loading BGE-M3 embedding model: %s", settings.bge_embedding_model)
+        _embedding_model = BGEM3FlagModel(
+            settings.bge_embedding_model, use_fp16=True
+        )
+    return _embedding_model
+def _get_reranker() -> FlagReranker:
+    global _reranker
+    if _reranker is None:
+        logger.info("Loading BGE reranker: %s", settings.bge_reranker_model)
+        _reranker = FlagReranker(settings.bge_reranker_model, use_fp16=True)
+    return _reranker
+def _get_gliner() -> GLiNER:
+    global _gliner
+    if _gliner is None:
+        logger.info("Loading GLiNER model: %s", settings.gliner_model)
+        _gliner = GLiNER.from_pretrained(settings.gliner_model)
+    return _gliner
+def _load_bm25() -> tuple[Optional[BM25Okapi], Optional[list[str]], Optional[list[str]]]:
+    global _bm25_index, _bm25_corpus, _bm25_doc_ids
+    if _bm25_index is not None:
+        return _bm25_index, _bm25_corpus, _bm25_doc_ids
+    index_path = Path(settings.bm25_index_path)
+    if not index_path.exists():
+        logger.warning("BM25 index not found at %s — skipping BM25 retrieval", index_path)
+        return None, None, None
+    try:
+        with index_path.open("rb") as f:
+            data = pickle.load(f)
+        _bm25_index = data["index"]
+        _bm25_corpus = data["corpus"]
+        _bm25_doc_ids = data["doc_ids"]
+        logger.info("Loaded BM25 index with %d documents", len(_bm25_doc_ids))
+    except Exception:
+        logger.exception("Failed to load BM25 index from %s", index_path)
+        return None, None, None
+    return _bm25_index, _bm25_corpus, _bm25_doc_ids
+def _mask_pii(text: str) -> str:
+    try:
+        gliner = _get_gliner()
+        entities = gliner.predict_entities(text, _PII_ENTITY_TYPES, threshold=0.5)
+        # iterate reverse so substring replacements don't shift later indices
+        entities_sorted = sorted(entities, key=lambda e: e["start"], reverse=True)
+        masked = text
+        for ent in entities_sorted:
+            masked = masked[: ent["start"]] + "[REDACTED]" + masked[ent["end"] :]
+        return masked
+    except Exception:
+        logger.exception("GLiNER PII masking failed; using raw query")
+        return text
+def _reciprocal_rank_fusion(
+    ranked_lists: list[list[str]], k: int = 60
+) -> dict[str, float]:
+    scores: dict[str, float] = {}
+    for ranked in ranked_lists:
+        for rank, doc_id in enumerate(ranked):
+            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
+    return scores
+async def run_doc_search(query: str, team_id: str) -> list[RetrievedChunk]:
+    masked_query = _mask_pii(query)
+    logger.info("doc_search: masked query = %r", masked_query)
+    embed_model = _get_embedding_model()
+    output = embed_model.encode(
+        [masked_query],
+        return_dense=True,
+        return_sparse=True,
+        return_colbert_vecs=False,
+    )
+    dense_vector: list[float] = output["dense_vecs"][0].tolist()
+    sparse_weights: dict[int, float] = output["lexical_weights"][0]
+    sparse_indices = list(sparse_weights.keys())
+    sparse_values = [sparse_weights[i] for i in sparse_indices]
+    bm25_ranked_ids: list[str] = []
+    bm25, corpus, doc_ids = _load_bm25()
+    if bm25 is not None and doc_ids:
+        tokenized = masked_query.lower().split()
+        bm25_scores = bm25.get_scores(tokenized)
+        top_bm25_idx = sorted(
+            range(len(bm25_scores)), key=lambda i: bm25_scores[i], reverse=True
+        )[: settings.rrf_top_k]
+        bm25_ranked_ids = [doc_ids[i] for i in top_bm25_idx]
+    qdrant_ranked_ids: list[str] = []
+    qdrant_payload_map: dict[str, dict] = {}
+    qdrant_score_map: dict[str, float] = {}
+    try:
+        client = AsyncQdrantClient(host=settings.qdrant_host, port=settings.qdrant_port)
+        results = await client.search(
+            collection_name=settings.qdrant_collection,
+            query_vector=qmodels.NamedVector(
+                name=settings.qdrant_dense_vector_name,
+                vector=dense_vector,
+            ),
+            query_filter=qmodels.Filter(
+                must=[
+                    qmodels.FieldCondition(
+                        key="team_id",
+                        match=qmodels.MatchValue(value=team_id),
+                    )
+                ]
+            ),
+            limit=settings.rrf_top_k,
+            with_payload=True,
+        )
+        for hit in results:
+            doc_id = hit.payload.get("chunk_id", str(hit.id))
+            qdrant_ranked_ids.append(doc_id)
+            qdrant_payload_map[doc_id] = hit.payload
+            qdrant_score_map[doc_id] = hit.score
+        sparse_results = await client.search(
+            collection_name=settings.qdrant_collection,
+            query_vector=qmodels.NamedSparseVector(
+                name=settings.qdrant_sparse_vector_name,
+                vector=qmodels.SparseVector(
+                    indices=sparse_indices,
+                    values=sparse_values,
+                ),
+            ),
+            query_filter=qmodels.Filter(
+                must=[
+                    qmodels.FieldCondition(
+                        key="team_id",
+                        match=qmodels.MatchValue(value=team_id),
+                    )
+                ]
+            ),
+            limit=settings.rrf_top_k,
+            with_payload=True,
+        )
+        sparse_ranked_ids: list[str] = []
+        for hit in sparse_results:
+            doc_id = hit.payload.get("chunk_id", str(hit.id))
+            sparse_ranked_ids.append(doc_id)
+            qdrant_payload_map.setdefault(doc_id, hit.payload)
+            qdrant_score_map.setdefault(doc_id, hit.score)
+        await client.close()
+    except Exception:
+        logger.exception("Qdrant search failed — returning empty results")
+        sparse_ranked_ids = []
+    ranked_lists = [lst for lst in [qdrant_ranked_ids, sparse_ranked_ids, bm25_ranked_ids] if lst]
+    if not ranked_lists:
+        logger.warning("All retrieval sources returned empty — no results")
+        return []
+    rrf_scores = _reciprocal_rank_fusion(ranked_lists)
+    top_ids = sorted(rrf_scores, key=lambda x: rrf_scores[x], reverse=True)[: settings.rrf_top_k]
+    candidates: list[dict] = []
+    for doc_id in top_ids:
+        payload = qdrant_payload_map.get(doc_id)
+        if payload is None:
+            # BM25-only hit has no Qdrant payload — reconstruct from corpus
+            if doc_ids and doc_id in doc_ids:
+                idx = doc_ids.index(doc_id)
+                text = corpus[idx] if corpus else ""
+                payload = {"chunk_id": doc_id, "text": text, "source": "bm25", "source_type": "internal"}
+            else:
+                continue
+        candidates.append({"id": doc_id, "payload": payload, "rrf_score": rrf_scores[doc_id]})
+    if not candidates:
+        return []
+    reranker = _get_reranker()
+    pairs = [(masked_query, c["payload"].get("text", "")) for c in candidates]
+    try:
+        rerank_scores: list[float] = reranker.compute_score(pairs, normalize=True)
+    except Exception:
+        logger.exception("Reranker failed — falling back to RRF scores")
+        rerank_scores = [c["rrf_score"] for c in candidates]
+    for i, cand in enumerate(candidates):
+        cand["reranker_score"] = rerank_scores[i]
+    candidates.sort(key=lambda c: c["reranker_score"], reverse=True)
+    top_candidates = candidates[: settings.final_top_k]
+    chunks: list[RetrievedChunk] = []
+    for cand in top_candidates:
+        p = cand["payload"]
+        chunks.append(
+            RetrievedChunk(
+                chunk_id=p.get("chunk_id", cand["id"]),
+                text=p.get("text", ""),
+                source=p.get("source", "unknown"),
+                source_type=p.get("source_type", "internal"),
+                score=cand["rrf_score"],
+                reranker_score=cand["reranker_score"],
+            )
+        )
+    logger.info(
+        "doc_search: returning %d chunks, top reranker score=%.3f",
+        len(chunks),
+        chunks[0].reranker_score if chunks else 0.0,
+    )
+    return chunks
+def compute_retrieval_confidence(chunks: list[RetrievedChunk]) -> str:
+    if not chunks:
+        return "low"
+    top_score = chunks[0].reranker_score or 0.0
+    if top_score >= settings.reranker_high_threshold:
+        return "high"
+    if top_score >= settings.reranker_medium_threshold:
+        return "medium"
+    return "low"

agent/tools/live_docs.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""Live web documentation fetcher — Firecrawl + Tavily stub."""
+from __future__ import annotations
+import logging
+from agent.config import settings
+from agent.models import RetrievedChunk
+logger = logging.getLogger(__name__)
+async def run_live_docs(query: str, team_id: str) -> list[RetrievedChunk]:
+    """Stub — returns empty until FIRECRAWL_API_KEY or TAVILY_API_KEY are configured."""
+    if not settings.firecrawl_api_key and not settings.tavily_api_key:
+        logger.info("live_docs: no API keys configured, returning empty results")
+        return []
+    logger.warning("live_docs: stub returning empty results")
+    return []

agent/tools/summariser.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""Summariser tool — condenses large chunk sets using Gemini Flash."""
+from __future__ import annotations
+import logging
+from agent.config import settings
+from agent.models import RetrievedChunk
+from agent.prompts import build_summariser_prompt
+logger = logging.getLogger(__name__)
+async def run_summariser(chunks: list[RetrievedChunk], query: str) -> str:
+    from agent.agents._gemini import call_gemini_text
+    if not chunks:
+        return ""
+    chunks_text = "\n\n".join(
+        f"[{c.source}] {c.text}" for c in chunks
+    )
+    prompt = build_summariser_prompt(chunks_text, query)
+    summary = await call_gemini_text(
+        model_name=settings.summariser_model,
+        system_prompt="You are a concise technical summariser. Summarise only what is in the provided text.",
+        user_message=prompt,
+    )
+    logger.info("summariser: produced %d-char summary", len(summary))
+    return summary

agent/tools/ticket_lookup.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""Jira ticket lookup tool — interface stub, no live credentials yet."""
+from __future__ import annotations
+import logging
+from agent.config import settings
+from agent.models import RetrievedChunk
+logger = logging.getLogger(__name__)
+async def run_ticket_lookup(query: str, team_id: str) -> list[RetrievedChunk]:
+    """Stub — returns empty until JIRA_BASE_URL and JIRA_API_TOKEN are configured."""
+    if not settings.jira_base_url or not settings.jira_api_token:
+        logger.info("ticket_lookup: Jira credentials not configured, returning empty results")
+        return []
+    logger.warning("ticket_lookup: stub returning empty results")
+    return []

ingestion/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from ingestion.api import router
2	+
3	+ __all__ = ["router"]

ingestion/api.py ADDED Viewed

	@@ -0,0 +1,117 @@

+from __future__ import annotations
+import logging
+import uuid
+from datetime import datetime
+from fastapi import APIRouter, HTTPException, UploadFile
+from fastapi.responses import JSONResponse
+from ingestion.models import (
+    ConfluenceIngestRequest,
+    GithubIngestRequest,
+    IngestJobResponse,
+    IngestJobStatus,
+    IngestSourcePayload,
+)
+logger = logging.getLogger(__name__)
+router = APIRouter(prefix="/ingest", tags=["ingest"])
+def _dispatch(payload: IngestSourcePayload) -> IngestJobResponse:
+    from ingestion.jobs.ingest_job import run_ingest
+    from ingestion.models import IngestJobRecord
+    from ingestion.storage.supabase_store import get_client, upsert_job
+    task = run_ingest.delay(payload.model_dump())
+    job_id = task.id
+    record = IngestJobRecord(
+        job_id=job_id,
+        celery_task_id=task.id,
+        status=IngestJobStatus.pending,
+        source_type=payload.source_type,
+        team_id=payload.team_id,
+        created_at=datetime.utcnow(),
+    )
+    try:
+        upsert_job(record, client=get_client())
+    except Exception:
+        logger.exception("api: failed to persist job record for task %s", job_id)
+    return IngestJobResponse(job_id=job_id, status=IngestJobStatus.pending)
+@router.post("/confluence", response_model=IngestJobResponse)
+async def ingest_confluence(request: ConfluenceIngestRequest) -> IngestJobResponse:
+    payload = IngestSourcePayload(
+        source_type="confluence",
+        team_id=request.team_id,
+        params={"space_key": request.space_key, "page_ids": request.page_ids},
+    )
+    return _dispatch(payload)
+@router.post("/github", response_model=IngestJobResponse)
+async def ingest_github(request: GithubIngestRequest) -> IngestJobResponse:
+    payload = IngestSourcePayload(
+        source_type="github",
+        team_id=request.team_id,
+        params={
+            "repo_url": request.repo_url,
+            "path_filter": request.path_filter,
+            "branch": request.branch,
+        },
+    )
+    return _dispatch(payload)
+@router.post("/upload", response_model=IngestJobResponse)
+async def ingest_pdf(team_id: str, file: UploadFile) -> IngestJobResponse:
+    if not file.filename or not file.filename.lower().endswith(".pdf"):
+        raise HTTPException(status_code=400, detail="Only PDF files are accepted")
+    content = await file.read()
+    if not content:
+        raise HTTPException(status_code=400, detail="Uploaded file is empty")
+    # PDF bytes are not JSON-serialisable; store transiently and pass filename+content via task
+    # For production, upload to object storage (S3/GCS) and pass the URL instead
+    import base64
+    payload = IngestSourcePayload(
+        source_type="pdf",
+        team_id=team_id,
+        params={
+            "filename": file.filename,
+            "content": base64.b64encode(content).decode(),
+        },
+    )
+    return _dispatch(payload)
+@router.get("/jobs/{job_id}", response_model=IngestJobResponse)
+async def get_job_status(job_id: str) -> IngestJobResponse:
+    from ingestion.storage.supabase_store import get_job
+    record = get_job(job_id)
+    if record is None:
+        # Fall back to Celery task state if Supabase record not yet written
+        from ingestion.jobs.celery_app import celery_app
+        task = celery_app.AsyncResult(job_id)
+        state_map = {
+            "PENDING": IngestJobStatus.pending,
+            "STARTED": IngestJobStatus.running,
+            "SUCCESS": IngestJobStatus.completed,
+            "FAILURE": IngestJobStatus.failed,
+        }
+        status = state_map.get(task.state, IngestJobStatus.pending)
+        return IngestJobResponse(job_id=job_id, status=status)
+    return IngestJobResponse(
+        job_id=record["job_id"],
+        status=IngestJobStatus(record["status"]),
+    )

ingestion/config.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from pydantic_settings import BaseSettings, SettingsConfigDict
+class IngestionSettings(BaseSettings):
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        case_sensitive=False,
+        extra="ignore",
+    )
+    google_api_key: str = ""
+    cag_model: str = "gemini-2.5-pro"
+    qdrant_host: str = "localhost"
+    qdrant_port: int = 6333
+    qdrant_collection: str = "knowledge_base"
+    qdrant_dense_vector_name: str = "dense"
+    qdrant_sparse_vector_name: str = "sparse"
+    qdrant_dense_size: int = 1024
+    supabase_url: str = ""
+    supabase_key: str = ""
+    redis_url: str = "redis://localhost:6379/0"
+    bge_embedding_model: str = "BAAI/bge-m3"
+    bge_reranker_model: str = "BAAI/bge-reranker-v2-m3"
+    gliner_model: str = "urchade/gliner_mediumv2.1"
+    spacy_model: str = "en_core_web_sm"
+    bm25_index_path: str = "data/bm25_index.pkl"
+    embed_batch_size: int = 32
+    chunk_target_tokens: int = 512
+    chunk_max_tokens: int = 768
+    chunk_overlap_ratio: float = 0.15
+    confluence_base_url: str = ""
+    confluence_token: str = ""
+    confluence_email: str = ""
+    github_token: str = ""
+    github_api_url: str = "https://api.github.com"
+    github_path_filter: str = "docs/"
+    github_branch: str = "main"
+    jira_base_url: str = ""
+    jira_api_token: str = ""
+    jira_project_key: str = ""
+    cag_lookback_days: int = 14
+    cag_max_tokens: int = 50_000
+settings = IngestionSettings()

ingestion/jobs/__init__.py ADDED Viewed

File without changes

ingestion/jobs/cag_job.py ADDED Viewed

	@@ -0,0 +1,185 @@

+from __future__ import annotations
+import asyncio
+import logging
+from datetime import datetime, timedelta
+from typing import Any
+from ingestion.jobs.celery_app import celery_app
+logger = logging.getLogger(__name__)
+_CAG_SYSTEM_PROMPT = """You are a technical project analyst for an enterprise engineering team.
+Given a team's recent Jira activity and GitHub commits from the last 14 days, produce a structured project snapshot.
+Return a JSON object with:
+{
+  "summary": "<2-3 sentence executive summary of team activity>",
+  "active_areas": ["<area>", ...],
+  "recent_issues": [{"key": "<JIRA-123>", "title": "...", "status": "..."}],
+  "recent_commits": [{"sha": "<short sha>", "message": "...", "repo": "..."}],
+  "blockers": ["<blocker if evident from tickets>"],
+  "generated_at": "<ISO datetime>"
+}
+Be concise. Do not hallucinate. Use only the data provided."""
+@celery_app.task(name="ingestion.jobs.cag_job.run_cag")
+def run_cag() -> dict[str, Any]:
+    return asyncio.run(_run_cag_async())
+async def _run_cag_async() -> dict[str, Any]:
+    from ingestion.storage.supabase_store import get_all_teams, get_client, update_cag_snapshot
+    sb = get_client()
+    teams = get_all_teams(client=sb)
+    logger.info("cag_job: processing %d teams", len(teams))
+    results: dict[str, str] = {}
+    for team in teams:
+        team_id = team["team_id"]
+        try:
+            snapshot = await _build_team_snapshot(team_id, sb)
+            update_cag_snapshot(team_id, snapshot, client=sb)
+            results[team_id] = "ok"
+        except Exception:
+            logger.exception("cag_job: failed to build snapshot for team %s", team_id)
+            results[team_id] = "error"
+    return results
+async def _build_team_snapshot(team_id: str, sb: Any) -> str:
+    from ingestion.config import settings
+    since = datetime.utcnow() - timedelta(days=settings.cag_lookback_days)
+    since_str = since.strftime("%Y-%m-%d")
+    jira_text = await _fetch_jira_activity(since_str)
+    github_text = await _fetch_github_activity(team_id, sb, since_str)
+    combined = f"Jira activity (last {settings.cag_lookback_days} days):\n{jira_text}\n\nGitHub commits (last {settings.cag_lookback_days} days):\n{github_text}"
+    # Truncate to stay under token budget; rough estimate is 4 chars/token
+    max_chars = settings.cag_max_tokens * 4
+    if len(combined) > max_chars:
+        combined = combined[:max_chars]
+        logger.warning("cag_job: truncated input for team %s to ~%d tokens", team_id, settings.cag_max_tokens)
+    snapshot = await _call_gemini(combined)
+    return snapshot
+async def _fetch_jira_activity(since: str) -> str:
+    from ingestion.config import settings
+    if not settings.jira_base_url or not settings.jira_api_token:
+        return "(Jira not configured)"
+    import base64
+    import httpx
+    jql = f'project = "{settings.jira_project_key}" AND updated >= "{since}" ORDER BY updated DESC'
+    url = f"{settings.jira_base_url}/rest/api/3/search"
+    credentials = base64.b64encode(
+        f"{settings.confluence_email}:{settings.jira_api_token}".encode()
+    ).decode()
+    headers = {"Authorization": f"Basic {credentials}", "Accept": "application/json"}
+    try:
+        async with httpx.AsyncClient(headers=headers, timeout=20) as client:
+            resp = await client.get(url, params={"jql": jql, "maxResults": 50, "fields": "summary,status,assignee,updated"})
+            resp.raise_for_status()
+            issues = resp.json().get("issues", [])
+        lines = []
+        for issue in issues:
+            f = issue["fields"]
+            status = f.get("status", {}).get("name", "?")
+            assignee = (f.get("assignee") or {}).get("displayName", "unassigned")
+            lines.append(f"- [{issue['key']}] {f.get('summary', '')} | {status} | {assignee}")
+        return "\n".join(lines) if lines else "(no recent Jira activity)"
+    except Exception:
+        logger.exception("cag_job: Jira fetch failed")
+        return "(Jira fetch failed)"
+async def _fetch_github_activity(team_id: str, sb: Any, since: str) -> str:
+    from ingestion.config import settings
+    if not settings.github_token:
+        return "(GitHub not configured)"
+    import httpx
+    try:
+        result = (
+            sb.table("documents")
+            .select("source_url, metadata")
+            .eq("team_id", team_id)
+            .eq("source_type", "github")
+            .execute()
+        )
+        repos = result.data or []
+    except Exception:
+        logger.exception("cag_job: failed to fetch repos for team %s", team_id)
+        return "(GitHub repo lookup failed)"
+    headers = {
+        "Authorization": f"Bearer {settings.github_token}",
+        "Accept": "application/vnd.github+json",
+        "X-GitHub-Api-Version": "2022-11-28",
+    }
+    lines: list[str] = []
+    async with httpx.AsyncClient(headers=headers, timeout=20) as client:
+        seen_repos: set[str] = set()
+        for doc in repos:
+            meta = doc.get("metadata") or {}
+            repo = meta.get("repo")
+            if not repo or repo in seen_repos:
+                continue
+            seen_repos.add(repo)
+            url = f"{settings.github_api_url}/repos/{repo}/commits"
+            try:
+                resp = await client.get(url, params={"since": f"{since}T00:00:00Z", "per_page": 20})
+                resp.raise_for_status()
+                for commit in resp.json():
+                    sha = commit["sha"][:7]
+                    message = commit["commit"]["message"].split("\n")[0]
+                    lines.append(f"- [{repo}] {sha} {message}")
+            except Exception:
+                logger.exception("cag_job: commit fetch failed for %s", repo)
+    return "\n".join(lines) if lines else "(no recent GitHub activity)"
+async def _call_gemini(user_message: str) -> str:
+    import asyncio
+    from langchain_core.messages import HumanMessage, SystemMessage
+    from langchain_google_genai import ChatGoogleGenerativeAI
+    from ingestion.config import settings
+    llm = ChatGoogleGenerativeAI(
+        model=settings.cag_model,
+        google_api_key=settings.google_api_key,
+        temperature=0.0,
+    )
+    messages = [SystemMessage(content=_CAG_SYSTEM_PROMPT), HumanMessage(content=user_message)]
+    for attempt in range(3):
+        try:
+            response = await llm.ainvoke(messages)
+            return response.content
+        except Exception as exc:
+            if attempt == 2:
+                raise
+            await asyncio.sleep(2 ** attempt)
+    raise RuntimeError("Unreachable")

ingestion/jobs/celery_app.py ADDED Viewed

	@@ -0,0 +1,33 @@

+from __future__ import annotations
+from celery import Celery
+from celery.schedules import crontab
+from ingestion.config import settings
+celery_app = Celery(
+    "ingestion",
+    broker=settings.redis_url,
+    backend=settings.redis_url,
+    include=[
+        "ingestion.jobs.ingest_job",
+        "ingestion.jobs.cag_job",
+    ],
+)
+celery_app.conf.update(
+    task_serializer="json",
+    result_serializer="json",
+    accept_content=["json"],
+    timezone="UTC",
+    enable_utc=True,
+    task_track_started=True,
+    result_expires=86400,
+)
+celery_app.conf.beat_schedule = {
+    "cag-nightly": {
+        "task": "ingestion.jobs.cag_job.run_cag",
+        "schedule": crontab(hour=2, minute=0),
+    }
+}

ingestion/jobs/ingest_job.py ADDED Viewed

	@@ -0,0 +1,124 @@

+from __future__ import annotations
+import asyncio
+import logging
+from datetime import datetime
+from typing import Any
+from celery import Task
+from ingestion.jobs.celery_app import celery_app
+from ingestion.models import IngestJobRecord, IngestJobStatus, IngestSourcePayload
+logger = logging.getLogger(__name__)
+_SOURCE_REGISTRY: dict[str, Any] = {}
+def _get_source_registry() -> dict[str, Any]:
+    if not _SOURCE_REGISTRY:
+        from ingestion.sources.confluence import ConfluenceSource
+        from ingestion.sources.github import GithubSource
+        from ingestion.sources.jira import JiraSource
+        from ingestion.sources.pdf import PDFSource
+        _SOURCE_REGISTRY.update(
+            {
+                "confluence": ConfluenceSource,
+                "github": GithubSource,
+                "pdf": PDFSource,
+                "jira": JiraSource,
+            }
+        )
+    return _SOURCE_REGISTRY
+@celery_app.task(bind=True, name="ingestion.jobs.ingest_job.run_ingest", max_retries=2)
+def run_ingest(self: Task, payload: dict[str, Any]) -> dict[str, Any]:
+    job_id = self.request.id
+    return asyncio.run(_run_ingest_async(job_id, IngestSourcePayload(**payload)))
+async def _run_ingest_async(job_id: str, payload: IngestSourcePayload) -> dict[str, Any]:
+    from ingestion.pipeline.chunker import chunk_document
+    from ingestion.pipeline.embedder import embed_chunks
+    from ingestion.pipeline.pii_masker import mask_pii
+    from ingestion.storage.bm25_store import rebuild_from_supabase
+    from ingestion.storage.qdrant_store import delete_chunks_for_doc, upsert_chunks as qdrant_upsert
+    from ingestion.storage.supabase_store import (
+        delete_chunks_for_doc as sb_delete_chunks,
+        get_client,
+        upsert_chunks as sb_upsert_chunks,
+        upsert_document,
+        upsert_job,
+    )
+    sb = get_client()
+    job = IngestJobRecord(
+        job_id=job_id,
+        celery_task_id=job_id,
+        status=IngestJobStatus.running,
+        source_type=payload.source_type,
+        team_id=payload.team_id,
+    )
+    upsert_job(job, client=sb)
+    total_chunks = 0
+    try:
+        registry = _get_source_registry()
+        source_cls = registry.get(payload.source_type)
+        if source_cls is None:
+            raise ValueError(f"Unknown source type: {payload.source_type}")
+        if payload.source_type == "github":
+            # GithubSource requires supabase_client for SHA change-detection
+            source = source_cls(team_id=payload.team_id, supabase_client=sb, **payload.params)
+        elif payload.source_type == "pdf":
+            # content arrives as base64 string because Celery serialises to JSON
+            import base64
+            params = dict(payload.params)
+            if isinstance(params.get("content"), str):
+                params["content"] = base64.b64decode(params["content"])
+            source = source_cls(team_id=payload.team_id, **params)
+        else:
+            source = source_cls(team_id=payload.team_id, **payload.params)
+        raw_docs = await source.fetch()
+        logger.info("ingest_job: fetched %d documents from %s", len(raw_docs), payload.source_type)
+        for doc in raw_docs:
+            # Delete stale vectors and chunk records before re-ingesting the same document
+            delete_chunks_for_doc(doc.doc_id)
+            sb_delete_chunks(doc.doc_id, client=sb)
+            chunks = chunk_document(doc)
+            if not chunks:
+                logger.warning("ingest_job: no chunks produced for doc_id=%s", doc.doc_id)
+                continue
+            # Mask PII in chunk text before embedding or storing
+            for chunk in chunks:
+                chunk.text = mask_pii(chunk.text)
+            embedded = embed_chunks(chunks)
+            qdrant_upsert(embedded)
+            sb_upsert_chunks(chunks, client=sb)
+            upsert_document(doc, client=sb)
+            total_chunks += len(chunks)
+            logger.info("ingest_job: ingested %d chunks for doc_id=%s", len(chunks), doc.doc_id)
+        rebuild_from_supabase()
+        job.status = IngestJobStatus.completed
+        job.completed_at = datetime.utcnow()
+        job.chunks_ingested = total_chunks
+    except Exception as exc:
+        logger.exception("ingest_job: job %s failed", job_id)
+        job.status = IngestJobStatus.failed
+        job.completed_at = datetime.utcnow()
+        job.error = str(exc)
+    upsert_job(job, client=sb)
+    return {"job_id": job_id, "status": job.status.value, "chunks_ingested": total_chunks}

ingestion/models.py ADDED Viewed

	@@ -0,0 +1,86 @@

+from __future__ import annotations
+from datetime import datetime
+from enum import Enum
+from typing import Any, Literal, Optional
+from pydantic import BaseModel, Field
+class RawDocument(BaseModel):
+    doc_id: str
+    title: str
+    content: str
+    source_url: str
+    source_type: str
+    team_id: str
+    metadata: dict[str, Any] = Field(default_factory=dict)
+    fetched_at: datetime = Field(default_factory=datetime.utcnow)
+class DocumentChunk(BaseModel):
+    chunk_id: str
+    doc_id: str
+    text: str
+    source: str
+    source_type: str
+    team_id: str
+    chunk_index: int
+    metadata: dict[str, Any] = Field(default_factory=dict)
+class EmbeddedChunk(BaseModel):
+    chunk_id: str
+    doc_id: str
+    text: str
+    source: str
+    source_type: str
+    team_id: str
+    chunk_index: int
+    dense_vector: list[float]
+    sparse_indices: list[int]
+    sparse_values: list[float]
+    metadata: dict[str, Any] = Field(default_factory=dict)
+class IngestJobStatus(str, Enum):
+    pending = "pending"
+    running = "running"
+    completed = "completed"
+    failed = "failed"
+class IngestJobRecord(BaseModel):
+    job_id: str
+    celery_task_id: str
+    status: IngestJobStatus
+    source_type: str
+    team_id: str
+    created_at: datetime = Field(default_factory=datetime.utcnow)
+    completed_at: Optional[datetime] = None
+    error: Optional[str] = None
+    chunks_ingested: int = 0
+class IngestSourcePayload(BaseModel):
+    source_type: Literal["confluence", "github", "pdf", "jira"]
+    team_id: str
+    params: dict[str, Any] = Field(default_factory=dict)
+class ConfluenceIngestRequest(BaseModel):
+    space_key: str
+    team_id: str
+    page_ids: Optional[list[str]] = None
+class GithubIngestRequest(BaseModel):
+    repo_url: str
+    team_id: str
+    path_filter: str = "docs/"
+    branch: str = "main"
+class IngestJobResponse(BaseModel):
+    job_id: str
+    status: IngestJobStatus

ingestion/pipeline/__init__.py ADDED Viewed

File without changes

ingestion/pipeline/chunker.py ADDED Viewed

	@@ -0,0 +1,96 @@

+from __future__ import annotations
+import hashlib
+import logging
+from typing import Optional
+import spacy
+from ingestion.config import settings
+from ingestion.models import DocumentChunk, RawDocument
+logger = logging.getLogger(__name__)
+_nlp: Optional[spacy.language.Language] = None
+def _get_nlp() -> spacy.language.Language:
+    global _nlp
+    if _nlp is None:
+        logger.info("chunker: loading spacy model %s", settings.spacy_model)
+        _nlp = spacy.load(settings.spacy_model, disable=["ner", "parser"])
+        _nlp.enable_pipe("senter")
+    return _nlp
+def _token_count(nlp: spacy.language.Language, text: str) -> int:
+    return len(nlp.tokenizer(text))
+def chunk_document(doc: RawDocument) -> list[DocumentChunk]:
+    nlp = _get_nlp()
+    spacy_doc = nlp(doc.content)
+    sentences = [s.text.strip() for s in spacy_doc.sents if s.text.strip()]
+    if not sentences:
+        return []
+    token_counts = [_token_count(nlp, s) for s in sentences]
+    target = settings.chunk_target_tokens
+    hard_max = settings.chunk_max_tokens
+    overlap_budget = int(target * settings.chunk_overlap_ratio)
+    chunks: list[DocumentChunk] = []
+    chunk_start = 0
+    while chunk_start < len(sentences):
+        accumulated = 0
+        chunk_end = chunk_start
+        while chunk_end < len(sentences):
+            next_count = token_counts[chunk_end]
+            # Force-include at least one sentence even if it exceeds hard_max
+            if accumulated + next_count > hard_max and accumulated > 0:
+                break
+            accumulated += next_count
+            chunk_end += 1
+            if accumulated >= target:
+                break
+        text = " ".join(sentences[chunk_start:chunk_end])
+        chunk_id = hashlib.sha256(
+            f"{doc.doc_id}:{chunk_start}:{chunk_end}".encode()
+        ).hexdigest()
+        chunks.append(
+            DocumentChunk(
+                chunk_id=chunk_id,
+                doc_id=doc.doc_id,
+                text=text,
+                source=doc.source_url,
+                source_type=doc.source_type,
+                team_id=doc.team_id,
+                chunk_index=len(chunks),
+                metadata={**doc.metadata, "title": doc.title},
+            )
+        )
+        # Walk back from chunk_end to find the overlap window for the next chunk
+        overlap_tokens = 0
+        overlap_start = chunk_end
+        while overlap_start > chunk_start + 1:
+            candidate = overlap_start - 1
+            if overlap_tokens + token_counts[candidate] <= overlap_budget:
+                overlap_tokens += token_counts[candidate]
+                overlap_start = candidate
+            else:
+                break
+        # Guard: if overlap would not advance the cursor, step forward unconditionally
+        next_start = overlap_start if overlap_start > chunk_start else chunk_end
+        if next_start <= chunk_start:
+            next_start = chunk_end
+        chunk_start = next_start
+    logger.debug("chunker: %s -> %d chunks", doc.doc_id, len(chunks))
+    return chunks

ingestion/pipeline/embedder.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from __future__ import annotations
+import logging
+from typing import Optional
+from FlagEmbedding import BGEM3FlagModel
+from ingestion.config import settings
+from ingestion.models import DocumentChunk, EmbeddedChunk
+logger = logging.getLogger(__name__)
+# loaded once at module level — BGE-M3 is large; per-request init is unacceptable latency
+_model: Optional[BGEM3FlagModel] = None
+def _get_model() -> BGEM3FlagModel:
+    global _model
+    if _model is None:
+        logger.info("embedder: loading BGE-M3 model %s", settings.bge_embedding_model)
+        _model = BGEM3FlagModel(settings.bge_embedding_model, use_fp16=True)
+    return _model
+def embed_chunks(chunks: list[DocumentChunk]) -> list[EmbeddedChunk]:
+    if not chunks:
+        return []
+    model = _get_model()
+    batch_size = settings.embed_batch_size
+    embedded: list[EmbeddedChunk] = []
+    for batch_start in range(0, len(chunks), batch_size):
+        batch = chunks[batch_start : batch_start + batch_size]
+        texts = [c.text for c in batch]
+        output = model.encode(
+            texts,
+            return_dense=True,
+            return_sparse=True,
+            return_colbert_vecs=False,
+            batch_size=batch_size,
+        )
+        for i, chunk in enumerate(batch):
+            dense_vector: list[float] = output["dense_vecs"][i].tolist()
+            # lexical_weights keys are token ids (int); values are float weights
+            sparse_weights: dict[int, float] = output["lexical_weights"][i]
+            sparse_indices = list(sparse_weights.keys())
+            sparse_values = [sparse_weights[k] for k in sparse_indices]
+            embedded.append(
+                EmbeddedChunk(
+                    chunk_id=chunk.chunk_id,
+                    doc_id=chunk.doc_id,
+                    text=chunk.text,
+                    source=chunk.source,
+                    source_type=chunk.source_type,
+                    team_id=chunk.team_id,
+                    chunk_index=chunk.chunk_index,
+                    dense_vector=dense_vector,
+                    sparse_indices=sparse_indices,
+                    sparse_values=sparse_values,
+                    metadata=chunk.metadata,
+                )
+            )
+        logger.debug(
+            "embedder: embedded batch %d-%d of %d",
+            batch_start,
+            batch_start + len(batch),
+            len(chunks),
+        )
+    return embedded

ingestion/pipeline/pii_masker.py ADDED Viewed

	@@ -0,0 +1,47 @@

+from __future__ import annotations
+import logging
+from typing import Optional
+from gliner import GLiNER
+from ingestion.config import settings
+logger = logging.getLogger(__name__)
+_PII_ENTITY_TYPES = [
+    "person",
+    "email",
+    "phone",
+    "ssn",
+    "credit_card",
+    "address",
+    "date_of_birth",
+]
+_model: Optional[GLiNER] = None
+def _get_model() -> GLiNER:
+    global _model
+    if _model is None:
+        logger.info("pii_masker: loading GLiNER model %s", settings.gliner_model)
+        _model = GLiNER.from_pretrained(settings.gliner_model)
+    return _model
+def mask_pii(text: str) -> str:
+    try:
+        model = _get_model()
+        entities = model.predict_entities(text, _PII_ENTITY_TYPES, threshold=0.5)
+        # iterate reverse so substring replacements don't shift later indices
+        for ent in sorted(entities, key=lambda e: e["start"], reverse=True):
+            text = text[: ent["start"]] + "[REDACTED]" + text[ent["end"] :]
+        return text
+    except Exception:
+        logger.exception("pii_masker: masking failed; using original text")
+        return text
+def mask_chunks(texts: list[str]) -> list[str]:
+    return [mask_pii(t) for t in texts]

ingestion/pipeline/reranker.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from __future__ import annotations
+import logging
+from typing import Optional
+from FlagEmbedding import FlagReranker
+from ingestion.config import settings
+logger = logging.getLogger(__name__)
+_reranker: Optional[FlagReranker] = None
+def _get_reranker() -> FlagReranker:
+    global _reranker
+    if _reranker is None:
+        logger.info("reranker: loading model %s", settings.bge_reranker_model)
+        _reranker = FlagReranker(settings.bge_reranker_model, use_fp16=True)
+    return _reranker
+def rerank(query: str, texts: list[str], normalize: bool = True) -> list[float]:
+    if not texts:
+        return []
+    reranker = _get_reranker()
+    pairs = [(query, t) for t in texts]
+    try:
+        return reranker.compute_score(pairs, normalize=normalize)
+    except Exception:
+        logger.exception("reranker: compute_score failed")
+        return [0.0] * len(texts)

ingestion/sources/__init__.py ADDED Viewed

File without changes

ingestion/sources/base.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from __future__ import annotations
+from abc import ABC, abstractmethod
+from ingestion.models import RawDocument
+class BaseSource(ABC):
+    @abstractmethod
+    async def fetch(self) -> list[RawDocument]:
+        ...
+    @abstractmethod
+    def source_type(self) -> str:
+        ...

ingestion/sources/confluence.py ADDED Viewed

	@@ -0,0 +1,117 @@

+from __future__ import annotations
+import hashlib
+import logging
+import re
+from datetime import datetime
+from typing import Optional
+import httpx
+from ingestion.config import settings
+from ingestion.models import RawDocument
+from ingestion.sources.base import BaseSource
+logger = logging.getLogger(__name__)
+_HTML_TAG_RE = re.compile(r"<[^>]+>")
+_WHITESPACE_RE = re.compile(r"\s{2,}")
+def _strip_html(html: str) -> str:
+    text = _HTML_TAG_RE.sub(" ", html)
+    return _WHITESPACE_RE.sub(" ", text).strip()
+class ConfluenceSource(BaseSource):
+    def __init__(
+        self,
+        team_id: str,
+        space_key: str,
+        page_ids: Optional[list[str]] = None,
+        base_url: str = "",
+        token: str = "",
+        email: str = "",
+    ) -> None:
+        self._team_id = team_id
+        self._space_key = space_key
+        self._page_ids = page_ids
+        self._base_url = base_url or settings.confluence_base_url
+        self._token = token or settings.confluence_token
+        self._email = email or settings.confluence_email
+    def source_type(self) -> str:
+        return "confluence"
+    def _auth_headers(self) -> dict[str, str]:
+        import base64
+        credentials = base64.b64encode(f"{self._email}:{self._token}".encode()).decode()
+        return {"Authorization": f"Basic {credentials}", "Accept": "application/json"}
+    async def fetch(self) -> list[RawDocument]:
+        if not self._base_url or not self._token:
+            logger.warning("confluence: credentials not configured, returning empty")
+            return []
+        async with httpx.AsyncClient(headers=self._auth_headers(), timeout=30) as client:
+            if self._page_ids:
+                return await self._fetch_pages(client, self._page_ids)
+            return await self._fetch_space(client)
+    async def _fetch_space(self, client: httpx.AsyncClient) -> list[RawDocument]:
+        docs: list[RawDocument] = []
+        url = f"{self._base_url}/wiki/rest/api/content"
+        params: dict = {"spaceKey": self._space_key, "expand": "body.storage", "limit": 50, "start": 0}
+        while True:
+            resp = await client.get(url, params=params)
+            resp.raise_for_status()
+            data = resp.json()
+            for page in data.get("results", []):
+                doc = self._page_to_raw_document(page)
+                if doc:
+                    docs.append(doc)
+            if data.get("_links", {}).get("next"):
+                params["start"] += params["limit"]
+            else:
+                break
+        logger.info("confluence: fetched %d pages from space %s", len(docs), self._space_key)
+        return docs
+    async def _fetch_pages(self, client: httpx.AsyncClient, page_ids: list[str]) -> list[RawDocument]:
+        docs: list[RawDocument] = []
+        for page_id in page_ids:
+            url = f"{self._base_url}/wiki/rest/api/content/{page_id}"
+            try:
+                resp = await client.get(url, params={"expand": "body.storage"})
+                resp.raise_for_status()
+                doc = self._page_to_raw_document(resp.json())
+                if doc:
+                    docs.append(doc)
+            except Exception:
+                logger.exception("confluence: failed to fetch page %s", page_id)
+        return docs
+    def _page_to_raw_document(self, page: dict) -> Optional[RawDocument]:
+        try:
+            page_id = page["id"]
+            title = page.get("title", "Untitled")
+            html_body = page.get("body", {}).get("storage", {}).get("value", "")
+            content = _strip_html(html_body)
+            if not content.strip():
+                return None
+            source_url = f"{self._base_url}/wiki/spaces/{self._space_key}/pages/{page_id}"
+            doc_id = hashlib.sha256(f"confluence:{page_id}".encode()).hexdigest()
+            return RawDocument(
+                doc_id=doc_id,
+                title=title,
+                content=content,
+                source_url=source_url,
+                source_type="confluence",
+                team_id=self._team_id,
+                metadata={"page_id": page_id, "space_key": self._space_key},
+            )
+        except Exception:
+            logger.exception("confluence: failed to parse page payload")
+            return None

ingestion/sources/github.py ADDED Viewed

	@@ -0,0 +1,163 @@

+from __future__ import annotations
+import base64
+import hashlib
+import logging
+from typing import Optional
+from urllib.parse import urlparse
+import httpx
+from supabase import Client
+from ingestion.config import settings
+from ingestion.models import RawDocument
+from ingestion.sources.base import BaseSource
+logger = logging.getLogger(__name__)
+def _parse_owner_repo(repo_url: str) -> tuple[str, str]:
+    path = urlparse(repo_url).path.strip("/")
+    parts = path.split("/")
+    if len(parts) < 2:
+        raise ValueError(f"Cannot parse owner/repo from URL: {repo_url}")
+    return parts[0], parts[1]
+class GithubSource(BaseSource):
+    def __init__(
+        self,
+        team_id: str,
+        repo_url: str,
+        supabase_client: Client,
+        path_filter: str = "",
+        branch: str = "",
+        token: str = "",
+    ) -> None:
+        self._team_id = team_id
+        self._repo_url = repo_url
+        self._supabase = supabase_client
+        self._path_filter = path_filter or settings.github_path_filter
+        self._branch = branch or settings.github_branch
+        self._token = token or settings.github_token
+        self._owner, self._repo = _parse_owner_repo(repo_url)
+    def source_type(self) -> str:
+        return "github"
+    def _headers(self) -> dict[str, str]:
+        h = {"Accept": "application/vnd.github+json", "X-GitHub-Api-Version": "2022-11-28"}
+        if self._token:
+            h["Authorization"] = f"Bearer {self._token}"
+        return h
+    async def fetch(self) -> list[RawDocument]:
+        async with httpx.AsyncClient(headers=self._headers(), timeout=30) as client:
+            latest_sha = await self._get_latest_commit_sha(client)
+            stored_sha = self._get_stored_sha()
+            if latest_sha and latest_sha == stored_sha:
+                logger.info("github: %s/%s unchanged at SHA %s — skipping", self._owner, self._repo, latest_sha)
+                return []
+            docs = await self._fetch_markdown_files(client, latest_sha or self._branch)
+            if docs and latest_sha:
+                self._store_sha(latest_sha)
+            logger.info("github: fetched %d documents from %s/%s", len(docs), self._owner, self._repo)
+            return docs
+    async def _get_latest_commit_sha(self, client: httpx.AsyncClient) -> Optional[str]:
+        url = f"{settings.github_api_url}/repos/{self._owner}/{self._repo}/commits"
+        params = {"path": self._path_filter, "per_page": 1, "sha": self._branch}
+        try:
+            resp = await client.get(url, params=params)
+            resp.raise_for_status()
+            commits = resp.json()
+            return commits[0]["sha"] if commits else None
+        except Exception:
+            logger.exception("github: failed to get latest commit SHA")
+            return None
+    def _get_stored_sha(self) -> Optional[str]:
+        repo_doc_id = self._repo_doc_id()
+        try:
+            result = (
+                self._supabase.table("documents")
+                .select("last_commit_sha")
+                .eq("doc_id", repo_doc_id)
+                .maybe_single()
+                .execute()
+            )
+            return result.data["last_commit_sha"] if result.data else None
+        except Exception:
+            logger.exception("github: failed to fetch stored SHA")
+            return None
+    def _store_sha(self, sha: str) -> None:
+        repo_doc_id = self._repo_doc_id()
+        try:
+            self._supabase.table("documents").upsert(
+                {
+                    "doc_id": repo_doc_id,
+                    "title": f"{self._owner}/{self._repo}",
+                    "source_url": self._repo_url,
+                    "source_type": "github",
+                    "team_id": self._team_id,
+                    "last_commit_sha": sha,
+                },
+                on_conflict="doc_id",
+            ).execute()
+        except Exception:
+            logger.exception("github: failed to store new SHA")
+    def _repo_doc_id(self) -> str:
+        return hashlib.sha256(f"github:repo:{self._owner}/{self._repo}".encode()).hexdigest()
+    async def _fetch_markdown_files(self, client: httpx.AsyncClient, ref: str) -> list[RawDocument]:
+        url = f"{settings.github_api_url}/repos/{self._owner}/{self._repo}/git/trees/{ref}"
+        try:
+            resp = await client.get(url, params={"recursive": "1"})
+            resp.raise_for_status()
+            tree = resp.json().get("tree", [])
+        except Exception:
+            logger.exception("github: failed to fetch file tree")
+            return []
+        md_paths = [
+            item["path"]
+            for item in tree
+            if item["type"] == "blob"
+            and item["path"].endswith(".md")
+            and item["path"].startswith(self._path_filter)
+        ]
+        docs: list[RawDocument] = []
+        for path in md_paths:
+            doc = await self._fetch_file(client, path)
+            if doc:
+                docs.append(doc)
+        return docs
+    async def _fetch_file(self, client: httpx.AsyncClient, path: str) -> Optional[RawDocument]:
+        url = f"{settings.github_api_url}/repos/{self._owner}/{self._repo}/contents/{path}"
+        try:
+            resp = await client.get(url, params={"ref": self._branch})
+            resp.raise_for_status()
+            data = resp.json()
+            content = base64.b64decode(data["content"]).decode("utf-8", errors="replace")
+            doc_id = hashlib.sha256(f"github:{self._owner}/{self._repo}/{path}".encode()).hexdigest()
+            source_url = data.get("html_url", f"{self._repo_url}/blob/{self._branch}/{path}")
+            return RawDocument(
+                doc_id=doc_id,
+                title=path.split("/")[-1].removesuffix(".md"),
+                content=content,
+                source_url=source_url,
+                source_type="github",
+                team_id=self._team_id,
+                metadata={"repo": f"{self._owner}/{self._repo}", "path": path, "branch": self._branch},
+            )
+        except Exception:
+            logger.exception("github: failed to fetch file %s", path)
+            return None

ingestion/sources/jira.py ADDED Viewed

	@@ -0,0 +1,69 @@

+from __future__ import annotations
+import logging
+from ingestion.config import settings
+from ingestion.models import RawDocument
+from ingestion.sources.base import BaseSource
+logger = logging.getLogger(__name__)
+class JiraSource(BaseSource):
+    def __init__(
+        self,
+        team_id: str,
+        project_key: str = "",
+        lookback_days: int = 30,
+        base_url: str = "",
+        api_token: str = "",
+    ) -> None:
+        self._team_id = team_id
+        self._project_key = project_key or settings.jira_project_key
+        self._lookback_days = lookback_days
+        self._base_url = base_url or settings.jira_base_url
+        self._api_token = api_token or settings.jira_api_token
+    def source_type(self) -> str:
+        return "jira"
+    async def fetch(self) -> list[RawDocument]:
+        """Stub — returns empty until JIRA_BASE_URL and JIRA_API_TOKEN are configured."""
+        if not self._base_url or not self._api_token:
+            logger.info("jira: credentials not configured, returning empty")
+            return []
+        logger.warning("jira: fetch stub returning empty results")
+        return []
+        # Real implementation shape (uncomment when credentials available):
+        #
+        # since = (datetime.utcnow() - timedelta(days=self._lookback_days)).strftime("%Y-%m-%d")
+        # jql = f'project = "{self._project_key}" AND updated >= "{since}" ORDER BY updated DESC'
+        # url = f"{self._base_url}/rest/api/3/search"
+        # import base64
+        # headers = {
+        #     "Authorization": "Basic " + base64.b64encode(
+        #         f"{settings.confluence_email}:{self._api_token}".encode()
+        #     ).decode(),
+        #     "Accept": "application/json",
+        # }
+        # async with httpx.AsyncClient(headers=headers, timeout=30) as client:
+        #     resp = await client.get(url, params={"jql": jql, "maxResults": 100, "fields": "summary,description,status,assignee,updated"})
+        #     resp.raise_for_status()
+        #     issues = resp.json().get("issues", [])
+        # docs = []
+        # for issue in issues:
+        #     f = issue["fields"]
+        #     text = f"{f.get('summary', '')}\n{f.get('description') or ''}"
+        #     doc_id = hashlib.sha256(f"jira:{issue['key']}".encode()).hexdigest()
+        #     docs.append(RawDocument(
+        #         doc_id=doc_id,
+        #         title=f"{issue['key']}: {f.get('summary', '')}",
+        #         content=text,
+        #         source_url=f"{self._base_url}/browse/{issue['key']}",
+        #         source_type="jira",
+        #         team_id=self._team_id,
+        #         metadata={"issue_key": issue["key"], "status": f.get("status", {}).get("name")},
+        #     ))
+        # return docs

ingestion/sources/pdf.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from __future__ import annotations
+import hashlib
+import logging
+import fitz  # PyMuPDF
+from ingestion.models import RawDocument
+from ingestion.sources.base import BaseSource
+logger = logging.getLogger(__name__)
+class PDFSource(BaseSource):
+    def __init__(self, team_id: str, filename: str, content: bytes) -> None:
+        self._team_id = team_id
+        self._filename = filename
+        self._content = content
+    def source_type(self) -> str:
+        return "pdf"
+    async def fetch(self) -> list[RawDocument]:
+        try:
+            doc = fitz.open(stream=self._content, filetype="pdf")
+            pages_text: list[str] = []
+            for page in doc:
+                pages_text.append(page.get_text())
+            doc.close()
+            full_text = "\n\n".join(t for t in pages_text if t.strip())
+            if not full_text.strip():
+                logger.warning("pdf: no text extracted from %s", self._filename)
+                return []
+            doc_id = hashlib.sha256(self._content).hexdigest()
+            title = self._filename.removesuffix(".pdf")
+            return [
+                RawDocument(
+                    doc_id=doc_id,
+                    title=title,
+                    content=full_text,
+                    source_url=self._filename,
+                    source_type="pdf",
+                    team_id=self._team_id,
+                    metadata={"filename": self._filename, "pages": len(pages_text)},
+                )
+            ]
+        except Exception:
+            logger.exception("pdf: failed to parse %s", self._filename)
+            return []

ingestion/storage/__init__.py ADDED Viewed

File without changes

ingestion/storage/bm25_store.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from __future__ import annotations
+import logging
+import pickle
+from pathlib import Path
+from rank_bm25 import BM25Okapi
+from ingestion.config import settings
+logger = logging.getLogger(__name__)
+def rebuild_bm25_index(chunk_ids: list[str], texts: list[str]) -> None:
+    if not chunk_ids or not texts:
+        logger.warning("bm25_store: no chunks provided, skipping rebuild")
+        return
+    if len(chunk_ids) != len(texts):
+        raise ValueError("chunk_ids and texts must have equal length")
+    tokenized_corpus = [t.lower().split() for t in texts]
+    index = BM25Okapi(tokenized_corpus)
+    index_path = Path(settings.bm25_index_path)
+    index_path.parent.mkdir(parents=True, exist_ok=True)
+    with index_path.open("wb") as f:
+        pickle.dump({"index": index, "corpus": texts, "doc_ids": chunk_ids}, f)
+    logger.info("bm25_store: rebuilt index with %d documents -> %s", len(chunk_ids), index_path)
+def rebuild_from_supabase() -> None:
+    from ingestion.storage.supabase_store import get_all_chunks
+    rows = get_all_chunks()
+    if not rows:
+        logger.warning("bm25_store: no chunks in Supabase, skipping rebuild")
+        return
+    chunk_ids = [r["chunk_id"] for r in rows]
+    texts = [r["text"] for r in rows]
+    rebuild_bm25_index(chunk_ids, texts)

ingestion/storage/qdrant_store.py ADDED Viewed

	@@ -0,0 +1,101 @@

+from __future__ import annotations
+import logging
+import uuid
+from qdrant_client import QdrantClient
+from qdrant_client.http import models as qmodels
+from ingestion.config import settings
+from ingestion.models import EmbeddedChunk
+logger = logging.getLogger(__name__)
+def _chunk_uuid(chunk_id: str) -> str:
+    # stable UUID derived from chunk_id so re-ingestion overwrites the same point
+    return str(uuid.uuid5(uuid.NAMESPACE_DNS, chunk_id))
+def _get_client() -> QdrantClient:
+    return QdrantClient(host=settings.qdrant_host, port=settings.qdrant_port)
+def ensure_collection_exists() -> None:
+    client = _get_client()
+    existing = {c.name for c in client.get_collections().collections}
+    if settings.qdrant_collection in existing:
+        return
+    client.create_collection(
+        collection_name=settings.qdrant_collection,
+        vectors_config={
+            settings.qdrant_dense_vector_name: qmodels.VectorParams(
+                size=settings.qdrant_dense_size,
+                distance=qmodels.Distance.COSINE,
+            )
+        },
+        sparse_vectors_config={
+            settings.qdrant_sparse_vector_name: qmodels.SparseVectorParams(
+                index=qmodels.SparseIndexParams(on_disk=False)
+            )
+        },
+    )
+    logger.info("qdrant_store: created collection %s", settings.qdrant_collection)
+def upsert_chunks(chunks: list[EmbeddedChunk]) -> None:
+    if not chunks:
+        return
+    try:
+        ensure_collection_exists()
+        client = _get_client()
+        points = [
+            qmodels.PointStruct(
+                id=_chunk_uuid(chunk.chunk_id),
+                vector={
+                    settings.qdrant_dense_vector_name: chunk.dense_vector,
+                    settings.qdrant_sparse_vector_name: qmodels.SparseVector(
+                        indices=chunk.sparse_indices,
+                        values=chunk.sparse_values,
+                    ),
+                },
+                payload={
+                    "chunk_id": chunk.chunk_id,
+                    "doc_id": chunk.doc_id,
+                    "text": chunk.text,
+                    "source": chunk.source,
+                    "source_type": chunk.source_type,
+                    "team_id": chunk.team_id,
+                    "chunk_index": chunk.chunk_index,
+                    **chunk.metadata,
+                },
+            )
+            for chunk in chunks
+        ]
+        client.upsert(collection_name=settings.qdrant_collection, points=points)
+        logger.info("qdrant_store: upserted %d points", len(points))
+    except Exception:
+        logger.exception("qdrant_store: upsert failed")
+        raise
+def delete_chunks_for_doc(doc_id: str) -> None:
+    try:
+        client = _get_client()
+        client.delete(
+            collection_name=settings.qdrant_collection,
+            points_selector=qmodels.FilterSelector(
+                filter=qmodels.Filter(
+                    must=[qmodels.FieldCondition(key="doc_id", match=qmodels.MatchValue(value=doc_id))]
+                )
+            ),
+        )
+        logger.info("qdrant_store: deleted points for doc_id=%s", doc_id)
+    except Exception:
+        logger.exception("qdrant_store: delete failed for doc_id=%s", doc_id)
+        raise

ingestion/storage/supabase_store.py ADDED Viewed

	@@ -0,0 +1,138 @@

+from __future__ import annotations
+import logging
+from datetime import datetime
+from typing import Any, Optional
+from supabase import Client, create_client
+from ingestion.config import settings
+from ingestion.models import DocumentChunk, IngestJobRecord, IngestJobStatus, RawDocument
+logger = logging.getLogger(__name__)
+def get_client() -> Client:
+    return create_client(settings.supabase_url, settings.supabase_key)
+def upsert_document(doc: RawDocument, client: Optional[Client] = None) -> None:
+    sb = client or get_client()
+    try:
+        sb.table("documents").upsert(
+            {
+                "doc_id": doc.doc_id,
+                "title": doc.title,
+                "source_url": doc.source_url,
+                "source_type": doc.source_type,
+                "team_id": doc.team_id,
+                "metadata": doc.metadata,
+                "last_commit_sha": doc.metadata.get("last_commit_sha"),
+                "updated_at": datetime.utcnow().isoformat(),
+            },
+            on_conflict="doc_id",
+        ).execute()
+    except Exception:
+        logger.exception("supabase_store: failed to upsert document %s", doc.doc_id)
+        raise
+def upsert_chunks(chunks: list[DocumentChunk], client: Optional[Client] = None) -> None:
+    if not chunks:
+        return
+    sb = client or get_client()
+    try:
+        rows = [
+            {
+                "chunk_id": c.chunk_id,
+                "doc_id": c.doc_id,
+                "text": c.text,
+                "source": c.source,
+                "source_type": c.source_type,
+                "team_id": c.team_id,
+                "chunk_index": c.chunk_index,
+            }
+            for c in chunks
+        ]
+        sb.table("chunks").upsert(rows, on_conflict="chunk_id").execute()
+        logger.info("supabase_store: upserted %d chunks", len(rows))
+    except Exception:
+        logger.exception("supabase_store: failed to upsert chunks")
+        raise
+def delete_chunks_for_doc(doc_id: str, client: Optional[Client] = None) -> None:
+    sb = client or get_client()
+    try:
+        sb.table("chunks").delete().eq("doc_id", doc_id).execute()
+    except Exception:
+        logger.exception("supabase_store: failed to delete chunks for doc_id=%s", doc_id)
+        raise
+def upsert_job(record: IngestJobRecord, client: Optional[Client] = None) -> None:
+    sb = client or get_client()
+    try:
+        sb.table("ingest_jobs").upsert(
+            {
+                "job_id": record.job_id,
+                "celery_task_id": record.celery_task_id,
+                "status": record.status.value,
+                "source_type": record.source_type,
+                "team_id": record.team_id,
+                "created_at": record.created_at.isoformat(),
+                "completed_at": record.completed_at.isoformat() if record.completed_at else None,
+                "error": record.error,
+                "chunks_ingested": record.chunks_ingested,
+            },
+            on_conflict="job_id",
+        ).execute()
+    except Exception:
+        logger.exception("supabase_store: failed to upsert job %s", record.job_id)
+def get_job(job_id: str, client: Optional[Client] = None) -> Optional[dict[str, Any]]:
+    sb = client or get_client()
+    try:
+        result = sb.table("ingest_jobs").select("*").eq("job_id", job_id).maybe_single().execute()
+        return result.data
+    except Exception:
+        logger.exception("supabase_store: failed to get job %s", job_id)
+        return None
+def get_all_teams(client: Optional[Client] = None) -> list[dict[str, Any]]:
+    sb = client or get_client()
+    try:
+        result = sb.table("teams").select("team_id").execute()
+        return result.data or []
+    except Exception:
+        logger.exception("supabase_store: failed to fetch teams")
+        return []
+def update_cag_snapshot(team_id: str, snapshot: str, client: Optional[Client] = None) -> None:
+    sb = client or get_client()
+    try:
+        sb.table("teams").upsert(
+            {
+                "team_id": team_id,
+                "cag_snapshot": snapshot,
+                "snapshot_at": datetime.utcnow().isoformat(),
+            },
+            on_conflict="team_id",
+        ).execute()
+        logger.info("supabase_store: updated CAG snapshot for team %s", team_id)
+    except Exception:
+        logger.exception("supabase_store: failed to update CAG snapshot for team %s", team_id)
+        raise
+def get_all_chunks(client: Optional[Client] = None) -> list[dict[str, Any]]:
+    sb = client or get_client()
+    try:
+        result = sb.table("chunks").select("chunk_id, text").execute()
+        return result.data or []
+    except Exception:
+        logger.exception("supabase_store: failed to fetch all chunks")
+        return []

main.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import logging
+from fastapi import FastAPI
+from agent.api import router as agent_router
+from ingestion.api import router as ingestion_router
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(name)s: %(message)s",
+)
+app = FastAPI(
+    title="Enterprise Knowledge Copilot",
+    version="0.1.0",
+)
+app.include_router(agent_router)
+app.include_router(ingestion_router)

requirements.txt CHANGED Viewed

@@ -1,9 +1,64 @@
-redis==5.0.1
-celery==5.3.4
-pydantic==2.5.0
-pydantic-settings==2.1.0
 aiohttp==3.9.1
 beautifulsoup4==4.12.2
 docling==0.7.0
-requests==2.31.0
-python-dotenv==1.0.0

+# Core orchestration
+langgraph==0.2.55
+langchain-core==0.3.29
+langchain-google-genai==2.0.9
+# Embeddings and reranking
+FlagEmbedding==1.2.11
+# GLiNER for PII masking
+gliner==0.2.13
+# Vector database
+qdrant-client==1.12.1
+# BM25
+rank-bm25==0.2.2
+# API framework
+fastapi==0.115.6
+uvicorn[standard]==0.32.1
+# Pydantic
+pydantic==2.10.4
+pydantic-settings==2.7.0
+# Environment management
+python-dotenv==1.0.1
+# HTTP client
+httpx==0.28.1
+requests==2.31.0
 aiohttp==3.9.1
+# Google AI SDK (transitive, pinned for stability)
+google-generativeai==0.8.3
+google-auth==2.37.0
+# Numpy (required by FlagEmbedding)
+numpy==1.26.4
+# PyTorch CPU (FlagEmbedding dependency — use GPU build if available)
+torch==2.5.1
+# Transformers (FlagEmbedding dependency)
+transformers==4.47.1
+# Ingestion — document parsing
+PyMuPDF==1.24.14
+# Ingestion — NLP / chunking
+spacy==3.8.3
+# Run after install: python -m spacy download en_core_web_sm
+# Ingestion — Supabase
+supabase==2.11.0
+# Ingestion — background jobs
+celery==5.4.0
+celery[redis]==5.4.0
+redis==5.2.1
+# Existing project dependencies
 beautifulsoup4==4.12.2
 docling==0.7.0

supabase/schema.sql ADDED Viewed

	@@ -0,0 +1,81 @@

+-- Documents table — one row per ingested document (PDF, Confluence page, GitHub file, Jira issue)
+create table if not exists documents (
+    id          uuid primary key default gen_random_uuid(),
+    doc_id      text not null unique,
+    title       text not null,
+    source_url  text not null,
+    source_type text not null,
+    team_id     text not null,
+    metadata    jsonb not null default '{}',
+    last_commit_sha text,
+    created_at  timestamptz not null default now(),
+    updated_at  timestamptz not null default now()
+);
+create index if not exists documents_team_id_idx        on documents (team_id);
+create index if not exists documents_source_type_idx    on documents (source_type);
+create index if not exists documents_team_source_idx    on documents (team_id, source_type);
+-- Lets the CAG job find repos for a team via metadata->>'repo' without a full scan
+create index if not exists documents_metadata_gin_idx   on documents using gin (metadata);
+-- Chunks table — one row per text chunk produced by the chunker
+create table if not exists chunks (
+    id          uuid primary key default gen_random_uuid(),
+    chunk_id    text not null unique,
+    doc_id      text not null references documents (doc_id) on delete cascade,
+    text        text not null,
+    source      text not null,
+    source_type text not null,
+    team_id     text not null,
+    chunk_index integer not null,
+    created_at  timestamptz not null default now()
+);
+create index if not exists chunks_doc_id_idx    on chunks (doc_id);
+create index if not exists chunks_team_id_idx   on chunks (team_id);
+-- Teams table — one row per tenant team
+create table if not exists teams (
+    team_id      text primary key,
+    cag_snapshot text,
+    snapshot_at  timestamptz,
+    created_at   timestamptz not null default now()
+);
+-- Ingest jobs table — tracks Celery task state for the API
+create table if not exists ingest_jobs (
+    job_id          text primary key,
+    celery_task_id  text not null,
+    status          text not null default 'pending',
+    source_type     text not null,
+    team_id         text not null,
+    chunks_ingested integer not null default 0,
+    error           text,
+    created_at      timestamptz not null default now(),
+    completed_at    timestamptz
+);
+create index if not exists ingest_jobs_team_id_idx  on ingest_jobs (team_id);
+create index if not exists ingest_jobs_status_idx   on ingest_jobs (status);
+-- RLS: each team only sees its own documents and chunks
+alter table documents  enable row level security;
+alter table chunks     enable row level security;
+alter table ingest_jobs enable row level security;
+-- Service role bypasses RLS; these policies cover the anon / authenticated roles
+create policy "team isolation — documents"
+    on documents for all
+    using (team_id = current_setting('app.team_id', true));
+create policy "team isolation — chunks"
+    on chunks for all
+    using (team_id = current_setting('app.team_id', true));
+create policy "team isolation — ingest_jobs"
+    on ingest_jobs for all
+    using (team_id = current_setting('app.team_id', true));