Spaces:

XQ
/

Dokumentassistent

Sleeping

App Files Files

XQ commited on Apr 7

Commit

1441fa0

1 Parent(s): e128a20

Refactor to Plan-and-Execute architecture

Browse files

Files changed (13) hide show

.github/README.md +15 -10
README.md +16 -11
src/agent/memory.py +125 -0
src/agent/plan_and_execute.py +506 -0
src/agent/react_router.py +15 -6
src/agent/tools.py +226 -32
src/api/main.py +5 -3
src/api/routes.py +8 -2
src/models.py +4 -0
src/ui/app.py +51 -6
tests/test_memory.py +239 -0
tests/test_plan_and_execute.py +370 -0
tests/test_tools.py +381 -0

.github/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space) — hosted on Hugging Face Spaces
-A document intelligence system covering PDF ingestion, semantic chunking, hybrid retrieval with reranking, and LLM-generated answers with source citations. The LLM layer is provider-agnostic. Two modes: a LangGraph ReAct agent (default) for queries that need multiple retrieval steps, and a pipeline for lightweight models without tool-calling support. Retrieval quality is evaluated with RAGAS.
 ## How it works
@@ -14,15 +14,18 @@ At query time both indexes are searched and their results merged with reciprocal
 **Two routing modes, switchable via `AGENT_MODE`:**
-- **ReAct Agent** (default): a reasoning loop where the LLM calls tools as many times as it needs before answering. Useful for multi-hop questions or comparisons across documents. Requires a model with tool-calling support.
   | Tool | Purpose |
   |------|---------|
-  | `hybrid_search(query, top_k)` | Retrieve relevant passages |
   | `list_documents()` | See what's in the knowledge base |
   | `fetch_document(document_id)` | Read a full document |
-- **Pipeline** (`AGENT_MODE=pipeline`): a fixed LangGraph graph — language detection → optional translation → hybrid retrieval → reranking → generation. Works with lightweight local models that lack tool-calling support.
 ## Tech Stack
@@ -49,12 +52,12 @@ See `.env.example` for per-provider configuration.
 | Mode | `AGENT_MODE` | Notes |
 |------|-------------|-------|
-| ReAct | `react` (default) | Tool-calling loop, needs a model that supports tool use |
-| Pipeline | `pipeline` | Fixed graph, works with lightweight models that lack tool calling |
 Tool-calling is supported by OpenAI, Anthropic, Google GenAI, Azure OpenAI, Groq, and some Ollama models (`gemma4`, `llama3.1`, `qwen2.5`, `mistral-nemo`).
-ReAct with local Ollama (default):
 ```dotenv
 AGENT_MODE=react
@@ -70,7 +73,7 @@ LLM_PROVIDER=ollama
 OLLAMA_MODEL=gemma3
 ```
-ReAct with OpenAI:
 ```dotenv
 AGENT_MODE=react
@@ -149,8 +152,10 @@ src/
   agent/
     intent_classifier.py
     router.py              # pipeline mode (AGENT_MODE=pipeline)
-    tools.py               # hybrid_search + ToolResultStore
-    react_router.py        # ReAct mode (AGENT_MODE=react)
   evaluation/
     evaluator.py           # RAGAS metrics
   ui/

 [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space) — hosted on Hugging Face Spaces
+A document intelligence system covering PDF ingestion, semantic chunking, hybrid retrieval with reranking, and LLM-generated answers with source citations. The LLM layer is provider-agnostic. Two modes: a Plan-and-Execute agent (default) with conversation memory for complex multi-step queries, and a pipeline for lightweight models without tool-calling support. Retrieval quality is evaluated with RAGAS.
 ## How it works
 **Two routing modes, switchable via `AGENT_MODE`:**
+- **Plan-and-Execute Agent** (default): a structured multi-step pipeline — a planner decomposes the query into steps, an executor runs each step via a ReAct sub-agent with tool access, and a synthesizer produces the final cited answer. Includes conversation memory for multi-turn follow-ups. Requires a model with tool-calling support.
   | Tool | Purpose |
   |------|---------|
+  | `hybrid_search(query, top_k)` | Retrieve relevant passages via hybrid search + reranking |
+  | `multi_query_search(question, top_k)` | Decompose complex questions into sub-queries, search each, merge results |
+  | `search_within_document(document_id, query, top_k)` | Find specific sections inside a known document |
+  | `summarize_document(document_id)` | Generate a structured summary of a document |
   | `list_documents()` | See what's in the knowledge base |
   | `fetch_document(document_id)` | Read a full document |
+- **Pipeline** (`AGENT_MODE=pipeline`): a predefined LangGraph graph — language detection → optional translation → hybrid retrieval → reranking → generation, with a confidence-based retry loop. Works with lightweight local models that lack tool-calling support.
 ## Tech Stack
 | Mode | `AGENT_MODE` | Notes |
 |------|-------------|-------|
+| Plan-and-Execute | `react` (default) | Structured multi-step agent with conversation memory |
+| Pipeline | `pipeline` | Predefined graph, works with lightweight models that lack tool calling |
 Tool-calling is supported by OpenAI, Anthropic, Google GenAI, Azure OpenAI, Groq, and some Ollama models (`gemma4`, `llama3.1`, `qwen2.5`, `mistral-nemo`).
+Plan-and-Execute with local Ollama (default):
 ```dotenv
 AGENT_MODE=react
 OLLAMA_MODEL=gemma3
 ```
+Plan-and-Execute with OpenAI:
 ```dotenv
 AGENT_MODE=react
   agent/
     intent_classifier.py
     router.py              # pipeline mode (AGENT_MODE=pipeline)
+    tools.py               # 6 retrieval tools + ToolResultStore
+    react_router.py        # legacy ReAct loop (superseded by plan_and_execute)
+    plan_and_execute.py    # Plan-and-Execute agent (AGENT_MODE=react)
+    memory.py              # conversation memory for multi-turn
   evaluation/
     evaluator.py           # RAGAS metrics
   ui/

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ noindex: true
 **Live Demo:** [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space) — hosted on Hugging Face Spaces
-A document intelligence system built on a RAG architecture, covering PDF ingestion, semantic chunking, hybrid retrieval with reranking, and LLM-generated answers with source citations. The LLM layer is provider-agnostic. Two modes: a pipeline for lightweight models, a LangGraph ReAct agent for queries that need multiple retrieval steps. Retrieval quality is evaluated with RAGAS.
 ## How it works
@@ -22,13 +22,16 @@ At query time both indexes are searched and their results merged with reciprocal
 **Two routing modes, switchable via `AGENT_MODE`:**
-- **Pipeline** (default): a fixed LangGraph DAG — language detection → optional translation → hybrid retrieval → reranking → generation. Works with lightweight local models like `gemma4`.
-- **ReAct Agent** (`AGENT_MODE=react`): replaces the DAG with a reasoning loop where the LLM calls tools as many times as it needs before answering. Useful for multi-hop questions or comparisons across documents. Requires a model with tool-calling support.
   | Tool | Purpose |
   |------|---------|
-  | `hybrid_search(query, top_k)` | Retrieve relevant passages |
   | `list_documents()` | See what's in the knowledge base |
   | `fetch_document(document_id)` | Read a full document |
@@ -57,12 +60,12 @@ See `.env.example` for per-provider configuration.
 | Mode | `AGENT_MODE` | Notes |
 |------|-------------|-------|
-| Pipeline | `pipeline` (default) | Fixed DAG, works with `gemma4` |
-| ReAct | `react` | Tool-calling loop, needs a model that supports tool use |
-Tool-calling is supported by OpenAI, Anthropic, Google GenAI, Azure OpenAI, Groq, and some Ollama models (`llama3.1`, `qwen2.5`, `mistral-nemo`). The default `gemma4` does not support it — use `pipeline` mode with Ollama.
-ReAct with OpenAI:
 ```dotenv
 AGENT_MODE=react
@@ -76,7 +79,7 @@ Pipeline with local Ollama:
 ```dotenv
 AGENT_MODE=pipeline
 LLM_PROVIDER=ollama
-OLLAMA_MODEL=gemma4
 ```
 ## Quick Start
@@ -149,8 +152,10 @@ src/
   agent/
     intent_classifier.py
     router.py              # pipeline mode (AGENT_MODE=pipeline)
-    tools.py               # hybrid_search + ToolResultStore
-    react_router.py        # ReAct mode (AGENT_MODE=react)
   evaluation/
     evaluator.py           # RAGAS metrics
   ui/

 **Live Demo:** [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space) — hosted on Hugging Face Spaces
+A document intelligence system built on a RAG architecture, covering PDF ingestion, semantic chunking, hybrid retrieval with reranking, and LLM-generated answers with source citations. The LLM layer is provider-agnostic. Two modes: a pipeline for lightweight models, and a Plan-and-Execute agent flow with conversation memory for complex multi-step queries. Retrieval quality is evaluated with RAGAS.
 ## How it works
 **Two routing modes, switchable via `AGENT_MODE`:**
+- **Pipeline**: a predefined LangGraph graph — language detection → optional translation → hybrid retrieval → reranking → generation, with a confidence-based retry loop. Works with lightweight local models.
+- **Plan-and-Execute Agent** (default, `AGENT_MODE=react`): a structured multi-step pipeline where a planner decomposes the query into steps, an executor runs each step via a ReAct sub-agent with tool access, and a synthesizer produces the final cited answer. Includes conversation memory for multi-turn follow-ups. Requires a model with tool-calling support.
   | Tool | Purpose |
   |------|---------|
+  | `hybrid_search(query, top_k)` | Retrieve relevant passages via hybrid search + reranking |
+  | `multi_query_search(question, top_k)` | Decompose complex questions into sub-queries, search each, merge results |
+  | `search_within_document(document_id, query, top_k)` | Find specific sections inside a known document |
+  | `summarize_document(document_id)` | Generate a structured summary of a document |
   | `list_documents()` | See what's in the knowledge base |
   | `fetch_document(document_id)` | Read a full document |
 | Mode | `AGENT_MODE` | Notes |
 |------|-------------|-------|
+| Pipeline | `pipeline` | Predefined graph, works with lightweight models |
+| Plan-and-Execute (default) | `react` | Structured multi-step agent with conversation memory |
+Tool-calling is supported by OpenAI, Anthropic, Google GenAI, Azure OpenAI, Groq, and some Ollama models (`llama3.1`, `qwen2.5`, `mistral-nemo`).
+Plan-and-Execute with OpenAI:
 ```dotenv
 AGENT_MODE=react
 ```dotenv
 AGENT_MODE=pipeline
 LLM_PROVIDER=ollama
+OLLAMA_MODEL=gemma3
 ```
 ## Quick Start
   agent/
     intent_classifier.py
     router.py              # pipeline mode (AGENT_MODE=pipeline)
+    tools.py               # 6 retrieval tools + ToolResultStore
+    react_router.py        # legacy ReAct loop (superseded by plan_and_execute)
+    plan_and_execute.py    # Plan-and-Execute agent (AGENT_MODE=react)
+    memory.py              # conversation memory for multi-turn
   evaluation/
     evaluator.py           # RAGAS metrics
   ui/

src/agent/memory.py ADDED Viewed

	@@ -0,0 +1,125 @@

+"""Conversation memory for multi-turn interactions.
+Stores message history and retrieved sources across turns so that:
+- Follow-up questions can reference prior context ("what about the other one?")
+- The planner/synthesizer can see what was already discussed
+- Previously retrieved sources are available without re-searching
+"""
+import logging
+from dataclasses import dataclass, field
+from src.models import QueryResult
+logger = logging.getLogger(__name__)
+_MAX_TURNS = 20
+@dataclass
+class Turn:
+    """A single conversation turn.
+    Attributes:
+        query: The user's question.
+        answer: The assistant's response.
+        sources: Retrieved sources used to generate the answer.
+    """
+    query: str
+    answer: str
+    sources: list[QueryResult] = field(default_factory=list)
+class ConversationMemory:
+    """Manages multi-turn conversation state.
+    Stores a rolling window of recent turns and provides formatted
+    context for the planner and synthesizer prompts.
+    """
+    def __init__(self, max_turns: int = _MAX_TURNS) -> None:
+        """Initialize conversation memory.
+        Args:
+            max_turns: Maximum number of turns to retain.
+        """
+        self._max_turns = max_turns
+        self._turns: list[Turn] = []
+    @property
+    def turns(self) -> list[Turn]:
+        """Return the list of conversation turns (read-only copy)."""
+        return list(self._turns)
+    @property
+    def is_empty(self) -> bool:
+        """Return True if no conversation history exists."""
+        return len(self._turns) == 0
+    def add_turn(self, query: str, answer: str, sources: list[QueryResult] | None = None) -> None:
+        """Record a completed conversation turn.
+        Args:
+            query: The user's question.
+            answer: The assistant's response.
+            sources: Retrieved sources (optional).
+        """
+        self._turns.append(Turn(query=query, answer=answer, sources=sources or []))
+        if len(self._turns) > self._max_turns:
+            removed = self._turns.pop(0)
+            logger.debug("Evicted oldest turn: %s", removed.query[:50])
+        logger.debug("Memory now has %d turns", len(self._turns))
+    def clear(self) -> None:
+        """Clear all conversation history."""
+        self._turns.clear()
+        logger.info("Conversation memory cleared")
+    def format_history(self, max_recent: int = 5) -> str:
+        """Format recent conversation history for inclusion in prompts.
+        Args:
+            max_recent: Maximum number of recent turns to include.
+        Returns:
+            Formatted string of recent Q&A pairs, or empty string if no history.
+        """
+        if not self._turns:
+            return ""
+        recent = self._turns[-max_recent:]
+        parts: list[str] = []
+        for i, turn in enumerate(recent, 1):
+            source_note = ""
+            if turn.sources:
+                doc_ids = sorted({s.chunk.document_id for s in turn.sources})
+                source_note = f" [sources: {', '.join(doc_ids)}]"
+            parts.append(
+                f"Turn {i}:\n"
+                f"  User: {turn.query}\n"
+                f"  Assistant: {turn.answer[:500]}{source_note}"
+            )
+        return "\n\n".join(parts)
+    def get_prior_sources(self) -> list[QueryResult]:
+        """Return all unique sources from prior turns, sorted by score.
+        Returns:
+            Deduplicated list of QueryResult from all past turns.
+        """
+        by_id: dict[str, QueryResult] = {}
+        for turn in self._turns:
+            for r in turn.sources:
+                cid = r.chunk.chunk_id
+                if cid not in by_id or r.score > by_id[cid].score:
+                    by_id[cid] = r
+        return sorted(by_id.values(), key=lambda r: r.score, reverse=True)
+    def last_query(self) -> str:
+        """Return the last user query, or empty string."""
+        return self._turns[-1].query if self._turns else ""
+    def last_sources(self) -> list[QueryResult]:
+        """Return sources from the most recent turn."""
+        return self._turns[-1].sources if self._turns else []

src/agent/plan_and_execute.py ADDED Viewed

	@@ -0,0 +1,506 @@

+"""Plan-and-Execute agent router using LangGraph.
+Replaces the flat ReAct loop with a structured three-phase pipeline:
+1. **Planner** — analyses the user query and produces an ordered list of
+   steps (e.g. "search for exam rules", "search for grading policy",
+   "compare both").
+2. **Executor** — runs each step via a short ReAct sub-graph that has
+   access to all retrieval tools.
+3. **Synthesizer** — collects the results from all executed steps and
+   produces a final, cited answer.
+The separation gives the pipeline *predictable structure* while still
+allowing the executor to reason freely within each step.
+"""
+import json
+import logging
+from collections.abc import Generator
+from typing import TypedDict
+from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, ToolMessage
+from langchain_core.runnables import Runnable
+from langgraph.graph import END, StateGraph
+from langgraph.prebuilt import create_react_agent
+from src.agent.memory import ConversationMemory
+from src.agent.tools import ToolResultStore, make_retrieval_tools
+from src.models import GenerationResponse, IntentType, PipelineDetails, QueryResult
+from src.retrieval.hybrid import HybridRetriever
+from src.retrieval.reranker import Reranker
+from src.retrieval.vector_store import VectorStore
+logger = logging.getLogger(__name__)
+_MAX_STEPS = 6
+# ------------------------------------------------------------------
+# Prompts
+# ------------------------------------------------------------------
+_PLANNER_PROMPT = (
+    "You are a planning assistant for the University of Copenhagen (KU) document system.\n\n"
+    "Given a user question, produce a JSON list of 1–4 steps needed to answer it.\n"
+    "Each step is an object with:\n"
+    '  - "action": one of "search", "search_within", "multi_search", '
+    '"summarize", "list_docs", "fetch_doc"\n'
+    '  - "detail": a short description of what to do (e.g. the search query, document ID)\n\n'
+    "Rules:\n"
+    "- For simple factual questions: 1 search step is enough.\n"
+    "- For comparison questions: use multi_search or separate search steps.\n"
+    "- For document overview requests: use summarize.\n"
+    "- Always end with the steps needed; do NOT include a final 'answer' step.\n\n"
+    "Reply with ONLY the JSON array, nothing else.\n\n"
+    "Examples:\n"
+    'Question: "What is the exam policy?"\n'
+    '[{"action": "search", "detail": "KU eksamensregler"}]\n\n'
+    'Question: "Compare vacation rules for academic vs administrative staff"\n'
+    '[{"action": "search", "detail": "ferieregler videnskabeligt personale"}, '
+    '{"action": "search", "detail": "ferieregler administrativt personale"}]\n\n'
+    'Question: "Summarize the AI policy document"\n'
+    '[{"action": "summarize", "detail": "ku_ai_policy.pdf"}]\n\n'
+    "Now plan for this question:\n"
+)
+_EXECUTOR_SYSTEM = (
+    "You are executing ONE step of a plan to answer a user's question about "
+    "University of Copenhagen (KU) documents.\n\n"
+    "You have retrieval tools available. Execute the step described below, "
+    "then summarise what you found in 2-3 sentences. If you find nothing "
+    "relevant, say so clearly.\n\n"
+    "Do NOT produce a final answer — just report what you found for this step."
+)
+_SYNTHESIZER_PROMPT = (
+    "You are a helpful assistant for administrative staff at the University "
+    "of Copenhagen (KU).\n\n"
+    "Below are the results gathered from multiple research steps. "
+    "Synthesize them into a single coherent answer to the user's original question.\n\n"
+    "Guidelines:\n"
+    "- Cite document sources using [1], [2], etc.\n"
+    "- Answer in the same language as the user's question.\n"
+    "- Be concise but thorough.\n"
+    "- If some steps found no results, acknowledge gaps honestly.\n\n"
+)
+# ------------------------------------------------------------------
+# Graph state
+# ------------------------------------------------------------------
+class PlanStep(TypedDict):
+    """A single step in the execution plan."""
+    action: str
+    detail: str
+class PlanExecState(TypedDict):
+    """State for the Plan-and-Execute graph.
+    Attributes:
+        query: The user's original question.
+        top_k: Number of results per retrieval call.
+        plan: Ordered list of steps produced by the planner.
+        step_index: Index of the next step to execute.
+        step_results: List of (step_description, result_text) pairs.
+        answer: Final synthesised answer.
+    """
+    query: str
+    top_k: int
+    plan: list[PlanStep]
+    step_index: int
+    step_results: list[tuple[str, str]]
+    answer: str
+# ------------------------------------------------------------------
+# Router class
+# ------------------------------------------------------------------
+class PlanAndExecuteRouter:
+    """Routes queries through a Plan-and-Execute pipeline.
+    Graph topology::
+        plan → should_execute? ─┬─ yes → execute_step → should_execute?
+                                └─ no  → synthesize → END
+    """
+    def __init__(
+        self,
+        llm: Runnable,
+        hybrid_retriever: HybridRetriever,
+        reranker: Reranker,
+        vector_store: VectorStore,
+        default_top_k: int = 5,
+        memory: ConversationMemory | None = None,
+    ) -> None:
+        """Initialise the Plan-and-Execute router.
+        Args:
+            llm: LLM with tool-calling support.
+            hybrid_retriever: HybridRetriever instance.
+            reranker: Reranker instance.
+            vector_store: VectorStore instance.
+            default_top_k: Default number of results per retrieval call.
+            memory: Optional ConversationMemory for multi-turn context.
+                When provided, prior conversation history is injected into
+                planner and synthesizer prompts, and each completed turn
+                is automatically recorded.
+        """
+        self._llm = llm
+        self._hybrid_retriever = hybrid_retriever
+        self._reranker = reranker
+        self._vector_store = vector_store
+        self._default_top_k = default_top_k
+        self._store = ToolResultStore()
+        self._memory = memory or ConversationMemory()
+    # ------------------------------------------------------------------
+    # Node functions
+    # ------------------------------------------------------------------
+    def _plan_node(self, state: PlanExecState) -> dict:
+        """Generate an execution plan from the user query."""
+        history = self._memory.format_history()
+        history_section = ""
+        if history:
+            history_section = (
+                f"Conversation history (for context on follow-up questions):\n"
+                f"{history}\n\n"
+            )
+        prompt = _PLANNER_PROMPT + history_section + f'Question: "{state["query"]}"'
+        raw = str(self._llm.invoke(prompt)).strip()
+        logger.info("Planner raw output: %s", raw)
+        plan = _parse_plan(raw)
+        logger.info("Plan: %d steps — %s", len(plan), plan)
+        return {"plan": plan, "step_index": 0, "step_results": []}
+    @staticmethod
+    def _should_execute(state: PlanExecState) -> str:
+        """Decide whether to execute the next step or synthesize."""
+        if state["step_index"] < len(state["plan"]) and state["step_index"] < _MAX_STEPS:
+            return "execute"
+        return "synthesize"
+    def _execute_step_node(self, state: PlanExecState) -> dict:
+        """Execute the current plan step using a ReAct sub-agent."""
+        idx = state["step_index"]
+        step = state["plan"][idx]
+        step_desc = f'{step["action"]}: {step["detail"]}'
+        logger.info("Executing step %d/%d: %s", idx + 1, len(state["plan"]), step_desc)
+        # Build a fresh tool set and sub-agent for this step
+        tools = make_retrieval_tools(
+            self._hybrid_retriever,
+            self._reranker,
+            self._vector_store,
+            self._store,
+            self._default_top_k,
+            llm_chain=self._llm,
+        )
+        sub_agent = create_react_agent(self._llm, tools)
+        step_prompt = (
+            f'Step to execute: {step_desc}\n\n'
+            f'Original user question (for context): {state["query"]}'
+        )
+        result = sub_agent.invoke({
+            "messages": [
+                SystemMessage(content=_EXECUTOR_SYSTEM),
+                HumanMessage(content=step_prompt),
+            ]
+        })
+        # Extract the sub-agent's final text answer
+        answer = _extract_last_ai_text(result.get("messages", []))
+        logger.info("Step %d result: %s", idx + 1, answer[:200])
+        new_results = list(state["step_results"]) + [(step_desc, answer)]
+        return {"step_index": idx + 1, "step_results": new_results}
+    def _synthesize_node(self, state: PlanExecState) -> dict:
+        """Synthesize a final answer from all step results."""
+        step_texts = []
+        for i, (desc, result) in enumerate(state["step_results"], 1):
+            step_texts.append(f"### Step {i}: {desc}\n{result}")
+        gathered = "\n\n".join(step_texts)
+        history = self._memory.format_history()
+        history_section = ""
+        if history:
+            history_section = (
+                f"Prior conversation:\n{history}\n\n"
+            )
+        prompt = (
+            f"{_SYNTHESIZER_PROMPT}"
+            f"{history_section}"
+            f"Original question: {state['query']}\n\n"
+            f"Research results:\n{gathered}\n\n"
+            f"Answer:"
+        )
+        answer = str(self._llm.invoke(prompt)).strip()
+        logger.info("Synthesized final answer (%d chars)", len(answer))
+        return {"answer": answer}
+    # ------------------------------------------------------------------
+    # Graph construction
+    # ------------------------------------------------------------------
+    def _build_graph(self) -> object:
+        """Build the Plan-and-Execute LangGraph.
+        Returns:
+            Compiled LangGraph.
+        """
+        graph: StateGraph = StateGraph(PlanExecState)
+        graph.add_node("plan", self._plan_node)
+        graph.add_node("execute_step", self._execute_step_node)
+        graph.add_node("synthesize", self._synthesize_node)
+        graph.set_entry_point("plan")
+        graph.add_conditional_edges(
+            "plan",
+            self._should_execute,
+            {"execute": "execute_step", "synthesize": "synthesize"},
+        )
+        graph.add_conditional_edges(
+            "execute_step",
+            self._should_execute,
+            {"execute": "execute_step", "synthesize": "synthesize"},
+        )
+        graph.add_edge("synthesize", END)
+        return graph.compile()
+    # ------------------------------------------------------------------
+    # Public interface (mirrors QueryRouter / ReActRouter)
+    # ------------------------------------------------------------------
+    def route(self, query: str, top_k: int) -> GenerationResponse:
+        """Route a query through the Plan-and-Execute pipeline.
+        Args:
+            query: The user's natural language query.
+            top_k: Number of top documents to retrieve per tool call.
+        Returns:
+            GenerationResponse with answer, sources, intent, and confidence.
+        """
+        logger.info("PlanExec routing query: %s", query)
+        self._store = ToolResultStore()
+        initial_state = PlanExecState(
+            query=query,
+            top_k=top_k,
+            plan=[],
+            step_index=0,
+            step_results=[],
+            answer="",
+        )
+        graph = self._build_graph()
+        final_state: PlanExecState = graph.invoke(initial_state)
+        sources = self._store.retrieved[:top_k]
+        confidence = max((r.score for r in sources), default=0.0)
+        plan_step_strs = [
+            f'{s["action"]}: {s["detail"]}' for s in final_state.get("plan", [])
+        ]
+        tool_call_strs = [f"{name}: {arg}" for name, arg in self._store.tool_calls]
+        response = GenerationResponse(
+            answer=final_state["answer"],
+            sources=sources,
+            intent=IntentType.RAG if sources else IntentType.FACTUAL,
+            confidence=confidence,
+            pipeline_details=PipelineDetails(
+                original_query=query,
+                retrieval_query=", ".join(
+                    q for name, q in self._store.tool_calls if name == "hybrid_search"
+                ) or query,
+                dense_results=self._store.dense_results,
+                sparse_results=self._store.sparse_results,
+                fused_results=self._store.fused_results,
+                reranked_results=sources,
+                plan_steps=plan_step_strs,
+                tool_calls=tool_call_strs,
+            ),
+        )
+        self._memory.add_turn(query, response.answer, sources)
+        return response
+    def route_stream(self, query: str, top_k: int) -> Generator[dict, None, None]:
+        """Stream Plan-and-Execute events step by step.
+        Yields event dicts with step types:
+        - ``plan`` — plan was generated; carries ``steps``.
+        - ``execute_step`` — a step was executed; carries ``step_index``,
+          ``step_desc``, ``result_preview``.
+        - ``synthesize`` — final answer generated.
+        - ``done`` — final event with full result payload.
+        Args:
+            query: User query.
+            top_k: Number of results to retrieve per tool call.
+        Yields:
+            Step event dicts.
+        """
+        self._store = ToolResultStore()
+        initial_state = PlanExecState(
+            query=query,
+            top_k=top_k,
+            plan=[],
+            step_index=0,
+            step_results=[],
+            answer="",
+        )
+        graph = self._build_graph()
+        accumulated: dict = dict(initial_state)
+        for chunk in graph.stream(initial_state, stream_mode="updates"):
+            for node_name, update in chunk.items():
+                if update is None:
+                    continue
+                accumulated.update(update)
+                if node_name == "plan":
+                    yield {
+                        "step": "plan",
+                        "steps": [
+                            f'{s["action"]}: {s["detail"]}'
+                            for s in update.get("plan", [])
+                        ],
+                    }
+                elif node_name == "execute_step":
+                    results = update.get("step_results", [])
+                    if results:
+                        last_desc, last_result = results[-1]
+                        yield {
+                            "step": "execute_step",
+                            "step_index": update.get("step_index", 0),
+                            "step_desc": last_desc,
+                            "result_preview": last_result[:300],
+                        }
+                elif node_name == "synthesize":
+                    yield {"step": "synthesize"}
+        sources = self._store.retrieved[:top_k]
+        confidence = max((r.score for r in sources), default=0.0)
+        answer = accumulated.get("answer", "")
+        self._memory.add_turn(query, answer, sources)
+        yield {
+            "step": "done",
+            "result": {
+                "answer": answer,
+                "sources": [r.to_dict() for r in sources],
+                "intent": (IntentType.RAG if sources else IntentType.FACTUAL).value,
+                "confidence": confidence,
+                "pipeline_details": {
+                    "original_query": query,
+                    "retrieval_query": ", ".join(
+                        q for name, q in self._store.tool_calls if name == "hybrid_search"
+                    ) or query,
+                    "detected_language": "",
+                    "translated": False,
+                    "dense_results": [r.to_dict(include_text=False) for r in self._store.dense_results],
+                    "sparse_results": [r.to_dict(include_text=False) for r in self._store.sparse_results],
+                    "fused_results": [r.to_dict(include_text=False) for r in self._store.fused_results],
+                    "reranked_results": [r.to_dict(include_text=False) for r in sources],
+                    "plan_steps": [
+                        f'{s["action"]}: {s["detail"]}'
+                        for s in accumulated.get("plan", [])
+                    ],
+                    "tool_calls": [f"{n}: {a}" for n, a in self._store.tool_calls],
+                },
+            },
+        }
+# ------------------------------------------------------------------
+# Helpers
+# ------------------------------------------------------------------
+def _parse_plan(raw: str) -> list[PlanStep]:
+    """Parse the planner's JSON output into a list of PlanStep dicts.
+    Robust against markdown fences, trailing text, and minor formatting issues.
+    Args:
+        raw: Raw LLM output expected to contain a JSON array.
+    Returns:
+        List of PlanStep dicts. Falls back to a single search step on failure.
+    """
+    # Strip markdown code fences if present
+    cleaned = raw.strip()
+    if cleaned.startswith("```"):
+        lines = cleaned.splitlines()
+        # Remove opening and closing fences
+        lines = [l for l in lines if not l.strip().startswith("```")]
+        cleaned = "\n".join(lines).strip()
+    try:
+        parsed = json.loads(cleaned)
+    except json.JSONDecodeError:
+        # Try to extract a JSON array from the text
+        start = cleaned.find("[")
+        end = cleaned.rfind("]")
+        if start != -1 and end != -1:
+            try:
+                parsed = json.loads(cleaned[start:end + 1])
+            except json.JSONDecodeError:
+                logger.warning("Failed to parse plan, falling back to single search")
+                return [PlanStep(action="search", detail=cleaned[:200])]
+        else:
+            logger.warning("No JSON array found in plan output, falling back")
+            return [PlanStep(action="search", detail=cleaned[:200])]
+    if not isinstance(parsed, list):
+        logger.warning("Plan is not a list, wrapping")
+        parsed = [parsed]
+    steps: list[PlanStep] = []
+    for item in parsed:
+        if isinstance(item, dict) and "action" in item and "detail" in item:
+            steps.append(PlanStep(action=str(item["action"]), detail=str(item["detail"])))
+        else:
+            logger.warning("Skipping malformed plan step: %s", item)
+    if not steps:
+        return [PlanStep(action="search", detail="general search")]
+    return steps
+def _extract_last_ai_text(messages: list) -> str:
+    """Return the text content of the last non-tool-call AI message.
+    Args:
+        messages: List of LangChain message objects.
+    Returns:
+        The extracted text, or empty string if none found.
+    """
+    for msg in reversed(messages):
+        if (
+            isinstance(msg, AIMessage)
+            and msg.content
+            and not getattr(msg, "tool_calls", None)
+        ):
+            return str(msg.content)
+    return ""

src/agent/react_router.py CHANGED Viewed

@@ -26,14 +26,22 @@ logger = logging.getLogger(__name__)
 _SYSTEM_PROMPT = (
     "You are a helpful assistant for administrative staff at the University of Copenhagen (KU).\n\n"
-    "You have access to a hybrid_search tool that searches KU policy documents stored in the "
-    "knowledge base.\n\n"
     "Guidelines:\n"
-    "- Always call hybrid_search before answering questions about KU rules, policies, exams, "
     "employment conditions, or administrative procedures.\n"
-    "- If the first search does not return sufficient information, call hybrid_search again "
-    "with a refined or more specific query.\n"
-    "- For comparison questions, search for each item separately.\n"
     "- Cite the document sources ([1], [2], …) in your answer.\n"
     "- Answer in the same language as the user's question."
 )
@@ -83,6 +91,7 @@ class ReActRouter:
             self._vector_store,
             store,
             self._default_top_k,
         )
         return create_react_agent(self._llm, tools)

 _SYSTEM_PROMPT = (
     "You are a helpful assistant for administrative staff at the University of Copenhagen (KU).\n\n"
+    "You have access to several tools for searching KU policy documents:\n"
+    "- hybrid_search: General-purpose search across all documents.\n"
+    "- multi_query_search: For complex or comparison questions — decomposes into sub-queries.\n"
+    "- search_within_document: Pinpoint specific sections inside a known document.\n"
+    "- summarize_document: Generate an overview of an entire document.\n"
+    "- list_documents: See which documents are available.\n"
+    "- fetch_document: Get the full text of a specific document.\n\n"
     "Guidelines:\n"
+    "- Always search before answering questions about KU rules, policies, exams, "
     "employment conditions, or administrative procedures.\n"
+    "- Use multi_query_search for comparison questions or complex multi-part questions.\n"
+    "- Use search_within_document when you already know the relevant document and "
+    "need to find a specific clause or section.\n"
+    "- Use summarize_document when the user asks for an overview of a document.\n"
+    "- If the first search does not return sufficient information, try a different "
+    "tool or refine your query.\n"
     "- Cite the document sources ([1], [2], …) in your answer.\n"
     "- Answer in the same language as the user's question."
 )
             self._vector_store,
             store,
             self._default_top_k,
+            llm_chain=self._llm,
         )
         return create_react_agent(self._llm, tools)

src/agent/tools.py CHANGED Viewed

@@ -3,6 +3,7 @@
 import logging
 from dataclasses import dataclass, field
 from langchain_core.tools import tool
 from src.models import QueryResult
@@ -33,12 +34,54 @@ class ToolResultStore:
     fused_results: list[QueryResult] = field(default_factory=list)
 def make_retrieval_tools(
     hybrid_retriever: HybridRetriever,
     reranker: Reranker,
     vector_store: VectorStore,
     store: ToolResultStore,
     default_top_k: int = 5,
 ) -> list:
     """Create retrieval tools bound to the given components and result store.
@@ -52,11 +95,18 @@ def make_retrieval_tools(
         vector_store: VectorStore instance for document-level access.
         store: Shared ToolResultStore that captures structured results.
         default_top_k: Default number of results to return per call.
     Returns:
         List of LangChain tool callables ready for bind_tools / ToolNode.
     """
     @tool
     def hybrid_search(query: str, top_k: int = default_top_k) -> str:
         """Search the KU document knowledge base using hybrid retrieval.
@@ -84,36 +134,16 @@ def make_retrieval_tools(
         hybrid_result = hybrid_retriever.search_detailed(query, top_k=top_k)
         results = reranker.rerank(query, hybrid_result.fused_results, top_k=top_k)
-        # Accumulate intermediate pipeline stages
-        def _merge(existing_list: list[QueryResult], new_list: list[QueryResult]) -> list[QueryResult]:
-            by_id = {r.chunk.chunk_id: r for r in existing_list}
-            for r in new_list:
-                cid = r.chunk.chunk_id
-                if cid not in by_id or r.score > by_id[cid].score:
-                    by_id[cid] = r
-            return sorted(by_id.values(), key=lambda r: r.score, reverse=True)
-        store.dense_results = _merge(store.dense_results, hybrid_result.dense_results)
-        store.sparse_results = _merge(store.sparse_results, hybrid_result.sparse_results)
-        store.fused_results = _merge(store.fused_results, hybrid_result.fused_results)
-        # Accumulate reranked results across multiple calls (union by chunk_id, keep highest score)
-        existing = {r.chunk.chunk_id: r for r in store.retrieved}
-        for r in results:
-            cid = r.chunk.chunk_id
-            if cid not in existing or r.score > existing[cid].score:
-                existing[cid] = r
-        store.retrieved = sorted(existing.values(), key=lambda r: r.score, reverse=True)
-        if not results:
-            return "Ingen relevante dokumenter fundet. (No relevant documents found.)"
-        parts: list[str] = []
-        for i, r in enumerate(results, 1):
-            parts.append(
-                f"[{i}] {r.chunk.document_id}  (relevance: {r.score:.3f})\n{r.chunk.text}"
-            )
-        return "\n\n---\n\n".join(parts)
     @tool
     def list_documents() -> str:
@@ -161,11 +191,8 @@ def make_retrieval_tools(
                 f"(Document not found. Use list_documents to see available IDs.)"
             )
-        # Sort chunks by chunk_index to preserve document order
         chunks.sort(key=lambda c: c.metadata.get("chunk_index", 0))
-        # Register chunks as QueryResult so confidence and sources are surfaced in the UI.
-        # Score 1.0 indicates a direct full-document fetch (no ranking involved).
         existing = {r.chunk.chunk_id: r for r in store.retrieved}
         for chunk in chunks:
             if chunk.chunk_id not in existing:
@@ -178,4 +205,171 @@ def make_retrieval_tools(
             f"{full_text}"
         )
-    return [hybrid_search, list_documents, fetch_document]

 import logging
 from dataclasses import dataclass, field
+from langchain_core.runnables import Runnable
 from langchain_core.tools import tool
 from src.models import QueryResult
     fused_results: list[QueryResult] = field(default_factory=list)
+def _merge_results(existing: list[QueryResult], new: list[QueryResult]) -> list[QueryResult]:
+    """Merge two QueryResult lists by chunk_id, keeping the highest score.
+    Args:
+        existing: Previously accumulated results.
+        new: New results to merge in.
+    Returns:
+        Merged list sorted by descending score.
+    """
+    by_id = {r.chunk.chunk_id: r for r in existing}
+    for r in new:
+        cid = r.chunk.chunk_id
+        if cid not in by_id or r.score > by_id[cid].score:
+            by_id[cid] = r
+    return sorted(by_id.values(), key=lambda r: r.score, reverse=True)
+def _format_results(results: list[QueryResult]) -> str:
+    """Format a list of QueryResult into a readable string.
+    Args:
+        results: Ranked results to format.
+    Returns:
+        Formatted string with numbered entries, or a no-results message.
+    """
+    if not results:
+        return "Ingen relevante dokumenter fundet. (No relevant documents found.)"
+    parts: list[str] = []
+    for i, r in enumerate(results, 1):
+        page_info = ""
+        page = r.chunk.metadata.get("page_number")
+        if page is not None:
+            page_info = f"  side {page}"
+        parts.append(
+            f"[{i}] {r.chunk.document_id}{page_info}  (relevance: {r.score:.3f})\n{r.chunk.text}"
+        )
+    return "\n\n---\n\n".join(parts)
 def make_retrieval_tools(
     hybrid_retriever: HybridRetriever,
     reranker: Reranker,
     vector_store: VectorStore,
     store: ToolResultStore,
     default_top_k: int = 5,
+    llm_chain: Runnable | None = None,
 ) -> list:
     """Create retrieval tools bound to the given components and result store.
         vector_store: VectorStore instance for document-level access.
         store: Shared ToolResultStore that captures structured results.
         default_top_k: Default number of results to return per call.
+        llm_chain: Optional LLM chain for tools that need generation
+            (summarize_document, multi_query_search). When None, those
+            tools are excluded from the returned list.
     Returns:
         List of LangChain tool callables ready for bind_tools / ToolNode.
     """
+    # ------------------------------------------------------------------
+    # Core search tool
+    # ------------------------------------------------------------------
     @tool
     def hybrid_search(query: str, top_k: int = default_top_k) -> str:
         """Search the KU document knowledge base using hybrid retrieval.
         hybrid_result = hybrid_retriever.search_detailed(query, top_k=top_k)
         results = reranker.rerank(query, hybrid_result.fused_results, top_k=top_k)
+        store.dense_results = _merge_results(store.dense_results, hybrid_result.dense_results)
+        store.sparse_results = _merge_results(store.sparse_results, hybrid_result.sparse_results)
+        store.fused_results = _merge_results(store.fused_results, hybrid_result.fused_results)
+        store.retrieved = _merge_results(store.retrieved, results)
+        return _format_results(results)
+    # ------------------------------------------------------------------
+    # Document-level tools
+    # ------------------------------------------------------------------
     @tool
     def list_documents() -> str:
                 f"(Document not found. Use list_documents to see available IDs.)"
             )
         chunks.sort(key=lambda c: c.metadata.get("chunk_index", 0))
         existing = {r.chunk.chunk_id: r for r in store.retrieved}
         for chunk in chunks:
             if chunk.chunk_id not in existing:
             f"{full_text}"
         )
+    # ------------------------------------------------------------------
+    # Targeted within-document search
+    # ------------------------------------------------------------------
+    @tool
+    def search_within_document(document_id: str, query: str, top_k: int = 3) -> str:
+        """Search for specific information within a single document.
+        Retrieves all chunks belonging to the document and uses the cross-encoder
+        reranker to find the most relevant passages for the query. Use this when
+        you already know which document to look in and need to pinpoint the exact
+        section (e.g. a specific clause, page, or paragraph).
+        Args:
+            document_id: The exact document ID to search within.
+            query: What to look for inside the document.
+            top_k: Number of top passages to return (1–10). Default is 3.
+        Returns:
+            The most relevant passages within the document, ranked by relevance.
+        """
+        logger.info(
+            "Tool search_within_document: doc=%r query=%r top_k=%d",
+            document_id, query, top_k,
+        )
+        store.tool_calls.append(("search_within_document", f"{document_id}: {query}"))
+        chunks = vector_store.get_chunks_by_document_id(document_id)
+        if not chunks:
+            return (
+                f"Dokumentet '{document_id}' blev ikke fundet i vidensbasen. "
+                f"(Document not found. Use list_documents to see available IDs.)"
+            )
+        # Wrap chunks as QueryResult so the reranker can score them
+        candidates = [
+            QueryResult(chunk=c, score=0.0, source="search_within_document")
+            for c in chunks
+        ]
+        results = reranker.rerank(query, candidates, top_k=top_k)
+        store.retrieved = _merge_results(store.retrieved, results)
+        return _format_results(results)
+    # ------------------------------------------------------------------
+    # LLM-powered tools (only available when llm_chain is provided)
+    # ------------------------------------------------------------------
+    tools: list = [hybrid_search, list_documents, fetch_document, search_within_document]
+    if llm_chain is not None:
+        @tool
+        def multi_query_search(question: str, top_k: int = default_top_k) -> str:
+            """Decompose a complex question into sub-queries and search each independently.
+            Use this tool instead of hybrid_search when the question involves
+            multiple aspects, comparisons, or requires information from different
+            topics. For example: "How do exam rules differ between bachelor and
+            master programmes?" would be split into separate searches for each
+            programme's exam rules, then merged.
+            Args:
+                question: The complex user question to decompose and search.
+                top_k: Number of results to return per sub-query (1–10). Default is 5.
+            Returns:
+                Combined results from all sub-queries, deduplicated and ranked.
+            """
+            logger.info("Tool multi_query_search: question=%r", question)
+            store.tool_calls.append(("multi_query_search", question))
+            # Step 1: Ask LLM to decompose the question
+            decompose_prompt = (
+                "You are a search query planner. Given a complex question, "
+                "decompose it into 2-4 simple, independent search queries that "
+                "together cover all aspects of the question. The queries should "
+                "be in Danish (since the document base is Danish).\n\n"
+                "Reply with ONLY the queries, one per line, nothing else.\n\n"
+                f"Question: {question}"
+            )
+            raw = str(llm_chain.invoke(decompose_prompt)).strip()
+            sub_queries = [q.strip().lstrip("0123456789.-) ") for q in raw.splitlines() if q.strip()]
+            if not sub_queries:
+                sub_queries = [question]
+            logger.info("Decomposed into %d sub-queries: %s", len(sub_queries), sub_queries)
+            # Step 2: Search each sub-query independently
+            all_results: list[QueryResult] = []
+            for sq in sub_queries:
+                hybrid_result = hybrid_retriever.search_detailed(sq, top_k=top_k)
+                reranked = reranker.rerank(sq, hybrid_result.fused_results, top_k=top_k)
+                all_results = _merge_results(all_results, reranked)
+                store.dense_results = _merge_results(store.dense_results, hybrid_result.dense_results)
+                store.sparse_results = _merge_results(store.sparse_results, hybrid_result.sparse_results)
+                store.fused_results = _merge_results(store.fused_results, hybrid_result.fused_results)
+            # Step 3: Keep top results across all sub-queries
+            final = all_results[:top_k]
+            store.retrieved = _merge_results(store.retrieved, final)
+            header = f"Søgning opdelt i {len(sub_queries)} delforespørgsler:\n"
+            header += "\n".join(f"  • {sq}" for sq in sub_queries)
+            header += "\n\n"
+            return header + _format_results(final)
+        @tool
+        def summarize_document(document_id: str) -> str:
+            """Generate a structured summary of a document in the knowledge base.
+            Fetches the full document and uses the LLM to produce a concise summary
+            covering the main topics, key rules, and important details. Use this
+            when the user asks "what is this document about?" or wants an overview
+            before diving into specifics.
+            Args:
+                document_id: The exact document ID to summarize.
+            Returns:
+                A structured summary of the document, or an error if not found.
+            """
+            logger.info("Tool summarize_document: document_id=%r", document_id)
+            store.tool_calls.append(("summarize_document", document_id))
+            chunks = vector_store.get_chunks_by_document_id(document_id)
+            if not chunks:
+                return (
+                    f"Dokumentet '{document_id}' blev ikke fundet i vidensbasen. "
+                    f"(Document not found. Use list_documents to see available IDs.)"
+                )
+            chunks.sort(key=lambda c: c.metadata.get("chunk_index", 0))
+            full_text = "\n\n".join(c.text for c in chunks)
+            # Register chunks as sources
+            existing = {r.chunk.chunk_id: r for r in store.retrieved}
+            for chunk in chunks:
+                if chunk.chunk_id not in existing:
+                    existing[chunk.chunk_id] = QueryResult(
+                        chunk=chunk, score=1.0, source="summarize_document",
+                    )
+            store.retrieved = sorted(existing.values(), key=lambda r: r.score, reverse=True)
+            # Truncate to avoid exceeding context limits
+            max_chars = 8000
+            if len(full_text) > max_chars:
+                full_text = full_text[:max_chars] + "\n\n[... teksten er forkortet ...]"
+            summary_prompt = (
+                "Produce a structured summary of the following document. "
+                "Include:\n"
+                "1. Document title/topic\n"
+                "2. Key points (3-7 bullet points)\n"
+                "3. Important rules, deadlines, or requirements mentioned\n"
+                "4. Who the document applies to\n\n"
+                "Write the summary in the same language as the document.\n\n"
+                f"Document ID: {document_id}\n\n"
+                f"Document text:\n{full_text}"
+            )
+            summary = str(llm_chain.invoke(summary_prompt)).strip()
+            return f"Resumé af {document_id}:\n\n{summary}"
+        tools.extend([multi_query_search, summarize_document])
+    return tools

src/api/main.py CHANGED Viewed

@@ -16,7 +16,8 @@ from src.retrieval.hybrid import HybridRetriever
 from src.retrieval.reranker import Reranker
 from src.agent.intent_classifier import IntentClassifier
 from src.agent.router import QueryRouter
-from src.agent.react_router import ReActRouter
 from src.ingestion.pipeline import IngestionPipeline
 from src.api.routes import router, set_dependencies
@@ -72,13 +73,14 @@ def create_app() -> FastAPI:
     reranker = Reranker(model=create_reranker(settings.reranker_model))
     if settings.agent_mode == "react":
-        logger.info("Agent mode: ReAct (tool-calling loop)")
-        query_router: QueryRouter | ReActRouter = ReActRouter(
             llm=llm,
             hybrid_retriever=hybrid_retriever,
             reranker=reranker,
             vector_store=vector_store,
             default_top_k=settings.top_k,
         )
     else:
         logger.info("Agent mode: pipeline (fixed DAG)")

 from src.retrieval.reranker import Reranker
 from src.agent.intent_classifier import IntentClassifier
 from src.agent.router import QueryRouter
+from src.agent.plan_and_execute import PlanAndExecuteRouter
+from src.agent.memory import ConversationMemory
 from src.ingestion.pipeline import IngestionPipeline
 from src.api.routes import router, set_dependencies
     reranker = Reranker(model=create_reranker(settings.reranker_model))
     if settings.agent_mode == "react":
+        logger.info("Agent mode: Plan-and-Execute (structured multi-step agent)")
+        query_router: QueryRouter | PlanAndExecuteRouter = PlanAndExecuteRouter(
             llm=llm,
             hybrid_retriever=hybrid_retriever,
             reranker=reranker,
             vector_store=vector_store,
             default_top_k=settings.top_k,
+            memory=ConversationMemory(),
         )
     else:
         logger.info("Agent mode: pipeline (fixed DAG)")

src/api/routes.py CHANGED Viewed

@@ -15,6 +15,7 @@ from pydantic import BaseModel
 if TYPE_CHECKING:
     from src.agent.router import QueryRouter
     from src.agent.react_router import ReActRouter
     from src.config import Settings
     from src.ingestion.pipeline import IngestionPipeline
     from src.retrieval.bm25_search import BM25Search
@@ -25,7 +26,7 @@ logger = logging.getLogger(__name__)
 router = APIRouter()
-_query_router: "QueryRouter | ReActRouter | None" = None
 _ingestion_pipeline: "IngestionPipeline | None" = None
 _embedder: "Embedder | None" = None
 _vector_store: "VectorStore | None" = None
@@ -34,7 +35,7 @@ _settings: "Settings | None" = None
 def set_dependencies(
-    query_router: "QueryRouter | ReActRouter",
     ingestion_pipeline: "IngestionPipeline",
     embedder: "Embedder",
     vector_store: "VectorStore",
@@ -75,6 +76,7 @@ class PipelineResultItem(BaseModel):
     chunk_id: str
     score: float
     source: str
 class PipelineDetailsResponse(BaseModel):
@@ -88,6 +90,8 @@ class PipelineDetailsResponse(BaseModel):
     sparse_results: list[PipelineResultItem] = []
     fused_results: list[PipelineResultItem] = []
     reranked_results: list[PipelineResultItem] = []
 class SourceItem(BaseModel):
@@ -206,6 +210,8 @@ async def query_documents(request: QueryRequest) -> QueryResponse:
         sparse_results=[PipelineResultItem(**r.to_dict(include_text=False)) for r in pd.sparse_results],
         fused_results=[PipelineResultItem(**r.to_dict(include_text=False)) for r in pd.fused_results],
         reranked_results=[PipelineResultItem(**r.to_dict(include_text=False)) for r in pd.reranked_results],
     )
     return QueryResponse(

 if TYPE_CHECKING:
     from src.agent.router import QueryRouter
     from src.agent.react_router import ReActRouter
+    from src.agent.plan_and_execute import PlanAndExecuteRouter
     from src.config import Settings
     from src.ingestion.pipeline import IngestionPipeline
     from src.retrieval.bm25_search import BM25Search
 router = APIRouter()
+_query_router: "QueryRouter | ReActRouter | PlanAndExecuteRouter | None" = None
 _ingestion_pipeline: "IngestionPipeline | None" = None
 _embedder: "Embedder | None" = None
 _vector_store: "VectorStore | None" = None
 def set_dependencies(
+    query_router: "QueryRouter | ReActRouter | PlanAndExecuteRouter",
     ingestion_pipeline: "IngestionPipeline",
     embedder: "Embedder",
     vector_store: "VectorStore",
     chunk_id: str
     score: float
     source: str
+    metadata: dict[str, str | int] = {}
 class PipelineDetailsResponse(BaseModel):
     sparse_results: list[PipelineResultItem] = []
     fused_results: list[PipelineResultItem] = []
     reranked_results: list[PipelineResultItem] = []
+    plan_steps: list[str] = []
+    tool_calls: list[str] = []
 class SourceItem(BaseModel):
         sparse_results=[PipelineResultItem(**r.to_dict(include_text=False)) for r in pd.sparse_results],
         fused_results=[PipelineResultItem(**r.to_dict(include_text=False)) for r in pd.fused_results],
         reranked_results=[PipelineResultItem(**r.to_dict(include_text=False)) for r in pd.reranked_results],
+        plan_steps=pd.plan_steps,
+        tool_calls=pd.tool_calls,
     )
     return QueryResponse(

src/models.py CHANGED Viewed

@@ -91,6 +91,8 @@ class PipelineDetails:
         sparse_results: Results from sparse (BM25) retrieval.
         fused_results: Results after reciprocal rank fusion.
         reranked_results: Results after cross-encoder reranking.
     """
     original_query: str = ""
@@ -101,6 +103,8 @@ class PipelineDetails:
     sparse_results: list[QueryResult] = field(default_factory=list)
     fused_results: list[QueryResult] = field(default_factory=list)
     reranked_results: list[QueryResult] = field(default_factory=list)
 @dataclass

         sparse_results: Results from sparse (BM25) retrieval.
         fused_results: Results after reciprocal rank fusion.
         reranked_results: Results after cross-encoder reranking.
+        plan_steps: Ordered descriptions of planned steps (Plan-and-Execute mode).
+        tool_calls: Log of tool invocations as "tool_name: argument" strings.
     """
     original_query: str = ""
     sparse_results: list[QueryResult] = field(default_factory=list)
     fused_results: list[QueryResult] = field(default_factory=list)
     reranked_results: list[QueryResult] = field(default_factory=list)
+    plan_steps: list[str] = field(default_factory=list)
+    tool_calls: list[str] = field(default_factory=list)
 @dataclass

src/ui/app.py CHANGED Viewed

@@ -53,7 +53,7 @@ TEXTS: dict[str, dict[str, str]] = {
             "- **LLM-integration** — provider-agnostisk, prompt-styret "
             "svargenerering\n"
             "- **Evaluering** — RAGAS-baseret kvalitetsmåling\n"
-            "- **Agent Flows** — ReAct-loop med værktøjskald.\n"
             "- [**Kildedokumenter**](https://github.com/Xiiqiing/Dokumentassistent/tree/main/docs)"
             " — de dokumenter systemet er indekseret fra"
         ),
@@ -67,8 +67,8 @@ TEXTS: dict[str, dict[str, str]] = {
             "Et dokumentintelligens-system bygget på en RAG-arkitektur, dækkende file-indlæsning, semantisk chunking, "
             "hybrid søgning med reranking "
             "og LLM-genererede svar med kildehenvisninger. LLM-laget er provider-agnostisk. "
-            "To tilstande: en LangGraph ReAct-agent (standard) til forespørgsler der kræver flere søgetrin, "
-            "og en pipeline til lette modeller uden værktøjskald. Søgekvaliteten evalueres med RAGAS."
         ),
         "search_label": "Stil et spørgsmål om ... ",
         "search_placeholder": "F.eks.: Hvad er reglerne for behandling af personoplysninger?",
@@ -113,6 +113,8 @@ TEXTS: dict[str, dict[str, str]] = {
         "pipeline_rank": "#",
         "pipeline_no_results": "Ingen resultater",
         "pipeline_score_change": "Score-ændring",
     },
     "en": {
         "page_title": "Document Assistant",
@@ -131,7 +133,7 @@ TEXTS: dict[str, dict[str, str]] = {
             "- **LLM integration** — provider-agnostic, prompt-driven "
             "answer generation\n"
             "- **Evaluation** — RAGAS-based quality measurement\n"
-            "- **Agent Flows** — ReAct loop with tool calling\n"
             "- [**Source documents**](https://github.com/Xiiqiing/Dokumentassistent/tree/main/docs)"
             " — the documents indexed into the knowledge base"
         ),
@@ -145,8 +147,8 @@ TEXTS: dict[str, dict[str, str]] = {
             "A document intelligence system built on a RAG architecture, covering file ingestion, semantic chunking, "
             "hybrid retrieval with reranking, "
             "and LLM-generated answers with source citations. The LLM layer is provider-agnostic. "
-            "Two modes: a LangGraph ReAct agent (default) for queries that need multiple retrieval steps, "
-            "and a pipeline for lightweight models without tool-calling support. "
             "Retrieval quality is evaluated with RAGAS."
         ),
         "search_label": "Ask a question ...",
@@ -192,6 +194,8 @@ TEXTS: dict[str, dict[str, str]] = {
         "pipeline_rank": "#",
         "pipeline_no_results": "No results",
         "pipeline_score_change": "Score change",
     },
 }
@@ -711,6 +715,32 @@ if search_clicked and question.strip():
                             else (f"Reranked to **{_rc}** results · confidence **{_cf:.0%}**")
                         )
                     elif _step == "tool_call":
                         _tool_name = _event.get("tool", "")
                         _tool_query = _event.get("query", "")
@@ -850,6 +880,21 @@ if search_clicked and question.strip():
     pd = data.get("pipeline_details", {})
     if pd:
         with st.expander(t["pipeline_heading"], expanded=False):
             # 1) Query translation (only show if translation actually happened)
             if pd.get("translated"):
                 st.markdown(f'**{t["pipeline_translation"]}**')

             "- **LLM-integration** — provider-agnostisk, prompt-styret "
             "svargenerering\n"
             "- **Evaluering** — RAGAS-baseret kvalitetsmåling\n"
+            "- **Agent Flows** — LangGraph Plan-and-Execute med værktøjskald og samtalehukommelse\n"
             "- [**Kildedokumenter**](https://github.com/Xiiqiing/Dokumentassistent/tree/main/docs)"
             " — de dokumenter systemet er indekseret fra"
         ),
             "Et dokumentintelligens-system bygget på en RAG-arkitektur, dækkende file-indlæsning, semantisk chunking, "
             "hybrid søgning med reranking "
             "og LLM-genererede svar med kildehenvisninger. LLM-laget er provider-agnostisk. "
+            "To tilstande: en LangGraph Plan-and-Execute-agent (standard) med samtalehukommelse til komplekse forespørgsler, "
+            "og en foruddefineret pipeline til lette modeller. Søgekvaliteten evalueres med RAGAS."
         ),
         "search_label": "Stil et spørgsmål om ... ",
         "search_placeholder": "F.eks.: Hvad er reglerne for behandling af personoplysninger?",
         "pipeline_rank": "#",
         "pipeline_no_results": "Ingen resultater",
         "pipeline_score_change": "Score-ændring",
+        "pipeline_plan_steps": "Udførelsesplan",
+        "pipeline_tool_calls": "Værktøjskald",
     },
     "en": {
         "page_title": "Document Assistant",
             "- **LLM integration** — provider-agnostic, prompt-driven "
             "answer generation\n"
             "- **Evaluation** — RAGAS-based quality measurement\n"
+            "- **Agent Flows** — LangGraph Plan-and-Execute with tool calling and conversation memory\n"
             "- [**Source documents**](https://github.com/Xiiqiing/Dokumentassistent/tree/main/docs)"
             " — the documents indexed into the knowledge base"
         ),
             "A document intelligence system built on a RAG architecture, covering file ingestion, semantic chunking, "
             "hybrid retrieval with reranking, "
             "and LLM-generated answers with source citations. The LLM layer is provider-agnostic. "
+            "Two modes: a LangGraph Plan-and-Execute agent (default) with conversation memory for complex multi-step queries, "
+            "and a predefined pipeline for lightweight models. "
             "Retrieval quality is evaluated with RAGAS."
         ),
         "search_label": "Ask a question ...",
         "pipeline_rank": "#",
         "pipeline_no_results": "No results",
         "pipeline_score_change": "Score change",
+        "pipeline_plan_steps": "Execution Plan",
+        "pipeline_tool_calls": "Tool Calls",
     },
 }
                             else (f"Reranked to **{_rc}** results · confidence **{_cf:.0%}**")
                         )
+                    elif _step == "plan":
+                        _steps = _event.get("steps", [])
+                        st.write(
+                            (f"Plan oprettet med **{len(_steps)}** trin")
+                            if lang == "da"
+                            else (f"Plan created with **{len(_steps)}** steps")
+                        )
+                        for _ps in _steps:
+                            st.write(f"  - {_ps}")
+                    elif _step == "execute_step":
+                        _si = _event.get("step_index", 0)
+                        _sd = _event.get("step_desc", "")
+                        st.write(
+                            (f"Trin {_si} udført: _{_sd}_")
+                            if lang == "da"
+                            else (f"Step {_si} executed: _{_sd}_")
+                        )
+                    elif _step == "synthesize":
+                        st.write(
+                            "Syntetiserer endeligt svar ..."
+                            if lang == "da"
+                            else "Synthesizing final answer ..."
+                        )
                     elif _step == "tool_call":
                         _tool_name = _event.get("tool", "")
                         _tool_query = _event.get("query", "")
     pd = data.get("pipeline_details", {})
     if pd:
         with st.expander(t["pipeline_heading"], expanded=False):
+            # 0) Plan steps and tool calls (Plan-and-Execute mode)
+            plan_steps = pd.get("plan_steps", [])
+            if plan_steps:
+                st.markdown(f'**{t["pipeline_plan_steps"]}**')
+                for i, step in enumerate(plan_steps, 1):
+                    st.markdown(f"{i}. {step}")
+                st.markdown("---")
+            tool_calls = pd.get("tool_calls", [])
+            if tool_calls:
+                st.markdown(f'**{t["pipeline_tool_calls"]}**')
+                for tc in tool_calls:
+                    st.markdown(f"- `{tc}`")
+                st.markdown("---")
             # 1) Query translation (only show if translation actually happened)
             if pd.get("translated"):
                 st.markdown(f'**{t["pipeline_translation"]}**')

tests/test_memory.py ADDED Viewed

	@@ -0,0 +1,239 @@

+"""Tests for conversation memory."""
+import pytest
+from src.agent.memory import ConversationMemory, Turn
+from src.models import DocumentChunk, QueryResult
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _qr(chunk_id: str = "c1", doc_id: str = "doc.pdf", score: float = 0.8) -> QueryResult:
+    chunk = DocumentChunk(
+        chunk_id=chunk_id, document_id=doc_id, text="text",
+        metadata={"page_number": 1},
+    )
+    return QueryResult(chunk=chunk, score=score, source="test")
+# ---------------------------------------------------------------------------
+# Basic operations
+# ---------------------------------------------------------------------------
+class TestConversationMemory:
+    def test_initially_empty(self) -> None:
+        mem = ConversationMemory()
+        assert mem.is_empty
+        assert mem.turns == []
+        assert mem.last_query() == ""
+        assert mem.last_sources() == []
+    def test_add_turn(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("What is X?", "X is Y.", [_qr()])
+        assert not mem.is_empty
+        assert len(mem.turns) == 1
+        assert mem.last_query() == "What is X?"
+    def test_multiple_turns(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q1", "A1")
+        mem.add_turn("Q2", "A2")
+        assert len(mem.turns) == 2
+        assert mem.last_query() == "Q2"
+    def test_clear(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q1", "A1")
+        mem.clear()
+        assert mem.is_empty
+    def test_turns_returns_copy(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q1", "A1")
+        turns = mem.turns
+        turns.append(Turn(query="fake", answer="fake"))
+        assert len(mem.turns) == 1  # original unaffected
+# ---------------------------------------------------------------------------
+# Eviction
+# ---------------------------------------------------------------------------
+class TestEviction:
+    def test_max_turns_eviction(self) -> None:
+        mem = ConversationMemory(max_turns=3)
+        for i in range(5):
+            mem.add_turn(f"Q{i}", f"A{i}")
+        assert len(mem.turns) == 3
+        # Oldest should be Q2 (Q0 and Q1 evicted)
+        assert mem.turns[0].query == "Q2"
+    def test_max_turns_one(self) -> None:
+        mem = ConversationMemory(max_turns=1)
+        mem.add_turn("Q1", "A1")
+        mem.add_turn("Q2", "A2")
+        assert len(mem.turns) == 1
+        assert mem.turns[0].query == "Q2"
+# ---------------------------------------------------------------------------
+# format_history
+# ---------------------------------------------------------------------------
+class TestFormatHistory:
+    def test_empty_history(self) -> None:
+        mem = ConversationMemory()
+        assert mem.format_history() == ""
+    def test_includes_query_and_answer(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("What is X?", "X is a policy.")
+        text = mem.format_history()
+        assert "What is X?" in text
+        assert "X is a policy." in text
+    def test_includes_source_doc_ids(self) -> None:
+        mem = ConversationMemory()
+        sources = [_qr(doc_id="policy.pdf"), _qr(chunk_id="c2", doc_id="rules.pdf")]
+        mem.add_turn("Q", "A", sources)
+        text = mem.format_history()
+        assert "policy.pdf" in text
+        assert "rules.pdf" in text
+    def test_max_recent_limits_output(self) -> None:
+        mem = ConversationMemory()
+        for i in range(10):
+            mem.add_turn(f"Q{i}", f"A{i}")
+        text = mem.format_history(max_recent=2)
+        assert "Q8" in text
+        assert "Q9" in text
+        assert "Q0" not in text
+    def test_long_answer_truncated(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q", "x" * 1000)
+        text = mem.format_history()
+        # Answer should be truncated to 500 chars
+        assert len(text) < 1000
+# ---------------------------------------------------------------------------
+# get_prior_sources
+# ---------------------------------------------------------------------------
+class TestGetPriorSources:
+    def test_empty_returns_empty(self) -> None:
+        mem = ConversationMemory()
+        assert mem.get_prior_sources() == []
+    def test_collects_across_turns(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q1", "A1", [_qr(chunk_id="c1", score=0.8)])
+        mem.add_turn("Q2", "A2", [_qr(chunk_id="c2", score=0.9)])
+        sources = mem.get_prior_sources()
+        assert len(sources) == 2
+        # Sorted by score descending
+        assert sources[0].score == 0.9
+    def test_deduplicates_by_chunk_id(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q1", "A1", [_qr(chunk_id="c1", score=0.5)])
+        mem.add_turn("Q2", "A2", [_qr(chunk_id="c1", score=0.9)])
+        sources = mem.get_prior_sources()
+        assert len(sources) == 1
+        assert sources[0].score == 0.9  # keeps higher score
+    def test_no_sources_turns(self) -> None:
+        mem = ConversationMemory()
+        mem.add_turn("Q1", "A1")  # no sources
+        assert mem.get_prior_sources() == []
+# ---------------------------------------------------------------------------
+# Integration: memory in PlanAndExecuteRouter
+# ---------------------------------------------------------------------------
+class TestMemoryIntegration:
+    def test_route_records_turn(self) -> None:
+        """After route(), the conversation turn should be recorded in memory."""
+        from unittest.mock import MagicMock, patch
+        from langchain_core.messages import AIMessage
+        from src.agent.plan_and_execute import PlanAndExecuteRouter
+        llm = MagicMock()
+        retriever = MagicMock()
+        reranker = MagicMock()
+        vector_store = MagicMock()
+        memory = ConversationMemory()
+        plan_json = '[{"action": "search", "detail": "test"}]'
+        llm.invoke.side_effect = [plan_json, "The answer."]
+        mock_agent = MagicMock()
+        mock_agent.invoke.return_value = {"messages": [AIMessage(content="Found info.")]}
+        router = PlanAndExecuteRouter(
+            llm, retriever, reranker, vector_store, memory=memory,
+        )
+        with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+            router.route("test question", top_k=5)
+        assert not memory.is_empty
+        assert memory.last_query() == "test question"
+        assert memory.turns[0].answer == "The answer."
+    def test_history_injected_into_planner(self) -> None:
+        """On a follow-up query, conversation history should appear in the planner prompt."""
+        from unittest.mock import MagicMock, patch
+        from langchain_core.messages import AIMessage
+        from src.agent.plan_and_execute import PlanAndExecuteRouter
+        llm = MagicMock()
+        memory = ConversationMemory()
+        memory.add_turn("What is the exam policy?", "The exam policy says...")
+        plan_json = '[{"action": "search", "detail": "follow-up"}]'
+        llm.invoke.side_effect = [plan_json, "Follow-up answer."]
+        mock_agent = MagicMock()
+        mock_agent.invoke.return_value = {"messages": [AIMessage(content="More info.")]}
+        router = PlanAndExecuteRouter(
+            llm, MagicMock(), MagicMock(), MagicMock(), memory=memory,
+        )
+        with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+            router.route("What about the grading?", top_k=5)
+        # The first LLM call is the planner — check it includes history
+        planner_prompt = llm.invoke.call_args_list[0][0][0]
+        assert "exam policy" in planner_prompt
+        assert "Conversation history" in planner_prompt
+    def test_multi_turn_accumulates(self) -> None:
+        """Multiple route() calls should accumulate turns in memory."""
+        from unittest.mock import MagicMock, patch
+        from langchain_core.messages import AIMessage
+        from src.agent.plan_and_execute import PlanAndExecuteRouter
+        llm = MagicMock()
+        memory = ConversationMemory()
+        mock_agent = MagicMock()
+        mock_agent.invoke.return_value = {"messages": [AIMessage(content="info")]}
+        router = PlanAndExecuteRouter(
+            llm, MagicMock(), MagicMock(), MagicMock(), memory=memory,
+        )
+        for i in range(3):
+            plan_json = f'[{{"action": "search", "detail": "q{i}"}}]'
+            llm.invoke.side_effect = [plan_json, f"Answer {i}"]
+            with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+                router.route(f"Question {i}", top_k=5)
+        assert len(memory.turns) == 3

tests/test_plan_and_execute.py ADDED Viewed

	@@ -0,0 +1,370 @@

+"""Tests for the Plan-and-Execute agent router."""
+from unittest.mock import MagicMock, patch
+import json
+import pytest
+from src.agent.plan_and_execute import (
+    PlanAndExecuteRouter,
+    PlanExecState,
+    PlanStep,
+    _extract_last_ai_text,
+    _parse_plan,
+)
+from src.models import (
+    DocumentChunk,
+    GenerationResponse,
+    IntentType,
+    QueryResult,
+)
+from src.retrieval.hybrid import HybridSearchResult
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _chunk(chunk_id: str = "c1", text: str = "text") -> DocumentChunk:
+    return DocumentChunk(
+        chunk_id=chunk_id, document_id="doc.pdf", text=text,
+        metadata={"page_number": 1, "chunk_index": 0},
+    )
+def _qr(chunk_id: str = "c1", score: float = 0.8, text: str = "text") -> QueryResult:
+    return QueryResult(chunk=_chunk(chunk_id=chunk_id, text=text), score=score, source="hybrid")
+def _hybrid_result(results: list[QueryResult]) -> HybridSearchResult:
+    return HybridSearchResult(
+        dense_results=results, sparse_results=results, fused_results=results,
+    )
+# ---------------------------------------------------------------------------
+# _parse_plan
+# ---------------------------------------------------------------------------
+class TestParsePlan:
+    def test_valid_json(self) -> None:
+        raw = '[{"action": "search", "detail": "exam rules"}]'
+        steps = _parse_plan(raw)
+        assert len(steps) == 1
+        assert steps[0]["action"] == "search"
+        assert steps[0]["detail"] == "exam rules"
+    def test_multiple_steps(self) -> None:
+        raw = json.dumps([
+            {"action": "search", "detail": "policy A"},
+            {"action": "search", "detail": "policy B"},
+            {"action": "summarize", "detail": "doc.pdf"},
+        ])
+        steps = _parse_plan(raw)
+        assert len(steps) == 3
+    def test_markdown_fenced(self) -> None:
+        raw = '```json\n[{"action": "search", "detail": "test"}]\n```'
+        steps = _parse_plan(raw)
+        assert len(steps) == 1
+        assert steps[0]["action"] == "search"
+    def test_json_with_surrounding_text(self) -> None:
+        raw = 'Here is the plan:\n[{"action": "search", "detail": "x"}]\nDone.'
+        steps = _parse_plan(raw)
+        assert len(steps) == 1
+    def test_invalid_json_falls_back(self) -> None:
+        raw = "this is not json at all"
+        steps = _parse_plan(raw)
+        assert len(steps) == 1
+        assert steps[0]["action"] == "search"
+    def test_empty_array_falls_back(self) -> None:
+        raw = "[]"
+        steps = _parse_plan(raw)
+        assert len(steps) == 1  # fallback to single search
+    def test_malformed_items_skipped(self) -> None:
+        raw = json.dumps([
+            {"action": "search", "detail": "good"},
+            {"bad": "step"},
+            {"action": "search", "detail": "also good"},
+        ])
+        steps = _parse_plan(raw)
+        assert len(steps) == 2
+    def test_non_list_wrapped(self) -> None:
+        raw = '{"action": "search", "detail": "test"}'
+        steps = _parse_plan(raw)
+        assert len(steps) == 1
+# ---------------------------------------------------------------------------
+# _extract_last_ai_text
+# ---------------------------------------------------------------------------
+class TestExtractLastAIText:
+    def test_returns_last_ai_message(self) -> None:
+        from langchain_core.messages import AIMessage, HumanMessage
+        messages = [
+            HumanMessage(content="question"),
+            AIMessage(content="first"),
+            AIMessage(content="second"),
+        ]
+        assert _extract_last_ai_text(messages) == "second"
+    def test_skips_tool_calls(self) -> None:
+        from langchain_core.messages import AIMessage
+        msg_with_tools = AIMessage(content="calling tool", tool_calls=[{"name": "t", "args": {}, "id": "1"}])
+        msg_final = AIMessage(content="the answer")
+        assert _extract_last_ai_text([msg_with_tools, msg_final]) == "the answer"
+    def test_empty_messages(self) -> None:
+        assert _extract_last_ai_text([]) == ""
+# ---------------------------------------------------------------------------
+# PlanAndExecuteRouter — plan node
+# ---------------------------------------------------------------------------
+class TestPlanNode:
+    def test_plan_node_generates_steps(self) -> None:
+        llm = MagicMock()
+        llm.invoke.return_value = '[{"action": "search", "detail": "KU regler"}]'
+        router = PlanAndExecuteRouter(
+            llm=llm,
+            hybrid_retriever=MagicMock(),
+            reranker=MagicMock(),
+            vector_store=MagicMock(),
+        )
+        state = PlanExecState(
+            query="What are the rules?",
+            top_k=5, plan=[], step_index=0, step_results=[], answer="",
+        )
+        result = router._plan_node(state)
+        assert len(result["plan"]) == 1
+        assert result["plan"][0]["action"] == "search"
+        assert result["step_index"] == 0
+    def test_plan_node_handles_bad_llm_output(self) -> None:
+        llm = MagicMock()
+        llm.invoke.return_value = "I cannot produce JSON"
+        router = PlanAndExecuteRouter(
+            llm=llm,
+            hybrid_retriever=MagicMock(),
+            reranker=MagicMock(),
+            vector_store=MagicMock(),
+        )
+        state = PlanExecState(
+            query="test", top_k=5, plan=[], step_index=0, step_results=[], answer="",
+        )
+        result = router._plan_node(state)
+        assert len(result["plan"]) >= 1  # fallback plan
+# ---------------------------------------------------------------------------
+# PlanAndExecuteRouter — should_execute
+# ---------------------------------------------------------------------------
+class TestShouldExecute:
+    def test_more_steps_returns_execute(self) -> None:
+        state = PlanExecState(
+            query="q", top_k=5,
+            plan=[PlanStep(action="search", detail="x")],
+            step_index=0, step_results=[], answer="",
+        )
+        assert PlanAndExecuteRouter._should_execute(state) == "execute"
+    def test_all_steps_done_returns_synthesize(self) -> None:
+        state = PlanExecState(
+            query="q", top_k=5,
+            plan=[PlanStep(action="search", detail="x")],
+            step_index=1, step_results=[], answer="",
+        )
+        assert PlanAndExecuteRouter._should_execute(state) == "synthesize"
+    def test_empty_plan_returns_synthesize(self) -> None:
+        state = PlanExecState(
+            query="q", top_k=5, plan=[], step_index=0, step_results=[], answer="",
+        )
+        assert PlanAndExecuteRouter._should_execute(state) == "synthesize"
+    def test_max_steps_cap(self) -> None:
+        """Step index at _MAX_STEPS should stop execution."""
+        state = PlanExecState(
+            query="q", top_k=5,
+            plan=[PlanStep(action="search", detail=f"q{i}") for i in range(10)],
+            step_index=6,  # == _MAX_STEPS
+            step_results=[], answer="",
+        )
+        assert PlanAndExecuteRouter._should_execute(state) == "synthesize"
+# ---------------------------------------------------------------------------
+# PlanAndExecuteRouter — synthesize node
+# ---------------------------------------------------------------------------
+class TestSynthesizeNode:
+    def test_synthesize_combines_results(self) -> None:
+        llm = MagicMock()
+        llm.invoke.return_value = "Combined answer about exams."
+        router = PlanAndExecuteRouter(
+            llm=llm,
+            hybrid_retriever=MagicMock(),
+            reranker=MagicMock(),
+            vector_store=MagicMock(),
+        )
+        state = PlanExecState(
+            query="exam rules",
+            top_k=5, plan=[],
+            step_index=2,
+            step_results=[
+                ("search: exam bachelor", "Found bachelor exam rules..."),
+                ("search: exam master", "Found master exam rules..."),
+            ],
+            answer="",
+        )
+        result = router._synthesize_node(state)
+        assert result["answer"] == "Combined answer about exams."
+        # Verify prompt includes both step results
+        prompt = llm.invoke.call_args[0][0]
+        assert "bachelor exam rules" in prompt
+        assert "master exam rules" in prompt
+# ---------------------------------------------------------------------------
+# PlanAndExecuteRouter — full route (integration with mocks)
+# ---------------------------------------------------------------------------
+class TestFullRoute:
+    def test_route_produces_response(self) -> None:
+        """Full route with mocked LLM and retrieval components."""
+        llm = MagicMock()
+        retriever = MagicMock()
+        reranker = MagicMock()
+        vector_store = MagicMock()
+        # Plan: single search step
+        plan_json = '[{"action": "search", "detail": "test query"}]'
+        # Sub-agent answer after executing step
+        from langchain_core.messages import AIMessage
+        sub_agent_result = {"messages": [AIMessage(content="Found relevant info about test.")]}
+        # Final synthesis
+        final_answer = "The test policy states..."
+        # LLM calls: plan, executor system/tools, synthesis
+        # We mock the LLM and also mock the sub-agent creation
+        llm.invoke.side_effect = [plan_json, final_answer]
+        results = [_qr(chunk_id="c1", score=0.9, text="test policy")]
+        retriever.search_detailed.return_value = _hybrid_result(results)
+        reranker.rerank.return_value = results
+        vector_store.list_document_ids.return_value = ["doc.pdf"]
+        router = PlanAndExecuteRouter(llm, retriever, reranker, vector_store)
+        # Patch create_react_agent to return a mock that returns our sub_agent_result
+        mock_agent = MagicMock()
+        mock_agent.invoke.return_value = sub_agent_result
+        with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+            response = router.route("test question", top_k=5)
+        assert isinstance(response, GenerationResponse)
+        assert response.answer == "The test policy states..."
+    def test_route_with_no_results(self) -> None:
+        """Route when retrieval finds nothing."""
+        llm = MagicMock()
+        retriever = MagicMock()
+        reranker = MagicMock()
+        vector_store = MagicMock()
+        plan_json = '[{"action": "search", "detail": "nonexistent"}]'
+        from langchain_core.messages import AIMessage
+        sub_agent_result = {"messages": [AIMessage(content="No relevant documents found.")]}
+        llm.invoke.side_effect = [plan_json, "I could not find information."]
+        mock_agent = MagicMock()
+        mock_agent.invoke.return_value = sub_agent_result
+        router = PlanAndExecuteRouter(llm, retriever, reranker, vector_store)
+        with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+            response = router.route("nonexistent topic", top_k=5)
+        assert response.intent == IntentType.FACTUAL
+        assert response.confidence == 0.0
+    def test_route_multi_step(self) -> None:
+        """Route with a multi-step plan."""
+        llm = MagicMock()
+        retriever = MagicMock()
+        reranker = MagicMock()
+        vector_store = MagicMock()
+        plan_json = json.dumps([
+            {"action": "search", "detail": "policy A"},
+            {"action": "search", "detail": "policy B"},
+        ])
+        from langchain_core.messages import AIMessage
+        sub_result_1 = {"messages": [AIMessage(content="Policy A info")]}
+        sub_result_2 = {"messages": [AIMessage(content="Policy B info")]}
+        llm.invoke.side_effect = [plan_json, "Comparison of A and B."]
+        mock_agent = MagicMock()
+        mock_agent.invoke.side_effect = [sub_result_1, sub_result_2]
+        router = PlanAndExecuteRouter(llm, retriever, reranker, vector_store)
+        with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+            response = router.route("Compare A and B", top_k=5)
+        assert response.answer == "Comparison of A and B."
+        # Sub-agent should have been called twice
+        assert mock_agent.invoke.call_count == 2
+# ---------------------------------------------------------------------------
+# PlanAndExecuteRouter — route_stream
+# ---------------------------------------------------------------------------
+class TestRouteStream:
+    def test_stream_yields_plan_execute_synthesize_done(self) -> None:
+        """Streaming should yield events in order: plan, execute, synthesize, done."""
+        llm = MagicMock()
+        retriever = MagicMock()
+        reranker = MagicMock()
+        vector_store = MagicMock()
+        plan_json = '[{"action": "search", "detail": "test"}]'
+        from langchain_core.messages import AIMessage
+        sub_agent_result = {"messages": [AIMessage(content="Found info.")]}
+        llm.invoke.side_effect = [plan_json, "Final answer."]
+        mock_agent = MagicMock()
+        mock_agent.invoke.return_value = sub_agent_result
+        router = PlanAndExecuteRouter(llm, retriever, reranker, vector_store)
+        with patch("src.agent.plan_and_execute.create_react_agent", return_value=mock_agent):
+            events = list(router.route_stream("test", top_k=5))
+        step_names = [e["step"] for e in events]
+        assert "plan" in step_names
+        assert "done" in step_names
+        # done event has the result
+        done_event = [e for e in events if e["step"] == "done"][0]
+        assert "result" in done_event
+        assert done_event["result"]["answer"] == "Final answer."

tests/test_tools.py ADDED Viewed

	@@ -0,0 +1,381 @@

+"""Tests for agent tools (hybrid_search, list_documents, fetch_document,
+search_within_document, multi_query_search, summarize_document)."""
+from unittest.mock import MagicMock
+import pytest
+from src.agent.tools import ToolResultStore, make_retrieval_tools, _merge_results, _format_results
+from src.models import DocumentChunk, QueryResult
+from src.retrieval.hybrid import HybridSearchResult
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _chunk(chunk_id: str = "c1", document_id: str = "doc.pdf", text: str = "text",
+           page_number: int = 1, chunk_index: int = 0) -> DocumentChunk:
+    return DocumentChunk(
+        chunk_id=chunk_id,
+        document_id=document_id,
+        text=text,
+        metadata={"page_number": page_number, "chunk_index": chunk_index},
+    )
+def _qr(chunk_id: str = "c1", document_id: str = "doc.pdf", text: str = "text",
+         score: float = 0.8, source: str = "hybrid", page_number: int = 1) -> QueryResult:
+    return QueryResult(
+        chunk=_chunk(chunk_id=chunk_id, document_id=document_id, text=text, page_number=page_number),
+        score=score,
+        source=source,
+    )
+def _hybrid_result(results: list[QueryResult]) -> HybridSearchResult:
+    return HybridSearchResult(
+        dense_results=results,
+        sparse_results=results,
+        fused_results=results,
+    )
+@pytest.fixture
+def components():
+    """Create mock retriever, reranker, vector_store, and store."""
+    retriever = MagicMock()
+    reranker = MagicMock()
+    vector_store = MagicMock()
+    store = ToolResultStore()
+    return retriever, reranker, vector_store, store
+# ---------------------------------------------------------------------------
+# Unit tests for helper functions
+# ---------------------------------------------------------------------------
+class TestMergeResults:
+    def test_merge_empty(self) -> None:
+        assert _merge_results([], []) == []
+    def test_merge_keeps_higher_score(self) -> None:
+        old = [_qr(chunk_id="c1", score=0.5)]
+        new = [_qr(chunk_id="c1", score=0.9)]
+        merged = _merge_results(old, new)
+        assert len(merged) == 1
+        assert merged[0].score == 0.9
+    def test_merge_keeps_old_if_higher(self) -> None:
+        old = [_qr(chunk_id="c1", score=0.9)]
+        new = [_qr(chunk_id="c1", score=0.5)]
+        merged = _merge_results(old, new)
+        assert merged[0].score == 0.9
+    def test_merge_combines_different_ids(self) -> None:
+        old = [_qr(chunk_id="c1", score=0.5)]
+        new = [_qr(chunk_id="c2", score=0.9)]
+        merged = _merge_results(old, new)
+        assert len(merged) == 2
+        assert merged[0].chunk.chunk_id == "c2"  # higher score first
+    def test_merge_sorted_descending(self) -> None:
+        results = [_qr(chunk_id=f"c{i}", score=s) for i, s in enumerate([0.3, 0.9, 0.6])]
+        merged = _merge_results([], results)
+        scores = [r.score for r in merged]
+        assert scores == sorted(scores, reverse=True)
+class TestFormatResults:
+    def test_empty_returns_no_results_message(self) -> None:
+        result = _format_results([])
+        assert "Ingen relevante" in result
+    def test_includes_document_id_and_score(self) -> None:
+        results = [_qr(document_id="policy.pdf", score=0.85)]
+        text = _format_results(results)
+        assert "policy.pdf" in text
+        assert "0.850" in text
+    def test_includes_page_number(self) -> None:
+        results = [_qr(page_number=5)]
+        text = _format_results(results)
+        assert "side 5" in text
+# ---------------------------------------------------------------------------
+# hybrid_search
+# ---------------------------------------------------------------------------
+class TestHybridSearch:
+    def test_returns_formatted_results(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        results = [_qr(document_id="a.pdf", score=0.9, text="answer")]
+        retriever.search_detailed.return_value = _hybrid_result(results)
+        reranker.rerank.return_value = results
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        hybrid_search = tools[0]
+        output = hybrid_search.invoke({"query": "test", "top_k": 5})
+        assert "a.pdf" in output
+        assert "answer" in output
+        retriever.search_detailed.assert_called_once_with("test", top_k=5)
+    def test_accumulates_in_store(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        results = [_qr(chunk_id="c1", score=0.8)]
+        retriever.search_detailed.return_value = _hybrid_result(results)
+        reranker.rerank.return_value = results
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        tools[0].invoke({"query": "q1"})
+        assert len(store.retrieved) == 1
+        assert store.retrieved[0].chunk.chunk_id == "c1"
+        assert len(store.tool_calls) == 1
+        assert store.tool_calls[0] == ("hybrid_search", "q1")
+    def test_no_results(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        retriever.search_detailed.return_value = _hybrid_result([])
+        reranker.rerank.return_value = []
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        output = tools[0].invoke({"query": "nothing"})
+        assert "Ingen relevante" in output
+# ---------------------------------------------------------------------------
+# list_documents
+# ---------------------------------------------------------------------------
+class TestListDocuments:
+    def test_returns_document_list(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        vector_store.list_document_ids.return_value = ["a.pdf", "b.pdf"]
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        list_docs = tools[1]
+        output = list_docs.invoke({})
+        assert "a.pdf" in output
+        assert "b.pdf" in output
+        assert "2 i alt" in output
+    def test_empty_knowledge_base(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        vector_store.list_document_ids.return_value = []
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        output = tools[1].invoke({})
+        assert "empty" in output.lower() or "Ingen" in output
+# ---------------------------------------------------------------------------
+# fetch_document
+# ---------------------------------------------------------------------------
+class TestFetchDocument:
+    def test_returns_full_text(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        chunks = [_chunk(chunk_id="c1", text="page1"), _chunk(chunk_id="c2", text="page2")]
+        vector_store.get_chunks_by_document_id.return_value = chunks
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        fetch = tools[2]
+        output = fetch.invoke({"document_id": "doc.pdf"})
+        assert "page1" in output
+        assert "page2" in output
+        assert len(store.retrieved) == 2
+    def test_document_not_found(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        vector_store.get_chunks_by_document_id.return_value = []
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        output = tools[2].invoke({"document_id": "missing.pdf"})
+        assert "ikke fundet" in output
+# ---------------------------------------------------------------------------
+# search_within_document
+# ---------------------------------------------------------------------------
+class TestSearchWithinDocument:
+    def test_reranks_document_chunks(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        chunks = [
+            _chunk(chunk_id="c1", text="irrelevant"),
+            _chunk(chunk_id="c2", text="relevant answer"),
+        ]
+        vector_store.get_chunks_by_document_id.return_value = chunks
+        reranker.rerank.return_value = [_qr(chunk_id="c2", text="relevant answer", score=0.95)]
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        search_within = tools[3]
+        output = search_within.invoke({"document_id": "doc.pdf", "query": "answer"})
+        assert "relevant answer" in output
+        assert "0.950" in output
+        reranker.rerank.assert_called_once()
+        # Verify it passed all chunks to reranker
+        candidates = reranker.rerank.call_args[0][1]
+        assert len(candidates) == 2
+    def test_document_not_found(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        vector_store.get_chunks_by_document_id.return_value = []
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        output = tools[3].invoke({"document_id": "missing.pdf", "query": "test"})
+        assert "ikke fundet" in output
+    def test_accumulates_in_store(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        chunks = [_chunk(chunk_id="c1")]
+        vector_store.get_chunks_by_document_id.return_value = chunks
+        reranker.rerank.return_value = [_qr(chunk_id="c1", score=0.7)]
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store)
+        tools[3].invoke({"document_id": "doc.pdf", "query": "q"})
+        assert len(store.retrieved) == 1
+        assert store.tool_calls[-1][0] == "search_within_document"
+# ---------------------------------------------------------------------------
+# multi_query_search (requires llm_chain)
+# ---------------------------------------------------------------------------
+class TestMultiQuerySearch:
+    def test_decomposes_and_searches(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        # LLM returns 2 sub-queries
+        llm_chain.invoke.return_value = "eksamenregler bachelor\neksamensregler kandidat"
+        results_a = [_qr(chunk_id="c1", score=0.9, text="bachelor exam")]
+        results_b = [_qr(chunk_id="c2", score=0.85, text="master exam")]
+        retriever.search_detailed.side_effect = [
+            _hybrid_result(results_a),
+            _hybrid_result(results_b),
+        ]
+        reranker.rerank.side_effect = [results_a, results_b]
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        multi_search = tools[4]
+        output = multi_search.invoke({"question": "Compare exam rules"})
+        assert "delforespørgsler" in output
+        assert retriever.search_detailed.call_count == 2
+        assert reranker.rerank.call_count == 2
+        assert len(store.retrieved) == 2
+    def test_fallback_when_decompose_fails(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        # LLM returns empty/garbage
+        llm_chain.invoke.return_value = ""
+        results = [_qr(chunk_id="c1", score=0.8)]
+        retriever.search_detailed.return_value = _hybrid_result(results)
+        reranker.rerank.return_value = results
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        output = tools[4].invoke({"question": "original question"})
+        # Should fall back to the original question as single query
+        assert retriever.search_detailed.call_count == 1
+        assert "0.800" in output
+    def test_not_available_without_llm(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=None)
+        tool_names = [t.name for t in tools]
+        assert "multi_query_search" not in tool_names
+        assert "summarize_document" not in tool_names
+    def test_deduplicates_across_sub_queries(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        llm_chain.invoke.return_value = "query1\nquery2"
+        # Both sub-queries return the same chunk
+        same_result = [_qr(chunk_id="c1", score=0.8)]
+        retriever.search_detailed.return_value = _hybrid_result(same_result)
+        reranker.rerank.return_value = same_result
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        tools[4].invoke({"question": "test"})
+        # Should be deduplicated to 1
+        assert len(store.retrieved) == 1
+# ---------------------------------------------------------------------------
+# summarize_document (requires llm_chain)
+# ---------------------------------------------------------------------------
+class TestSummarizeDocument:
+    def test_generates_summary(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        llm_chain.invoke.return_value = "This document covers exam policies."
+        chunks = [_chunk(chunk_id="c1", text="Exam rules...")]
+        vector_store.get_chunks_by_document_id.return_value = chunks
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        summarize = tools[5]
+        output = summarize.invoke({"document_id": "exam.pdf"})
+        assert "Resumé af exam.pdf" in output
+        assert "exam policies" in output
+        llm_chain.invoke.assert_called_once()
+        # Verify the prompt includes the document text
+        prompt = llm_chain.invoke.call_args[0][0]
+        assert "Exam rules" in prompt
+    def test_document_not_found(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        vector_store.get_chunks_by_document_id.return_value = []
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        output = tools[5].invoke({"document_id": "missing.pdf"})
+        assert "ikke fundet" in output
+        llm_chain.invoke.assert_not_called()
+    def test_truncates_long_documents(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        llm_chain.invoke.return_value = "summary"
+        # Create a document longer than 8000 chars
+        long_text = "x" * 10000
+        chunks = [_chunk(chunk_id="c1", text=long_text)]
+        vector_store.get_chunks_by_document_id.return_value = chunks
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        tools[5].invoke({"document_id": "long.pdf"})
+        prompt = llm_chain.invoke.call_args[0][0]
+        assert "forkortet" in prompt
+    def test_registers_chunks_as_sources(self, components) -> None:
+        retriever, reranker, vector_store, store = components
+        llm_chain = MagicMock()
+        llm_chain.invoke.return_value = "summary"
+        chunks = [_chunk(chunk_id="c1"), _chunk(chunk_id="c2")]
+        vector_store.get_chunks_by_document_id.return_value = chunks
+        tools = make_retrieval_tools(retriever, reranker, vector_store, store, llm_chain=llm_chain)
+        tools[5].invoke({"document_id": "doc.pdf"})
+        assert len(store.retrieved) == 2
+        assert store.tool_calls[-1] == ("summarize_document", "doc.pdf")