XQ commited on
Commit
05c89bc
·
1 Parent(s): 3a623df

Update language and prompt

Browse files
.github/README.md CHANGED
@@ -7,7 +7,7 @@ Hosted on Hugging Face Spaces: [xq-dokumentassistent.hf.space](https://xq-dokume
7
 
8
  ## Dansk
9
 
10
- En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på dansk og få svar med kildehenvisninger. Systemet er bygget på open source-komponenter (LangChain, LangGraph, Qdrant, Ollama) og kan køre helt lokalt uden eksterne API-kald. Det implementerer hybrid søgning med reranking, en Plan-and-Execute agent med samtalehukommelse, og RAGAS-baseret evaluering af svarkvaliteten.
11
 
12
  ### Funktioner
13
 
@@ -79,7 +79,7 @@ Se `.env.example` for konfiguration pr. provider.
79
 
80
  Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
81
 
82
- Prøv for eksempel disse spørgsmål på dansk.
83
 
84
  - "Hvad er KU's politik for brug af AI-værktøjer?"
85
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
@@ -177,7 +177,7 @@ docs/ # eksempel-PDF'er eller tekster (KU AI-dokumenter)
177
 
178
  ## English
179
 
180
- A production-ready RAG application that lets users ask questions about documents in Danish and receive answers with source citations. The system is built on open source components (LangChain, LangGraph, Qdrant, Ollama) and can run fully local without any external API calls. It implements hybrid search with reranking, a Plan-and-Execute agent with conversation memory, and RAGAS-based evaluation of answer quality.
181
 
182
  ### Capabilities
183
 
@@ -249,7 +249,7 @@ See `.env.example` for per-provider configuration.
249
 
250
  The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
251
 
252
- Try asking these questions in Danish.
253
 
254
  - "Hvad er KU's politik for brug af AI-værktøjer?"
255
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
 
7
 
8
  ## Dansk
9
 
10
+ En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på et hvilket som helst sprog og få svar med kildehenvisninger. Systemet er bygget på open source-komponenter (LangChain, LangGraph, Qdrant, Ollama) og kan køre helt lokalt uden eksterne API-kald. Det implementerer hybrid søgning med reranking, en Plan-and-Execute agent med samtalehukommelse, og RAGAS-baseret evaluering af svarkvaliteten.
11
 
12
  ### Funktioner
13
 
 
79
 
80
  Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
81
 
82
+ Prøv for eksempel disse spørgsmål på et hvilket som helst sprog.
83
 
84
  - "Hvad er KU's politik for brug af AI-værktøjer?"
85
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
 
177
 
178
  ## English
179
 
180
+ A production-ready RAG application that lets users ask questions about documents in any language and receive answers with source citations. The system is built on open source components (LangChain, LangGraph, Qdrant, Ollama) and can run fully local without any external API calls. It implements hybrid search with reranking, a Plan-and-Execute agent with conversation memory, and RAGAS-based evaluation of answer quality.
181
 
182
  ### Capabilities
183
 
 
249
 
250
  The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
251
 
252
+ Try asking these questions, or in your language.
253
 
254
  - "Hvad er KU's politik for brug af AI-værktøjer?"
255
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
README.md CHANGED
@@ -17,7 +17,7 @@ Hosted on Hugging Face Spaces: [xq-dokumentassistent.hf.space](https://xq-dokume
17
 
18
  ## Dansk
19
 
20
- En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på dansk og få svar med kildehenvisninger. Systemet er bygget på open source-komponenter (LangChain, LangGraph, Qdrant, Ollama) og kan køre helt lokalt uden eksterne API-kald. Det implementerer hybrid søgning med reranking, en Plan-and-Execute agent med samtalehukommelse, og RAGAS-baseret evaluering af svarkvaliteten.
21
 
22
  ### Funktioner
23
 
@@ -89,7 +89,7 @@ Se `.env.example` for konfiguration pr. provider.
89
 
90
  Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
91
 
92
- Prøv for eksempel disse spørgsmål på dansk.
93
 
94
  - "Hvad er KU's politik for brug af AI-værktøjer?"
95
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
@@ -187,7 +187,7 @@ docs/ # eksempel-PDF'er eller tekster (KU AI-dokumenter)
187
 
188
  ## English
189
 
190
- A production-ready RAG application that lets users ask questions about documents in Danish and receive answers with source citations. The system is built on open source components (LangChain, LangGraph, Qdrant, Ollama) and can run fully local without any external API calls. It implements hybrid search with reranking, a Plan-and-Execute agent with conversation memory, and RAGAS-based evaluation of answer quality.
191
 
192
  ### Capabilities
193
 
@@ -259,7 +259,7 @@ See `.env.example` for per-provider configuration.
259
 
260
  The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
261
 
262
- Try asking these questions in Danish.
263
 
264
  - "Hvad er KU's politik for brug af AI-værktøjer?"
265
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
 
17
 
18
  ## Dansk
19
 
20
+ En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på et hvilket som helst sprog og få svar med kildehenvisninger. Systemet er bygget på open source-komponenter (LangChain, LangGraph, Qdrant, Ollama) og kan køre helt lokalt uden eksterne API-kald. Det implementerer hybrid søgning med reranking, en Plan-and-Execute agent med samtalehukommelse, og RAGAS-baseret evaluering af svarkvaliteten.
21
 
22
  ### Funktioner
23
 
 
89
 
90
  Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
91
 
92
+ Prøv for eksempel disse spørgsmål på et hvilket som helst sprog.
93
 
94
  - "Hvad er KU's politik for brug af AI-værktøjer?"
95
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
 
187
 
188
  ## English
189
 
190
+ A production-ready RAG application that lets users ask questions about Danish documents in any language and receive answers with source citations. The system is built on open source components (LangChain, LangGraph, Qdrant, Ollama) and can run fully local without any external API calls. It implements hybrid search with reranking, a Plan-and-Execute agent with conversation memory, and RAGAS-based evaluation of answer quality.
191
 
192
  ### Capabilities
193
 
 
259
 
260
  The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
261
 
262
+ Try asking these questions, or in your language.
263
 
264
  - "Hvad er KU's politik for brug af AI-værktøjer?"
265
  - "Hvilke regler gælder for brug af generativ AI i eksamen?"
src/agent/plan_and_execute.py CHANGED
@@ -26,7 +26,7 @@ from langgraph.graph import END, StateGraph
26
  from langgraph.prebuilt import create_react_agent
27
 
28
  from src.agent.memory import ConversationMemory
29
- from src.agent.tools import ToolResultStore, make_retrieval_tools
30
  from src.models import GenerationResponse, IntentType, PipelineDetails, QueryResult
31
  from src.retrieval.hybrid import HybridRetriever
32
  from src.retrieval.reranker import Reranker
@@ -145,6 +145,7 @@ class PlanAndExecuteRouter:
145
  vector_store: VectorStore,
146
  default_top_k: int = 5,
147
  memory: ConversationMemory | None = None,
 
148
  ) -> None:
149
  """Initialise the Plan-and-Execute router.
150
 
@@ -158,6 +159,9 @@ class PlanAndExecuteRouter:
158
  When provided, prior conversation history is injected into
159
  planner and synthesizer prompts, and each completed turn
160
  is automatically recorded.
 
 
 
161
  """
162
  self._llm = llm
163
  self._hybrid_retriever = hybrid_retriever
@@ -165,6 +169,24 @@ class PlanAndExecuteRouter:
165
  self._vector_store = vector_store
166
  self._default_top_k = default_top_k
167
  self._memory = memory or ConversationMemory()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  # ------------------------------------------------------------------
170
  # Node functions
@@ -217,6 +239,7 @@ class PlanAndExecuteRouter:
217
  store,
218
  self._default_top_k,
219
  llm_chain=self._llm,
 
220
  )
221
  sub_agent = create_react_agent(self._llm, tools)
222
 
 
26
  from langgraph.prebuilt import create_react_agent
27
 
28
  from src.agent.memory import ConversationMemory
29
+ from src.agent.tools import ToolResultStore, detect_document_languages, make_retrieval_tools
30
  from src.models import GenerationResponse, IntentType, PipelineDetails, QueryResult
31
  from src.retrieval.hybrid import HybridRetriever
32
  from src.retrieval.reranker import Reranker
 
145
  vector_store: VectorStore,
146
  default_top_k: int = 5,
147
  memory: ConversationMemory | None = None,
148
+ document_languages: list[str] | None = None,
149
  ) -> None:
150
  """Initialise the Plan-and-Execute router.
151
 
 
159
  When provided, prior conversation history is injected into
160
  planner and synthesizer prompts, and each completed turn
161
  is automatically recorded.
162
+ document_languages: Optional pre-detected list of corpus
163
+ languages. When omitted, the router lazily detects them
164
+ from the vector store on first use via the LLM.
165
  """
166
  self._llm = llm
167
  self._hybrid_retriever = hybrid_retriever
 
169
  self._vector_store = vector_store
170
  self._default_top_k = default_top_k
171
  self._memory = memory or ConversationMemory()
172
+ self._document_languages: list[str] | None = (
173
+ list(document_languages) if document_languages else None
174
+ )
175
+
176
+ def _ensure_document_languages(self) -> list[str]:
177
+ """Lazily detect and cache the document corpus languages via the LLM.
178
+
179
+ Returns:
180
+ List of detected language names (e.g. ``["Danish"]`` or
181
+ ``["Danish", "English"]``). Empty list when the corpus is empty
182
+ or no readable text could be sampled.
183
+ """
184
+ if self._document_languages is not None:
185
+ return self._document_languages
186
+ self._document_languages = detect_document_languages(self._vector_store, self._llm)
187
+ if self._document_languages:
188
+ logger.info("Detected document corpus languages: %s", self._document_languages)
189
+ return self._document_languages
190
 
191
  # ------------------------------------------------------------------
192
  # Node functions
 
239
  store,
240
  self._default_top_k,
241
  llm_chain=self._llm,
242
+ document_languages=self._ensure_document_languages(),
243
  )
244
  sub_agent = create_react_agent(self._llm, tools)
245
 
src/agent/router.py CHANGED
@@ -20,6 +20,7 @@ from langgraph.graph import END, StateGraph
20
 
21
  from src.models import IntentType, GenerationResponse, PipelineDetails, QueryResult
22
  from src.agent.intent_classifier import IntentClassifier
 
23
  from src.retrieval.hybrid import HybridRetriever
24
  from src.retrieval.reranker import Reranker
25
 
@@ -138,6 +139,7 @@ class QueryRouter:
138
  llm_chain: Runnable,
139
  *,
140
  translate_query: bool = True,
 
141
  ) -> None:
142
  """Initialize the query router.
143
 
@@ -147,17 +149,42 @@ class QueryRouter:
147
  reranker: Reranker instance.
148
  llm_chain: LLM chain (llm | StrOutputParser) for generation,
149
  translation, and language detection.
150
- translate_query: Whether to translate non-Danish queries to Danish
151
- before retrieval. When False, language detection still runs for
152
- the answer-language rule but no translation is performed.
 
 
 
 
 
153
  """
154
  self._intent_classifier = intent_classifier
155
  self._hybrid_retriever = hybrid_retriever
156
  self._reranker = reranker
157
  self._llm_chain = llm_chain
158
  self._translate_query_enabled = translate_query
 
 
 
159
  self._graph = self._build_graph()
160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  def _detect_language_and_intent(self, query: str) -> tuple[str, IntentType]:
162
  """Detect the query language and classify intent in a single LLM call.
163
 
@@ -203,29 +230,49 @@ class QueryRouter:
203
  return detected, intent
204
 
205
  def _translate_query(self, query: str, detected_language: str) -> str:
206
- """Translate the query to Danish if needed.
 
 
 
 
 
 
207
 
208
  Args:
209
  query: The user's original query.
210
  detected_language: Detected language of the query.
211
 
212
  Returns:
213
- The Danish retrieval query, or the original if already Danish.
214
  """
215
- if detected_language.lower() in ("danish", "dansk"):
 
 
 
 
 
 
 
 
 
 
 
 
 
216
  return query
217
 
218
  if not self._translate_query_enabled:
219
  logger.info("Query translation disabled; using original query for retrieval")
220
  return query
221
 
 
222
  translate_prompt = (
223
- "Translate the following text to Danish. "
224
  "Reply with ONLY the translated text, nothing else.\n\n"
225
  f"Text: {query}"
226
  )
227
  translated = _extract_content(self._llm_chain.invoke(translate_prompt))
228
- logger.info("Translated query to Danish: %s", translated)
229
  return translated
230
 
231
  # ------------------------------------------------------------------
@@ -552,10 +599,21 @@ class QueryRouter:
552
 
553
  instruction = intent_instructions[intent]
554
 
 
 
 
 
 
 
 
 
 
 
 
555
  language_rule = (
556
  f"IMPORTANT: You MUST answer in {user_language}. "
557
  f"The user asked in {user_language}, so your entire response must be in {user_language}. "
558
- f"The context documents may be in Danish — use them as reference but always reply in {user_language}."
559
  )
560
 
561
  return (
 
20
 
21
  from src.models import IntentType, GenerationResponse, PipelineDetails, QueryResult
22
  from src.agent.intent_classifier import IntentClassifier
23
+ from src.agent.tools import detect_document_languages
24
  from src.retrieval.hybrid import HybridRetriever
25
  from src.retrieval.reranker import Reranker
26
 
 
139
  llm_chain: Runnable,
140
  *,
141
  translate_query: bool = True,
142
+ document_languages: list[str] | None = None,
143
  ) -> None:
144
  """Initialize the query router.
145
 
 
149
  reranker: Reranker instance.
150
  llm_chain: LLM chain (llm | StrOutputParser) for generation,
151
  translation, and language detection.
152
+ translate_query: Whether to translate the user query into a
153
+ corpus language before BM25 retrieval when the query
154
+ language does not already match one of the corpus languages.
155
+ When False, no translation is performed.
156
+ document_languages: Optional pre-detected list of corpus
157
+ languages. When omitted, the router lazily detects them
158
+ from the vector store on first translation/generation via
159
+ the LLM.
160
  """
161
  self._intent_classifier = intent_classifier
162
  self._hybrid_retriever = hybrid_retriever
163
  self._reranker = reranker
164
  self._llm_chain = llm_chain
165
  self._translate_query_enabled = translate_query
166
+ self._document_languages: list[str] | None = (
167
+ list(document_languages) if document_languages else None
168
+ )
169
  self._graph = self._build_graph()
170
 
171
+ def _ensure_document_languages(self) -> list[str]:
172
+ """Lazily detect and cache the document corpus languages via the LLM.
173
+
174
+ Returns:
175
+ List of detected language names (e.g. ``["Danish"]`` or
176
+ ``["Danish", "English"]``). Empty list when the corpus is empty
177
+ or no readable text could be sampled.
178
+ """
179
+ if self._document_languages is not None:
180
+ return self._document_languages
181
+ self._document_languages = detect_document_languages(
182
+ self._hybrid_retriever.vector_store, self._llm_chain
183
+ )
184
+ if self._document_languages:
185
+ logger.info("Detected document corpus languages: %s", self._document_languages)
186
+ return self._document_languages
187
+
188
  def _detect_language_and_intent(self, query: str) -> tuple[str, IntentType]:
189
  """Detect the query language and classify intent in a single LLM call.
190
 
 
230
  return detected, intent
231
 
232
  def _translate_query(self, query: str, detected_language: str) -> str:
233
+ """Translate the query into a corpus language when needed.
234
+
235
+ BM25 needs token-level matches against the corpus, so when the user's
236
+ query language is not present in the corpus we translate it to the
237
+ primary corpus language. When the corpus contains the user's
238
+ language already (single- or multi-language corpus), no translation
239
+ is performed — the original query is used as-is.
240
 
241
  Args:
242
  query: The user's original query.
243
  detected_language: Detected language of the query.
244
 
245
  Returns:
246
+ The retrieval query, translated when necessary.
247
  """
248
+ doc_langs = self._ensure_document_languages()
249
+
250
+ # Without a known corpus language we cannot pick a translation target.
251
+ if not doc_langs:
252
+ return query
253
+
254
+ user_lang = detected_language.lower().strip()
255
+ doc_lang_set = {lang.lower() for lang in doc_langs}
256
+ # Accept the Danish autonym so legacy "dansk" detection still matches.
257
+ if user_lang == "dansk":
258
+ user_lang = "danish"
259
+
260
+ # Query already in one of the corpus languages → BM25 will work as-is.
261
+ if user_lang in doc_lang_set:
262
  return query
263
 
264
  if not self._translate_query_enabled:
265
  logger.info("Query translation disabled; using original query for retrieval")
266
  return query
267
 
268
+ target = doc_langs[0]
269
  translate_prompt = (
270
+ f"Translate the following text to {target}. "
271
  "Reply with ONLY the translated text, nothing else.\n\n"
272
  f"Text: {query}"
273
  )
274
  translated = _extract_content(self._llm_chain.invoke(translate_prompt))
275
+ logger.info("Translated query to %s: %s", target, translated)
276
  return translated
277
 
278
  # ------------------------------------------------------------------
 
599
 
600
  instruction = intent_instructions[intent]
601
 
602
+ doc_langs = self._ensure_document_languages()
603
+ if doc_langs:
604
+ corpus_clause = (
605
+ f"The context documents may be in {' or '.join(doc_langs)} — "
606
+ f"use them as reference but always reply in {user_language}."
607
+ )
608
+ else:
609
+ corpus_clause = (
610
+ f"The context documents may be in a different language — "
611
+ f"use them as reference but always reply in {user_language}."
612
+ )
613
  language_rule = (
614
  f"IMPORTANT: You MUST answer in {user_language}. "
615
  f"The user asked in {user_language}, so your entire response must be in {user_language}. "
616
+ f"{corpus_clause}"
617
  )
618
 
619
  return (
src/agent/tools.py CHANGED
@@ -69,6 +69,81 @@ class ToolResultStore:
69
  fused_results: list[QueryResult] = field(default_factory=list)
70
 
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  def _merge_results(existing: list[QueryResult], new: list[QueryResult]) -> list[QueryResult]:
73
  """Merge two QueryResult lists by chunk_id, keeping the highest score.
74
 
@@ -117,6 +192,7 @@ def make_retrieval_tools(
117
  store: ToolResultStore,
118
  default_top_k: int = 5,
119
  llm_chain: Runnable | None = None,
 
120
  ) -> list:
121
  """Create retrieval tools bound to the given components and result store.
122
 
@@ -133,10 +209,34 @@ def make_retrieval_tools(
133
  llm_chain: Optional LLM chain for tools that need generation
134
  (summarize_document, multi_query_search). When None, those
135
  tools are excluded from the returned list.
 
 
 
 
 
136
 
137
  Returns:
138
  List of LangChain tool callables ready for bind_tools / ToolNode.
139
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  # ------------------------------------------------------------------
142
  # Core search tool
@@ -317,8 +417,7 @@ def make_retrieval_tools(
317
  decompose_prompt = (
318
  "You are a search query planner. Given a complex question, "
319
  "decompose it into 2-4 simple, independent search queries that "
320
- "together cover all aspects of the question. The queries should "
321
- "be in Danish (since the document base is Danish).\n\n"
322
  "Reply with ONLY the queries, one per line, nothing else.\n\n"
323
  f"Question: {question}"
324
  )
 
69
  fused_results: list[QueryResult] = field(default_factory=list)
70
 
71
 
72
+ def detect_document_languages(
73
+ vector_store: VectorStore,
74
+ llm: Runnable,
75
+ *,
76
+ max_documents: int = 5,
77
+ chunks_per_document: int = 2,
78
+ sample_chars: int = 2000,
79
+ ) -> list[str]:
80
+ """Detect all languages present in the document corpus via the LLM.
81
+
82
+ Samples chunks from up to ``max_documents`` distinct documents and asks the
83
+ LLM in a single call to identify every language present. Used by routers
84
+ so that intermediate retrieval queries can be phrased in the corpus
85
+ language(s) without hardcoding any specific language.
86
+
87
+ Args:
88
+ vector_store: VectorStore to sample chunks from.
89
+ llm: LLM runnable used for the single detection call.
90
+ max_documents: Maximum number of documents to sample from.
91
+ chunks_per_document: Chunks taken from each sampled document.
92
+ sample_chars: Cap on total sample text length sent to the LLM.
93
+
94
+ Returns:
95
+ List of detected language names in English (e.g. ``["Danish"]`` or
96
+ ``["Danish", "English"]``), preserving the order returned by the LLM.
97
+ Returns an empty list when the corpus is empty or no readable text
98
+ could be sampled (e.g. when the vector store is mocked in tests).
99
+ """
100
+ try:
101
+ ids = vector_store.list_document_ids()
102
+ except Exception:
103
+ return []
104
+ if not isinstance(ids, list) or not ids:
105
+ return []
106
+
107
+ samples: list[str] = []
108
+ for doc_id in ids[:max_documents]:
109
+ try:
110
+ chunks = vector_store.get_chunks_by_document_id(doc_id)
111
+ except Exception:
112
+ continue
113
+ if not isinstance(chunks, list):
114
+ continue
115
+ for c in chunks[:chunks_per_document]:
116
+ text = (getattr(c, "text", "") or "").strip()
117
+ if text:
118
+ samples.append(text)
119
+
120
+ sample_text = "\n---\n".join(samples)[:sample_chars].strip()
121
+ if not sample_text:
122
+ return []
123
+
124
+ prompt = (
125
+ "You are a language detector. The text samples below come from "
126
+ "different documents in a knowledge base. Identify ALL distinct "
127
+ "languages present across the samples (do not list a language more "
128
+ "than once). Reply with ONLY the language names in English, one per "
129
+ "line, no explanation.\n\n"
130
+ f"Samples:\n{sample_text}"
131
+ )
132
+ raw = _extract_content(llm.invoke(prompt))
133
+
134
+ seen: set[str] = set()
135
+ detected: list[str] = []
136
+ for line in raw.strip().splitlines():
137
+ name = line.strip().lstrip("-•*0123456789.) ").rstrip(".").strip()
138
+ if not name:
139
+ continue
140
+ name = name.capitalize()
141
+ if name.lower() not in seen:
142
+ seen.add(name.lower())
143
+ detected.append(name)
144
+ return detected
145
+
146
+
147
  def _merge_results(existing: list[QueryResult], new: list[QueryResult]) -> list[QueryResult]:
148
  """Merge two QueryResult lists by chunk_id, keeping the highest score.
149
 
 
192
  store: ToolResultStore,
193
  default_top_k: int = 5,
194
  llm_chain: Runnable | None = None,
195
+ document_languages: list[str] | None = None,
196
  ) -> list:
197
  """Create retrieval tools bound to the given components and result store.
198
 
 
209
  llm_chain: Optional LLM chain for tools that need generation
210
  (summarize_document, multi_query_search). When None, those
211
  tools are excluded from the returned list.
212
+ document_languages: Detected languages of the document corpus
213
+ (e.g. ``["Danish"]`` or ``["Danish", "English"]``). Used by
214
+ multi_query_search to phrase sub-queries in the corpus
215
+ language(s) for best BM25 recall. When None or empty, the
216
+ sub-query language is left unconstrained.
217
 
218
  Returns:
219
  List of LangChain tool callables ready for bind_tools / ToolNode.
220
  """
221
+ if document_languages:
222
+ if len(document_languages) == 1:
223
+ _lang_clause = (
224
+ f"The queries should be in {document_languages[0]} "
225
+ f"(the document base is {document_languages[0]})."
226
+ )
227
+ else:
228
+ _lang_list = ", ".join(document_languages)
229
+ _lang_clause = (
230
+ f"The document base contains multiple languages: {_lang_list}. "
231
+ f"For each sub-query, write it in whichever of these languages "
232
+ f"best matches the topic; mix languages across sub-queries if "
233
+ f"the topic is likely covered by documents in different languages."
234
+ )
235
+ else:
236
+ _lang_clause = (
237
+ "Write each sub-query in the language most likely used by the "
238
+ "underlying documents."
239
+ )
240
 
241
  # ------------------------------------------------------------------
242
  # Core search tool
 
417
  decompose_prompt = (
418
  "You are a search query planner. Given a complex question, "
419
  "decompose it into 2-4 simple, independent search queries that "
420
+ f"together cover all aspects of the question. {_lang_clause}\n\n"
 
421
  "Reply with ONLY the queries, one per line, nothing else.\n\n"
422
  f"Question: {question}"
423
  )
src/retrieval/hybrid.py CHANGED
@@ -52,6 +52,11 @@ class HybridRetriever:
52
  self._dense_weight = dense_weight
53
  self._bm25_weight = bm25_weight
54
 
 
 
 
 
 
55
  def search(self, query: str, top_k: int) -> list[QueryResult]:
56
  """Execute hybrid search combining dense and sparse results.
57
 
 
52
  self._dense_weight = dense_weight
53
  self._bm25_weight = bm25_weight
54
 
55
+ @property
56
+ def vector_store(self) -> VectorStore:
57
+ """Underlying vector store, exposed for callers that need corpus-level access."""
58
+ return self._vector_store
59
+
60
  def search(self, query: str, top_k: int) -> list[QueryResult]:
61
  """Execute hybrid search combining dense and sparse results.
62
 
tests/test_router.py CHANGED
@@ -244,7 +244,7 @@ class TestQueryTranslation:
244
  retriever.search_detailed.assert_called_once_with("Hvad er reglerne?", top_k=3)
245
 
246
  def test_english_query_translated_for_retrieval(self, mock_components) -> None:
247
- """English queries should be translated to Danish for retrieval."""
248
  classifier, retriever, reranker, llm_chain = mock_components
249
 
250
  results = [_make_query_result("ctx", 0.5)]
@@ -252,7 +252,10 @@ class TestQueryTranslation:
252
  reranker.rerank.return_value = results
253
  _setup_llm_chain_english(llm_chain, "Hvad er reglerne?", "The rules are...", intent="rag")
254
 
255
- router = QueryRouter(classifier, retriever, reranker, llm_chain, translate_query=True)
 
 
 
256
  response = router.route("What are the rules?", top_k=3)
257
 
258
  # 3 invoke calls: combined detection + translation + generation
 
244
  retriever.search_detailed.assert_called_once_with("Hvad er reglerne?", top_k=3)
245
 
246
  def test_english_query_translated_for_retrieval(self, mock_components) -> None:
247
+ """English queries should be translated into the corpus language for retrieval."""
248
  classifier, retriever, reranker, llm_chain = mock_components
249
 
250
  results = [_make_query_result("ctx", 0.5)]
 
252
  reranker.rerank.return_value = results
253
  _setup_llm_chain_english(llm_chain, "Hvad er reglerne?", "The rules are...", intent="rag")
254
 
255
+ router = QueryRouter(
256
+ classifier, retriever, reranker, llm_chain,
257
+ translate_query=True, document_languages=["Danish"],
258
+ )
259
  response = router.route("What are the rules?", top_k=3)
260
 
261
  # 3 invoke calls: combined detection + translation + generation