Spaces:
Running
Running
XQ commited on
Commit ·
05c89bc
1
Parent(s): 3a623df
Update language and prompt
Browse files- .github/README.md +4 -4
- README.md +4 -4
- src/agent/plan_and_execute.py +24 -1
- src/agent/router.py +67 -9
- src/agent/tools.py +101 -2
- src/retrieval/hybrid.py +5 -0
- tests/test_router.py +5 -2
.github/README.md
CHANGED
|
@@ -7,7 +7,7 @@ Hosted on Hugging Face Spaces: [xq-dokumentassistent.hf.space](https://xq-dokume
|
|
| 7 |
|
| 8 |
## Dansk
|
| 9 |
|
| 10 |
-
En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på
|
| 11 |
|
| 12 |
### Funktioner
|
| 13 |
|
|
@@ -79,7 +79,7 @@ Se `.env.example` for konfiguration pr. provider.
|
|
| 79 |
|
| 80 |
Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 81 |
|
| 82 |
-
Prøv for eksempel disse spørgsmål på
|
| 83 |
|
| 84 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 85 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
|
@@ -177,7 +177,7 @@ docs/ # eksempel-PDF'er eller tekster (KU AI-dokumenter)
|
|
| 177 |
|
| 178 |
## English
|
| 179 |
|
| 180 |
-
A production-ready RAG application that lets users ask questions about documents in
|
| 181 |
|
| 182 |
### Capabilities
|
| 183 |
|
|
@@ -249,7 +249,7 @@ See `.env.example` for per-provider configuration.
|
|
| 249 |
|
| 250 |
The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 251 |
|
| 252 |
-
Try asking these questions in
|
| 253 |
|
| 254 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 255 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
|
|
|
| 7 |
|
| 8 |
## Dansk
|
| 9 |
|
| 10 |
+
En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på et hvilket som helst sprog og få svar med kildehenvisninger. Systemet er bygget på open source-komponenter (LangChain, LangGraph, Qdrant, Ollama) og kan køre helt lokalt uden eksterne API-kald. Det implementerer hybrid søgning med reranking, en Plan-and-Execute agent med samtalehukommelse, og RAGAS-baseret evaluering af svarkvaliteten.
|
| 11 |
|
| 12 |
### Funktioner
|
| 13 |
|
|
|
|
| 79 |
|
| 80 |
Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 81 |
|
| 82 |
+
Prøv for eksempel disse spørgsmål på et hvilket som helst sprog.
|
| 83 |
|
| 84 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 85 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
|
|
|
| 177 |
|
| 178 |
## English
|
| 179 |
|
| 180 |
+
A production-ready RAG application that lets users ask questions about documents in any language and receive answers with source citations. The system is built on open source components (LangChain, LangGraph, Qdrant, Ollama) and can run fully local without any external API calls. It implements hybrid search with reranking, a Plan-and-Execute agent with conversation memory, and RAGAS-based evaluation of answer quality.
|
| 181 |
|
| 182 |
### Capabilities
|
| 183 |
|
|
|
|
| 249 |
|
| 250 |
The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 251 |
|
| 252 |
+
Try asking these questions, or in your language.
|
| 253 |
|
| 254 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 255 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
README.md
CHANGED
|
@@ -17,7 +17,7 @@ Hosted on Hugging Face Spaces: [xq-dokumentassistent.hf.space](https://xq-dokume
|
|
| 17 |
|
| 18 |
## Dansk
|
| 19 |
|
| 20 |
-
En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på
|
| 21 |
|
| 22 |
### Funktioner
|
| 23 |
|
|
@@ -89,7 +89,7 @@ Se `.env.example` for konfiguration pr. provider.
|
|
| 89 |
|
| 90 |
Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 91 |
|
| 92 |
-
Prøv for eksempel disse spørgsmål på
|
| 93 |
|
| 94 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 95 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
|
@@ -187,7 +187,7 @@ docs/ # eksempel-PDF'er eller tekster (KU AI-dokumenter)
|
|
| 187 |
|
| 188 |
## English
|
| 189 |
|
| 190 |
-
A production-ready RAG application that lets users ask questions about documents in
|
| 191 |
|
| 192 |
### Capabilities
|
| 193 |
|
|
@@ -259,7 +259,7 @@ See `.env.example` for per-provider configuration.
|
|
| 259 |
|
| 260 |
The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 261 |
|
| 262 |
-
Try asking these questions in
|
| 263 |
|
| 264 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 265 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
|
|
|
| 17 |
|
| 18 |
## Dansk
|
| 19 |
|
| 20 |
+
En produktionsklar RAG-applikation, der gør det muligt at stille spørgsmål til dokumenter på et hvilket som helst sprog og få svar med kildehenvisninger. Systemet er bygget på open source-komponenter (LangChain, LangGraph, Qdrant, Ollama) og kan køre helt lokalt uden eksterne API-kald. Det implementerer hybrid søgning med reranking, en Plan-and-Execute agent med samtalehukommelse, og RAGAS-baseret evaluering af svarkvaliteten.
|
| 21 |
|
| 22 |
### Funktioner
|
| 23 |
|
|
|
|
| 89 |
|
| 90 |
Demoen ligger på [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 91 |
|
| 92 |
+
Prøv for eksempel disse spørgsmål på et hvilket som helst sprog.
|
| 93 |
|
| 94 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 95 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
|
|
|
| 187 |
|
| 188 |
## English
|
| 189 |
|
| 190 |
+
A production-ready RAG application that lets users ask questions about Danish documents in any language and receive answers with source citations. The system is built on open source components (LangChain, LangGraph, Qdrant, Ollama) and can run fully local without any external API calls. It implements hybrid search with reranking, a Plan-and-Execute agent with conversation memory, and RAGAS-based evaluation of answer quality.
|
| 191 |
|
| 192 |
### Capabilities
|
| 193 |
|
|
|
|
| 259 |
|
| 260 |
The demo lives at [xq-dokumentassistent.hf.space](https://xq-dokumentassistent.hf.space).
|
| 261 |
|
| 262 |
+
Try asking these questions, or in your language.
|
| 263 |
|
| 264 |
- "Hvad er KU's politik for brug af AI-værktøjer?"
|
| 265 |
- "Hvilke regler gælder for brug af generativ AI i eksamen?"
|
src/agent/plan_and_execute.py
CHANGED
|
@@ -26,7 +26,7 @@ from langgraph.graph import END, StateGraph
|
|
| 26 |
from langgraph.prebuilt import create_react_agent
|
| 27 |
|
| 28 |
from src.agent.memory import ConversationMemory
|
| 29 |
-
from src.agent.tools import ToolResultStore, make_retrieval_tools
|
| 30 |
from src.models import GenerationResponse, IntentType, PipelineDetails, QueryResult
|
| 31 |
from src.retrieval.hybrid import HybridRetriever
|
| 32 |
from src.retrieval.reranker import Reranker
|
|
@@ -145,6 +145,7 @@ class PlanAndExecuteRouter:
|
|
| 145 |
vector_store: VectorStore,
|
| 146 |
default_top_k: int = 5,
|
| 147 |
memory: ConversationMemory | None = None,
|
|
|
|
| 148 |
) -> None:
|
| 149 |
"""Initialise the Plan-and-Execute router.
|
| 150 |
|
|
@@ -158,6 +159,9 @@ class PlanAndExecuteRouter:
|
|
| 158 |
When provided, prior conversation history is injected into
|
| 159 |
planner and synthesizer prompts, and each completed turn
|
| 160 |
is automatically recorded.
|
|
|
|
|
|
|
|
|
|
| 161 |
"""
|
| 162 |
self._llm = llm
|
| 163 |
self._hybrid_retriever = hybrid_retriever
|
|
@@ -165,6 +169,24 @@ class PlanAndExecuteRouter:
|
|
| 165 |
self._vector_store = vector_store
|
| 166 |
self._default_top_k = default_top_k
|
| 167 |
self._memory = memory or ConversationMemory()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
# ------------------------------------------------------------------
|
| 170 |
# Node functions
|
|
@@ -217,6 +239,7 @@ class PlanAndExecuteRouter:
|
|
| 217 |
store,
|
| 218 |
self._default_top_k,
|
| 219 |
llm_chain=self._llm,
|
|
|
|
| 220 |
)
|
| 221 |
sub_agent = create_react_agent(self._llm, tools)
|
| 222 |
|
|
|
|
| 26 |
from langgraph.prebuilt import create_react_agent
|
| 27 |
|
| 28 |
from src.agent.memory import ConversationMemory
|
| 29 |
+
from src.agent.tools import ToolResultStore, detect_document_languages, make_retrieval_tools
|
| 30 |
from src.models import GenerationResponse, IntentType, PipelineDetails, QueryResult
|
| 31 |
from src.retrieval.hybrid import HybridRetriever
|
| 32 |
from src.retrieval.reranker import Reranker
|
|
|
|
| 145 |
vector_store: VectorStore,
|
| 146 |
default_top_k: int = 5,
|
| 147 |
memory: ConversationMemory | None = None,
|
| 148 |
+
document_languages: list[str] | None = None,
|
| 149 |
) -> None:
|
| 150 |
"""Initialise the Plan-and-Execute router.
|
| 151 |
|
|
|
|
| 159 |
When provided, prior conversation history is injected into
|
| 160 |
planner and synthesizer prompts, and each completed turn
|
| 161 |
is automatically recorded.
|
| 162 |
+
document_languages: Optional pre-detected list of corpus
|
| 163 |
+
languages. When omitted, the router lazily detects them
|
| 164 |
+
from the vector store on first use via the LLM.
|
| 165 |
"""
|
| 166 |
self._llm = llm
|
| 167 |
self._hybrid_retriever = hybrid_retriever
|
|
|
|
| 169 |
self._vector_store = vector_store
|
| 170 |
self._default_top_k = default_top_k
|
| 171 |
self._memory = memory or ConversationMemory()
|
| 172 |
+
self._document_languages: list[str] | None = (
|
| 173 |
+
list(document_languages) if document_languages else None
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
def _ensure_document_languages(self) -> list[str]:
|
| 177 |
+
"""Lazily detect and cache the document corpus languages via the LLM.
|
| 178 |
+
|
| 179 |
+
Returns:
|
| 180 |
+
List of detected language names (e.g. ``["Danish"]`` or
|
| 181 |
+
``["Danish", "English"]``). Empty list when the corpus is empty
|
| 182 |
+
or no readable text could be sampled.
|
| 183 |
+
"""
|
| 184 |
+
if self._document_languages is not None:
|
| 185 |
+
return self._document_languages
|
| 186 |
+
self._document_languages = detect_document_languages(self._vector_store, self._llm)
|
| 187 |
+
if self._document_languages:
|
| 188 |
+
logger.info("Detected document corpus languages: %s", self._document_languages)
|
| 189 |
+
return self._document_languages
|
| 190 |
|
| 191 |
# ------------------------------------------------------------------
|
| 192 |
# Node functions
|
|
|
|
| 239 |
store,
|
| 240 |
self._default_top_k,
|
| 241 |
llm_chain=self._llm,
|
| 242 |
+
document_languages=self._ensure_document_languages(),
|
| 243 |
)
|
| 244 |
sub_agent = create_react_agent(self._llm, tools)
|
| 245 |
|
src/agent/router.py
CHANGED
|
@@ -20,6 +20,7 @@ from langgraph.graph import END, StateGraph
|
|
| 20 |
|
| 21 |
from src.models import IntentType, GenerationResponse, PipelineDetails, QueryResult
|
| 22 |
from src.agent.intent_classifier import IntentClassifier
|
|
|
|
| 23 |
from src.retrieval.hybrid import HybridRetriever
|
| 24 |
from src.retrieval.reranker import Reranker
|
| 25 |
|
|
@@ -138,6 +139,7 @@ class QueryRouter:
|
|
| 138 |
llm_chain: Runnable,
|
| 139 |
*,
|
| 140 |
translate_query: bool = True,
|
|
|
|
| 141 |
) -> None:
|
| 142 |
"""Initialize the query router.
|
| 143 |
|
|
@@ -147,17 +149,42 @@ class QueryRouter:
|
|
| 147 |
reranker: Reranker instance.
|
| 148 |
llm_chain: LLM chain (llm | StrOutputParser) for generation,
|
| 149 |
translation, and language detection.
|
| 150 |
-
translate_query: Whether to translate
|
| 151 |
-
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
"""
|
| 154 |
self._intent_classifier = intent_classifier
|
| 155 |
self._hybrid_retriever = hybrid_retriever
|
| 156 |
self._reranker = reranker
|
| 157 |
self._llm_chain = llm_chain
|
| 158 |
self._translate_query_enabled = translate_query
|
|
|
|
|
|
|
|
|
|
| 159 |
self._graph = self._build_graph()
|
| 160 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
def _detect_language_and_intent(self, query: str) -> tuple[str, IntentType]:
|
| 162 |
"""Detect the query language and classify intent in a single LLM call.
|
| 163 |
|
|
@@ -203,29 +230,49 @@ class QueryRouter:
|
|
| 203 |
return detected, intent
|
| 204 |
|
| 205 |
def _translate_query(self, query: str, detected_language: str) -> str:
|
| 206 |
-
"""Translate the query
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
|
| 208 |
Args:
|
| 209 |
query: The user's original query.
|
| 210 |
detected_language: Detected language of the query.
|
| 211 |
|
| 212 |
Returns:
|
| 213 |
-
The
|
| 214 |
"""
|
| 215 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
return query
|
| 217 |
|
| 218 |
if not self._translate_query_enabled:
|
| 219 |
logger.info("Query translation disabled; using original query for retrieval")
|
| 220 |
return query
|
| 221 |
|
|
|
|
| 222 |
translate_prompt = (
|
| 223 |
-
"Translate the following text to
|
| 224 |
"Reply with ONLY the translated text, nothing else.\n\n"
|
| 225 |
f"Text: {query}"
|
| 226 |
)
|
| 227 |
translated = _extract_content(self._llm_chain.invoke(translate_prompt))
|
| 228 |
-
logger.info("Translated query to
|
| 229 |
return translated
|
| 230 |
|
| 231 |
# ------------------------------------------------------------------
|
|
@@ -552,10 +599,21 @@ class QueryRouter:
|
|
| 552 |
|
| 553 |
instruction = intent_instructions[intent]
|
| 554 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 555 |
language_rule = (
|
| 556 |
f"IMPORTANT: You MUST answer in {user_language}. "
|
| 557 |
f"The user asked in {user_language}, so your entire response must be in {user_language}. "
|
| 558 |
-
f"
|
| 559 |
)
|
| 560 |
|
| 561 |
return (
|
|
|
|
| 20 |
|
| 21 |
from src.models import IntentType, GenerationResponse, PipelineDetails, QueryResult
|
| 22 |
from src.agent.intent_classifier import IntentClassifier
|
| 23 |
+
from src.agent.tools import detect_document_languages
|
| 24 |
from src.retrieval.hybrid import HybridRetriever
|
| 25 |
from src.retrieval.reranker import Reranker
|
| 26 |
|
|
|
|
| 139 |
llm_chain: Runnable,
|
| 140 |
*,
|
| 141 |
translate_query: bool = True,
|
| 142 |
+
document_languages: list[str] | None = None,
|
| 143 |
) -> None:
|
| 144 |
"""Initialize the query router.
|
| 145 |
|
|
|
|
| 149 |
reranker: Reranker instance.
|
| 150 |
llm_chain: LLM chain (llm | StrOutputParser) for generation,
|
| 151 |
translation, and language detection.
|
| 152 |
+
translate_query: Whether to translate the user query into a
|
| 153 |
+
corpus language before BM25 retrieval when the query
|
| 154 |
+
language does not already match one of the corpus languages.
|
| 155 |
+
When False, no translation is performed.
|
| 156 |
+
document_languages: Optional pre-detected list of corpus
|
| 157 |
+
languages. When omitted, the router lazily detects them
|
| 158 |
+
from the vector store on first translation/generation via
|
| 159 |
+
the LLM.
|
| 160 |
"""
|
| 161 |
self._intent_classifier = intent_classifier
|
| 162 |
self._hybrid_retriever = hybrid_retriever
|
| 163 |
self._reranker = reranker
|
| 164 |
self._llm_chain = llm_chain
|
| 165 |
self._translate_query_enabled = translate_query
|
| 166 |
+
self._document_languages: list[str] | None = (
|
| 167 |
+
list(document_languages) if document_languages else None
|
| 168 |
+
)
|
| 169 |
self._graph = self._build_graph()
|
| 170 |
|
| 171 |
+
def _ensure_document_languages(self) -> list[str]:
|
| 172 |
+
"""Lazily detect and cache the document corpus languages via the LLM.
|
| 173 |
+
|
| 174 |
+
Returns:
|
| 175 |
+
List of detected language names (e.g. ``["Danish"]`` or
|
| 176 |
+
``["Danish", "English"]``). Empty list when the corpus is empty
|
| 177 |
+
or no readable text could be sampled.
|
| 178 |
+
"""
|
| 179 |
+
if self._document_languages is not None:
|
| 180 |
+
return self._document_languages
|
| 181 |
+
self._document_languages = detect_document_languages(
|
| 182 |
+
self._hybrid_retriever.vector_store, self._llm_chain
|
| 183 |
+
)
|
| 184 |
+
if self._document_languages:
|
| 185 |
+
logger.info("Detected document corpus languages: %s", self._document_languages)
|
| 186 |
+
return self._document_languages
|
| 187 |
+
|
| 188 |
def _detect_language_and_intent(self, query: str) -> tuple[str, IntentType]:
|
| 189 |
"""Detect the query language and classify intent in a single LLM call.
|
| 190 |
|
|
|
|
| 230 |
return detected, intent
|
| 231 |
|
| 232 |
def _translate_query(self, query: str, detected_language: str) -> str:
|
| 233 |
+
"""Translate the query into a corpus language when needed.
|
| 234 |
+
|
| 235 |
+
BM25 needs token-level matches against the corpus, so when the user's
|
| 236 |
+
query language is not present in the corpus we translate it to the
|
| 237 |
+
primary corpus language. When the corpus contains the user's
|
| 238 |
+
language already (single- or multi-language corpus), no translation
|
| 239 |
+
is performed — the original query is used as-is.
|
| 240 |
|
| 241 |
Args:
|
| 242 |
query: The user's original query.
|
| 243 |
detected_language: Detected language of the query.
|
| 244 |
|
| 245 |
Returns:
|
| 246 |
+
The retrieval query, translated when necessary.
|
| 247 |
"""
|
| 248 |
+
doc_langs = self._ensure_document_languages()
|
| 249 |
+
|
| 250 |
+
# Without a known corpus language we cannot pick a translation target.
|
| 251 |
+
if not doc_langs:
|
| 252 |
+
return query
|
| 253 |
+
|
| 254 |
+
user_lang = detected_language.lower().strip()
|
| 255 |
+
doc_lang_set = {lang.lower() for lang in doc_langs}
|
| 256 |
+
# Accept the Danish autonym so legacy "dansk" detection still matches.
|
| 257 |
+
if user_lang == "dansk":
|
| 258 |
+
user_lang = "danish"
|
| 259 |
+
|
| 260 |
+
# Query already in one of the corpus languages → BM25 will work as-is.
|
| 261 |
+
if user_lang in doc_lang_set:
|
| 262 |
return query
|
| 263 |
|
| 264 |
if not self._translate_query_enabled:
|
| 265 |
logger.info("Query translation disabled; using original query for retrieval")
|
| 266 |
return query
|
| 267 |
|
| 268 |
+
target = doc_langs[0]
|
| 269 |
translate_prompt = (
|
| 270 |
+
f"Translate the following text to {target}. "
|
| 271 |
"Reply with ONLY the translated text, nothing else.\n\n"
|
| 272 |
f"Text: {query}"
|
| 273 |
)
|
| 274 |
translated = _extract_content(self._llm_chain.invoke(translate_prompt))
|
| 275 |
+
logger.info("Translated query to %s: %s", target, translated)
|
| 276 |
return translated
|
| 277 |
|
| 278 |
# ------------------------------------------------------------------
|
|
|
|
| 599 |
|
| 600 |
instruction = intent_instructions[intent]
|
| 601 |
|
| 602 |
+
doc_langs = self._ensure_document_languages()
|
| 603 |
+
if doc_langs:
|
| 604 |
+
corpus_clause = (
|
| 605 |
+
f"The context documents may be in {' or '.join(doc_langs)} — "
|
| 606 |
+
f"use them as reference but always reply in {user_language}."
|
| 607 |
+
)
|
| 608 |
+
else:
|
| 609 |
+
corpus_clause = (
|
| 610 |
+
f"The context documents may be in a different language — "
|
| 611 |
+
f"use them as reference but always reply in {user_language}."
|
| 612 |
+
)
|
| 613 |
language_rule = (
|
| 614 |
f"IMPORTANT: You MUST answer in {user_language}. "
|
| 615 |
f"The user asked in {user_language}, so your entire response must be in {user_language}. "
|
| 616 |
+
f"{corpus_clause}"
|
| 617 |
)
|
| 618 |
|
| 619 |
return (
|
src/agent/tools.py
CHANGED
|
@@ -69,6 +69,81 @@ class ToolResultStore:
|
|
| 69 |
fused_results: list[QueryResult] = field(default_factory=list)
|
| 70 |
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
def _merge_results(existing: list[QueryResult], new: list[QueryResult]) -> list[QueryResult]:
|
| 73 |
"""Merge two QueryResult lists by chunk_id, keeping the highest score.
|
| 74 |
|
|
@@ -117,6 +192,7 @@ def make_retrieval_tools(
|
|
| 117 |
store: ToolResultStore,
|
| 118 |
default_top_k: int = 5,
|
| 119 |
llm_chain: Runnable | None = None,
|
|
|
|
| 120 |
) -> list:
|
| 121 |
"""Create retrieval tools bound to the given components and result store.
|
| 122 |
|
|
@@ -133,10 +209,34 @@ def make_retrieval_tools(
|
|
| 133 |
llm_chain: Optional LLM chain for tools that need generation
|
| 134 |
(summarize_document, multi_query_search). When None, those
|
| 135 |
tools are excluded from the returned list.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
Returns:
|
| 138 |
List of LangChain tool callables ready for bind_tools / ToolNode.
|
| 139 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
# ------------------------------------------------------------------
|
| 142 |
# Core search tool
|
|
@@ -317,8 +417,7 @@ def make_retrieval_tools(
|
|
| 317 |
decompose_prompt = (
|
| 318 |
"You are a search query planner. Given a complex question, "
|
| 319 |
"decompose it into 2-4 simple, independent search queries that "
|
| 320 |
-
"together cover all aspects of the question.
|
| 321 |
-
"be in Danish (since the document base is Danish).\n\n"
|
| 322 |
"Reply with ONLY the queries, one per line, nothing else.\n\n"
|
| 323 |
f"Question: {question}"
|
| 324 |
)
|
|
|
|
| 69 |
fused_results: list[QueryResult] = field(default_factory=list)
|
| 70 |
|
| 71 |
|
| 72 |
+
def detect_document_languages(
|
| 73 |
+
vector_store: VectorStore,
|
| 74 |
+
llm: Runnable,
|
| 75 |
+
*,
|
| 76 |
+
max_documents: int = 5,
|
| 77 |
+
chunks_per_document: int = 2,
|
| 78 |
+
sample_chars: int = 2000,
|
| 79 |
+
) -> list[str]:
|
| 80 |
+
"""Detect all languages present in the document corpus via the LLM.
|
| 81 |
+
|
| 82 |
+
Samples chunks from up to ``max_documents`` distinct documents and asks the
|
| 83 |
+
LLM in a single call to identify every language present. Used by routers
|
| 84 |
+
so that intermediate retrieval queries can be phrased in the corpus
|
| 85 |
+
language(s) without hardcoding any specific language.
|
| 86 |
+
|
| 87 |
+
Args:
|
| 88 |
+
vector_store: VectorStore to sample chunks from.
|
| 89 |
+
llm: LLM runnable used for the single detection call.
|
| 90 |
+
max_documents: Maximum number of documents to sample from.
|
| 91 |
+
chunks_per_document: Chunks taken from each sampled document.
|
| 92 |
+
sample_chars: Cap on total sample text length sent to the LLM.
|
| 93 |
+
|
| 94 |
+
Returns:
|
| 95 |
+
List of detected language names in English (e.g. ``["Danish"]`` or
|
| 96 |
+
``["Danish", "English"]``), preserving the order returned by the LLM.
|
| 97 |
+
Returns an empty list when the corpus is empty or no readable text
|
| 98 |
+
could be sampled (e.g. when the vector store is mocked in tests).
|
| 99 |
+
"""
|
| 100 |
+
try:
|
| 101 |
+
ids = vector_store.list_document_ids()
|
| 102 |
+
except Exception:
|
| 103 |
+
return []
|
| 104 |
+
if not isinstance(ids, list) or not ids:
|
| 105 |
+
return []
|
| 106 |
+
|
| 107 |
+
samples: list[str] = []
|
| 108 |
+
for doc_id in ids[:max_documents]:
|
| 109 |
+
try:
|
| 110 |
+
chunks = vector_store.get_chunks_by_document_id(doc_id)
|
| 111 |
+
except Exception:
|
| 112 |
+
continue
|
| 113 |
+
if not isinstance(chunks, list):
|
| 114 |
+
continue
|
| 115 |
+
for c in chunks[:chunks_per_document]:
|
| 116 |
+
text = (getattr(c, "text", "") or "").strip()
|
| 117 |
+
if text:
|
| 118 |
+
samples.append(text)
|
| 119 |
+
|
| 120 |
+
sample_text = "\n---\n".join(samples)[:sample_chars].strip()
|
| 121 |
+
if not sample_text:
|
| 122 |
+
return []
|
| 123 |
+
|
| 124 |
+
prompt = (
|
| 125 |
+
"You are a language detector. The text samples below come from "
|
| 126 |
+
"different documents in a knowledge base. Identify ALL distinct "
|
| 127 |
+
"languages present across the samples (do not list a language more "
|
| 128 |
+
"than once). Reply with ONLY the language names in English, one per "
|
| 129 |
+
"line, no explanation.\n\n"
|
| 130 |
+
f"Samples:\n{sample_text}"
|
| 131 |
+
)
|
| 132 |
+
raw = _extract_content(llm.invoke(prompt))
|
| 133 |
+
|
| 134 |
+
seen: set[str] = set()
|
| 135 |
+
detected: list[str] = []
|
| 136 |
+
for line in raw.strip().splitlines():
|
| 137 |
+
name = line.strip().lstrip("-•*0123456789.) ").rstrip(".").strip()
|
| 138 |
+
if not name:
|
| 139 |
+
continue
|
| 140 |
+
name = name.capitalize()
|
| 141 |
+
if name.lower() not in seen:
|
| 142 |
+
seen.add(name.lower())
|
| 143 |
+
detected.append(name)
|
| 144 |
+
return detected
|
| 145 |
+
|
| 146 |
+
|
| 147 |
def _merge_results(existing: list[QueryResult], new: list[QueryResult]) -> list[QueryResult]:
|
| 148 |
"""Merge two QueryResult lists by chunk_id, keeping the highest score.
|
| 149 |
|
|
|
|
| 192 |
store: ToolResultStore,
|
| 193 |
default_top_k: int = 5,
|
| 194 |
llm_chain: Runnable | None = None,
|
| 195 |
+
document_languages: list[str] | None = None,
|
| 196 |
) -> list:
|
| 197 |
"""Create retrieval tools bound to the given components and result store.
|
| 198 |
|
|
|
|
| 209 |
llm_chain: Optional LLM chain for tools that need generation
|
| 210 |
(summarize_document, multi_query_search). When None, those
|
| 211 |
tools are excluded from the returned list.
|
| 212 |
+
document_languages: Detected languages of the document corpus
|
| 213 |
+
(e.g. ``["Danish"]`` or ``["Danish", "English"]``). Used by
|
| 214 |
+
multi_query_search to phrase sub-queries in the corpus
|
| 215 |
+
language(s) for best BM25 recall. When None or empty, the
|
| 216 |
+
sub-query language is left unconstrained.
|
| 217 |
|
| 218 |
Returns:
|
| 219 |
List of LangChain tool callables ready for bind_tools / ToolNode.
|
| 220 |
"""
|
| 221 |
+
if document_languages:
|
| 222 |
+
if len(document_languages) == 1:
|
| 223 |
+
_lang_clause = (
|
| 224 |
+
f"The queries should be in {document_languages[0]} "
|
| 225 |
+
f"(the document base is {document_languages[0]})."
|
| 226 |
+
)
|
| 227 |
+
else:
|
| 228 |
+
_lang_list = ", ".join(document_languages)
|
| 229 |
+
_lang_clause = (
|
| 230 |
+
f"The document base contains multiple languages: {_lang_list}. "
|
| 231 |
+
f"For each sub-query, write it in whichever of these languages "
|
| 232 |
+
f"best matches the topic; mix languages across sub-queries if "
|
| 233 |
+
f"the topic is likely covered by documents in different languages."
|
| 234 |
+
)
|
| 235 |
+
else:
|
| 236 |
+
_lang_clause = (
|
| 237 |
+
"Write each sub-query in the language most likely used by the "
|
| 238 |
+
"underlying documents."
|
| 239 |
+
)
|
| 240 |
|
| 241 |
# ------------------------------------------------------------------
|
| 242 |
# Core search tool
|
|
|
|
| 417 |
decompose_prompt = (
|
| 418 |
"You are a search query planner. Given a complex question, "
|
| 419 |
"decompose it into 2-4 simple, independent search queries that "
|
| 420 |
+
f"together cover all aspects of the question. {_lang_clause}\n\n"
|
|
|
|
| 421 |
"Reply with ONLY the queries, one per line, nothing else.\n\n"
|
| 422 |
f"Question: {question}"
|
| 423 |
)
|
src/retrieval/hybrid.py
CHANGED
|
@@ -52,6 +52,11 @@ class HybridRetriever:
|
|
| 52 |
self._dense_weight = dense_weight
|
| 53 |
self._bm25_weight = bm25_weight
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
def search(self, query: str, top_k: int) -> list[QueryResult]:
|
| 56 |
"""Execute hybrid search combining dense and sparse results.
|
| 57 |
|
|
|
|
| 52 |
self._dense_weight = dense_weight
|
| 53 |
self._bm25_weight = bm25_weight
|
| 54 |
|
| 55 |
+
@property
|
| 56 |
+
def vector_store(self) -> VectorStore:
|
| 57 |
+
"""Underlying vector store, exposed for callers that need corpus-level access."""
|
| 58 |
+
return self._vector_store
|
| 59 |
+
|
| 60 |
def search(self, query: str, top_k: int) -> list[QueryResult]:
|
| 61 |
"""Execute hybrid search combining dense and sparse results.
|
| 62 |
|
tests/test_router.py
CHANGED
|
@@ -244,7 +244,7 @@ class TestQueryTranslation:
|
|
| 244 |
retriever.search_detailed.assert_called_once_with("Hvad er reglerne?", top_k=3)
|
| 245 |
|
| 246 |
def test_english_query_translated_for_retrieval(self, mock_components) -> None:
|
| 247 |
-
"""English queries should be translated
|
| 248 |
classifier, retriever, reranker, llm_chain = mock_components
|
| 249 |
|
| 250 |
results = [_make_query_result("ctx", 0.5)]
|
|
@@ -252,7 +252,10 @@ class TestQueryTranslation:
|
|
| 252 |
reranker.rerank.return_value = results
|
| 253 |
_setup_llm_chain_english(llm_chain, "Hvad er reglerne?", "The rules are...", intent="rag")
|
| 254 |
|
| 255 |
-
router = QueryRouter(
|
|
|
|
|
|
|
|
|
|
| 256 |
response = router.route("What are the rules?", top_k=3)
|
| 257 |
|
| 258 |
# 3 invoke calls: combined detection + translation + generation
|
|
|
|
| 244 |
retriever.search_detailed.assert_called_once_with("Hvad er reglerne?", top_k=3)
|
| 245 |
|
| 246 |
def test_english_query_translated_for_retrieval(self, mock_components) -> None:
|
| 247 |
+
"""English queries should be translated into the corpus language for retrieval."""
|
| 248 |
classifier, retriever, reranker, llm_chain = mock_components
|
| 249 |
|
| 250 |
results = [_make_query_result("ctx", 0.5)]
|
|
|
|
| 252 |
reranker.rerank.return_value = results
|
| 253 |
_setup_llm_chain_english(llm_chain, "Hvad er reglerne?", "The rules are...", intent="rag")
|
| 254 |
|
| 255 |
+
router = QueryRouter(
|
| 256 |
+
classifier, retriever, reranker, llm_chain,
|
| 257 |
+
translate_query=True, document_languages=["Danish"],
|
| 258 |
+
)
|
| 259 |
response = router.route("What are the rules?", top_k=3)
|
| 260 |
|
| 261 |
# 3 invoke calls: combined detection + translation + generation
|