Spaces:

scvcoder
/

kpaa

Paused

App Files Files Community

scvcoder commited on May 5

Commit

8863f87

verified ·

1 Parent(s): ca27852

Retriever: RRF로 키워드+원본 질문 결합 — LLM이 핵심 주제어 누락(e.g. '처방전 보관기간' → ['보관기간']) 해도 원본 질문이 안전망. cases·guides 동일 적용

Browse files

Files changed (1) hide show

src/kpaa/retrieval/retriever.py +35 -5

src/kpaa/retrieval/retriever.py CHANGED Viewed

@@ -211,9 +211,23 @@ async def _fetch_cases(
         if on_progress:
             await on_progress("fetch_done", {"source": "case", "count": 0, "keyword": ""})
         return []
-    # 매칭된 키워드 1~3개를 공백으로 합쳐 전달 → search 내부에서 OR 결합
-    query = " ".join((plan.search_keywords or [plan.query])[:3])
-    hits = idx.search(query, k=k)
     out: list[Excerpt] = []
     for h in hits:
         category = " > ".join(filter(None, (h.category1, h.category2, h.category3)))
@@ -260,8 +274,24 @@ async def _fetch_guides(
         if on_progress:
             await on_progress("fetch_done", {"source": "guide", "count": 0, "keyword": ""})
         return []
-    query = " ".join((plan.search_keywords or [plan.query])[:3])
-    hits = idx.search(query, k=k)
     out: list[Excerpt] = []
     for h in hits:
         # doc_date "YYYY.MM" → 연도만 추출해 recency 점수에 사용

         if on_progress:
             await on_progress("fetch_done", {"source": "case", "count": 0, "keyword": ""})
         return []
+    # 키워드 + 원본 질문 두 쿼리로 검색 후 RRF(Reciprocal Rank Fusion)로 결합 —
+    # LLM 추출 키워드가 핵심 주제어 누락 시 원본 질문이 안전망. 단순 concat은
+    # BM25 토큰 가중치 차이로 한쪽이 독점 가능하므로 rank 기반 결합 필요.
+    _RRF_K = 60
+    queries: list[str] = []
+    if plan.search_keywords:
+        queries.append(" ".join(plan.search_keywords[:3]))
+    if plan.query and plan.query not in queries:
+        queries.append(plan.query)
+    rrf_scores: dict = {}
+    hit_map: dict = {}
+    for q in queries:
+        for rank, h in enumerate(idx.search(q, k=k)):
+            rrf_scores[h.ntt_id] = rrf_scores.get(h.ntt_id, 0.0) + 1.0 / (_RRF_K + rank)
+            hit_map.setdefault(h.ntt_id, h)
+    top_ids = sorted(rrf_scores, key=lambda i: -rrf_scores[i])[:k]
+    hits = [hit_map[i] for i in top_ids]
     out: list[Excerpt] = []
     for h in hits:
         category = " > ".join(filter(None, (h.category1, h.category2, h.category3)))
         if on_progress:
             await on_progress("fetch_done", {"source": "guide", "count": 0, "keyword": ""})
         return []
+    # 키워드 + 원본 질문 두 쿼리로 검색 후 RRF로 결합 — LLM 추출 키워드가 핵심
+    # 주제어를 누락해도 원본 질문이 안전망 (e.g. "처방전 보관기간" → ["보관기간"]만
+    # 추출돼도 원본 query에서 "처방전" 토큰 hit 가능). 단순 concat은 BM25 가중치
+    # 차이로 한쪽이 독점하므로 rank 기반 union 필요.
+    _RRF_K = 60
+    queries: list[str] = []
+    if plan.search_keywords:
+        queries.append(" ".join(plan.search_keywords[:3]))
+    if plan.query and plan.query not in queries:
+        queries.append(plan.query)
+    rrf_scores: dict = {}
+    hit_map: dict = {}
+    for q in queries:
+        for rank, h in enumerate(idx.search(q, k=k)):
+            rrf_scores[h.chunk_id] = rrf_scores.get(h.chunk_id, 0.0) + 1.0 / (_RRF_K + rank)
+            hit_map.setdefault(h.chunk_id, h)
+    top_ids = sorted(rrf_scores, key=lambda i: -rrf_scores[i])[:k]
+    hits = [hit_map[i] for i in top_ids]
     out: list[Excerpt] = []
     for h in hits:
         # doc_date "YYYY.MM" → 연도만 추출해 recency 점수에 사용