File size: 1,660 Bytes
69c12a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# app/modules/processing/query_processor.py
import re
from typing import List


def normalize_query(query: str) -> str:
    """
    Normalize a query string.
    - Trim whitespace
    - (Optional) Add rules for punctuation or casing if needed
    """
    return query.strip()


def deduplicate_queries(queries: List[str]) -> List[str]:
    """
    Remove exact duplicate queries.
    """
    seen = set()
    unique = []
    for q in queries:
        if q not in seen:
            unique.append(q)
            seen.add(q)
    return unique


def select_top_k_queries(queries: List[str], reranker, k: int = 5) -> List[str]:
    """
    Select top-k queries using a reranker.
    - Keep broad coverage while reducing redundant API calls
    - Reranker should be provided externally (function/class)
    """
    if not queries:
        return []

    scored = reranker(queries)  # expected: List of (query, score)
    scored.sort(key=lambda x: x[1], reverse=True)
    return [q for q, _ in scored[:k]]


def process_queries(raw_output: str, max_queries: int = 5) -> List[str]:
    """
    Entry point: process raw LLM output into final queries.
    - Split by line
    - Remove leading numbers or list markers
    - Normalize each query
    - Deduplicate identical queries
    - Limit to max_queries
    """
    queries = []
    for line in raw_output.splitlines():
        q = line.strip()
        q = re.sub(r"^[\d\.\-\)\s]+", "", q)  # remove leading numbering
        q = normalize_query(q)

        if q:
            queries.append(q)

        if len(queries) >= max_queries:
            break

    queries = deduplicate_queries(queries)
    return queries