Spaces:

samsaara
/

hyde_rag

Sleeping

App Files Files

Vivek Vaddina commited on Oct 7, 2025

Commit

92f1a38

unverified ·

1 Parent(s): c289850

♻️ Refactor to improve model performance

Browse files

Files changed (8) hide show

README.md +68 -0
app.py +117 -446
requirements.txt +4 -0
src/config.py +33 -1
src/hyde_rag.py +0 -206
src/main.py +249 -323
src/prompts.yaml +43 -44
src/utils.py +148 -0

README.md CHANGED Viewed

@@ -11,3 +11,71 @@ short_description: answer based on input documents
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# RAG HYDE
+This challenge is to build a minimal but powerful Retrieval-Augmented Generation (RAG) workflow inspired by the techniques in the articles already shared.
+Your solution should:
+* Ingest and chunk local PDFs efficiently.
+* Use a small, fast embedding model for retrieval.
+* Apply HyDE to improve query relevance.
+* Fuse multiple query variants with Reciprocal Rank Fusion for higher accuracy.
+* Generate concise, context-grounded answers with clear source citations.
+Goal: Deliver a working RAG example that is fast, lightweight, and high-quality in both retrieval and final answers.
+## Description
+### Approach
+- Get the folder containing data
+- Build Corpus & Index
+- Get the user query
+- transform it to separate search & intent (Query Rewrite)
+- generate hypothetical documents from a (short) user query
+- get their corresponding embeddings
+- for each of those embeddings, get relevant results from the corpus
+- fuse them all together using Reciprocal Rank Fusion
+- extract top relevant results
+- pre-process those to make a context and send it to the LLM one last time with the user query
+- receive the final answer based on the user's query & provided context.
+## Run
+`MODEL_COMBOS` in [config.py](config.py) provides multiple variants of embedding & generative LLM model combinations keeping in mind the host system's limitations & capabilities.
+## Tips/Observations:
+- Extracting text as a markdown greatly preserved the structure and continuity of the text. This resulted in better logical chunking which in turn led to better embeddings and as a consequence, better search results.
+- Reading the document via `docling` extracted more and correct text compared to `pymupdf4llm` but at a bit of an expense of speed. It is enabled by default for prioritising accuracy.
+  - This proved esp. useful in extracting data containing lots of tables spread over multiple pages.
+  - You can pass `--fast-extract` from CLI or tick a box via gradio UI to use pymupdf instead.
+- Increasing the model size (coupled with correct text extraction in markdown) greatly improved performance. The Qwen3 models very much adhered to instructions but the smaller variants instead of hallucinating simply fell back to saying _'I don't know'_ (as per instructions). The `4B` variant understood the user intent which sometimes was vague and yet managed to give relevant results. The base variant is huge and it wouldn't have been fit and run fast enough on a consumer grade laptop GPU. Loading the `AWQ` variant of it helped as it occupied substantially less memory compared to the original without much loss in performance.
+  - This model also showed great multilingual capabilities. User can upload document in one language and ask questions in another. Or they could upload multilingual documents and ask multilingual queries. For the demo, I tested mostly in English & German.
+- The data is now stored in datasets format that allows for better storage & scaling (arrow) along with indexing (FAISS) for querying.
+---
+## Limitations / Known Issues
+- Even though `docling` with mostly default options proved to be better than `pymupdf4llm` to extract text, it's not perfect everytime. There're instances where _pymupdf_ extracted text from an embedded image inside a PDF better than docling. However, docling is highly configurable and allows for deep customization via 'pipelines'. And it also comes with a very permissive license for commercial use compared to PyMuPDF.
+  - docling comes with `easyocr` by default for text OCR. It's not powerful enough compared to _tesseract_ or similar models. But since installing the latter and linking it with docling involves touching system config, it's not pursued.
+- When user uploads multiple PDFs, we can improve load times by reading them asynchronously. Attempts to do that with `docling` sometimes resulted in pages with ordering different than the original. So it's dropped for the demo. More investigation is needed later.
+## Next Steps
+- Checkout [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for embeddings
+- Checkout [fastembed](https://github.com/qdrant/fastembed) to generate embeddings faster
+- Improve text extraction via docling pipeline
+- Checkout `GGUF` models for CPU Inferencing

app.py CHANGED Viewed

@@ -1,474 +1,145 @@
-import re
-import math
-import yaml
-import json
-import torch
-import faiss
 import string
-import asyncio
-import pymupdf
 import gradio as gr
-from time import time
-from pathlib import Path
-from functools import lru_cache
-from ast import literal_eval
-from collections import defaultdict
-from concurrent.futures import ThreadPoolExecutor
-from sentence_transformers import SentenceTransformer
-from langchain_text_splitters import SentenceTransformersTokenTextSplitter
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
-from src.config import PROMPTS_FILEPATH, log
-async def load_pdfs(files, max_concurrence=5):
-    """
-    Load multiple PDF files async.
-    Args:
-        folder (str): Path to folder containing PDF files
-        max_concurrence (int): Maximum number of concurrent PDF processing tasks
-    Returns:
-        list: List of tuples containing (filename, extracted_text)
-    """
-    def _load_pdf_sync(file):
-        """Synchronous PDF loading function for thread pool execution"""
-        text = ""
-        try:
-            with pymupdf.open(file, filetype="pdf") as doc:
-                text = "\n".join(page.get_text() for page in doc)
-        except Exception:
-            log.exception(f"Error reading {file.name}")
-            pass
-        return (file.name, text)
-    loop = asyncio.get_event_loop()
-    with ThreadPoolExecutor(max_workers=max_concurrence) as executor:
-        futures = [
-            loop.run_in_executor(executor, _load_pdf_sync, file)
-            for file in files
-            if file is not None
-        ]
-        results = await asyncio.gather(*futures, return_exceptions=True)
-    valid_results = [result for result in results if not isinstance(result, Exception)]
-    log.info(f"successfully processed {len(valid_results)} out of {len(files)} PDFs ")
-    return valid_results
-async def build_corpus(pdfs, text_splitter, **load_kwargs):
-    texts = await load_pdfs(pdfs, **load_kwargs)
-    corpus, meta = [], []
-    for file_name, raw_text in texts:
-        chunks = text_splitter.split_text(raw_text)
-        for i, chunk in enumerate(chunks):
-            corpus.append(chunk)
-            meta.append({"file": file_name, "chunk_id": i})
-    return corpus, meta
-def generate_text(
-    tokenizer, model, user_prompts, system_prompt=None, **llm_kwargs
-):  # max_new_tokens=512, temperature=.4):
-    if system_prompt is None or "":
-        system_prompt = "You are a helpful assistant."
-    if isinstance(user_prompts, str):
-        user_prompts = [user_prompts]
-    messages = [
-        [
-            {"role": "system", "content": system_prompt},
-            {"role": "user", "content": user_prompt},
-        ]
-        for user_prompt in user_prompts
-    ]
-    texts = tokenizer.apply_chat_template(
-        messages, tokenize=False, add_generation_prompt=True
-    )
-    model_inputs = tokenizer(
-        texts, return_tensors="pt", truncation=True, padding=True
-    ).to(model.device)
-    generated_ids = model.generate(
-        **model_inputs,
-        max_new_tokens=llm_kwargs.pop("max_new_tokens", 512),
-        temperature=llm_kwargs.pop("temperature", 0.4),
-    )
-    generated_ids = [
-        output_ids[len(input_ids) :]
-        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-    ]
-    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
-    return response if len(user_prompts) > 1 else response[0]
-def load_models(
-    embed_model_name: str,
-    gen_model_name: str,
-    causal_lm: bool = False,
-    device=None,
-    bitsandbytesconfig=None,
-):
-    # This will take some time to run for the first time if the model(s) don't exist locally.
-    if not device:
-        device = "cuda" if torch.cuda.is_available() else "cpu"
-    embedder = SentenceTransformer(
-        embed_model_name,
-        device=device,
-        model_kwargs={"dtype": "float16"} if device == "cuda" else {},
-    )
-    if not causal_lm:
-        tok = AutoTokenizer.from_pretrained(gen_model_name)
-        gen = AutoModelForSeq2SeqLM.from_pretrained(
-            gen_model_name,  # device_map='auto',
-            quantization_config=bitsandbytesconfig if bitsandbytesconfig else None,
-        )
-    else:
-        tok = AutoTokenizer.from_pretrained(gen_model_name, padding_side="left")
-        gen = AutoModelForCausalLM.from_pretrained(
-            gen_model_name,
-            dtype="float16",  # device_map='auto',
-            quantization_config=bitsandbytesconfig if bitsandbytesconfig else None,
-        )
-    gen.to(device)
-    return embedder, tok, gen
-def make_query_variants(
-    tokenizer, model, query: str, prompt: str, n: int = 3, **llm_kwargs
-):
-    instructions = f"Now give me at least {n} variations."
-    resp = generate_text(tokenizer, model, query + instructions, prompt, **llm_kwargs)
-    clean_resp = re.sub(r"^\d+\.\s*", "", resp, flags=re.MULTILINE).split("\n")
-    return [query] + [q for q in clean_resp if q.strip()]
-def clean_rewrite_resp(resp):
-    try:
-        resp = json.loads(resp)  # Parse JSON
-    except json.JSONDecodeError:
-        try:
-            resp = literal_eval(resp)  # Fallback parse
-        except Exception:
-            pass  # Keep resp as-is if both fail
-    # Ensure resp is a string before strip and slicing
-    if isinstance(resp, str):
-        resp = resp.strip()
-        if resp:
-            start = resp.find("{")
-            if start != -1:
-                end = resp[::-1].find("}")
-                if end != -1:
-                    resp = resp[start : len(resp) - end]
-                    return clean_rewrite_resp(resp)
-    return resp
-def transform_query(
-    tokenizer, model, query: str, rewrite_prompt: str, **llm_kwargs
-) -> dict:
-    """split the query into things to search and actions to take"""
-    resp = generate_text(tokenizer, model, query, rewrite_prompt, **llm_kwargs)
-    try:
-        resp = clean_rewrite_resp(resp)
-    except:
-        pass
-    return resp
-def aggregate_queries_and_tasks(
-    tokenizer,
-    model,
-    orig_query,
-    rewrite_prompt,
-    variants_prompt,
-    n_variations=3,
-    **llm_kwargs,
 ):
-    # make variations for the original query as is
-    queries = make_query_variants(
-        tokenizer,
-        model,
-        orig_query.strip(),
-        variants_prompt,
-        n_variations,
-        **llm_kwargs,
     )
-    start = time()
-    tr_q = transform_query(tokenizer, model, orig_query.strip(), rewrite_prompt)
-    end = time()
-    log.debug(f"\t\t transforming query task took {(end - start):.1f} seconds...")
-    # transformed query might have multiple things to search and tasks to perform depending on user query
-    # recursively get variations for each of the search queries but keep the tasks as is.
-    tasks = []
-    if isinstance(tr_q, dict):
-        search_results, tasks = tr_q.get("search", []), tr_q.get("tasks", [])
-        for search_result in search_results:
-            queries.extend(
-                make_query_variants(
-                    tokenizer,
-                    model,
-                    search_result,
-                    variants_prompt,
-                    n_variations,
-                    **llm_kwargs,
-                )
-            )
-    queries = [q.strip(string.punctuation) for q in queries]
-    tasks = [t.strip(string.punctuation) for t in tasks]
-    # keep the original user query as is (if in case LLM messes up the original query) and pick some after shuffling the rest
-    # This is disabled as we don't do loops and instead take advantage of batches.
-    # Since it's efficient, we can take many query variations at once without worrying about performance.
-    # q, queries = queries[:1], queries[1:]
-    # shuffle(queries)
-    # q += queries[:n_variations-1]
-    return queries, tasks
-def build_index(corpus_emb, n_cells=5, n_probe=2):
-    log.debug(f"building index with {n_cells=}, {n_probe=}")
-    d = corpus_emb.shape[1]
-    quantizer = faiss.IndexFlatIP(d)
-    index = faiss.IndexIVFFlat(quantizer, d, n_cells)
-    index.n_probe = n_probe
-    index.train(corpus_emb)
-    index.add(corpus_emb)
-    # index.make_direct_map()
-    return index
-def reciprocal_rank_fusion(indices, top_k=3, denom=50):
-    ii = indices.tolist()
-    scores = defaultdict(int)
-    for row in ii:
-        for rank, chunk_id in enumerate(row):
-            scores[chunk_id] += 1 / (rank + denom)
-    results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
-    return [chunk_id for chunk_id, _ in results]
-class HyDeRAGFusion:
-    def __init__(
-        self,
-        embed_model: str,
-        generator_llm_model: str,
-        causal_lm: bool = True,
-        chunk_overlap: int = 50,
-        tokens_per_chunk: int = 256,
-        embed_batch_size: int = 64,
-        bitsandbytesconfig=None,
-    ):
-        self.embed_batch_size = embed_batch_size
-        self.text_splitter = SentenceTransformersTokenTextSplitter(
-            chunk_overlap, embed_model, tokens_per_chunk
-        )
-        self.embedder, self.tok, self.gen = load_models(
-            embed_model, generator_llm_model, causal_lm, bitsandbytesconfig
-        )
-        with open(PROMPTS_FILEPATH) as fl:
-            self.prompts = yaml.safe_load(fl)
-    @lru_cache(maxsize=8)
-    def preprocess_pdfs(self, pdfs, data_load_kwargs={}, faiss_index_kwargs={}):
-        self.corpus, self.meta = asyncio.run(
-            build_corpus(pdfs, self.text_splitter, **data_load_kwargs)
-        )
-        log.debug(f"{len(self.corpus)}, {len(self.meta)}")
-        self.corpus_emb = self.embedder.encode(
-            self.corpus,
-            batch_size=self.embed_batch_size,
-            show_progress_bar=True,
-            normalize_embeddings=True,
-        )
-        log.debug(f"{self.corpus_emb.shape}")
-        # https://github.com/facebookresearch/faiss/issues/112
-        # n_cells = int(round(4 * (self.corpus_emb.shape[0])**.5))
-        # one centroid for every 100 or so vectors and 20% of them as n_probe
-        n_cells = faiss_index_kwargs.pop("n_cells", self.corpus_emb.shape[0] // 100 + 1)
-        n_probe = faiss_index_kwargs.pop("n_probe", math.ceil(0.2 * n_cells))
-        self.index = build_index(self.corpus_emb, n_cells, n_probe)
-    def retrieve(
-        self, query, n_variants=3, top_k_per_variant=10, top_k_retrieve=3, **llm_kwargs
-    ):
-        start = time()
-        queries, tasks = aggregate_queries_and_tasks(
-            self.tok,
-            self.gen,
-            query.strip(),
-            self.prompts["rewrite"],
-            self.prompts["variants"],
-            n_variants,
-            **llm_kwargs,
-        )
-        end = time()
-        log.debug(f"aggregate task took {(end - start):.1f} seconds...")
-        start = time()
-        hyde_docs = generate_text(
-            self.tok, self.gen, queries, self.prompts["hyde"], **llm_kwargs
-        )
-        end = time()
-        log.debug(f"generating hyde docs took {(end - start):.1f} seconds...")
-        start = time()
-        chunks = []
-        for hyde_doc in hyde_docs:
-            chunks.extend(self.text_splitter.split_text(hyde_doc))
-        q_emb = self.embedder.encode(
-            chunks, batch_size=self.embed_batch_size, normalize_embeddings=True
-        )
-        end = time()
-        log.debug(f"embedding hyde docs took {(end - start):.1f} seconds...")
-        _, I = self.index.search(q_emb, top_k_per_variant)
-        chunk_ids = reciprocal_rank_fusion(I, top_k_retrieve)
-        return chunk_ids, tasks
-    def answer(self, query, doc_ids, tasks, max_ctx_chars=128000):
-        total, text, prompt_length = 0, "", 10000
-        sep = "\n\n-----\n\n"
-        tasks = ", ".join(tasks)
-        for doc_id in doc_ids:
-            # adding tags in the context caused more hallucinations.
-            # Instead, we list them as sources beneath the model response.
-            # _meta = self.meta[doc_id]
-            # tag = f"(source: {_meta['file_name']}:{_meta['chunk_id']})"
-            chunk = self.corpus[doc_id].strip()
-            tag = ""
-            ctx = f"{sep}{tag}\n\n{chunk}"
-            if total + len(ctx) + len(tasks) + len(sep) + prompt_length > max_ctx_chars:
-                break
-            text += ctx
-            total = len(text)
-        text += f"{sep}{tasks}"
-        # instruction = "Answer concisely and also cite file names & chunk ids inline like (pdf_file_name:chunk_id)."
-        instruction = "go ahead and answer!"
-        user_query = f"\nq: {query}\n\nctx:{text}" + f"\n\n{instruction}\n\n"
-        start = time()
-        resp = generate_text(
-            self.tok,
-            self.gen,
-            user_query,
-            self.prompts["final_answer"],
-            temperature=0.3,
-        )
-        end = time()
-        log.debug(f"final resp took {(end - start):.1f} seconds...")
-        return resp
-def initial_setup(embed_model, generator_model, bitsandbytesconfig=None):
-    return HyDeRAGFusion(
-        embed_model, generator_model, bitsandbytesconfig=bitsandbytesconfig
-    )
-start = time()
-HRF = initial_setup("sentence-transformers/LaBSE", "Qwen/Qwen2.5-0.5B-Instruct")
-end = time()
-msg = f"init took {(end - start):.1f} seconds"
-log.debug(msg)
-def main(
-    pdfs,
-    query,
-    n_variants=3,
-    top_k_per_variant=5,
-    top_k_retrieve=3,
-    temperature=0.4,
-    max_new_tokens=512,
-):
-    start = time()
-    if pdfs:
-        HRF.preprocess_pdfs(tuple(sorted(pdfs)))
-    if query:
-        llm_kwargs = {
-            "temperature": temperature,
-            "max_new_tokens": max_new_tokens,
-        }
-        doc_ids, tasks = HRF.retrieve(
-            query,
-            int(n_variants),
-            int(top_k_per_variant),
-            int(top_k_retrieve),
-            **llm_kwargs,
-        )
-        docs = [HRF.corpus[doc_id] for doc_id in doc_ids]
-        reply = HRF.answer(query, doc_ids, tasks)
-        sources = [
-            {
-                "source": f"{Path(HRF.meta[doc_id]['file']).stem}:{HRF.meta[doc_id]['chunk_id']}",
-                "content": doc,
-            }
-            for doc_id, doc in zip(doc_ids, docs)
-        ]
-        resp = f"{reply}\n\n{'-' * 25}\n\n"
-        resp += "Top 3 sources:"
-        resp += f"\n\n{'-' * 25}\n\n"
-        for source in sources:
-            resp += f"source: {source['source']}\n\n"
-            resp += source["content"]
-            resp += f"\n\n{'-' * 25}\n\n"
-        end = time()
-        log.debug(f"final resp took {(end - start):.1f} seconds")
-        return resp
-def reset_text_on_file_change(pdfs):
-    """
-    Reset text input when input docs change
-    """
-    return ""
 with gr.Blocks(title="RAG with HYDE") as demo:
     gr.Markdown("# RAG with HYDE")
     with gr.Row():
         pdf_input = gr.File(
-            label="upload PDF(s)", file_types=[".pdf"], file_count="multiple"
         )
-        query = gr.Textbox(label="question")
-    gr.Markdown("*Please be patient after hitting the submit button*")
-    btn = gr.Button("Submit")
-    answer = gr.Markdown(label="### Answer")
-    btn.click(main, inputs=[pdf_input, query], outputs=answer)
-    pdf_input.change(reset_text_on_file_change, inputs=pdf_input, outputs=query)
 if __name__ == "__main__":
     demo.launch()

+import sys
 import string
 import gradio as gr
+from src.main import ask
+from src.utils import empty_cache
+from src.config import log
+def ask_wrapper(
+    pdfs,
+    query,
+    model_combo_key,
+    fast_extract,
+    n_variants,
+    top_k_per_variant,
+    top_k_retrieve,
+    temperature,
 ):
+    resp = ask.callback(
+        pdfs,
+        query,
+        model_combo_key,
+        fast_extract,
+        n_variants,
+        top_k_per_variant,
+        top_k_retrieve,
+        temperature,
     )
+    return f"## Final Answer:\n\n{resp}\n"
+def reset(pdfs):
+    """
+    Reset text input and empty cache
+    """
+    log.warning("emptying cache")
+    empty_cache()
+    return ""
+# Enable the button only when both fields are nonempty
+def _enable_submit_if_filled(pdfs, query):
+    status = bool(pdfs) and bool(len(query.strip(string.punctuation + " ")) > 10)
+    return gr.update(interactive=status)
+def disable_button():
+    return gr.update(interactive=False, value="Processing...")
+def enable_button():
+    return gr.update(interactive=True, value="Submit")
 with gr.Blocks(title="RAG with HYDE") as demo:
     gr.Markdown("# RAG with HYDE")
     with gr.Row():
         pdf_input = gr.File(
+            label="upload PDF(s)",
+            file_types=[".pdf"],
+            file_count="multiple",
         )
+        query = gr.Textbox(label="Question (Enter at least 10 valid characters)")
+    with gr.Accordion("Advanced Settings", open=False):
+        gr.Markdown(
+            "*These parameters have sensible defaults but can be customized if needed*"
+        )
+        with gr.Row():
+            _default_combo = "linux" if sys.platform == "linux" else "mac"
+            model_combo_key = gr.Dropdown(
+                label="Model Combo Key",
+                choices=[_default_combo, "HF-mid"],
+                value=_default_combo,
+            )
+            fast_extract = gr.Checkbox(
+                value=False, label="Use PyMuPDF to extract content in markdown"
+            )
+            n_variants = gr.Number(
+                value=3, minimum=1, maximum=5, label="no. of query variants"
+            )
+        with gr.Row():
+            top_k_per_variant = gr.Number(
+                value=5,
+                minimum=2,
+                maximum=10,
+                label="top `k` hits per query variant for RRF",
+            )
+            top_k_retrieve = gr.Number(
+                value=3,
+                minimum=1,
+                maximum=5,
+                label="top `k` chunks to retrieve after RRF",
+            )
+            temperature = gr.Slider(
+                value=0.7, minimum=0.1, maximum=1.0, step=0.1, label="temperature"
+            )
+    gr.Markdown(
+        "### *Please be patient after hitting the submit button* esp. for the first question after uploading new document(s)"
+    )
+    submit_btn = gr.Button("Submit", variant="primary", interactive=False)
+    answer = gr.Markdown(label="## Answer")
+    pdf_input.change(
+        _enable_submit_if_filled, [pdf_input, query], submit_btn, queue=False
+    )
+    query.change(_enable_submit_if_filled, [pdf_input, query], submit_btn, queue=False)
+    submit_btn.click(fn=disable_button, outputs=submit_btn).then(
+        fn=ask_wrapper,
+        inputs=[
+            pdf_input,
+            query,
+            model_combo_key,
+            fast_extract,
+            n_variants,
+            top_k_per_variant,
+            top_k_retrieve,
+            temperature,
+        ],
+        outputs=answer,
+    ).then(fn=enable_button, outputs=submit_btn)
+    query.submit(fn=disable_button, outputs=submit_btn).then(
+        fn=ask_wrapper,
+        inputs=[
+            pdf_input,
+            query,
+            model_combo_key,
+            fast_extract,
+            n_variants,
+            top_k_per_variant,
+            top_k_retrieve,
+            temperature,
+        ],
+        outputs=answer,
+    ).then(fn=enable_button, outputs=submit_btn)
+    pdf_input.change(reset, pdf_input, query)
+    demo.load(reset, pdf_input, query)
 if __name__ == "__main__":
     demo.launch()

requirements.txt CHANGED Viewed

@@ -3,6 +3,10 @@ numpy
 transformers
 sentence-transformers
 pymupdf
 langchain-text-splitters
 #faiss-cpu
 faiss-gpu-cu12

 transformers
 sentence-transformers
 pymupdf
+pymupdf4llm
 langchain-text-splitters
 #faiss-cpu
 faiss-gpu-cu12
+autoawq
+docling

src/config.py CHANGED Viewed

@@ -8,7 +8,7 @@ def get_logger(LOG_LEVEL="INFO"):
     LOG_PATH = Path("logs.log")
     formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
-    log = logging.Logger("hyde_rag")
     log.setLevel(LOG_LEVEL)
     file_handler = logging.FileHandler(LOG_PATH)
@@ -21,3 +21,35 @@ def get_logger(LOG_LEVEL="INFO"):
 log = get_logger("DEBUG")

     LOG_PATH = Path("logs.log")
     formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
+    log = logging.Logger("agentic_search")
     log.setLevel(LOG_LEVEL)
     file_handler = logging.FileHandler(LOG_PATH)
 log = get_logger("DEBUG")
+MODEL_COMBOS = {
+    "linux": {
+        "embed_model": "Qwen/Qwen3-Embedding-0.6B",
+        "gen_model": "Qwen/Qwen3-4B-AWQ",
+        # 'gen_model': "Qwen/Qwen3-0.6B-GPTQ-Int8"
+        # 'gen_model': "Qwen/Qwen3-1.7B-GPTQ-Int8"
+    },
+    # feel free to replace with any ??B-MLX-?bit versions from Qwen3 Collection at:
+    # https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f
+    "mac": {
+        "embed_model": "Qwen/Qwen3-Embedding-0.6B",
+        "gen_model": "Qwen/Qwen3-4B-MLX-4bit",
+    },
+    "mac_mid": {
+        "embed_model": "Qwen/Qwen3-Embedding-0.6B",
+        "gen_model": "Qwen/Qwen3-4B-MLX-6bit",
+    },
+    "mac_high": {
+        "embed_model": "Qwen/Qwen3-Embedding-0.6B",
+        "gen_model": "Qwen/Qwen3-4B-MLX-8bit",
+    },
+    # HF-low is same as `linux-local`
+    "HF-mid": {
+        "embed_model": "Qwen/Qwen3-Embedding-0.6B",
+        "gen_model": "Qwen/Qwen3-8B-AWQ",
+    },
+    "HF-high": {
+        "embed_model": "Qwen/Qwen3-Embedding-4B",
+        "gen_model": "Qwen/Qwen3-14B-AWQ",
+    },
+}

src/hyde_rag.py DELETED Viewed

@@ -1,206 +0,0 @@
-# hyde_ragfusion.py
-# Minimal HyDE + RAG-Fusion over local PDFs.
-# Dependencies: transformers, sentence-transformers, scikit-learn, pymupdf, numpy
-import os
-import re
-import heapq
-import fitz  # PyMuPDF
-from sklearn.neighbors import NearestNeighbors
-from sentence_transformers import SentenceTransformer
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-# -----------------------------
-# Ingestion & Chunking
-# -----------------------------
-def load_pdfs(folder):
-    docs = []
-    for fn in os.listdir(folder):
-        if fn.lower().endswith(".pdf"):
-            path = os.path.join(folder, fn)
-            with fitz.open(path) as doc:
-                text = "\n".join(page.get_text("text") for page in doc)
-            text = re.sub(r"\s+\n", "\n", text).strip()
-            docs.append((fn, text))
-    return docs
-def chunk_text(text, chunk_size=300, overlap=50):
-    words = text.split()
-    chunks, i = [], 0
-    while i < len(words):
-        chunk = " ".join(words[i : i + chunk_size])
-        chunks.append(chunk)
-        i += chunk_size - overlap
-    return chunks
-def build_corpus(pdf_folder):
-    raw = load_pdfs(pdf_folder)
-    corpus, meta = [], []
-    for fn, txt in raw:
-        for i, ch in enumerate(chunk_text(txt)):
-            corpus.append(ch)
-            meta.append({"file": fn, "chunk_id": i})
-    return corpus, meta
-# -----------------------------
-# Models (local)
-# -----------------------------
-def load_models():
-    # Small, fast encoder for embeddings
-    embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
-    # Lightweight local generator for HyDE + answers
-    gen_name = "google/flan-t5-base"
-    tok = AutoTokenizer.from_pretrained(gen_name)
-    gen = AutoModelForSeq2SeqLM.from_pretrained(gen_name)
-    return embedder, tok, gen
-# -----------------------------
-# Index (cosine)
-# -----------------------------
-def fit_index(embeddings, n_neighbors=12):
-    nn = NearestNeighbors(metric="cosine", algorithm="auto")
-    nn.fit(embeddings)
-    return nn
-# -----------------------------
-# RAG-Fusion (query variants) + HyDE
-# -----------------------------
-Q_VARIANTS_PROMPT = """You rewrite the user query into {n} diverse, specific search queries (short).
-User query: "{q}"
-Return each on a new line, no numbering, no extra text."""
-HYDE_PROMPT = """Write a factual, neutral, self-contained paragraph that could answer:
-"{q}"
-Avoid fluff. Include likely key terms and entities. 120-180 words."""
-ANSWER_PROMPT = """You are a helpful assistant. Use ONLY the provided context.
-Question: {q}
-Context:
-{ctx}
-Answer concisely and cite file names & chunk ids inline like (file:chunk).
-"""
-def generate_text(gen, tok, prompt, max_new_tokens=160, temperature=0.3):
-    inputs = tok(prompt, return_tensors="pt")
-    out = gen.generate(
-        **inputs,
-        max_new_tokens=max_new_tokens,
-        do_sample=False,
-        temperature=temperature,
-    )
-    return tok.decode(out[0], skip_special_tokens=True).strip()
-def make_query_variants(gen, tok, q, n=4):
-    txt = generate_text(
-        gen, tok, Q_VARIANTS_PROMPT.format(q=q, n=n), max_new_tokens=120
-    )
-    # Split cleanly into lines (drop empties/dups; include original)
-    lines = [l.strip(" -•\t") for l in txt.split("\n") if l.strip()]
-    uniq = []
-    seen = set()
-    for l in lines + [q]:
-        if l not in seen:
-            seen.add(l)
-            uniq.append(l)
-    return uniq[:n]
-def hyde_doc(gen, tok, q):
-    return generate_text(gen, tok, HYDE_PROMPT.format(q=q), max_new_tokens=220)
-# -----------------------------
-# Retrieval + RRF
-# -----------------------------
-def cosine_search(nn, corpus_embeddings, query_vec, top_k=8):
-    dists, idxs = nn.kneighbors(query_vec.reshape(1, -1), n_neighbors=top_k)
-    # Convert cosine distance to similarity
-    sims = 1 - dists[0]
-    return list(zip(idxs[0].tolist(), sims.tolist()))
-def reciprocal_rank_fusion(rank_lists, k=60, top_k=8):
-    # rank_lists: list of [doc_id, ...] ordered best→worst
-    scores = {}
-    for ranks in rank_lists:
-        for rank, doc_id in enumerate(ranks, start=1):
-            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
-    # top by fused score
-    best = heapq.nlargest(top_k, scores.items(), key=lambda x: x[1])
-    return [doc_id for doc_id, _ in best]
-# -----------------------------
-# Pipeline
-# -----------------------------
-class HyDeRAGFusion:
-    def __init__(self, pdf_folder):
-        self.corpus, self.meta = build_corpus(pdf_folder)
-        self.embedder, self.tok, self.gen = load_models()
-        self.corpus_emb = self.embedder.encode(
-            self.corpus,
-            batch_size=64,
-            show_progress_bar=True,
-            normalize_embeddings=True,
-        )
-        self.nn = fit_index(self.corpus_emb)
-    def retrieve(self, query, n_variants=4, per_variant_k=8, final_top_k=6, rrf_k=60):
-        variants = make_query_variants(self.gen, self.tok, query, n=n_variants)
-        rank_lists = []
-        for v in variants:
-            hypo = hyde_doc(self.gen, self.tok, v)  # HyDE
-            q_vec = self.embedder.encode([hypo], normalize_embeddings=True)[0]
-            hits = cosine_search(self.nn, self.corpus_emb, q_vec, top_k=per_variant_k)
-            rank_lists.append([doc_id for doc_id, _ in hits])
-        fused = reciprocal_rank_fusion(rank_lists, k=rrf_k, top_k=final_top_k)
-        return fused
-    def answer(self, query, doc_ids, max_ctx_chars=4000):
-        # Build compact context with inline provenance
-        ctx_parts = []
-        total = 0
-        for i in doc_ids:
-            piece = self.corpus[i]
-            tag = f"(source: {self.meta[i]['file']}:{self.meta[i]['chunk_id']})"
-            chunk = piece.strip()
-            if total + len(chunk) + len(tag) + 5 > max_ctx_chars:
-                break
-            ctx_parts.append(f"{chunk}\n{tag}")
-            total += len(chunk) + len(tag) + 5
-        ctx = "\n\n---\n\n".join(ctx_parts)
-        prompt = ANSWER_PROMPT.format(q=query, ctx=ctx)
-        return generate_text(self.gen, self.tok, prompt, max_new_tokens=300)
-# -----------------------------
-# Example usage
-# -----------------------------
-if __name__ == "__main__":
-    import argparse
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--pdf_folder", required=True, help="Folder with PDFs to index")
-    ap.add_argument("--query", required=True, help="Your user question")
-    ap.add_argument("--show_sources", action="store_true")
-    args = ap.parse_args()
-    rag = HyDeRAGFusion(args.pdf_folder)
-    doc_ids = rag.retrieve(args.query)
-    answer = rag.answer(args.query, doc_ids)
-    print("\n=== ANSWER ===\n")
-    print(answer)
-    if args.show_sources:
-        print("\n=== TOP SOURCES ===")
-        for i in doc_ids:
-            print(f"- {rag.meta[i]['file']}:{rag.meta[i]['chunk_id']}")

src/main.py CHANGED Viewed

@@ -1,79 +1,32 @@
-import pymupdf
-import math
-import faiss
 import string
 import yaml
 import re
-import json
-import asyncio
 import torch
-import streamlit as st
 import click
-from collections import defaultdict
-from ast import literal_eval
 from time import time
 from sentence_transformers import SentenceTransformer
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
-from concurrent.futures import ThreadPoolExecutor
-from langchain.text_splitter import SentenceTransformersTokenTextSplitter
-from src.config import PROMPTS_FILEPATH, log
-async def load_pdfs(files, max_concurrence=5):
-    """
-    Load multiple PDF files async.
-    Args:
-        folder (str): Path to folder containing PDF files
-        max_concurrence (int): Maximum number of concurrent PDF processing tasks
-    Returns:
-        list: List of tuples containing (filename, extracted_text)
-    """
-    def _load_pdf_sync(file):
-        """Synchronous PDF loading function for thread pool execution"""
-        text = ""
-        try:
-            with pymupdf.open(stream=file.getvalue(), filetype="pdf") as doc:
-                text = "\n".join(page.get_text() for page in doc)
-        except Exception:
-            log.exception(f"Error reading {file.name}")
-            pass
-        return (file.name, text)
-    loop = asyncio.get_event_loop()
-    with ThreadPoolExecutor(max_workers=max_concurrence) as executor:
-        futures = [
-            loop.run_in_executor(executor, _load_pdf_sync, file)
-            for file in files
-            if file is not None
-        ]
-        results = await asyncio.gather(*futures, return_exceptions=True)
-    valid_results = [result for result in results if not isinstance(result, Exception)]
-    log.info(f"successfully processed {len(valid_results)} out of {len(files)} PDFs ")
-    return valid_results
-async def build_corpus(pdfs, text_splitter, **load_kwargs):
-    texts = await load_pdfs(pdfs, **load_kwargs)
-    corpus, meta = [], []
-    for file_name, raw_text in texts:
-        chunks = text_splitter.split_text(raw_text)
-        for i, chunk in enumerate(chunks):
-            corpus.append(chunk)
-            meta.append({"file": file_name, "chunk_id": i})
-    return corpus, meta
 def generate_text(
-    tokenizer, model, user_prompts, system_prompt=None, **llm_kwargs
-):  # max_new_tokens=512, temperature=.4):
     if system_prompt is None or "":
         system_prompt = "You are a helpful assistant."
@@ -88,96 +41,122 @@ def generate_text(
         for user_prompt in user_prompts
     ]
-    texts = tokenizer.apply_chat_template(
-        messages, tokenize=False, add_generation_prompt=True
-    )
-    model_inputs = tokenizer(
-        texts, return_tensors="pt", truncation=True, padding=True
-    ).to(model.device)
-    generated_ids = model.generate(
-        **model_inputs,
-        max_new_tokens=llm_kwargs.pop("max_new_tokens", 512),
-        temperature=llm_kwargs.pop("temperature", 0.4),
-    )
-    generated_ids = [
-        output_ids[len(input_ids) :]
-        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-    ]
-    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
-    return response if len(user_prompts) > 1 else response[0]
-def load_models(
-    embed_model_name: str,
-    gen_model_name: str,
-    causal_lm: bool = False,
-    device=None,
-    bitsandbytesconfig=None,
-):
     # This will take some time to run for the first time if the model(s) don't exist locally.
     if not device:
-        device = "cuda" if torch.cuda.is_available() else "cpu"
     embedder = SentenceTransformer(
         embed_model_name,
         device=device,
-        model_kwargs={"dtype": "float16"} if device == "cuda" else {},
     )
-    if not causal_lm:
-        tok = AutoTokenizer.from_pretrained(gen_model_name)
-        gen = AutoModelForSeq2SeqLM.from_pretrained(
-            gen_model_name,  # device_map='auto',
-            quantization_config=bitsandbytesconfig if bitsandbytesconfig else None,
-        )
-    else:
-        tok = AutoTokenizer.from_pretrained(gen_model_name, padding_side="left")
-        gen = AutoModelForCausalLM.from_pretrained(
-            gen_model_name,
-            dtype="float16",  # device_map='auto',
-            quantization_config=bitsandbytesconfig if bitsandbytesconfig else None,
-        )
-    gen.to(device)
     return embedder, tok, gen
 def make_query_variants(
-    tokenizer, model, query: str, prompt: str, n: int = 3, **llm_kwargs
 ):
-    instructions = f"Now give me at least {n} variations."
-    resp = generate_text(tokenizer, model, query + instructions, prompt, **llm_kwargs)
     clean_resp = re.sub(r"^\d+\.\s*", "", resp, flags=re.MULTILINE).split("\n")
-    return [query] + [q for q in clean_resp if q.strip()]
-def clean_rewrite_resp(resp):
-    try:
-        resp = json.loads(resp)  # Parse JSON
-    except json.JSONDecodeError:
-        try:
-            resp = literal_eval(resp)  # Fallback parse
-        except Exception:
-            pass  # Keep resp as-is if both fail
-    # Ensure resp is a string before strip and slicing
-    if isinstance(resp, str):
-        resp = resp.strip()
-        if resp:
-            start = resp.find("{")
-            if start != -1:
-                end = resp[::-1].find("}")
-                if end != -1:
-                    resp = resp[start : len(resp) - end]
-                    return clean_rewrite_resp(resp)
-    return resp
 def transform_query(
-    tokenizer, model, query: str, rewrite_prompt: str, **llm_kwargs
 ) -> dict:
     """split the query into things to search and actions to take"""
-    resp = generate_text(tokenizer, model, query, rewrite_prompt, **llm_kwargs)
     try:
         resp = clean_rewrite_resp(resp)
     except:
@@ -192,6 +171,7 @@ def aggregate_queries_and_tasks(
     rewrite_prompt,
     variants_prompt,
     n_variations=3,
     **llm_kwargs,
 ):
     # make variations for the original query as is
@@ -201,14 +181,13 @@ def aggregate_queries_and_tasks(
         orig_query.strip(),
         variants_prompt,
         n_variations,
         **llm_kwargs,
     )
-    start = time()
-    tr_q = transform_query(tokenizer, model, orig_query.strip(), rewrite_prompt)
-    end = time()
-    log.debug(f"\t\t transforming query task took {(end - start):.1f} seconds...")
     # transformed query might have multiple things to search and tasks to perform depending on user query
     # recursively get variations for each of the search queries but keep the tasks as is.
     tasks = []
@@ -222,91 +201,77 @@ def aggregate_queries_and_tasks(
                     search_result,
                     variants_prompt,
                     n_variations,
                     **llm_kwargs,
                 )
             )
-    queries = [q.strip(string.punctuation) for q in queries]
-    tasks = [t.strip(string.punctuation) for t in tasks]
     # keep the original user query as is (if in case LLM messes up the original query) and pick some after shuffling the rest
-    # This is disabled as we don't do loops and instead take advantage of batches.
-    # Since it's efficient, we can take many query variations at once without worrying about performance.
-    # q, queries = queries[:1], queries[1:]
-    # shuffle(queries)
-    # q += queries[:n_variations-1]
     return queries, tasks
-def build_index(corpus_emb, n_cells=5, n_probe=2):
-    log.debug(f"building index with {n_cells=}, {n_probe=}")
-    d = corpus_emb.shape[1]
-    quantizer = faiss.IndexFlatIP(d)
-    index = faiss.IndexIVFFlat(quantizer, d, n_cells)
-    index.n_probe = n_probe
-    index.train(corpus_emb)
-    index.add(corpus_emb)
-    # index.make_direct_map()
-    return index
-def reciprocal_rank_fusion(indices, top_k=3, denom=50):
-    ii = indices.tolist()
-    scores = defaultdict(int)
-    for row in ii:
-        for rank, chunk_id in enumerate(row):
-            scores[chunk_id] += 1 / (rank + denom)
-    results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
-    return [chunk_id for chunk_id, _ in results]
 class HyDeRAGFusion:
     def __init__(
         self,
         embed_model: str,
         generator_llm_model: str,
-        causal_lm: bool = True,
-        chunk_overlap: int = 50,
-        tokens_per_chunk: int = 256,
-        embed_batch_size: int = 64,
-        bitsandbytesconfig=None,
     ):
         self.embed_batch_size = embed_batch_size
-        self.text_splitter = SentenceTransformersTokenTextSplitter(
-            chunk_overlap, embed_model, tokens_per_chunk
-        )
         self.embedder, self.tok, self.gen = load_models(
-            embed_model, generator_llm_model, causal_lm, bitsandbytesconfig
         )
         with open(PROMPTS_FILEPATH) as fl:
             self.prompts = yaml.safe_load(fl)
-    def preprocess_pdfs(self, pdfs, data_load_kwargs={}, faiss_index_kwargs={}):
-        self.corpus, self.meta = asyncio.run(
-            build_corpus(pdfs, self.text_splitter, **data_load_kwargs)
         )
-        self.corpus_emb = self.embedder.encode(
-            self.corpus,
             batch_size=self.embed_batch_size,
-            show_progress_bar=True,
-            normalize_embeddings=True,
         )
-        # https://github.com/facebookresearch/faiss/issues/112
-        # n_cells = int(round(4 * (self.corpus_emb.shape[0])**.5))
-        # one centroid for every 100 or so vectors and 20% of them as n_probe
-        n_cells = faiss_index_kwargs.pop("n_cells", self.corpus_emb.shape[0] // 100 + 1)
-        n_probe = faiss_index_kwargs.pop("n_probe", math.ceil(0.2 * n_cells))
-        self.index = build_index(self.corpus_emb, n_cells, n_probe)
     def retrieve(
-        self, query, n_variants=3, top_k_per_variant=10, top_k_retrieve=3, **llm_kwargs
     ):
-        start = time()
         queries, tasks = aggregate_queries_and_tasks(
             self.tok,
             self.gen,
@@ -314,48 +279,38 @@ class HyDeRAGFusion:
             self.prompts["rewrite"],
             self.prompts["variants"],
             n_variants,
             **llm_kwargs,
         )
-        end = time()
-        log.debug(f"aggregate task took {(end - start):.1f} seconds...")
-        start = time()
         hyde_docs = generate_text(
-            self.tok, self.gen, queries, self.prompts["hyde"], **llm_kwargs
         )
-        end = time()
-        log.debug(f"generating hyde docs took {(end - start):.1f} seconds...")
-        start = time()
         chunks = []
         for hyde_doc in hyde_docs:
             chunks.extend(self.text_splitter.split_text(hyde_doc))
-        q_emb = self.embedder.encode(
-            chunks, batch_size=self.embed_batch_size, normalize_embeddings=True
         )
-        end = time()
-        log.debug(f"embedding hyde docs took {(end - start):.1f} seconds...")
-        _, I = self.index.search(q_emb, top_k_per_variant)
-        chunk_ids = reciprocal_rank_fusion(I, top_k_retrieve)
-        return chunk_ids, tasks
-    def answer(self, query, doc_ids, tasks, max_ctx_chars=128000):
         total, text, prompt_length = 0, "", 10000
         sep = "\n\n-----\n\n"
-        tasks = ", ".join(tasks)
-        for doc_id in doc_ids:
-            # adding tags in the context caused more hallucinations.
-            # Instead, we list them as sources beneath the model response.
-            # _meta = self.meta[doc_id]
-            # tag = f"(source: {_meta['file_name']}:{_meta['chunk_id']})"
-            chunk = self.corpus[doc_id].strip()
-            tag = ""
-            ctx = f"{sep}{tag}\n\n{chunk}"
             if total + len(ctx) + len(tasks) + len(sep) + prompt_length > max_ctx_chars:
                 break
             text += ctx
@@ -363,41 +318,52 @@ class HyDeRAGFusion:
         text += f"{sep}{tasks}"
-        # instruction = "Answer concisely and also cite file names & chunk ids inline like (pdf_file_name:chunk_id)."
         instruction = "go ahead and answer!"
         user_query = f"\nq: {query}\n\nctx:{text}" + f"\n\n{instruction}\n\n"
-        start = time()
         resp = generate_text(
             self.tok,
             self.gen,
             user_query,
             self.prompts["final_answer"],
-            temperature=0.3,
         )
-        end = time()
-        log.debug(f"final resp took {(end - start):.1f} seconds...")
-        return resp
-@st.cache_resource
-def initial_setup(embed_model, generator_model, bitsandbytesconfig=None):
-    return HyDeRAGFusion(
-        embed_model, generator_model, bitsandbytesconfig=bitsandbytesconfig
-    )
 @click.command(context_settings=dict(show_default=True))
 @click.option(
-    "--embed-model",
-    default="sentence-transformers/LaBSE",
-    help="sentence transformers embedding model",
 )
 @click.option(
-    "--generator-llm-model",
-    default="Qwen/Qwen2.5-0.5B-Instruct",
-    help="Seq2Seq or CausalLM model (preferably multi-lingual)",
 )
 @click.option("--n-variants", default=3, help="no. of query variants")
 @click.option(
@@ -408,97 +374,57 @@ def initial_setup(embed_model, generator_model, bitsandbytesconfig=None):
 @click.option(
     "--top-k-retrieve", default=3, help="top `k` chunks to retrieve after RRF"
 )
-@click.option("--temperature", default=0.4, help="LLM Model Temperature")
-@click.option("--max-new-tokens", default=512, help="LLM max tokens")
-@click.option(
-    "--faiss-index-kwargs",
-    default=dict(),
-    help="kwargs to pass to FAISS Index such as `n_cells, n_probe`",
-)
-def main(
-    embed_model,
-    generator_llm_model,
     n_variants,
     top_k_per_variant,
     top_k_retrieve,
     temperature,
-    max_new_tokens,
-    faiss_index_kwargs,
 ):
-    # bits_and_bytes_cfg = BitsAndBytesConfig(
-    #     load_in_8bit=True
-    # )
-    start = time()
-    hrf = initial_setup(embed_model, generator_llm_model)
-    end = time()
-    msg = f"init took {(end - start):.1f} seconds"
-    log.debug(msg)
-    st.write(msg)
-    st.set_page_config(page_title="RAG HYDE")
-    st.header("Ask Questions")
-    state = st.session_state
-    if "uploaded_names" not in state:
-        state.uploaded_names = []
-    pdfs = st.file_uploader(
-        "Upload your PDF(s)", type="pdf", accept_multiple_files=True, key="upload"
     )
-    if pdfs:
-        current_names = sorted([pdf.name for pdf in pdfs])
-        # reinitialize if uploaded files are changed
-        if current_names != state.uploaded_names:
-            start = time()
-            hrf = initial_setup(embed_model, generator_llm_model)
-            hrf.preprocess_pdfs(
-                pdfs, faiss_index_kwargs=literal_eval(faiss_index_kwargs)
-            )
-            end = time()
-            st.write(
-                f"corpus embeddings shape: {hrf.corpus_emb.shape}, computed in {end - start:.1f} seconds"
-            )
-            state.uploaded_names = current_names
-    else:
-        state.uploaded_names = []
-        st.write("upload data to query")
-    query = st.text_input("ask question").strip()
-    if query and state.uploaded_names:
-        start = time()
-        llm_kwargs = {
-            "temperature": temperature,
-            "max_new_tokens": max_new_tokens,
-        }
-        doc_ids, tasks = hrf.retrieve(
-            query,
             int(n_variants),
             int(top_k_per_variant),
             int(top_k_retrieve),
-            **llm_kwargs,
         )
-        docs = [hrf.corpus[doc_id] for doc_id in doc_ids]
         end = time()
-        reply = hrf.answer(query, doc_ids, tasks)
-        st.write(f"search took {(end - start):.1f} seconds")
-        st.write(f"\n\nFinal Answer: \n{reply}\n\n")
-        st.write("Top 3 sources:")
-        sources = [
-            {
-                "source": f"{hrf.meta[doc_id]['file']}:{hrf.meta[doc_id]['chunk_id']}",
-                "content": doc,
-            }
-            for doc_id, doc in zip(doc_ids, docs)
-        ]
-        st.json(sources[:3])
 if __name__ == "__main__":
-    # faiss_index_kwargs = {
-    #     'n_cells': 20,
-    #     'n_probe': 8
-    # }
-    main()

 import string
+import faiss
 import yaml
 import re
+import sys
 import torch
 import click
 from time import time
+from random import shuffle
+from functools import lru_cache
 from sentence_transformers import SentenceTransformer
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from langchain_text_splitters import MarkdownTextSplitter
+from src.config import PROMPTS_FILEPATH, MODEL_COMBOS, log
+from src.utils import (
+    empty_cache,
+    find_think_tag_in_each_row,
+    build_corpus,
+    reciprocal_rank_fusion,
+    clean_rewrite_resp,
+)
 def generate_text(
+    tokenizer, model, user_prompts, system_prompt=None, model_name="", **llm_kwargs
+):
+    assert model_name, "pass on generative model name"
     if system_prompt is None or "":
         system_prompt = "You are a helpful assistant."
         for user_prompt in user_prompts
     ]
+    if "mlx" in model_name.lower() and sys.platform == "darwin":
+        from mlx_lm import generate
+        texts = [
+            tokenizer.apply_chat_template(
+                message,
+                tokenize=False,
+                add_generation_prompt=True,
+                enable_thinking=False,
+            )
+            for message in messages
+        ]
+        responses = [
+            generate(
+                model,
+                tokenizer,
+                prompt=text,
+                verbose=False,
+                max_tokens=llm_kwargs.pop("max_new_tokens", 32768),
+            )
+            for text in texts
+        ]
+    else:
+        texts = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
+        )
+        model_inputs = tokenizer(
+            texts, return_tensors="pt", truncation=True, padding=True
+        ).to(model.device)
+        with torch.no_grad():
+            generated_ids = model.generate(
+                **model_inputs,
+                max_new_tokens=llm_kwargs.pop("max_new_tokens", 32768),
+                temperature=llm_kwargs.pop("temperature", 0.7),
+                top_p=llm_kwargs.pop("top_p", 0.8),
+                top_k=llm_kwargs.pop("top_k", 20),
+                min_p=llm_kwargs.pop("min_p", 0),
+                **llm_kwargs,
+            )
+        output_ids = generated_ids[:, model_inputs.input_ids.shape[1] :]
+        idxs = find_think_tag_in_each_row(output_ids)
+        thinking_contents = [
+            tokenizer.decode(output_ids[i][:idx], skip_special_tokens=True).strip("\n")
+            for i, idx in enumerate(idxs)
+        ]
+        contents = [
+            tokenizer.decode(output_ids[i][idx:], skip_special_tokens=True).strip("\n")
+            for i, idx in enumerate(idxs)
+        ]
+        responses = [
+            f"{think_resp}{cont}"
+            for think_resp, cont in zip(thinking_contents, contents)
+        ]
+    return responses[0] if len(user_prompts) == 1 else responses
+def load_models(embed_model_name: str, gen_model_name: str, device: str = None):
     # This will take some time to run for the first time if the model(s) don't exist locally.
     if not device:
+        if torch.cuda.is_available():
+            device = "cuda"
+        elif torch.mps.is_available():
+            device = "mps"
+        else:
+            device = "cpu"
+    dtype = torch.bfloat16 if device == "cuda" else torch.float16
+    if device != "mps" or (device == "mps" and "mlx" not in gen_model_name.lower()):
+        tok = AutoTokenizer.from_pretrained(gen_model_name, padding_side="left")
+        # sometimes loading an AWQ model on my local machine fails for the first time
+        try:
+            gen = AutoModelForCausalLM.from_pretrained(
+                gen_model_name, dtype=dtype, device_map=device
+            ).eval()
+        except ImportError:
+            gen = AutoModelForCausalLM.from_pretrained(
+                gen_model_name, dtype=dtype, device_map=device
+            ).eval()
+    else:
+        from mlx_lm import load
+        gen, tok = load(gen_model_name)
     embedder = SentenceTransformer(
         embed_model_name,
         device=device,
+        model_kwargs={"dtype": dtype},
     )
     return embedder, tok, gen
 def make_query_variants(
+    tokenizer, model, query: str, prompt: str, n: int = 3, model_name="", **llm_kwargs
 ):
+    # instructions = f"\n\n(Now give me at least {n} diverse variations of user query in the same language as the user provided query)"
+    # query += instructions
+    resp = generate_text(
+        tokenizer, model, query.format(n=n), prompt, model_name=model_name, **llm_kwargs
+    )
     clean_resp = re.sub(r"^\d+\.\s*", "", resp, flags=re.MULTILINE).split("\n")
+    queries = [q.strip() for q in clean_resp if q.strip()]
+    return [query.lower().strip()] + sorted(
+        set(map(lambda x: str.lower(x).strip(), queries))
+    )
 def transform_query(
+    tokenizer, model, query: str, rewrite_prompt: str, model_name="", **llm_kwargs
 ) -> dict:
     """split the query into things to search and actions to take"""
+    resp = generate_text(
+        tokenizer, model, query, rewrite_prompt, model_name=model_name, **llm_kwargs
+    )
     try:
         resp = clean_rewrite_resp(resp)
     except:
     rewrite_prompt,
     variants_prompt,
     n_variations=3,
+    gen_model_name="",
     **llm_kwargs,
 ):
     # make variations for the original query as is
         orig_query.strip(),
         variants_prompt,
         n_variations,
+        gen_model_name,
         **llm_kwargs,
+    )[: n_variations + 1]
+    tr_q = transform_query(
+        tokenizer, model, orig_query.strip(), rewrite_prompt, gen_model_name
     )
     # transformed query might have multiple things to search and tasks to perform depending on user query
     # recursively get variations for each of the search queries but keep the tasks as is.
     tasks = []
                     search_result,
                     variants_prompt,
                     n_variations,
+                    gen_model_name,
                     **llm_kwargs,
                 )
             )
     # keep the original user query as is (if in case LLM messes up the original query) and pick some after shuffling the rest
+    q, qq = queries[0], queries[1:]
+    shuffle(qq)
+    queries = [q] + sorted(
+        set(map(lambda x: str.lower(x).strip(string.punctuation), qq[:n_variations]))
+    )
+    tasks = sorted(set(map(lambda x: str.lower(x).strip(string.punctuation), tasks)))
     return queries, tasks
 class HyDeRAGFusion:
     def __init__(
         self,
         embed_model: str,
         generator_llm_model: str,
+        embed_batch_size: int = 8,
     ):
         self.embed_batch_size = embed_batch_size
+        self.gen_model_name = generator_llm_model
         self.embedder, self.tok, self.gen = load_models(
+            embed_model, generator_llm_model
         )
+        self.text_splitter = MarkdownTextSplitter(chunk_overlap=450, chunk_size=3000)
         with open(PROMPTS_FILEPATH) as fl:
             self.prompts = yaml.safe_load(fl)
+    def get_embeddings(self, texts, **kwargs):
+        log.debug(f"batching size: {len(texts)} aka {self.embed_batch_size}")
+        return self.embedder.encode(
+            texts, batch_size=self.embed_batch_size, normalize_embeddings=True, **kwargs
         )
+    @lru_cache(maxsize=2)
+    def preprocess_pdfs(self, pdfs, **data_load_kwargs):
+        log.debug(f"\n\n{'@@@@' * 20}\n\n preprocessing {pdfs=}")
+        empty_cache()
+        self.dataset = build_corpus(pdfs, self.text_splitter, **data_load_kwargs)
+        empty_cache()
+        self.dataset = self.dataset.map(
+            lambda x: {
+                "embeddings": self.get_embeddings(
+                    x["chunk"], prompt_name="query", show_progress_bar=False
+                )
+            },
+            batched=True,
             batch_size=self.embed_batch_size,
+        )
+        empty_cache()
+        self.dataset.add_faiss_index(
+            "embeddings", metric_type=faiss.METRIC_INNER_PRODUCT
         )
+    def get_filtered_entries(self, idxs):
+        # We need to drop the index before filtering/selecting the desired indices and re-add it later
+        # Since it's FAISS and we index very little data, it's not noticeable
+        self.dataset.drop_index("embeddings")
+        entries = self.dataset.select(idxs)
+        self.dataset.add_faiss_index(
+            "embeddings", metric_type=faiss.METRIC_INNER_PRODUCT
+        )
+        return entries
     def retrieve(
+        self, query, n_variants=3, top_k_per_variant=5, top_k_retrieve=3, **llm_kwargs
     ):
         queries, tasks = aggregate_queries_and_tasks(
             self.tok,
             self.gen,
             self.prompts["rewrite"],
             self.prompts["variants"],
             n_variants,
+            self.gen_model_name,
             **llm_kwargs,
         )
         hyde_docs = generate_text(
+            self.tok,
+            self.gen,
+            queries,
+            self.prompts["hyde"],
+            self.gen_model_name,
+            **llm_kwargs,
         )
         chunks = []
         for hyde_doc in hyde_docs:
             chunks.extend(self.text_splitter.split_text(hyde_doc))
+        q_emb = self.get_embeddings(chunks)
+        matches = self.dataset.get_nearest_examples_batch(
+            "embeddings", q_emb, top_k_per_variant
         )
+        indices = [match["id"] for match in matches.total_examples]
+        top_idxs = reciprocal_rank_fusion(indices, top_k_retrieve)
+        return top_idxs, tasks
+    def answer(self, query, idxs, tasks, max_ctx_chars=32768):
         total, text, prompt_length = 0, "", 10000
         sep = "\n\n-----\n\n"
+        tasks = ", ".join(tasks) if tasks else ""
+        log.debug("filtering entries")
+        entries = self.get_filtered_entries(idxs)
+        for chunk in entries["chunk"]:
+            ctx = f"{sep}\n\n{chunk}"
             if total + len(ctx) + len(tasks) + len(sep) + prompt_length > max_ctx_chars:
+                log.warning("context overflow")
                 break
             text += ctx
         text += f"{sep}{tasks}"
         instruction = "go ahead and answer!"
         user_query = f"\nq: {query}\n\nctx:{text}" + f"\n\n{instruction}\n\n"
         resp = generate_text(
             self.tok,
             self.gen,
             user_query,
             self.prompts["final_answer"],
+            self.gen_model_name,
         )
+        sources = ""
+        for idx, entry in enumerate(entries):
+            source = f'<h2 style="color: cyan;">Source {idx + 1} :: {entry["file"]}:{entry["chunk_id"]}</h2>'
+            sources += f"{sep}{source}\n\n{entry['chunk']}"
+        return resp, sources.replace("```", "`")
+def initial_setup(model_combo_key):
+    models = MODEL_COMBOS[model_combo_key]
+    hrf = HyDeRAGFusion(models["embed_model"], models["gen_model"])
+    hrf._model_combo_key = model_combo_key
+    return hrf
+HRF = None
 @click.command(context_settings=dict(show_default=True))
 @click.option(
+    "--pdfs",
+    multiple=True,
+    type=click.Path(exists=True),
+    help="list of PDF filepaths to extract text from",
+)
+@click.option("--query", help="user query")
+@click.option(
+    "--model-combo-key",
+    type=click.Choice(["linux", "HF-mid"]),
+    default="linux",
+    help="embedder and generator llm models combination to load (see config.py)",
 )
 @click.option(
+    "--fast-extract/--no-fast-extract",
+    default=False,
+    help="Extract markdown text quickly (uses pymupdf if set, else docling if available)",
 )
 @click.option("--n-variants", default=3, help="no. of query variants")
 @click.option(
 @click.option(
     "--top-k-retrieve", default=3, help="top `k` chunks to retrieve after RRF"
 )
+@click.option("--temperature", default=0.7, help="LLM Model Temperature")
+def ask(
+    pdfs,
+    query,
+    model_combo_key,
+    fast_extract,
     n_variants,
     top_k_per_variant,
     top_k_retrieve,
     temperature,
 ):
+    pdfs = tuple(sorted(pdfs))
+    log.debug(
+        f"{pdfs=}, {query=}, {model_combo_key=}, {fast_extract=}, {n_variants=}, {top_k_per_variant=}, {top_k_retrieve=}, {temperature=}"
     )
+    global HRF
+    if HRF is None or HRF._model_combo_key != model_combo_key:
+        if HRF is not None:
+            log.debug("deleting HRF object")
+            del HRF
+        log.debug("emptying cache")
+        empty_cache()
+        log.debug(f"\n\n{'=:-:' * 20}\n\n initializing")
+        start = time()
+        HRF = initial_setup(model_combo_key)
+        end = time()
+        msg = f"init took {(end - start):.1f} seconds"
+        log.debug(msg)
+    start = time()
+    if pdfs:
+        HRF.preprocess_pdfs(pdfs, fast_extract=fast_extract)
+    if query and query.strip():
+        top_idxs, tasks = HRF.retrieve(
+            query.strip(),
             int(n_variants),
             int(top_k_per_variant),
             int(top_k_retrieve),
+            temperature=temperature,
         )
+        log.debug("retrieving")
+        resp, sources = HRF.answer(query, top_idxs, tasks)
         end = time()
+        final_response = f"\nSearch took {(end - start):.1f} seconds\n\n{resp}{sources}"
+        log.debug(final_response)
+        return final_response
+    return ""
 if __name__ == "__main__":
+    ask()

src/prompts.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 rewrite: >
     You are a professional content writer and editor who deeply pays attention to user's query & intention. You strictly reply ONLY in JSON.
-    Your mission is to analyze user input and intention and transform it in the same language as the input query to make it search engine optimised by determining the appropriate context.
     The user input can be a query or a statement. There can be multiple of them. And sometimes the input also contains actions to be taken depending on the query/statement.
@@ -109,7 +109,7 @@ rewrite: >
     -----------
-    Remember to not answer the user's question but only transform it and in the same language as given.
     user:
@@ -117,10 +117,10 @@ rewrite: >
 variants: >
     You are a multilingual professional content writer and editor who deeply pays attention to user's query & intention.
-    Your goal is to transform the given query into diverse search queries keeping the user's context & intention in mind.
-    You MUST respond in the same language as the user query which need not always be English.
     You MUST respond with only what's asked. Avoid explanations or verbose information of your actions.
@@ -130,59 +130,59 @@ variants: >
     --------------
-    user: "EBITDA last quarter"
     assistant:
-        "What was the EBITDA for the quarter ending March?",
-        "How has the company's EBITDA performance changed from the previous quarter?",
-        "What is the current trend of EBITDA growth over the past few quarters?",
-        "Which companies have had similar EBITDA performance recently?",
-        "What factors might be influencing the changes in EBITDA?",
-    user: "what are the growing concerns of the middle class?"
     assistant:
-        "How are the economic challenges impacting the middle class?",
-        "What are the social and political pressures on the middle class?",
-        "What are the long-term implications for the middle class's well-being?",
-        "What are the current trends and future prospects for the middle class"
-    user: "Capital of France"
     assistant:
-        "What is the capital city of France?",
-        "How is Paris known internationally?",
-        "Where is Paris located on the map?",
-    user: "EBITDA letztes Quartal"
     assistant:
-        "Wie hoch war das EBITDA für das im März endende Quartal?",
-        "Wie hat sich die EBITDA-Performance des Unternehmens gegenüber dem Vorquartal verändert?",
-        "Wie ist der aktuelle Trend des EBITDA-Wachstums in den letzten Quartalen?",
-        "Welche Unternehmen hatten in letzter Zeit eine ähnliche EBITDA-Entwicklung?",
-        "Welche Faktoren könnten die Veränderungen des EBITDA beeinflussen?",
-    user: "Was sind die wachsenden Sorgen der Mittelschicht?"
     assistant:
-        "Wie wirken sich die wirtschaftlichen Herausforderungen auf die Mittelschicht aus?",
-        "Welchen sozialen und politischen Druck erlebt die Mittelschicht?",
-        "Was sind die langfristigen Auswirkungen auf das Wohlergehen der Mittelschicht?",
-        "Was sind die aktuellen Trends und Zukunftsaussichten für die Mittelschicht?"
-    user: "Hauptstadt von Frankreich"
     assistant:
-        "Was ist die Hauptstadt von Frankreich?",
-        "Wie ist Paris international bekannt?",
-        "Wo liegt Paris auf der Karte?",
     --------------
@@ -193,14 +193,14 @@ variants: >
 hyde: >
     You are a professional editor at a prestigious international media organization.
     Given user's query, write a neutral, self-contained paragraph ABSOLUTELY GROUNDED IN FACTS and established sources. Avoid fluff. Include likely key terms and entities. 120-180 words.
-    You write content in the same language as the user query which need not always be in English.
     Examples:
     --------
-    user: Quelle est le niveau actuel de l'engagement de Deutsche Telekom avec le développement durable
-    assistant: Deutsche Telekom, la plus grande entreprise d'électricité et d'information au monde, a lancé un programme d'engagement durable en 2015 pour atteindre ses objectifs de développement durable. Ce programme visait à réduire son émission de gaz à effet de serre (CO2) par rapport aux niveaux de 1990, ainsi que les émissions de déchets et de produits chimiques. Le groupe a mis en place des politiques d'énergie basées sur la transition verte, comme la production de biogaz, l'utilisation de technologies solaires et l'innovation dans la gestion des ressources naturelles. L'objectif principal était de se démarquer du marché mondial en termes de performance énergétique et environnementale.
     user: BMW Group expansion into southern Asia
@@ -223,8 +223,8 @@ final_answer: >
     You are a journalist at a media organization. Your main specializations include fact checking, accurate information retrieval from sources among others.
     YOU ALWAYS ADHERE TO THE FOLLOWING INSTRUCTIONS:
-    - When given a user query `q` and a context `ctx`, your goal is to answer `q` FROM ONLY WITHIN the given context `ctx` and add citations where applicable.
-    - You reply in the same language as user input which need not be always English.
     - You do not state anything that is not present within `ctx`. NEVER GUESS.
     - ALWAYS GROUND YOUR TRUTH based only on what was provided within the context `ctx`.
     - If you believe `q` has nothing to do with `ctx`, simply state "I don't know" (or its equivalent in the user query language) instead of guessing.
@@ -238,7 +238,7 @@ final_answer: >
         ctx:
         -----
-            this purpose. This will enable us to guarantee transparency and comparability in the validation and measurement of our targets and, at the same time, ensure they are in line with the latest scientific findings. ↗ Carbon emissions ↗ Control parameters such as ↗ carbon emissions over the entire prod - uct life cycle are important ↗ Performance indicators during the de - velopment phase of our vehicle projects. The Board of Manage - ment receives and discusses a status report on sustainability every quarter and derives appropriate measures as required. The BMW Group is actively working on numerous projects and initiatives to improve the framework conditions for electromobil- ity, including the expansion of charging infrastructure on a broad basis. The ambitious goals of the Paris Climate Agreement are designed to tackle climate change in the transport sector, requir - ing a combination of modern drive technologies that are closely aligned with customer needs and different mobility requirements around the world. In addition to all - electric models, plug - in hybrids and modern combustion engine technologies also make an im - portant contribution to the reduction of global CO2 emissions. The BMW Group is also continuously forging ahead with its work with hydrogen. ↗ Products ESG criteria are built into individual market strategies across our global organisation. Best practices in the fields of environmental protection, social sustainability, corporate citizenship and gov
             go ahead and answer!
@@ -267,8 +267,7 @@ final_answer: >
         Zur Erhöhung der Transparenz und Effektivität kooperiert sie verstärkt mit öffentlichen und privaten Organisationen.
         Sicherheit ist durch das „Security by Design“-Prinzip fester Bestandteil im Entwicklungsprozess neuer Produkte und Informationssysteme.
         Es werden intensive und obligatorische digitale Sicherheitstests durchgeführt, um Schwachstellen systematisch aufzudecken.
-        Alle sicherheitsrelevanten Abteilungen wurden unter dem Dach der Deutschen Telekom Security zusammengeführt. Mit diesem End-to-End-Sicherheitsportfolio zielt das Unternehmen darauf ab, Marktanteile zu gewinnen und im Rahmen der Megatrends Internet der Dinge und Industrie 4.0 neue Sicherheitskonzepte zu etablieren.
-        Zudem wird das Partner-Ökosystem im Bereich Cybersicherheit kontinuierlich ausgebaut, und auf der Unternehmenswebsite wird fortlaufend über aktuelle Entwicklungen in Datenschutz und Datensicherheit berichtet.
     =====

 rewrite: >
     You are a professional content writer and editor who deeply pays attention to user's query & intention. You strictly reply ONLY in JSON.
+    Your mission is to analyze user input and intention and transform it **IN THE SAME LANGUAGE** as the input query to make it search engine optimised by determining the appropriate context.
     The user input can be a query or a statement. There can be multiple of them. And sometimes the input also contains actions to be taken depending on the query/statement.
     -----------
+    Remember to **NOT** answer the user's question but only transform it and **IN THE SAME LANGUAGE** as given.
     user:
 variants: >
     You are a multilingual professional content writer and editor who deeply pays attention to user's query & intention.
+    Your goal is to transform the given query into at least {n} diverse queries as long as they're related to the topic of the original query.
+    You MUST ALWAYS respond **IN THE SAME LANGUAGE** as the user.
     You MUST respond with only what's asked. Avoid explanations or verbose information of your actions.
     --------------
+    user: EBITDA last quarter
     assistant:
+        What was the EBITDA for the quarter ending March?
+        How has the company's performance changed from the previous quarter?
+        What is the current trend of EBITDA growth over the past few quarters?
+        Which companies have had similar EBITDA performance recently?
+        What factors might be influencing the changes in EBITDA?
+    user: what are the growing concerns of the middle class?
     assistant:
+        How are the economic challenges impacting the middle class?
+        How's middle class people coping up with social and political pressures?
+        What are the long-term implications for the middle class's well-being?
+        What are the current trends and future prospects for the middle class?
+    user: Capital of France
     assistant:
+        What is the capital city of France?
+        How is Paris known internationally?
+        Where is the capital of France located?
+    user: Wer ist der aktuelle CEO dieses Unternehmens?
     assistant:
+        Wer ist der Geschäftsführer dieses Unternehmens?
+        Wie sieht das Organigramm des Unternehmens aus?
+        Wann wurde der aktuelle CEO ernannt?
+        Wer war in der Vergangenheit CEO dieses Unternehmens?
+    user: Wie reitet man ein Pferd?
     assistant:
+        Wo fängt man an, ein Pferd zu lernen?
+        Wo kann man Reiten lernen?
+        Grundlagen des Reitens
+        Wo kann ich Reiten lernen?
+    user: Effets de la gravité dans l'espace
     assistant:
+        Comment fonctionne la gravité ?
+        Qu'est-ce que la gravité ?
+        Quelle sensation procure l'apesanteur dans l'espace ?
+        Comment la gravité affecte-t-elle l'espace-temps ?
+        L'attraction gravitationnelle des objets dans l'espace
     --------------
 hyde: >
     You are a professional editor at a prestigious international media organization.
     Given user's query, write a neutral, self-contained paragraph ABSOLUTELY GROUNDED IN FACTS and established sources. Avoid fluff. Include likely key terms and entities. 120-180 words.
+    You MUST ALWAYS write content in the same language as the user query.
     Examples:
     --------
+    user: Quelle est le niveau actuel de l'engagement de Lego avec le développement durable
+    assistant: Lego, la plus grande entreprise d'électricité et d'information au monde, a lancé un programme d'engagement durable en 2015 pour atteindre ses objectifs de développement durable. Ce programme visait à réduire son émission de gaz à effet de serre (CO2) par rapport aux niveaux de 1990, ainsi que les émissions de déchets et de produits chimiques. Le groupe a mis en place des politiques d'énergie basées sur la transition verte, comme la production de biogaz, l'utilisation de technologies solaires et l'innovation dans la gestion des ressources naturelles. L'objectif principal était de se démarquer du marché mondial en termes de performance énergétique et environnementale.
     user: BMW Group expansion into southern Asia
     You are a journalist at a media organization. Your main specializations include fact checking, accurate information retrieval from sources among others.
     YOU ALWAYS ADHERE TO THE FOLLOWING INSTRUCTIONS:
+    - When given a user query `q` and a context `ctx`, your goal is to answer `q` FROM ONLY WITHIN the given context `ctx`.
+    - You MUST ALWAYS reply in the same language as user's.
     - You do not state anything that is not present within `ctx`. NEVER GUESS.
     - ALWAYS GROUND YOUR TRUTH based only on what was provided within the context `ctx`.
     - If you believe `q` has nothing to do with `ctx`, simply state "I don't know" (or its equivalent in the user query language) instead of guessing.
         ctx:
         -----
+            This will enable us to guarantee transparency and comparability in the validation and measurement of our targets and, at the same time, ensure they are in line with the latest scientific findings. ↗ Carbon emissions ↗ Control parameters such as ↗ carbon emissions over the entire prod - uct life cycle are important ↗ Performance indicators during the de - velopment phase of our vehicle projects. The Board of Manage - ment receives and discusses a status report on sustainability every quarter and derives appropriate measures as required. The BMW Group is actively working on numerous projects and initiatives to improve the framework conditions for electromobil- ity, including the expansion of charging infrastructure on a broad basis. The ambitious goals of the Paris Climate Agreement are designed to tackle climate change in the transport sector, requir - ing a combination of modern drive technologies that are closely aligned with customer needs and different mobility requirements around the world. In addition to all - electric models, plug - in hybrids and modern combustion engine technologies also make an im - portant contribution to the reduction of global CO2 emissions. The BMW Group is also continuously forging ahead with its work with hydrogen. ↗ Products ESG criteria are built into individual market strategies across our global organisation. Best practices in the fields of environmental protection, social sustainability, corporate citizenship and gov
             go ahead and answer!
         Zur Erhöhung der Transparenz und Effektivität kooperiert sie verstärkt mit öffentlichen und privaten Organisationen.
         Sicherheit ist durch das „Security by Design“-Prinzip fester Bestandteil im Entwicklungsprozess neuer Produkte und Informationssysteme.
         Es werden intensive und obligatorische digitale Sicherheitstests durchgeführt, um Schwachstellen systematisch aufzudecken.
+        Alle sicherheitsrelevanten Abteilungen wurden unter dem Dach der Deutschen Telekom Security zusammengeführt. Mit diesem End-to-End-Sicherheitsportfolio zielt das Unternehmen darauf ab, Marktanteile zu gewinnen und im Rahmen der Megatrends Internet der Dinge und Industrie 4.0 neue Sicherheitskonzepte zu etablieren. Zudem wird das Partner-Ökosystem im Bereich Cybersicherheit kontinuierlich ausgebaut, und auf der Unternehmenswebsite wird fortlaufend über aktuelle Entwicklungen in Datenschutz und Datensicherheit berichtet.
     =====

src/utils.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import gc
+import json
+import torch
+import pymupdf, pymupdf4llm
+from ast import literal_eval
+from pathlib import Path
+from datasets import Dataset
+from collections import defaultdict
+from docling.document_converter import DocumentConverter
+from src.config import log
+from pathlib import Path
+def empty_cache():
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    elif torch.mps.is_available():
+        torch.mps.empty_cache()
+def extract_text(file, markdown=False, backend="pymupdf", **kwargs):
+    if backend == "pymupdf":
+        if not markdown:
+            with pymupdf.open(file, filetype="pdf") as doc:
+                return "\n".join(page.get_text(**kwargs) for page in doc)
+        else:
+            log.debug("\n\n using pymupdf4llm \n\n")
+            return pymupdf4llm.to_markdown(file, show_progress=True, **kwargs)
+    elif backend == "docling":
+        converter = DocumentConverter(allowed_formats=["pdf"])
+        doc = converter.convert(file, **kwargs).document
+        res = doc.export_to_markdown() if markdown else doc.export_to_text()
+        del converter, doc
+        empty_cache()
+        return res
+def _load_pdf_sync(file, markdown=True, fast=False, **kwargs):
+    """Synchronous PDF loading function for thread pool execution"""
+    text = extract_text(
+        file,
+        markdown,
+        backend="docling"
+        if ((not fast) and (torch.cuda.is_available() or torch.mps.is_available()))
+        else "pymupdf",
+        **kwargs,
+    )
+    return (Path(file).stem, text)
+def load_pdfs(files, markdown=True, fast_extract=False, **kwargs):
+    """
+    Load multiple PDF files
+    Args:
+        files: PDF filepaths
+        markdown: whether to extract text in markdown
+        fast_extract: whether to use pymupdf to extract text in markdown
+    Returns:
+        list: List of tuples containing (filename, extracted_text)
+    """
+    # # Use ThreadPoolExecutor to run synchronous operations concurrently
+    # loop = asyncio.get_event_loop()
+    # # Create executor with limited workers
+    # with ThreadPoolExecutor(max_workers=max_concurrence) as executor:
+    #     # Submit all PDF processing tasks
+    #     futures = [
+    #         loop.run_in_executor(executor, _load_pdf_sync, file, markdown, fast_extract, **kwargs) for file in files if file is not None
+    #     ]
+    #     results = await asyncio.gather(*futures, return_exceptions=True)
+    # valid_results = [result for result in results if not isinstance(result, Exception)]
+    # log.debug(f"Successfully processed {len(valid_results)} out of {len(files)} PDFs")
+    # return valid_results
+    results = []
+    for file in files:
+        results.append(_load_pdf_sync(file, markdown, fast_extract, **kwargs))
+    return results
+def find_think_tag_in_each_row(tensor):
+    # look for `</think>` tag
+    res = dict((tensor == 151668).nonzero().tolist())
+    if not res:
+        return [0] * len(tensor)
+    idxs = []
+    for idx in range(len(tensor)):
+        idxs.append(res.get(idx, -1))
+    return [x + 1 for x in idxs]
+def build_corpus(pdfs, text_splitter, **load_pdf_kwargs):
+    texts = load_pdfs(pdfs, **load_pdf_kwargs)
+    corpus_with_meta = []
+    _id = 0
+    for file_name, raw_text in texts:
+        chunks = text_splitter.split_text(raw_text)
+        for idx, chunk in enumerate(chunks):
+            corpus_with_meta.append(
+                {
+                    "id": _id,
+                    "file": Path(file_name).stem,
+                    "chunk_id": idx,
+                    "chunk": chunk,
+                }
+            )
+            _id += 1
+    return Dataset.from_list(corpus_with_meta)
+def reciprocal_rank_fusion(indices, top_k=3, denom=50):
+    scores = defaultdict(int)
+    for row in indices:
+        for rank, idx in enumerate(row):
+            scores[idx] += 1 / (rank + denom)
+    results = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
+    return [idx for idx, _ in results]
+def clean_rewrite_resp(resp):
+    try:
+        resp = json.loads(resp)  # Parse JSON
+    except json.JSONDecodeError:
+        try:
+            resp = literal_eval(resp)  # Fallback parse
+        except Exception:
+            pass  # Keep resp as-is if both fail
+    # Ensure resp is a string before strip and slicing
+    if isinstance(resp, str):
+        resp = resp.strip()
+        if resp:
+            start = resp.find("{")
+            if start != -1:
+                end = resp[::-1].find("}")
+                if end != -1:
+                    resp = resp[start : len(resp) - end]
+                    return clean_rewrite_resp(resp)
+    return resp