Spaces:

Asish22
/

code-crawler

Sleeping

juliaturc commited on Sep 23, 2024

Commit

8699925

1 Parent(s): 9802b75

Retrieval benchmark (#39)

* Fixes for previous PR

* Add retriever.py

* Add retrieval benchmark

* Add --retrieval-alpha flag

* Fit BM25 to the current corpus and add Voyage embeddings

* Add benchmark README.

* Nits to retrieve.py

* Add Voyage reranker.

* Update README to reflect Voyage embeddings and reranker.

* Address reviewer comments

Files changed (19) hide show

README.md +12 -6
benchmarks/retrieval/README.md +132 -0
benchmarks/retrieval/assets/chunks.png +0 -0
benchmarks/retrieval/assets/embeddings.png +0 -0
benchmarks/retrieval/assets/markdown.png +0 -0
benchmarks/retrieval/assets/rerankers.png +0 -0
benchmarks/retrieval/assets/retrievers.png +0 -0
benchmarks/retrieval/retrieve.py +108 -0
benchmarks/retrieval/sample.json +177 -0
requirements.txt +5 -1
sage/chat.py +3 -10
sage/chunker.py +1 -0
sage/config.py +65 -7
sage/configs/remote.yaml +1 -2
sage/embedder.py +70 -0
sage/index.py +2 -2
sage/reranker.py +12 -5
sage/retriever.py +25 -0
sage/vector_store.py +57 -19

README.md CHANGED Viewed

@@ -72,22 +72,28 @@ pip install git+https://github.com/Storia-AI/sage.git@main
 <details>
 <summary><strong>:cloud: Using external providers (higher quality)</strong></summary>
-1. We support <a href="https://openai.com/">OpenAI</a> for embeddings (they have a super fast batch embedding API) and <a href="https://www.pinecone.io/">Pinecone</a> for the vector store. So you will need two API keys:
     ```
-    export OPENAI_API_KEY=...
-    export PINECONE_API_KEY=...
     ```
-2. Create a Pinecone account. Export the desired index name (if it doesn't exist yet, we'll create it):
     ```
     export PINECONE_INDEX_NAME=...
     ```
-3. For reranking, we use <a href="https://cohere.com/rerank">Cohere</a> by default, but you can also try rerankers from <a href="https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/">NVIDIA</a> or <a href="https://jina.ai/reranker/">Jina</a>:
     ```
-    export COHERE_API_KEY=...  # or
     export NVIDIA_API_KEY=...  # or
     export JINA_API_KEY=...
     ```

 <details>
 <summary><strong>:cloud: Using external providers (higher quality)</strong></summary>
+1. For embeddings, we support <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI</a> and <a href="https://docs.voyageai.com/docs/embeddings">Voyage</a>. According to [our experiments](benchmarks/retrieval/README.md), OpenAI is better quality. Their batch API is also faster, with more generous rate limits. Export the API key of the desired provider:
     ```
+    export OPENAI_API_KEY=... # or
+    export VOYAGE_API_KEY=...
     ```
+2. We use <a href="https://www.pinecone.io/">Pinecone</a> for the vector store, so you will need an API key:
+    ```
+    export PINECONE_API_KEY=...
+    ```
+    If you want to reuse an existing Pinecone index, specify it. Otherwise we'll create a new one called `sage`.
     ```
     export PINECONE_INDEX_NAME=...
     ```
+3. For reranking, we support <a href="https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/">NVIDIA</a>, <a href="https://docs.voyageai.com/docs/reranker">Voyage</a>, <a href="https://cohere.com/rerank">Cohere</a>, and <a href="https://jina.ai/reranker/">Jina</a>. According to [our experiments](benchmark/retrieval/README.md), NVIDIA performs best. Export the API key of the desired provider:
     ```
     export NVIDIA_API_KEY=...  # or
+    export VOYAGE_API_KEY=...  # or
+    export COHERE_API_KEY=...  # or
     export JINA_API_KEY=...
     ```

benchmarks/retrieval/README.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# Chat-with-your-codebase: Retrieval Benchmark
+When using this repository (which allows you to chat with your codebase in two commands), you are indirectly making a series of choices that greatly influence the quality of your AI copilot: chunking strategy, embeddings, retrieval algorithm, rerankers, etc.
+Our role as maintainers is two-fold: to give you options/flexibility, but also to find good defaults. We're not here just to dump code on the Internet. We're here to *make it work*.
+To make progress, we need a ladder to climb. That's why we partnered with our friends at [Morph Labs](https://morph.so) to produce a benchmark that will allow us to make informed decisions and measure progress. We will make it public soon, but if you really really can't wait, let us know at [founders@storia.ai](mailto:founders@storia.ai).
+Here you will find our first learnings enabled by this dataset. We focused on proprietary APIs, but we're planning on extending experiments to open-source models as well.
+#### TL;DR
+- OpenAI's `text-embedding-3-small` embeddings perform best.
+- NVIDIA's reranker outperforms Cohere, Voyage and Jina.
+- Sparse retrieval (e.g. BM25) is actively hurting code retrieval if you have natural language files in your index (e.g. Markdown).
+- Chunks of size 800 are ideal; going smaller has very marginal gains.
+- Going beyond `top_k=25` for retrieval has diminishing returns.
+And now, if you want to nerd out, here's a bunch of plots and stats.
+## Dataset
+Our dataset consists of 1,000 `<question, answer, relevant_documents>` pairs that focus on Hugging Face's [Transformers](https://github.com/huggingface/transformers) library.
+The dataset was generated artificially and checked for quality by humans (we collaborated with [Morph Labs](https://morph.so)). The questions were designed to require context from 1-3 different Python files in order to be answered correctly.
+A sample of 10 instances is provided in [sample.json](sample.json).
+### Code Retrieval Benchmark
+Here, we will be using `<question, relevant_documents>` pairs as a code retrieval benchmark. For instance:
+```
+- Question:
+When developing a specialized model class in the Transformers library, how does `auto_class_update` ensure that the new class's methods are tailored specifically for its requirements while preserving the functionality of the original methods from the base class?
+- Relevant documents:
+huggingface/transformers/src/transformers/models/auto/auto_factory.py
+huggingface/transformers/src/transformers/utils/doc.py
+```
+#### Why not use an already-established code retrieval benchmark?
+Indeed, there are already comprehensive code retrieval benchmarks like [CoIR](https://arxiv.org/abs/2407.02883). In fact, the [CosQA](https://arxiv.org/abs/2105.13239) subset of this benchmark has a similar format to ours (text-to-code retrieval for web queries).
+However, we designed our document space to be *an entire codebase*, as opposed to a set of isolated Python functions. A real-world codebase contains a variety of files, including ones that are distracting and get undeservedly selected by the retriever. For instance, dense retrievers tend to prefer short files. READMEs also tend to score high even when irrelevant, since they're written in natural language. Our benchmark is able to surface such behaviors. It also allows us to experiment with a variety of strategies like file chunking.
+In the rest of this document, we'll be sharing a few initial learnings enabled by our benchmark.
+### Metrics
+Throughout this report, we will use the following evaluation metrics, as implemented by the [ir-measures](https://ir-measur.es/en/latest/) library.
+- [R-Precision](https://ir-measur.es/en/latest/measures.html#rprec): The precision at R, where R is the number of relevant documents for a given query. Since our queries have a variable number of relevant documents (1-3), this is a convenient metric.
+- [Precision@1 (P@1)](https://ir-measur.es/en/latest/measures.html#p): Reflects how many of the documents retrieved on the first position are actually golden documents. Note that P@3 would be a misleading metric: since not all queries have 3 relevant documents, not even the golden dataset would score 100%.
+- [Recall@3 (R@3)](https://ir-measur.es/en/latest/measures.html#r): Reflects how many of the golden documents were retrieved by the system. Note that R@1 would be a misleading metric: since a query can have multiple equally-relevant documents, not even the golden dataset would score 100%.
+- [Mean Reciprocal Rank (MRR)](https://ir-measur.es/en/latest/measures.html#rr): For each query, takes the first golden document and looks up its rank in the retrieved documents. For instance, if the first golden document is retrieved second, the score for this query is 1/2. Note this metric is somewhat incomplete for our benchmark, because we might have multiple relevant documents.
+## Embeddings
+:classical_building: **Verdict**: Use OpenAI's `text-embedding-3-small` embeddings.
+Today, most retrieval systems are *dense*. They pre-compute document *embeddings* and store them in an index. At inference time, queries are also mapped to the same embedding space. In this world, retrieval is equivalent to finding the nearest neighbors of the query embedding in the index.
+To this end, the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) (Massive Text Embeddings Benchmark) offers a comprehensive comparison for open-source embeddings.
+To complement this, we compared proprietary embedding APIs from [OpenAI](https://platform.openai.com/docs/guides/embeddings) and [Voyage](https://docs.voyageai.com/docs/embeddings). The main advantage of using these providers (in addition to quality) is that they provide *batch* embedding APIs, so you can get an entire repository indexed relatively quickly without the headache of hosting your own embedding models (you can do so with a simple `sage-index $GITHUB_REPO` command).
+![embeddings-plot](assets/embeddings.png)
+The plot above shows the performance of the three types of embeddings from OpenAI (`text-embedding-3-small`, `text-embedding-3-large`, `text-embedding-ada-002`) and the code-specific embeddings from Voyage (`voyage-code-2`).
+#### Experiment settings
+- File chunks of <= 800 tokens;
+- Dense retriever (nearest neighbor according to cosine distance of embeddings);
+- Retrieved `top_k=25`;
+- Reranked documents using the [NVIDIA re-ranker](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/using-reranking.html) and selected `top_k=3`.
+#### Results
+- Across most evaluation metrics, OpenAI's `text-embedding-3-small` performs best.
+- It's remarkable that the `text-embedding-3-large` embeddings don't perform better, despite having double the size (3072 vs 1536).
+- The older `text-embedding-ada-002` embeddings are trailing last with a huge gap in performance, so this is your call to update your pipeline if you haven't already.
+## Rerankers
+:classical_building: **Verdict**: Use NVIDIA's reranker.
+In a world with infinitely fast compute, we would perform retrieval by passing each `<query, document>` pair through a Transformer, allowing all the query tokens to attend to all the document tokens. However, this is prohibitively expensive.
+In practice, all documents are embedded independently and stored in a vector database. Most retrieval systems are two-staged: (1) embed the query independently to find its top N nearest neighbor documents, and (2) re-encode all top N `<query, document>` pairs and select the top K scoring ones. The second stage is called *reranking*.
+![rerankers-plot](assets/rerankers.png)
+While the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) compares *open-source* embedding models based on their ability to rerank documents, we conducted experiments on the most popular *proprietary* APIs for reranking, including [NVIDIA](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/using-reranking.html), [Voyage](https://docs.voyageai.com/docs/reranker), [Cohere](https://cohere.com/rerank) and [Jina](https://jina.ai/reranker/).
+#### Experiment settings
+- File chunks of <= 800 tokens;
+- Dense retriever using OpenAI's `text-embedding-3-small` model;
+- Retrieved `top_k=25` documents;
+- Reranked documents and selected `top_k=3`.
+#### Results
+- Across all evaluation metrics, the highest performing rerankers are, in this order: [NVIDIA](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/using-reranking.html), [Voyage](https://docs.voyageai.com/docs/reranker), [Cohere](https://cohere.com/rerank) and [Jina](https://jina.ai/reranker/).
+- Not using a reranker at all completely tanks the performance.
+## Retrieval: Sparse vs Dense
+:classical_building: **Verdict**: Use fully dense embeddings.
+So far, we've been experimenting with purely *dense* retrieval. That is, documents are selected solely on the cosine distance between their embedding and the query embedding.
+Before the emergence of deep learning, retrievers used to be *sparse*. Such retrievers (e.g. [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)) were based on vectors of word counts (the vector of a document has the length of the dictionary, with each entry showing how many times a token occurs in the document; the term *sparse* comes from the fact that most entries are 0).
+Since sparse retrievers rely on exact string match, one might assume they come in handy when the query contains a relatively unique token (e.g. a class name) that occurs in a small number of documents.
+At the intersection of dense and sparse retrievers, *hybrid* retrievers score documents by the weighted average of the dense and sparse scores.
+![retrievers-plot](assets/retrievers.png)
+In the experiment above, we compared the three types of retrievers (dense, hybrid and sparse).
+#### Experiment settings
+- File chunks of <= 800 tokens;
+- For the dense and hybrid retrievers, we used OpenAI's `text-embedding-3-small` model for embeddings;
+- Retrieved `top_k=25` documents;
+- Reranked documents using the [NVIDIA re-ranker](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/using-reranking.html) and selected `top_k=3`.
+#### Results
+Somewhat surprisingly, sparse retrieval is actively hurting performance. The reason is that exact string matching will favor files that are in natural language (and therefore match the token distribution in the query).
+The plot below shows what percentage of the retrieved files are in Markdown. The purely sparse retriever chooses a Markdown file 40% of the time! Remember that we designed our questions so that the required context are Python files. This doesn't preclude Markdown files from actually being helpful in answering some of the questions, but surely not to this degree.
+![markdown-plot](assets/markdown.png)
+## Chunk sizes
+:classical_building: **Verdict**: 800 tokens per chunk works well
+The [CodeRag paper](https://arxiv.org/pdf/2406.14497) suggests that the ideal chunk size is somewhere between 200-800 tokens. All our experiments above used 800 tokens per chunk. When experimenting with the other end of the spectrum, we saw very mild improvements from having smaller chunks. We believe that these marginal gains are not worth the increased indexing time (since we need to send 4x more queries to the batch embedding APIs).
+![chunks-plot](assets/chunks.png)

benchmarks/retrieval/assets/chunks.png ADDED Viewed

benchmarks/retrieval/assets/embeddings.png ADDED Viewed

benchmarks/retrieval/assets/markdown.png ADDED Viewed

benchmarks/retrieval/assets/rerankers.png ADDED Viewed

benchmarks/retrieval/assets/retrievers.png ADDED Viewed

benchmarks/retrieval/retrieve.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""Script to call retrieval on a benchmark dataset.
+Make sure to `pip install ir_measures` before running this script.
+"""
+import json
+import logging
+import os
+import time
+import configargparse
+from ir_measures import MAP, MRR, P, Qrel, R, Rprec, ScoredDoc, calc_aggregate, nDCG
+import sage.config
+from sage.retriever import build_retriever_from_args
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+def main():
+    parser = configargparse.ArgParser(
+        description="Runs retrieval on a benchmark dataset.", ignore_unknown_config_file_keys=True
+    )
+    parser.add("--benchmark", required=True, help="Path to the benchmark dataset.")
+    parser.add(
+        "--gold-field", default="context_files", help="Field in the benchmark dataset that contains the golden answers."
+    )
+    parser.add(
+        "--question-field", default="question", help="Field in the benchmark dataset that contains the questions."
+    )
+    parser.add(
+        "--logs-dir",
+        default=None,
+        help="Path where to output predictions and metrics. Optional, since metrics are also printed to console."
+    )
+    parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
+    sage.config.add_config_args(parser)
+    sage.config.add_embedding_args(parser)
+    sage.config.add_vector_store_args(parser)
+    sage.config.add_reranking_args(parser)
+    args = parser.parse_args()
+    sage.config.validate_vector_store_args(args)
+    retriever = build_retriever_from_args(args)
+    with open(args.benchmark, "r") as f:
+        benchmark = json.load(f)
+    if args.max_instances is not None:
+        benchmark = benchmark[: args.max_instances]
+    golden_docs = []  # List of ir_measures.Qrel objects
+    retrieved_docs = []  # List of ir_measures.ScoredDoc objects
+    for question_idx, item in enumerate(benchmark):
+        print(f"Processing question {question_idx}...")
+        query_id = str(question_idx)  # Solely needed for ir_measures library.
+        for golden_filepath in item[args.gold_field]:
+            # All the file paths in the golden answer are equally relevant for the query (i.e. the order is irrelevant),
+            # so we set relevance=1 for all of them.
+            golden_docs.append(Qrel(query_id=query_id, doc_id=golden_filepath, relevance=1))
+        # Make a retrieval call for the current question.
+        retrieved = retriever.invoke(item[args.question_field])
+        item["retrieved"] = []
+        for doc_idx, doc in enumerate(retrieved):
+            # The absolute value of the scores below does not affect the metrics; it merely determines the ranking of
+            # the retrived documents. The key of the score varies depending on the underlying retriever. If there's no
+            # score, we use 1/(doc_idx+1) since it preserves the order of the documents.
+            score = doc.metadata.get("score", doc.metadata.get("relevance_score", 1 / (doc_idx + 1)))
+            retrieved_docs.append(
+                ScoredDoc(query_id=query_id, doc_id=doc.metadata["file_path"], score=score)
+            )
+            # Update the output dictionary with the retrieved documents.
+            item["retrieved"].append({"file_path": doc.metadata["file_path"], "score": score})
+        if "answer" in item:
+            item.pop("answer")  # Makes the output file harder to read.
+    print("Calculating metrics...")
+    results = calc_aggregate([Rprec, P @ 1, R @ 3, nDCG @ 3, MAP, MRR], golden_docs, retrieved_docs)
+    results = {str(key): value for key, value in results.items()}
+    if args.logs_dir:
+        if not os.path.exists(args.logs_dir):
+            os.makedirs(args.logs_dir)
+        out_data = {
+            "data": benchmark,
+            "metrics": results,
+            "flags": vars(args),  # For reproducibility.
+        }
+        output_file = os.path.join(args.logs_dir, f"{time.time()}.json")
+        with open(output_file, "w") as f:
+            json.dump(out_data, f, indent=4)
+        for key in sorted(results.keys()):
+            print(f"{key}: {results[key]}")
+        print(f"Predictions and metrics saved to {output_file}")
+if __name__ == "__main__":
+    main()

benchmarks/retrieval/sample.json ADDED Viewed

	@@ -0,0 +1,177 @@

+[
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/commands/serving.py",
+            "huggingface/transformers/src/transformers/pipelines/__init__.py"
+        ],
+        "question": "With the introduction of a new translation service for \"en_to_es\", how does `serve_command_factory` ensure the server is prepared to handle this specific task efficiently?",
+        "answer": "The `serve_command_factory` function is designed to dynamically configure and deploy a server that can handle a variety of tasks, including complex and parameterized tasks such as language-specific translations. When a new translation service for \"en_to_es\" is introduced, the function ensures efficient handling through several key steps and mechanisms:\n\n1. **Task Specification and Validation**: The function begins by accepting command line arguments (`args`) which specify the task among other parameters. The task, in this case \"translation_en_to_es\", is passed to the `pipeline` function. This function internally calls `check_task` to validate and process the task string. The `check_task` function checks if the task is recognized and supported, and it returns structured data that includes default configurations and any task-specific options necessary for the task.\n\n2. **Dynamic Pipeline Configuration**: The `pipeline` function uses the information returned by `check_task` to dynamically configure a pipeline suitable for the specified task. This includes loading the appropriate model, tokenizer, and any additional components like feature extractors or image processors that are required for the task. For translation tasks, this might involve loading a specific model that is optimized for the language pair \"en_to_es\".\n\n3. **Model and Tokenizer Loading**: Depending on the task and the provided arguments, the `pipeline` function loads the necessary model and tokenizer. If the model or tokenizer is not explicitly specified in the command line arguments, the function will load default models and tokenizers that are appropriate for the task. This ensures that the server is using the most suitable tools for the translation between English and Spanish.\n\n4. **Device and Performance Configuration**: The `pipeline` function also handles device allocation (e.g., CPU, GPU) based on the provided arguments, ensuring that the model runs on the optimal hardware. This is crucial for performance, especially when handling complex tasks like real-time translation.\n\n5. **Server Initialization**: Once the pipeline is configured, `serve_command_factory` initializes a `ServeCommand` with the configured pipeline and server parameters such as host, port, and number of workers. This command encapsulates all the necessary settings to run the server efficiently.\n\n6. **Adaptability and Scalability**: The design of `serve_command_factory` and the underlying `pipeline` function allows for easy adaptation to new tasks and changes in task requirements. Adding support for a new translation like \"en_to_es\" primarily involves ensuring that the `check_task` function recognizes the task and that appropriate models and tokenizers are available and correctly configured.\n\nIn summary, `serve_command_factory` ensures that the server is prepared to handle the \"en_to_es\" translation task efficiently by dynamically configuring the server based on the specified task, validating and processing the task to load appropriate models and tools, and setting up the server with optimal hardware and network settings. This process guarantees that the server can adapt to new and complex tasks with minimal manual intervention.",
+        "expected_identifiers": [
+            "args",
+            "check_task",
+            "pipeline",
+            "ServeCommand"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/models/esm/modeling_esmfold.py",
+            "huggingface/transformers/src/transformers/models/esm/openfold_utils/feats.py",
+            "huggingface/transformers/src/transformers/models/esm/openfold_utils/tensor_utils.py"
+        ],
+        "question": "In a high-throughput setting where multiple protein structures are processed simultaneously, how does `EsmForProteinFolding.output_to_pdb` ensure accurate and independent structural representation in the resulting PDB files?",
+        "answer": "In a high-throughput setting where multiple protein structures are processed simultaneously, the function `output_to_pdb` ensures accurate and independent structural representation in the resulting PDB files through a combination of specialized tensor operations and careful indexing. This is achieved primarily through the use of the `atom14_to_atom37` function, which itself relies on the `batched_gather` function to correctly map atom positions from a simplified model output to a more detailed atomic representation.\n\n### Detailed Workflow:\n\n1. **Batch Processing and Tensor Operations**:\n   - The `output_to_pdb` function begins by converting all tensor data to the CPU and converting them to NumPy arrays for easier manipulation. This step is crucial for performance and compatibility with subsequent operations that may not be optimized for GPU tensors.\n\n2. **Mapping Atom Positions**:\n   - The function `atom14_to_atom37` is called within `output_to_pdb`. This function is responsible for expanding the reduced atom representation (14 atoms per amino acid) to a fuller representation (37 atoms per amino acid). It uses the `batched_gather` function to achieve this mapping accurately across potentially multiple proteins in a batch.\n\n3. **Complex Indexing with `batched_gather`**:\n   - `batched_gather` plays a critical role in ensuring that the atom positions are mapped correctly. It constructs a complex indexing tuple that combines batch indices with the provided indices for gathering (`inds`). This tuple (`ranges`) includes both batch dimensions and the specific indices where atoms need to be gathered from the `atom14` tensor.\n   - The use of `ranges` in `batched_gather` ensures that each protein's data is handled independently, preventing any cross-contamination or mixing of data between different proteins in the batch. This is crucial for maintaining the structural integrity of each protein.\n\n4. **Application of Mask and Final Adjustments**:\n   - After mapping the positions, `atom14_to_atom37` applies a mask (`batch[\"atom37_atom_exists\"]`) to ensure that only existing atoms are considered. This step further ensures the accuracy of the structural data by zeroing out positions of non-existent atoms, preventing any erroneous data from affecting the structural representation.\n\n5. **Generation of PDB Data**:\n   - Back in `output_to_pdb`, for each protein in the batch, an instance of `OFProtein` is created with the mapped atom positions, types, and other relevant data. The `to_pdb` function is then used to convert these protein data into the PDB format, ready for downstream applications like molecular dynamics simulations.\n\n### Conclusion:\n\nThrough the careful use of tensor operations, complex indexing, and data masking, `output_to_pdb` ensures that each protein's structural data is accurately and independently represented in the PDB outputs. This methodical approach is essential in high-throughput settings, where the accuracy and integrity of structural data are paramount for subsequent scientific analysis and applications.",
+        "expected_identifiers": [
+            "atom14_to_atom37",
+            "batched_gather",
+            "batch[\"atom37_atom_exists\"]",
+            "OFProtein"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/models/auto/auto_factory.py",
+            "huggingface/transformers/src/transformers/dynamic_module_utils.py"
+        ],
+        "question": "Following a security update in the production environment that limits internet connectivity, how does `_BaseAutoModelClass.from_pretrained` guarantee that the loaded model adheres strictly to the predefined version and settings?",
+        "answer": "In the updated production environment with restricted internet connectivity, `_BaseAutoModelClass.from_pretrained` ensures that the model loaded adheres strictly to the predefined version and settings through several key mechanisms, primarily involving the management of model files and code via a version control system and secure access to private repositories.\n\n### Version Control and Revision Specification\n\nThe function leverages a version control system that allows users to specify exact revisions of the model or code they wish to use. This is evident in the handling of the `revision` parameter in functions like `get_cached_module_file` and `get_class_from_dynamic_module`. The `revision` parameter can accept any identifier allowed by git, such as a branch name, a tag name, or a commit id. This ensures that the exact version of the model or code that was tested and approved in other environments (like development or staging) is the same version being deployed in production.\n\nFor example, in the `get_cached_module_file` function, the `revision` parameter is used to fetch the specific version of a module file from a repository:\n```python\nresolved_module_file = cached_file(\n    pretrained_model_name_or_path,\n    module_file,\n    cache_dir=cache_dir,\n    force_download=force_download,\n    proxies=proxies,\n    resume_download=resume_download,\n    local_files_only=local_files_only,\n    token=token,\n    revision=revision,\n    repo_type=repo_type,\n    _commit_hash=_commit_hash,\n)\n```\n\n### Secure Access to Private Repositories\n\nThe function can authenticate access to private repositories using tokens, which is crucial when operating in environments with strict security protocols. The `token` parameter, which can be set to a string or `True` (to use the token generated by `huggingface-cli login`), is used to authenticate HTTP requests for remote files. This is handled securely in both `get_cached_module_file` and `get_class_from_dynamic_module`, ensuring that only authorized users can access private model files or code.\n\nFor instance, in `get_class_from_dynamic_module`, the `token` parameter is used to authenticate and download the necessary module file:\n```python\nfinal_module = get_cached_module_file(\n    repo_id,\n    module_file + \".py\",\n    cache_dir=cache_dir,\n    force_download=force_download,\n    resume_download=resume_download,\n    proxies=proxies,\n    token=token,\n    revision=code_revision,\n    local_files_only=local_files_only,\n    repo_type=repo_type,\n)\n```\n\n### Handling Restricted Internet Connectivity\n\nIn environments with limited internet access, the `local_files_only` parameter becomes particularly important. This parameter, when set to `True`, forces the function to only look for model files locally and not attempt to download them from the internet. This is crucial for ensuring that the model loading process does not fail due to lack of internet access and adheres to strict security protocols that might block external internet connections.\n\n### Conclusion\n\nBy utilizing these mechanisms, `_BaseAutoModelClass.from_pretrained` ensures that the model loaded in a production environment with restricted internet access is exactly the version specified, using secure and authenticated access where necessary. This approach guarantees consistency, reproducibility, and adherence to security protocols across different environments.",
+        "expected_identifiers": [
+            "revision",
+            "token",
+            "local_files_only"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/models/auto/auto_factory.py",
+            "huggingface/transformers/src/transformers/utils/doc.py"
+        ],
+        "question": "When developing a specialized model class in the Transformers library, how does `auto_class_update` ensure that the new class's methods are tailored specifically for its requirements while preserving the functionality of the original methods from the base class?",
+        "answer": "In the Transformers library, the `auto_class_update` function plays a crucial role in dynamically creating specialized model classes that inherit functionalities from a base class but also have unique customizations. This is particularly important when different model classes need specific configurations or preprocessing steps that are not shared across all models.\n\nThe core mechanism that allows `auto_class_update` to achieve this functionality without altering the behavior of the base class methods lies in its use of the `copy_func` function. Here's how it works step-by-step:\n\n1. **Copying the Function**: `copy_func` is used to create an exact copy of the methods `from_config` and `from_pretrained` from the base class `_BaseAutoModelClass`. This is done by duplicating the `__code__` object of these methods. The `__code__` object contains the compiled executable code that the Python interpreter runs. By copying this code object, the new function retains the exact behavior and logic of the original function.\n\n2. **Customization of the Copied Function**: After copying, `auto_class_update` modifies the docstrings of these methods to tailor them to the specific subclass. This involves inserting a specific `head_doc`, replacing placeholders like `\"BaseAutoModelClass\"` with the subclass's name, and updating example checkpoints specific to the model type (e.g., `\"google-bert/bert-base-cased\"`). These modifications are crucial for providing accurate and relevant documentation and guidance specific to each subclass.\n\n3. **Re-assignment as Class Methods**: Once the functions are copied and customized, they are re-assigned to the subclass as class methods. This is done using `classmethod(from_config)` and `classmethod(from_pretrained)`. This step ensures that these methods, now tailored and documented specifically for the subclass, are callable on the subclass itself.\n\n4. **Preservation of Base Class Functionality**: Since the original methods are copied before being modified, the base class `_BaseAutoModelClass` retains its original `from_config` and `from_pretrained` methods without any changes. This isolation ensures that modifications specific to one subclass do not impact the behavior or documentation of these methods in the base class or any other subclasses.\n\nBy following this process, `auto_class_update` ensures that each subclass in the Transformers library can have methods that are specifically tailored to its requirements, both in terms of functionality and documentation, while preserving the integrity and functionality of the original methods from the base class. This approach enhances modularity and flexibility in the library, allowing developers to easily extend and customize model classes for various use cases.",
+        "expected_identifiers": [
+            "__code__",
+            "copy_func",
+            "from_config",
+            "from_pretrained"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/models/megatron_gpt2/checkpoint_reshaping_and_interoperability.py",
+            "huggingface/transformers/src/transformers/modeling_utils.py"
+        ],
+        "question": "Given a system limitation of 5GB per file, how does `convert_checkpoint_from_megatron_to_transformers` manage the storage of a large model's data to comply with this restriction?",
+        "answer": "The `convert_checkpoint_from_megatron_to_transformers` function manages the storage of a large model's data to comply with a system limitation of 5GB per file by utilizing the `shard_checkpoint` function to split the model's state dictionary into multiple sub-checkpoints, each of which does not exceed the specified maximum size.\n\nHere's a detailed breakdown of how this is achieved:\n\n1. **Sharding Process**: The `shard_checkpoint` function is called within `convert_checkpoint_from_megatron_to_transformers` to handle the division of the model's weights into smaller parts or shards. This function takes the entire state dictionary of the model (`output_state_dict`) and a maximum shard size as inputs.\n\n2. **Size Calculation**: The function calculates the byte size of each tensor in the state dictionary using the `dtype_byte_size` function. This function determines the number of bytes each element of a tensor occupies in memory, based on the tensor's data type (`dtype`). This calculation is crucial as it helps in accurately assessing how much space each tensor will take when saved as part of a shard.\n\n3. **Iterative Sharding**: The `shard_checkpoint` iterates through each tensor in the state dictionary and adds them to the current shard until adding another tensor would exceed the maximum shard size (5GB in this scenario). When this limit is reached, a new shard is started. This ensures that no individual shard file exceeds the specified size limit.\n\n4. **Handling Oversized Tensors**: If a single tensor is larger than the maximum shard size, it is placed in its own shard. This is a necessary exception to prevent the function from failing due to an inability to split a tensor.\n\n5. **Saving Shards**: Each shard is saved as a separate file. The naming convention and indexing ensure that each part of the model can be identified and accessed correctly. The function also generates an index file if the model is split into multiple shards, detailing where each parameter is stored.\n\n6. **Parameter Mapping**: The function maintains a mapping (`weight_map`) of model parameters to their respective shard files. This mapping is crucial for efficiently loading the model from its sharded state.\n\nBy following these steps, the `convert_checkpoint_from_megatron_to_transformers` function ensures that each shard of the converted model adheres to the 5GB file size limit imposed by the system. This methodical sharding allows for efficient storage and handling of large models without exceeding system file size limitations.",
+        "expected_identifiers": [
+            "shard_checkpoint",
+            "dtype_byte_size",
+            "output_state_dict",
+            "weight_map"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/quantizers/quantizer_hqq.py",
+            "huggingface/transformers/src/transformers/integrations/hqq.py"
+        ],
+        "question": "In a scenario where a neural network model is being optimized for deployment, how does `HqqHfQuantizer._process_model_before_weight_loading` ensure that each linear module is appropriately and uniquely quantized?",
+        "answer": "In the scenario where a neural network model is being optimized for deployment using the `HqqHfQuantizer._process_model_before_weight_loading` function, the process of ensuring that each linear module is appropriately and uniquely quantized involves several key steps and functions.\n\n1. **Tagging Modules with Unique Identifiers**: The process begins with the `get_linear_tags` function, which is responsible for identifying and tagging all linear modules within the model. This function uses a `set` to collect the names of these modules, which inherently ensures that each tag is unique (since sets do not allow duplicates). This is crucial because it prevents any confusion or errors in later stages when quantization parameters are applied to these tags.\n\n2. **Applying Quantization Configuration**: Once the linear modules are tagged, the `prepare_for_hqq_linear` function takes over. This function receives a `quantization_config` and a list of modules not to convert. It first calls `autoname_modules` to ensure each module in the model has a unique name, and then retrieves the linear tags using `get_linear_tags`. The function then filters these tags to exclude any specified in `skip_modules` or `modules_to_not_convert`, ensuring that the quantization process is applied only to the relevant modules.\n\n3. **Mapping Quantization Parameters**: The core of the quantization process happens when `prepare_for_hqq_linear` maps the quantization parameters to each linear tag. This is done by creating a dictionary (`patch_params`) where each key is a linear tag and the value is the corresponding quantization parameter. If specific quantization parameters are not provided for a tag, a default configuration is applied. This mapping ensures that each linear module (identified uniquely by its tag) receives a tailored set of quantization parameters.\n\n4. **Updating Model Configuration**: After mapping the quantization parameters, the `prepare_for_hqq_linear` function updates the model's configuration to include these parameters, ensuring that each linear module's configuration reflects its unique quantization settings. This step is crucial for the actual quantization process, where linear modules might be replaced with their quantized counterparts (`HQQLinear`), depending on the configuration.\n\n5. **Final Verification and Logging**: The function checks if any linear modules have been replaced and logs a warning if no modules were found for quantization. This serves as a final check to ensure that the quantization process has been applied as expected.\n\nIn summary, the `HqqHfQuantizer._process_model_before_weight_loading` function ensures that each linear module is uniquely and appropriately quantized by meticulously tagging each module, applying a tailored quantization configuration, and updating the model to reflect these settings. This process is designed to optimize the model's performance for deployment, ensuring that each module operates efficiently and accurately under the constraints of quantization.",
+        "expected_identifiers": [
+            "get_linear_tags",
+            "autoname_modules",
+            "prepare_for_hqq_linear",
+            "patch_params"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/models/esm/modeling_esmfold.py",
+            "huggingface/transformers/src/transformers/models/esm/openfold_utils/loss.py"
+        ],
+        "question": "When analyzing a protein sequence with low complexity using `EsmForProteinFolding.forward`, how is the stability and definition of the output ensured?",
+        "answer": "When analyzing a protein sequence with low complexity using the `EsmForProteinFolding.forward` function, the stability and definition of the output are ensured through several key mechanisms embedded within the function's implementation, particularly in how it handles normalization and potential numerical instabilities.\n\n1. **Normalization of Residue Weights**: In the `compute_tm` function, residue weights are normalized by their sum, with the addition of a small constant `eps` (epsilon) to prevent division by zero. This is crucial when dealing with sequences of low complexity where certain residues might be overrepresented or underrepresented. The normalization step is represented in the code as:\n   ```python\n   normed_residue_mask = residue_weights / (eps + residue_weights.sum())\n   ```\n   Here, `eps` acts as a safeguard against division by zero, ensuring that the function remains numerically stable and produces defined outputs even when the sum of residue weights is extremely small or zero.\n\n2. **Weighted Average Calculation**: The function calculates a weighted average of the Template Modeling (TM) scores across different bins, which is critical for obtaining a reliable TM score. This is done using the normalized residue weights, ensuring that each residue's contribution is proportionate to its presence, thus maintaining accuracy and stability in the final score calculation:\n   ```python\n   per_alignment = torch.sum(predicted_tm_term * normed_residue_mask, dim=-1)\n   ```\n   This step aggregates the TM scores across all residues, factoring in their normalized weights, which is particularly important in low complexity sequences where certain residues might dominate.\n\n3. **Handling of Edge Cases**: The use of `eps` in the normalization process is a direct method to handle edge cases, such as sequences with low complexity or unusual amino acid distributions. By ensuring that the denominator in the normalization step is never zero, the function avoids potential runtime errors (like NaN or infinite values), which could disrupt the analysis process.\n\n4. **Integration within `EsmForProteinFolding.forward`**: The stability and definition of outputs from the `EsmForProteinFolding.forward` function are further supported by how `compute_tm` integrates with other components of the model. The TM scores computed are used alongside other structural predictions, contributing to a comprehensive evaluation of the predicted protein structures. This integration ensures that the outputs are not only stable and defined but also meaningful in the context of protein structure prediction.\n\nIn summary, the `EsmForProteinFolding.forward` function ensures stable and defined outputs for protein structure predictions, particularly in scenarios involving low complexity sequences, by employing robust normalization techniques and handling potential numerical instabilities through the careful addition of a small epsilon value in critical calculations. This approach guarantees that the function can reliably process a wide range of input data without encountering computational errors.",
+        "expected_identifiers": [
+            "normed_residue_mask",
+            "eps",
+            "residue_weights / (eps + residue_weights.sum())",
+            "torch.sum(predicted_tm_term * normed_residue_mask, dim=-1)"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/pipelines/question_answering.py",
+            "huggingface/transformers/src/transformers/data/processors/squad.py"
+        ],
+        "question": "In a scenario where the textual data includes unusually lengthy paragraphs, how does `QuestionAnsweringPipeline.preprocess` ensure comprehensive coverage of all context tokens in the model's input sequences?",
+        "answer": "In scenarios where the textual data includes unusually lengthy paragraphs that exceed the model's maximum input length, the `QuestionAnsweringPipeline.preprocess` function ensures comprehensive coverage of all context tokens in the model's input sequences through a meticulous management of tokenization and handling of overflow tokens. This process is crucial for maintaining the integrity and continuity of the context information, which is essential for the model to accurately answer questions based on the provided context.\n\n### Step-by-Step Explanation:\n\n1. **Tokenization and Pairing**:\n   The function begins by tokenizing the question and context separately. Depending on the tokenizer's configuration (`tokenizer.padding_side`), the question and context are arranged in a specific order (either question first or context first). This is handled in the lines where `encoded_inputs` is defined using `self.tokenizer(text, text_pair, ...)`. \n\n2. **Handling Long Contexts with Overflow Tokens**:\n   The key parameter here is `return_overflowing_tokens=True` within the tokenizer call. This setting ensures that when the combined length of the question and context exceeds `max_seq_len`, the tokenizer automatically generates additional input sequences that contain the \"overflow\" tokens from the context. These sequences overlap by a number of tokens defined by `doc_stride`, which is calculated as `min(max_seq_len // 2, 128)`.\n\n3. **Creating Overlapping Spans**:\n   The overlapping spans are crucial for ensuring that tokens near the boundaries of a sequence are also seen in different contextual surroundings, enhancing the model's ability to understand and answer questions about tokens that appear near the maximum sequence length limit. This overlap is managed by the `stride` parameter in the tokenizer, which is set to `doc_stride`.\n\n4. **Feature Construction**:\n   For each span generated from the overflowing tokens, the function constructs a feature object that includes not only the token ids (`input_ids`) but also attention masks, token type ids, and a special mask (`p_mask`) which indicates which tokens can be part of an answer. The `p_mask` is particularly important as it helps the model distinguish between context tokens (potential answer locations) and non-context tokens (like those belonging to the question or special tokens).\n\n5. **Yielding Processed Features**:\n   Each feature constructed from the spans is then yielded one by one, with additional metadata such as whether it is the last feature of the example. This is handled in the loop `for i, feature in enumerate(features):` where each feature is prepared according to the model's requirements, potentially converting them into tensors suitable for the model's computation framework (PyTorch or TensorFlow).\n\n### Conclusion:\n\nBy managing the tokenization and overflow tokens effectively, `QuestionAnsweringPipeline.preprocess` ensures that every token from a lengthy context is included in at least one input sequence to the model. This comprehensive coverage is achieved through the creation of multiple, overlapping input sequences that ensure no contextual information is lost, thereby enabling the model to perform accurately even with contexts that far exceed its maximum input length. This methodical approach is crucial for handling real-world data where lengthy paragraphs are common, ensuring robust model performance across varied datasets.",
+        "expected_identifiers": [
+            "return_overflowing_tokens=True",
+            "doc_stride",
+            "p_mask"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/examples/research_projects/movement-pruning/masked_run_squad.py",
+            "huggingface/transformers/src/transformers/data/processors/squad.py"
+        ],
+        "question": "Given the challenge of training models on datasets with varying context lengths, how does `load_and_cache_examples` in `examples/research_projects/movement-pruning/masked_run_squad.py` ensure the production of a dataset that supports accurate answer position prediction by the models?",
+        "answer": "The `load_and_cache_examples` function in `examples/research_projects/movement-pruning/masked_run_squad.py` is designed to ensure that the dataset produced supports accurate answer position prediction by models, even when dealing with varying context lengths. This is achieved through a series of steps that involve careful handling and processing of the data, particularly when contexts exceed the model's maximum sequence length. Here's how the function manages this:\n\n1. **Data Loading and Caching**: Initially, the function checks if a cached version of the processed data exists. If it does, it loads the features, dataset, and examples directly from the cache, bypassing the need for reprocessing. This step ensures efficiency and consistency in data usage across different training runs.\n\n2. **Dynamic Data Processing**: If no cached data is available, the function processes the raw data to generate features suitable for model training. This involves tokenizing the text and converting the SQuAD examples into features that models can understand and learn from.\n\n3. **Handling Extended Contexts**: The core of handling varying context lengths lies in the `squad_convert_examples_to_features` function, which is called within `load_and_cache_examples`. This function uses `squad_convert_example_to_features` to process each example individually.\n\n4. **Segmentation and Token Index Adjustment**: In `squad_convert_example_to_features`, the context is potentially split into multiple spans if its length exceeds the model's maximum sequence length. This is crucial because it allows the model to handle long contexts by breaking them down into manageable parts. Each span is processed to ensure that the start and end positions of answers are correctly adjusted within the tokenized context. This adjustment is handled by the `_improve_answer_span` function, which ensures that the answer spans are accurately placed within the tokens, even if the context is segmented.\n\n5. **Feature Construction**: Each span is then converted into a set of features, including input IDs, attention masks, token type IDs, and the positions of the answers. Special care is taken to mark tokens that cannot be part of the answers (using a p_mask), and to identify the maximum context for each token, which is critical for understanding which part of the split context a token belongs to.\n\n6. **Dataset Compilation**: After processing, the features are compiled into a dataset format (either PyTorch or TensorFlow, based on the configuration). This dataset includes all necessary information for the model to learn from, including the context, the question, and the correct positions of the answers.\n\nBy carefully managing the tokenization, segmentation, and feature construction processes, `load_and_cache_examples` ensures that the dataset it produces allows models to accurately predict answer positions, regardless of the length of the context. This capability is essential for training robust question-answering models that can handle real-world data, where context lengths can vary significantly.",
+        "expected_identifiers": [
+            "squad_convert_examples_to_features",
+            "squad_convert_example_to_features",
+            "_improve_answer_span",
+            "p_mask"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/src/transformers/modeling_flax_utils.py",
+            "huggingface/transformers/src/transformers/utils/hub.py"
+        ],
+        "question": "In a scenario where network conditions are suboptimal, how does `FlaxPreTrainedModel.from_pretrained` manage to reduce the model loading time?",
+        "answer": "In scenarios where network conditions are suboptimal, the `FlaxPreTrainedModel.from_pretrained` function effectively reduces model loading time by leveraging a sophisticated caching mechanism. This mechanism is crucial for managing the download and storage of model shards, ensuring efficient and faster model initialization.\n\n### Caching Mechanism:\nThe function first checks if the required model shards are already available in the local cache before attempting any network requests. This is achieved through the `try_to_load_from_cache` function, which inspects the cache for the presence of the last shard of the model. If the last shard is found in the cache, it is likely that all previous shards are also cached, thus avoiding the need for further network requests.\n\n### Download and Cache Management:\nIf the shards are not found in the cache, `FlaxPreTrainedModel.from_pretrained` proceeds to download them. Each shard's presence is verified using the `cached_file` function, which handles the downloading and caching of the shard if it is not already present. This function also supports resuming downloads, which is particularly useful in suboptimal network conditions where downloads might be interrupted.\n\n### Efficient Shard Handling:\nThe function `get_checkpoint_shard_files` is specifically designed to manage sharded model files. It reads the checkpoint index file to determine all the necessary shards for the model and then ensures each shard is either fetched from the cache or downloaded. This process is streamlined by the use of a progress bar (managed by `tqdm`), which provides visual feedback on the download process, enhancing user experience especially in network-constrained environments.\n\n### Impact of Caching on Model Loading Time:\nBy prioritizing cached shards, `FlaxPreTrainedModel.from_pretrained` significantly reduces the dependency on network bandwidth and stability. This is particularly beneficial in scenarios with limited network resources, as it minimizes the time spent in downloading model components. The caching mechanism ensures that once a model shard is downloaded and stored locally, subsequent loads of the same model will utilize the cached versions, thereby bypassing the network entirely and leading to much faster model initialization times.\n\n### Conclusion:\nThe caching strategy employed by `FlaxPreTrainedModel.from_pretrained` not only optimizes the use of network resources but also ensures consistent and reduced model loading times, regardless of network conditions. This approach is instrumental in scenarios where models need to be switched frequently or reloaded, providing a seamless and efficient user experience.",
+        "expected_identifiers": [
+            "try_to_load_from_cache",
+            "cached_file",
+            "get_checkpoint_shard_files",
+            "tqdm"
+        ]
+    },
+    {
+        "repo": "huggingface/transformers",
+        "commit": "7bb1c99800d235791dace10305731f377db8077b",
+        "context_files": [
+            "huggingface/transformers/examples/research_projects/information-gain-filtration/run_clm_igf.py",
+            "huggingface/transformers/examples/research_projects/information-gain-filtration/igf/igf.py"
+        ],
+        "question": "In light of recent dataset size restrictions for training purposes, how does `generate_n_pairs` maintain compliance by ensuring the objective set adheres to the specified size and article length requirements?",
+        "answer": "The `generate_n_pairs` function ensures compliance with dataset size restrictions by meticulously managing the creation of the objective set through its subordinate function `generate_datasets`. This process is governed by specific parameters and conditions set within the code to meet the required criteria of size and article length.\n\n1. **Size of the Objective Set**: The function `generate_datasets` is designed to create an objective set that contains exactly the number of articles specified by the `number` parameter, which is passed from `generate_n_pairs` as `size_objective_set`. In the provided code, this value is set to 100. The loop within `generate_datasets` that populates the `objective_set` list includes a condition to break once the length of this list reaches the specified `number` (see the line `if len(objective_set) >= number: break`). This ensures that no more than 100 articles are added to the objective set, directly adhering to the dataset size restrictions.\n\n2. **Article Length Management**: The function also manages the length of each article in the objective set based on the `context_len` parameter. If `trim` is set to `True`, the function trims the articles to ensure they do not exceed the specified `context_len`. This is achieved by selecting a starting point randomly within the article and then slicing the article to obtain a segment of the specified `context_len` (see the line `objective_set.append(example[0, start : start + context_len])`). This ensures that each article in the objective set adheres to the length restrictions.\n\n3. **Compliance with Regulations**: By strictly controlling both the number of articles and their lengths as described, `generate_n_pairs` ensures that the objective set complies with new regulations requiring training datasets to contain no more than 100 articles, each of a specified maximum length. This compliance is crucial for ethical review and adherence to training dataset standards.\n\nIn summary, `generate_n_pairs` maintains compliance with dataset size and article length restrictions through careful implementation in `generate_datasets`, which explicitly controls the size of the objective set and trims articles to the required length based on the parameters provided. This methodical approach ensures that the objective set meets specified criteria, crucial for adhering to regulatory standards.",
+        "expected_identifiers": [
+            "generate_n_pairs",
+            "generate_datasets",
+            "size_objective_set",
+            "context_len"
+        ]
+    }
+]

requirements.txt CHANGED Viewed

@@ -1,18 +1,20 @@
 GitPython==3.1.43
 Pygments==2.18.0
 cohere==5.9.2
 fastapi==0.112.2
 gradio>=4.26.0
 langchain==0.2.16
 langchain-anthropic==0.1.23
 langchain-cohere==0.2.4
 langchain-community==0.2.17
-langchain-core==0.2.40
 langchain-experimental==0.0.65
 langchain-nvidia-ai-endpoints==0.2.2
 langchain-ollama==0.1.3
 langchain-openai==0.1.25
 langchain-text-splitters==0.2.4
 marqo==3.7.0
 nbformat==5.10.4
 openai==1.42.0
@@ -22,8 +24,10 @@ python-dotenv==1.0.1
 requests==2.32.3
 semchunk==2.2.0
 sentence-transformers==3.1.0
 tiktoken==0.7.0
 tokenizers==0.19.1
 transformers==4.44.2
 tree-sitter==0.22.3
 tree-sitter-language-pack==0.2.0

 GitPython==3.1.43
 Pygments==2.18.0
 cohere==5.9.2
+configargparse
 fastapi==0.112.2
 gradio>=4.26.0
 langchain==0.2.16
 langchain-anthropic==0.1.23
 langchain-cohere==0.2.4
 langchain-community==0.2.17
+langchain-core==0.2.41
 langchain-experimental==0.0.65
 langchain-nvidia-ai-endpoints==0.2.2
 langchain-ollama==0.1.3
 langchain-openai==0.1.25
 langchain-text-splitters==0.2.4
+langchain-voyageai==0.1.1
 marqo==3.7.0
 nbformat==5.10.4
 openai==1.42.0
 requests==2.32.3
 semchunk==2.2.0
 sentence-transformers==3.1.0
+tenacity==8.5.0
 tiktoken==0.7.0
 tokenizers==0.19.1
 transformers==4.44.2
 tree-sitter==0.22.3
 tree-sitter-language-pack==0.2.0
+voyageai==0.2.3

sage/chat.py CHANGED Viewed

@@ -7,18 +7,15 @@ import logging
 import configargparse
 import gradio as gr
-import pkg_resources
 from dotenv import load_dotenv
 from langchain.chains import create_history_aware_retriever, create_retrieval_chain
 from langchain.chains.combine_documents import create_stuff_documents_chain
-from langchain.retrievers import ContextualCompressionRetriever
 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
 import sage.config as sage_config
 from sage.llm import build_llm_via_langchain
-from sage.reranker import build_reranker
-from sage.vector_store import build_vector_store_from_args
 load_dotenv()
@@ -26,12 +23,7 @@ load_dotenv()
 def build_rag_chain(args):
     """Builds a RAG chain via LangChain."""
     llm = build_llm_via_langchain(args.llm_provider, args.llm_model)
-    retriever_top_k = 5 if args.reranker_provider == "none" else 25
-    retriever = build_vector_store_from_args(args).as_retriever(top_k=retriever_top_k)
-    compressor = build_reranker(args.reranker_provider, args.reranker_model)
-    if compressor:
-        retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
     # Prompt to contextualize the latest query based on the chat history.
     contextualize_q_system_prompt = (
@@ -83,6 +75,7 @@ def main():
     arg_validators = [
         sage_config.add_repo_args(parser),
         sage_config.add_vector_store_args(parser),
         sage_config.add_reranking_args(parser),
         sage_config.add_llm_args(parser),

 import configargparse
 import gradio as gr
 from dotenv import load_dotenv
 from langchain.chains import create_history_aware_retriever, create_retrieval_chain
 from langchain.chains.combine_documents import create_stuff_documents_chain
 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
 import sage.config as sage_config
 from sage.llm import build_llm_via_langchain
+from sage.retriever import build_retriever_from_args
 load_dotenv()
 def build_rag_chain(args):
     """Builds a RAG chain via LangChain."""
     llm = build_llm_via_langchain(args.llm_provider, args.llm_model)
+    retriever = build_retriever_from_args(args)
     # Prompt to contextualize the latest query based on the chat history.
     contextualize_q_system_prompt = (
     arg_validators = [
         sage_config.add_repo_args(parser),
+        sage_config.add_embedding_args(parser),
         sage_config.add_vector_store_args(parser),
         sage_config.add_reranking_args(parser),
         sage_config.add_llm_args(parser),

sage/chunker.py CHANGED Viewed

@@ -299,6 +299,7 @@ class UniversalFileChunker(Chunker):
     """Chunks a file into smaller pieces, regardless of whether it's code or text."""
     def __init__(self, max_tokens: int):
         self.code_chunker = CodeFileChunker(max_tokens)
         self.ipynb_chunker = IpynbFileChunker(self.code_chunker)
         self.text_chunker = TextFileChunker(max_tokens)

     """Chunks a file into smaller pieces, regardless of whether it's code or text."""
     def __init__(self, max_tokens: int):
+        self.max_tokens = max_tokens
         self.code_chunker = CodeFileChunker(max_tokens)
         self.ipynb_chunker = IpynbFileChunker(self.code_chunker)
         self.text_chunker = TextFileChunker(max_tokens)

sage/config.py CHANGED Viewed

@@ -28,6 +28,25 @@ OPENAI_DEFAULT_EMBEDDING_SIZE = {
     "text-embedding-3-large": 3072,
 }
 def add_config_args(parser: ArgumentParser):
     """Adds configuration-related arguments to the parser."""
@@ -61,7 +80,7 @@ def add_repo_args(parser: ArgumentParser) -> Callable:
 def add_embedding_args(parser: ArgumentParser) -> Callable:
     """Adds embedding-related arguments to the parser and returns a validator."""
-    parser.add("--embedding-provider", default="marqo", choices=["openai", "marqo"])
     parser.add(
         "--embedding-model",
         type=str,
@@ -115,11 +134,17 @@ def add_vector_store_args(parser: ArgumentParser) -> Callable:
         help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )
     parser.add(
-        "--hybrid-retrieval",
-        action=argparse.BooleanOptionalAction,
-        default=True,
-        help="Whether to use a hybrid of vector DB + BM25 retrieval. When set to False, we only use vector DB "
-        "retrieval. This is only relevant if using Pinecone as the vector store.",
     )
     return validate_vector_store_args
@@ -169,6 +194,7 @@ def add_reranking_args(parser: ArgumentParser) -> Callable:
         help="The reranker model name. When --reranker-provider=huggingface, we suggest choosing a model from the "
         "SentenceTransformers Cross-Encoders library https://huggingface.co/cross-encoder?sort_models=downloads#models",
     )
     # Trivial validator (nothing to check).
     return lambda _: True
@@ -228,6 +254,33 @@ def _validate_openai_embedding_args(args):
         raise ValueError(f"The maximum number of chunks per job is {OPENAI_MAX_TOKENS_PER_JOB}. Got {chunks_per_job}")
 def _validate_marqo_embedding_args(args):
     """Validates the configuration of the Marqo batch embedder and sets defaults."""
     if not args.embedding_model:
@@ -247,6 +300,8 @@ def validate_embedding_args(args):
     """Validates the configuration of the batch embedder and sets defaults."""
     if args.embedding_provider == "openai":
         _validate_openai_embedding_args(args)
     elif args.embedding_provider == "marqo":
         _validate_marqo_embedding_args(args)
     else:
@@ -257,8 +312,11 @@ def validate_vector_store_args(args):
     """Validates the configuration of the vector store and sets defaults."""
     if not args.index_namespace:
         args.index_namespace = args.repo_id
-        if args.commit_hash:
             args.index_namespace += "/" + args.commit_hash
         if args.vector_store_provider == "marqo":
             # Marqo doesn't allow slashes in the index namespace.

     "text-embedding-3-large": 3072,
 }
+VOYAGE_MAX_CHUNKS_PER_BATCH = 128
+def get_voyage_max_tokens_per_batch(model: str) -> int:
+    """Returns the maximum number of tokens per batch for the Voyage model.
+    See https://docs.voyageai.com/reference/embeddings-api."""
+    if model == "voyage-3-lite":
+        return 1_000_000
+    if model in ["voyage-3", "voyage-2"]:
+        return 320_000
+    return 120_000
+def get_voyage_embedding_size(model: str) -> int:
+    """Returns the embedding size for the Voyage model. See https://docs.voyageai.com/docs/embeddings#model-choices."""
+    if model == "voyage-3-lite":
+        return 512
+    if model == "voyage-2-code":
+        return 1536
+    return 1024
 def add_config_args(parser: ArgumentParser):
     """Adds configuration-related arguments to the parser."""
 def add_embedding_args(parser: ArgumentParser) -> Callable:
     """Adds embedding-related arguments to the parser and returns a validator."""
+    parser.add("--embedding-provider", default="marqo", choices=["openai", "voyage", "marqo"])
     parser.add(
         "--embedding-model",
         type=str,
         help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )
     parser.add(
+        "--retrieval-alpha",
+        default=0.5,
+        type=float,
+        help="Takes effect for Pinecone retriever only. The weight of the dense (embeddings-based) vs sparse (BM25) "
+        "encoder in the final retrieval score. A value of 0.0 means BM25 only, 1.0 means embeddings only.",
+    )
+    parser.add(
+        "--retriever-top-k",
+        default=25,
+        type=int,
+        help="The number of top documents to retrieve from the vector store."
     )
     return validate_vector_store_args
         help="The reranker model name. When --reranker-provider=huggingface, we suggest choosing a model from the "
         "SentenceTransformers Cross-Encoders library https://huggingface.co/cross-encoder?sort_models=downloads#models",
     )
+    parser.add("--reranker-top-k", default=5, help="The number of top documents to return after reranking.")
     # Trivial validator (nothing to check).
     return lambda _: True
         raise ValueError(f"The maximum number of chunks per job is {OPENAI_MAX_TOKENS_PER_JOB}. Got {chunks_per_job}")
+def _validate_voyage_embedding_args(args):
+    """Validates the configuration of the Voyage batch embedder and sets defaults."""
+    if args.embedding_provider == "voyage" and not os.getenv("VOYAGE_API_KEY"):
+        raise ValueError("Please set the VOYAGE_API_KEY environment variable.")
+    if not args.embedding_model:
+        args.embedding_model = "voyage-code-2"
+    if not args.tokens_per_chunk:
+        # https://arxiv.org/pdf/2406.14497 recommends a value between 200-800.
+        args.tokens_per_chunk = 800
+    if not args.chunks_per_batch:
+        args.chunks_per_batch = VOYAGE_MAX_CHUNKS_PER_BATCH
+    elif args.chunks_per_batch > VOYAGE_MAX_CHUNKS_PER_BATCH:
+        args.chunks_per_batch = VOYAGE_MAX_CHUNKS_PER_BATCH
+        logging.warning(f"Voyage enforces a limit of {VOYAGE_MAX_CHUNKS_PER_BATCH} chunks per batch. Overwriting.")
+    max_tokens = get_voyage_max_tokens_per_batch(args.embedding_model)
+    if args.tokens_per_chunk * args.chunks_per_batch > max_tokens:
+        raise ValueError(f"Voyage enforces a limit of {max_tokens} tokens per batch. "
+                         "Reduce either --tokens-per-chunk or --chunks-per-batch.")
+    if not args.embedding_size:
+        args.embedding_size = get_voyage_embedding_size(args.embedding_model)
 def _validate_marqo_embedding_args(args):
     """Validates the configuration of the Marqo batch embedder and sets defaults."""
     if not args.embedding_model:
     """Validates the configuration of the batch embedder and sets defaults."""
     if args.embedding_provider == "openai":
         _validate_openai_embedding_args(args)
+    elif args.embedding_provider == "voyage":
+        _validate_voyage_embedding_args(args)
     elif args.embedding_provider == "marqo":
         _validate_marqo_embedding_args(args)
     else:
     """Validates the configuration of the vector store and sets defaults."""
     if not args.index_namespace:
+        # Attempt to derive a default index namespace from the repository information.
+        if "repo_id" not in args:
+            raise ValueError("Please set a value for --index-namespace.")
         args.index_namespace = args.repo_id
+        if "commit_hash" in args and args.commit_hash:
             args.index_namespace += "/" + args.commit_hash
         if args.vector_store_provider == "marqo":
             # Marqo doesn't allow slashes in the index namespace.

sage/configs/remote.yaml CHANGED Viewed

@@ -14,5 +14,4 @@ llm-provider: openai
 llm-model: gpt-4
 # Reranking
-reranking-provider: cohere
-reranking-model: rerank-english-v3.0

 llm-model: gpt-4
 # Reranking
+reranker-provider: nvidia

sage/embedder.py CHANGED Viewed

@@ -9,7 +9,9 @@ from collections import Counter
 from typing import Dict, Generator, List, Optional, Tuple
 import marqo
 from openai import OpenAI
 from sage.chunker import Chunk, Chunker
 from sage.constants import TEXT_FIELD
@@ -205,6 +207,72 @@ class OpenAIBatchEmbedder(BatchEmbedder):
         }
 class MarqoEmbedder(BatchEmbedder):
     """Embedder that uses the open-source Marqo vector search engine.
@@ -270,6 +338,8 @@ class MarqoEmbedder(BatchEmbedder):
 def build_batch_embedder_from_flags(data_manager: DataManager, chunker: Chunker, args) -> BatchEmbedder:
     if args.embedding_provider == "openai":
         return OpenAIBatchEmbedder(data_manager, chunker, args.local_dir, args.embedding_model, args.embedding_size)
     elif args.embedding_provider == "marqo":
         return MarqoEmbedder(
             data_manager, chunker, index_name=args.index_namespace, url=args.marqo_url, model=args.embedding_model

 from typing import Dict, Generator, List, Optional, Tuple
 import marqo
+import requests
 from openai import OpenAI
+from tenacity import retry, stop_after_attempt, wait_random_exponential
 from sage.chunker import Chunk, Chunker
 from sage.constants import TEXT_FIELD
         }
+class VoyageBatchEmbedder(BatchEmbedder):
+    """Batch embedder that calls Voyage. See https://docs.voyageai.com/reference/embeddings-api."""
+    def __init__(self, data_manager: DataManager, chunker: Chunker, embedding_model: str):
+        self.data_manager = data_manager
+        self.chunker = chunker
+        self.embedding_model = embedding_model
+        self.embedding_data = []
+    def embed_dataset(self, chunks_per_batch: int, max_embedding_jobs: int = None):
+        """Issues batch embedding jobs for the entire dataset."""
+        batch = []
+        chunk_count = 0
+        for content, metadata in self.data_manager.walk():
+            chunks = self.chunker.chunk(content, metadata)
+            chunk_count += len(chunks)
+            batch.extend(chunks)
+            token_count = chunk_count * self.chunker.max_tokens
+            if token_count % 900_000 == 0:
+                logging.info("Pausing for 60 seconds to avoid rate limiting...")
+                time.sleep(60)  # Voyage API rate limits to 1m tokens per minute; we'll pause every 900k tokens.
+            if len(batch) > chunks_per_batch:
+                for i in range(0, len(batch), chunks_per_batch):
+                    sub_batch = batch[i : i + chunks_per_batch]
+                    logging.info("Embedding %d chunks...", len(sub_batch))
+                    result = self._make_batch_request(sub_batch)
+                    for chunk, datum in zip(sub_batch, result["data"]):
+                        self.embedding_data.append((chunk.metadata, datum["embedding"]))
+                batch = []
+        # Finally, commit the last batch.
+        if batch:
+            logging.info("Embedding %d chunks...", len(batch))
+            result = self._make_batch_request(batch)
+            for chunk, datum in zip(batch, result["data"]):
+                self.embedding_data.append((chunk.metadata, datum["embedding"]))
+        logging.info(f"Successfully embedded {chunk_count} chunks.")
+    def embeddings_are_ready(self, *args, **kwargs) -> bool:
+        """Checks whether the batch embedding jobs are done."""
+        # The Voyage API is synchronous, so once embed_dataset() returns, the embeddings are ready.
+        return True
+    def download_embeddings(self, *args, **kwargs) -> Generator[Vector, None, None]:
+        """Yields (chunk_metadata, embedding) pairs for each chunk in the dataset."""
+        for chunk_metadata, embedding in self.embedding_data:
+            yield (chunk_metadata, embedding)
+    @retry(wait=wait_random_exponential(multiplier=1, max=60), stop=stop_after_attempt(6))
+    def _make_batch_request(self, chunks: List[Chunk]) -> Dict:
+        """Makes a batch request to the Voyage API with exponential backoff when we hit rate limits."""
+        url = "https://api.voyageai.com/v1/embeddings"
+        headers = {"Authorization": f"Bearer {os.environ['VOYAGE_API_KEY']}", "Content-Type": "application/json"}
+        payload = {"input": [chunk.content for chunk in chunks], "model": self.embedding_model}
+        response = requests.post(url, json=payload, headers=headers)
+        if not response.status_code == 200:
+            raise ValueError(f"Failed to make batch request. Response: {response.text}")
+        return response.json()
 class MarqoEmbedder(BatchEmbedder):
     """Embedder that uses the open-source Marqo vector search engine.
 def build_batch_embedder_from_flags(data_manager: DataManager, chunker: Chunker, args) -> BatchEmbedder:
     if args.embedding_provider == "openai":
         return OpenAIBatchEmbedder(data_manager, chunker, args.local_dir, args.embedding_model, args.embedding_size)
+    elif args.embedding_provider == "voyage":
+        return VoyageBatchEmbedder(data_manager, chunker, args.embedding_model)
     elif args.embedding_provider == "marqo":
         return MarqoEmbedder(
             data_manager, chunker, index_name=args.index_namespace, url=args.marqo_url, model=args.embedding_model

sage/index.py CHANGED Viewed

@@ -90,7 +90,7 @@ def main():
             time.sleep(30)
         logging.info("Moving embeddings to the repo vector store...")
-        repo_vector_store = build_vector_store_from_args(args)
         repo_vector_store.ensure_exists()
         repo_vector_store.upsert(repo_embedder.download_embeddings(repo_jobs_file))
@@ -101,7 +101,7 @@ def main():
             time.sleep(30)
         logging.info("Moving embeddings to the issues vector store...")
-        issues_vector_store = build_vector_store_from_args(args)
         issues_vector_store.ensure_exists()
         issues_vector_store.upsert(issues_embedder.download_embeddings(issues_jobs_file))

             time.sleep(30)
         logging.info("Moving embeddings to the repo vector store...")
+        repo_vector_store = build_vector_store_from_args(args, repo_manager)
         repo_vector_store.ensure_exists()
         repo_vector_store.upsert(repo_embedder.download_embeddings(repo_jobs_file))
             time.sleep(30)
         logging.info("Moving embeddings to the issues vector store...")
+        issues_vector_store = build_vector_store_from_args(args, issues_manager)
         issues_vector_store.ensure_exists()
         issues_vector_store.upsert(issues_embedder.download_embeddings(issues_jobs_file))

sage/reranker.py CHANGED Viewed

@@ -8,6 +8,7 @@ from langchain_community.cross_encoders import HuggingFaceCrossEncoder
 from langchain_community.document_compressors import JinaRerank
 from langchain_core.documents import BaseDocumentCompressor
 from langchain_nvidia_ai_endpoints import NVIDIARerank
 class RerankerProvider(Enum):
@@ -16,27 +17,33 @@ class RerankerProvider(Enum):
     COHERE = "cohere"
     NVIDIA = "nvidia"
     JINA = "jina"
-def build_reranker(provider: str, model: Optional[str] = None, top_n: Optional[int] = 5) -> BaseDocumentCompressor:
     if provider == RerankerProvider.NONE.value:
         return None
     if provider == RerankerProvider.HUGGINGFACE.value:
         model = model or "cross-encoder/ms-marco-MiniLM-L-6-v2"
         encoder_model = HuggingFaceCrossEncoder(model_name=model)
-        return CrossEncoderReranker(model=encoder_model, top_n=top_n)
     if provider == RerankerProvider.COHERE.value:
         if not os.environ.get("COHERE_API_KEY"):
             raise ValueError("Please set the COHERE_API_KEY environment variable")
         model = model or "rerank-english-v3.0"
-        return CohereRerank(model=model, cohere_api_key=os.environ.get("COHERE_API_KEY"), top_n=top_n)
     if provider == RerankerProvider.NVIDIA.value:
         if not os.environ.get("NVIDIA_API_KEY"):
             raise ValueError("Please set the NVIDIA_API_KEY environment variable")
         model = model or "nvidia/nv-rerankqa-mistral-4b-v3"
-        return NVIDIARerank(model=model, api_key=os.environ.get("NVIDIA_API_KEY"), top_n=top_n, truncate="END")
     if provider == RerankerProvider.JINA.value:
         if not os.environ.get("JINA_API_KEY"):
             raise ValueError("Please set the JINA_API_KEY environment variable")
-        return JinaRerank(top_n=top_n)
     raise ValueError(f"Invalid reranker provider: {provider}")

 from langchain_community.document_compressors import JinaRerank
 from langchain_core.documents import BaseDocumentCompressor
 from langchain_nvidia_ai_endpoints import NVIDIARerank
+from langchain_voyageai import VoyageAIRerank
 class RerankerProvider(Enum):
     COHERE = "cohere"
     NVIDIA = "nvidia"
     JINA = "jina"
+    VOYAGE = "voyage"
+def build_reranker(provider: str, model: Optional[str] = None, top_k: Optional[int] = 5) -> BaseDocumentCompressor:
     if provider == RerankerProvider.NONE.value:
         return None
     if provider == RerankerProvider.HUGGINGFACE.value:
         model = model or "cross-encoder/ms-marco-MiniLM-L-6-v2"
         encoder_model = HuggingFaceCrossEncoder(model_name=model)
+        return CrossEncoderReranker(model=encoder_model, top_n=top_k)
     if provider == RerankerProvider.COHERE.value:
         if not os.environ.get("COHERE_API_KEY"):
             raise ValueError("Please set the COHERE_API_KEY environment variable")
         model = model or "rerank-english-v3.0"
+        return CohereRerank(model=model, cohere_api_key=os.environ.get("COHERE_API_KEY"), top_n=top_k)
     if provider == RerankerProvider.NVIDIA.value:
         if not os.environ.get("NVIDIA_API_KEY"):
             raise ValueError("Please set the NVIDIA_API_KEY environment variable")
         model = model or "nvidia/nv-rerankqa-mistral-4b-v3"
+        return NVIDIARerank(model=model, api_key=os.environ.get("NVIDIA_API_KEY"), top_n=top_k, truncate="END")
     if provider == RerankerProvider.JINA.value:
         if not os.environ.get("JINA_API_KEY"):
             raise ValueError("Please set the JINA_API_KEY environment variable")
+        return JinaRerank(top_n=top_k)
+    if provider == RerankerProvider.VOYAGE.value:
+        if not os.environ.get("VOYAGE_API_KEY"):
+            raise ValueError("Please set the VOYAGE_API_KEY environment variable")
+        model = model or "rerank-1"
+        return VoyageAIRerank(model=model, api_key=os.environ.get("VOYAGE_API_KEY"), top_k=top_k)
     raise ValueError(f"Invalid reranker provider: {provider}")

sage/retriever.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from langchain.retrievers import ContextualCompressionRetriever
+from langchain_openai import OpenAIEmbeddings
+from langchain_voyageai import VoyageAIEmbeddings
+from sage.reranker import build_reranker
+from sage.vector_store import build_vector_store_from_args
+def build_retriever_from_args(args):
+    """Builds a retriever (with optional reranking) from command-line arguments."""
+    if args.embedding_provider == "openai":
+        embeddings = OpenAIEmbeddings(model=args.embedding_model)
+    elif args.embedding_provider == "voyage":
+        embeddings = VoyageAIEmbeddings(model=args.embedding_model)
+    else:
+        embeddings = None
+    retriever = build_vector_store_from_args(args).as_retriever(top_k=args.retriever_top_k, embeddings=embeddings)
+    reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
+    if reranker:
+        retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
+    return retriever

sage/vector_store.py CHANGED Viewed

@@ -1,19 +1,22 @@
 """Vector store abstraction and implementations."""
 from abc import ABC, abstractmethod
 from functools import cached_property
-from typing import Dict, Generator, List, Tuple
 import marqo
 from langchain_community.retrievers import PineconeHybridSearchRetriever
 from langchain_community.vectorstores import Marqo
 from langchain_community.vectorstores import Pinecone as LangChainPinecone
 from langchain_core.documents import Document
-from langchain_openai import OpenAIEmbeddings
 from pinecone import Pinecone, ServerlessSpec
 from pinecone_text.sparse import BM25Encoder
 from sage.constants import TEXT_FIELD
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
@@ -41,24 +44,40 @@ class VectorStore(ABC):
             self.upsert_batch(batch)
     @abstractmethod
-    def as_retriever(self, top_k: int):
         """Converts the vector store to a LangChain retriever object."""
 class PineconeVectorStore(VectorStore):
     """Vector store implementation using Pinecone."""
-    def __init__(self, index_name: str, namespace: str, dimension: int, hybrid: bool = True):
         self.index_name = index_name
         self.dimension = dimension
         self.client = Pinecone()
         self.namespace = namespace
-        self.hybrid = hybrid
-        # The default BM25 encoder was fit in the MS MARCO dataset.
-        # See https://docs.pinecone.io/guides/data/encode-sparse-vectors
-        # In the future, we should fit the encoder on the current dataset. It's somewhat non-trivial for large datasets,
-        # because most BM25 implementations require the entire dataset to fit in memory.
-        self.bm25_encoder = BM25Encoder.default() if hybrid else None
     @cached_property
     def index(self):
@@ -84,7 +103,7 @@ class PineconeVectorStore(VectorStore):
                 name=self.index_name,
                 dimension=self.dimension,
                 # See https://www.pinecone.io/learn/hybrid-search-intro/
-                metric="dotproduct" if self.hybrid else "cosine",
                 spec=ServerlessSpec(cloud="aws", region="us-east-1"),
             )
@@ -98,19 +117,19 @@ class PineconeVectorStore(VectorStore):
         self.index.upsert(vectors=pinecone_vectors, namespace=self.namespace)
-    def as_retriever(self, top_k: int):
         if self.bm25_encoder:
             return PineconeHybridSearchRetriever(
-                embeddings=OpenAIEmbeddings(),
                 sparse_encoder=self.bm25_encoder,
                 index=self.index,
                 namespace=self.namespace,
                 top_k=top_k,
-                alpha=0.5,
             )
         return LangChainPinecone.from_existing_index(
-            index_name=self.index_name, embedding=OpenAIEmbeddings(), namespace=self.namespace
         ).as_retriever(search_kwargs={"k": top_k})
@@ -128,7 +147,8 @@ class MarqoVectorStore(VectorStore):
         # Since Marqo is both an embedder and a vector store, the embedder is already doing the upsert.
         pass
-    def as_retriever(self, top_k: int):
         vectorstore = Marqo(client=self.client, index_name=self.index_name)
         # Monkey-patch the _construct_documents_from_results_without_score method to not expect a "metadata" field in
@@ -146,14 +166,32 @@ class MarqoVectorStore(VectorStore):
         return vectorstore.as_retriever(search_kwargs={"k": top_k})
-def build_vector_store_from_args(args: dict) -> VectorStore:
-    """Builds a vector store from the given command-line arguments."""
     if args.vector_store_provider == "pinecone":
         return PineconeVectorStore(
             index_name=args.pinecone_index_name,
             namespace=args.index_namespace,
             dimension=args.embedding_size if "embedding_size" in args else None,
-            hybrid=args.hybrid_retrieval,
         )
     elif args.vector_store_provider == "marqo":
         return MarqoVectorStore(url=args.marqo_url, index_name=args.index_namespace)

 """Vector store abstraction and implementations."""
+import os
+import logging
 from abc import ABC, abstractmethod
 from functools import cached_property
+from typing import Dict, Generator, List, Optional, Tuple
 import marqo
 from langchain_community.retrievers import PineconeHybridSearchRetriever
 from langchain_community.vectorstores import Marqo
 from langchain_community.vectorstores import Pinecone as LangChainPinecone
 from langchain_core.documents import Document
+from langchain_core.embeddings import Embeddings
 from pinecone import Pinecone, ServerlessSpec
 from pinecone_text.sparse import BM25Encoder
 from sage.constants import TEXT_FIELD
+from sage.data_manager import DataManager
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
             self.upsert_batch(batch)
     @abstractmethod
+    def as_retriever(self, top_k: int, embeddings: Embeddings):
         """Converts the vector store to a LangChain retriever object."""
 class PineconeVectorStore(VectorStore):
     """Vector store implementation using Pinecone."""
+    def __init__(self, index_name: str, namespace: str, dimension: int, alpha: float, bm25_cache: Optional[str] = None):
+        """
+        Args:
+            index_name: The name of the Pinecone index to use. If it doesn't exist already, we'll create it.
+            namespace: The namespace within the index to use.
+            dimension: The dimension of the vectors.
+            alpha: The alpha parameter for hybrid search: alpha == 1.0 means pure dense search, alpha == 0.0 means pure
+                BM25, and 0.0 < alpha < 1.0 means a hybrid of the two.
+            bm25_cache: The path to the BM25 encoder file. If not specified, we'll use the default BM25 (fitted on the
+                MS MARCO dataset).
+        """
         self.index_name = index_name
         self.dimension = dimension
         self.client = Pinecone()
         self.namespace = namespace
+        self.alpha = alpha
+        if alpha < 1.0:
+            if bm25_cache and os.path.exists(bm25_cache):
+                logging.info("Loading BM25 encoder from cache.")
+                self.bm25_encoder = BM25Encoder()
+                self.bm25_encoder.load(path=bm25_cache)
+            else:
+                logging.info("Using default BM25 encoder (fitted to MS MARCO).")
+                self.bm25_encoder = BM25Encoder.default()
+        else:
+            self.bm25_encoder = None
     @cached_property
     def index(self):
                 name=self.index_name,
                 dimension=self.dimension,
                 # See https://www.pinecone.io/learn/hybrid-search-intro/
+                metric="dotproduct" if self.bm25_encoder else "cosine",
                 spec=ServerlessSpec(cloud="aws", region="us-east-1"),
             )
         self.index.upsert(vectors=pinecone_vectors, namespace=self.namespace)
+    def as_retriever(self, top_k: int, embeddings: Embeddings):
         if self.bm25_encoder:
             return PineconeHybridSearchRetriever(
+                embeddings=embeddings,
                 sparse_encoder=self.bm25_encoder,
                 index=self.index,
                 namespace=self.namespace,
                 top_k=top_k,
+                alpha=self.alpha,
             )
         return LangChainPinecone.from_existing_index(
+            index_name=self.index_name, embedding=embeddings, namespace=self.namespace
         ).as_retriever(search_kwargs={"k": top_k})
         # Since Marqo is both an embedder and a vector store, the embedder is already doing the upsert.
         pass
+    def as_retriever(self, top_k: int, embeddings: Embeddings = None):
+        del embeddings  # Unused; The Marqo vector store is also an embedder.
         vectorstore = Marqo(client=self.client, index_name=self.index_name)
         # Monkey-patch the _construct_documents_from_results_without_score method to not expect a "metadata" field in
         return vectorstore.as_retriever(search_kwargs={"k": top_k})
+def build_vector_store_from_args(args: dict, data_manager: Optional[DataManager] = None) -> VectorStore:
+    """Builds a vector store from the given command-line arguments.
+    When `data_manager` is specified and hybrid retrieval is requested, we'll use it to fit a BM25 encoder on the corpus
+    of documents.
+    """
     if args.vector_store_provider == "pinecone":
+        bm25_cache = os.path.join(".bm25_cache", args.index_namespace, "bm25_encoder.json")
+        if not os.path.exists(bm25_cache) and data_manager:
+            logging.info("Fitting BM25 encoder on the corpus...")
+            corpus = [content for content, _ in data_manager.walk()]
+            bm25_encoder = BM25Encoder()
+            bm25_encoder.fit(corpus)
+            # Make sure the folder exists, before we dump the encoder.
+            bm25_folder = os.path.dirname(bm25_cache)
+            if not os.path.exists(bm25_folder):
+                os.makedirs(bm25_folder)
+            bm25_encoder.dump(bm25_cache)
         return PineconeVectorStore(
             index_name=args.pinecone_index_name,
             namespace=args.index_namespace,
             dimension=args.embedding_size if "embedding_size" in args else None,
+            alpha=args.retrieval_alpha,
+            bm25_cache=bm25_cache,
         )
     elif args.vector_store_provider == "marqo":
         return MarqoVectorStore(url=args.marqo_url, index_name=args.index_namespace)