Spaces:

Asish22
/

code-crawler

Sleeping

juliaturc commited on Oct 4, 2024

Commit

c9295cd

1 Parent(s): ca3f128

Option for multi-query retrieval (#51)

* Don't filter out .ipynb files.

* Fail explicitly when the repo cannot be cloned.

* Fix indentation in retrieve.py

* Update default OpenAI embedding model to text-embedding-3-small

* Update plot which over-estimated R-Precision for Dense (it was a copy-paste error). The take-away still holds.

* Add multi-query retriever

* Update README with available retrieval strategeis.

* Add LLM flags to retrieve_kaggle.py

Files changed (9) hide show

README.md +25 -1
benchmarks/retrieval/assets/retrievers.png +0 -0
benchmarks/retrieval/retrieve.py +4 -3
benchmarks/retrieval/retrieve_kaggle.py +1 -0
sage/chunker.py +1 -1
sage/config.py +7 -0
sage/index.py +8 -1
sage/retriever.py +7 -0
sage/sample-exclude.txt +3 -1

README.md CHANGED Viewed

@@ -143,11 +143,13 @@ If you are planning on indexing GitHub issues in addition to the codebase, you w
 <details>
 <summary><strong>:lock: Working with private repositories</strong></summary>
-To index and chat with a private repository, simply set the GITHUB_TOKEN environment variable. To obtain this token: go to github.com > click on your profile icon > Settings > Developer settings > Personal access tokens. You can either make a fine-grained token for the desired repository, or a classic token.
 ```
 export GITHUB_TOKEN=...
 ```
 </details>
 <details>
@@ -181,10 +183,12 @@ To specify an exclusion file (i.e. index all files, except for the ones specifie
 sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
 ```
 By default, we use the exclusion file [sample-exclude.txt](sage/sample-exclude.txt).
 </details>
 <details>
 <summary><strong>:bug: Index open GitHub issues</strong></summary>
 You will need a GitHub token first:
 ```
@@ -205,6 +209,26 @@ To index GitHub issues, but not the codebase:
 ```
 sage-index $GITHUB_REPO --index-issues --no-index-repo
 ```
 </details>
 # Why chat with a codebase?

 <details>
 <summary><strong>:lock: Working with private repositories</strong></summary>
+To index and chat with a private repository, simply set the `GITHUB_TOKEN` environment variable. To obtain this token, go to github.com > click on your profile icon > Settings > Developer settings > Personal access tokens. You can either make a fine-grained token for the desired repository, or a classic token.
 ```
 export GITHUB_TOKEN=...
 ```
 </details>
 <details>
 sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
 ```
 By default, we use the exclusion file [sample-exclude.txt](sage/sample-exclude.txt).
 </details>
 <details>
 <summary><strong>:bug: Index open GitHub issues</strong></summary>
 You will need a GitHub token first:
 ```
 ```
 sage-index $GITHUB_REPO --index-issues --no-index-repo
 ```
+</details>
+<details>
+<summary><strong>:books: Experiment with retrieval strategies</strong></summary>
+Retrieving the right files from the vector database is arguably the quality bottleneck of the system. We are actively experimenting with various retrieval strategies and documenting our findings [here](benchmark/retrieval/README.md).
+Currently, we support the following types of retrieval:
+- **Vanilla RAG** from a vector database (nearest neighbor between dense embeddings). This is the default.
+- **Hybrid RAG** that combines dense retrieval (embeddings-based) with sparse retrieval (BM25). Use `--retrieval-alpha` to weigh the two strategies.
+    - A value of 1 means dense-only retrieval and 0 means BM25-only retrieval.
+    - Note this is not available when running locally, only when using Pinecone as a vector store.
+    - Contrary to [Anthropic's findings](https://www.anthropic.com/news/contextual-retrieval), we find that BM25 is actually damaging performance *on codebases*, because it gives undeserved advantage to Markdown files.
+- **Multi-query retrieval** performs multiple query rewrites, makes a separate retrieval call for each, and takes the union of the retrieved documents. You can activate it by passing `--multi-query-retrieval`.
+    - We find that [on our benchmark](benchmark/retrieval/README.md) this only marginally improves retrieval quality (from 0.44 to 0.46 R-precision) while being significantly slower and more expensive due to LLM calls. But your mileage may vary.
 </details>
 # Why chat with a codebase?

benchmarks/retrieval/assets/retrievers.png CHANGED Viewed

benchmarks/retrieval/retrieve.py CHANGED Viewed

@@ -38,6 +38,7 @@ def main():
     parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
     sage.config.add_config_args(parser)
     sage.config.add_embedding_args(parser)
     sage.config.add_vector_store_args(parser)
     sage.config.add_reranking_args(parser)
@@ -97,9 +98,9 @@ def main():
         with open(output_file, "w") as f:
             json.dump(out_data, f, indent=4)
-        for key in sorted(results.keys()):
-            print(f"{key}: {results[key]}")
-        print(f"Predictions and metrics saved to {output_file}")
 if __name__ == "__main__":

     parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
     sage.config.add_config_args(parser)
+    sage.config.add_llm_args(parser)  # Needed for --multi-query-retriever, which rewrites the query with an LLM.
     sage.config.add_embedding_args(parser)
     sage.config.add_vector_store_args(parser)
     sage.config.add_reranking_args(parser)
         with open(output_file, "w") as f:
             json.dump(out_data, f, indent=4)
+    for key in sorted(results.keys()):
+        print(f"{key}: {results[key]}")
+    print(f"Predictions and metrics saved to {output_file}")
 if __name__ == "__main__":

benchmarks/retrieval/retrieve_kaggle.py CHANGED Viewed

@@ -22,6 +22,7 @@ def main():
     parser.add("--output-file", required=True, help="Path to the output file with predictions.")
     sage.config.add_config_args(parser)
     sage.config.add_embedding_args(parser)
     sage.config.add_vector_store_args(parser)
     sage.config.add_reranking_args(parser)

     parser.add("--output-file", required=True, help="Path to the output file with predictions.")
     sage.config.add_config_args(parser)
+    sage.config.add_llm_args(parser)  # Necessary for --multi-query-retriever, which calls an LLM.
     sage.config.add_embedding_args(parser)
     sage.config.add_vector_store_args(parser)
     sage.config.add_reranking_args(parser)

sage/chunker.py CHANGED Viewed

@@ -291,7 +291,7 @@ class IpynbFileChunker(Chunker):
         for chunk in chunks:
             # Update filenames back to .ipynb
-            chunk.metadata = metadata
         return chunks

         for chunk in chunks:
             # Update filenames back to .ipynb
+            chunk.metadata["file_path"] = filename
         return chunks

sage/config.py CHANGED Viewed

@@ -145,6 +145,13 @@ def add_vector_store_args(parser: ArgumentParser) -> Callable:
     parser.add(
         "--retriever-top-k", default=25, type=int, help="The number of top documents to retrieve from the vector store."
     )
     return validate_vector_store_args

     parser.add(
         "--retriever-top-k", default=25, type=int, help="The number of top documents to retrieve from the vector store."
     )
+    parser.add(
+        "--multi-query-retriever",
+        action=argparse.BooleanOptionalAction,
+        default=False,
+        help="When set to True, we rewrite the query 5 times, perform retrieval for each rewrite, and take the union "
+        "of retrieved documents. See https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/."
+    )
     return validate_vector_store_args

sage/index.py CHANGED Viewed

@@ -58,7 +58,14 @@ def main():
             inclusion_file=args.include,
             exclusion_file=args.exclude,
         )
-        repo_manager.download()
         logging.info("Embedding the repo...")
         chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
         repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)

             inclusion_file=args.include,
             exclusion_file=args.exclude,
         )
+        success = repo_manager.download()
+        if not success:
+            raise ValueError(
+                f"Unable to clone {args.repo_id}. Please check that it exists and you have access to it. "
+                "For private repositories, please set the GITHUB_TOKEN variable in your environment."
+            )
         logging.info("Embedding the repo...")
         chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
         repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)

sage/retriever.py CHANGED Viewed

@@ -1,7 +1,9 @@
 from langchain.retrievers import ContextualCompressionRetriever
 from langchain_openai import OpenAIEmbeddings
 from langchain_voyageai import VoyageAIEmbeddings
 from sage.reranker import build_reranker
 from sage.vector_store import build_vector_store_from_args
@@ -20,6 +22,11 @@ def build_retriever_from_args(args):
         top_k=args.retriever_top_k, embeddings=embeddings, namespace=args.index_namespace
     )
     reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
     if reranker:
         retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)

 from langchain.retrievers import ContextualCompressionRetriever
+from langchain.retrievers.multi_query import MultiQueryRetriever
 from langchain_openai import OpenAIEmbeddings
 from langchain_voyageai import VoyageAIEmbeddings
+from sage.llm import build_llm_via_langchain
 from sage.reranker import build_reranker
 from sage.vector_store import build_vector_store_from_args
         top_k=args.retriever_top_k, embeddings=embeddings, namespace=args.index_namespace
     )
+    if args.multi_query_retriever:
+        retriever = MultiQueryRetriever.from_llm(
+            retriever=retriever, llm=build_llm_via_langchain(args.llm_provider, args.llm_model)
+        )
     reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
     if reranker:
         retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)

sage/sample-exclude.txt CHANGED Viewed

@@ -1,5 +1,7 @@
 # This list tends to be overly-aggressive. We're assuming by default devs are most interested in code files, not configs.
 dir:alembic
 dir:deprecated
 dir:docker
 dir:downgrades
@@ -39,7 +41,6 @@ ext:.gz
 ext:.icns
 ext:.ico
 ext:.inp
-ext:.ipynb
 ext:.isl
 ext:.jar
 ext:.jpeg
@@ -63,6 +64,7 @@ ext:.pt
 ext:.ptl
 ext:.s
 ext:.so
 ext:.sqlite
 ext:.stl
 ext:.sum

 # This list tends to be overly-aggressive. We're assuming by default devs are most interested in code files, not configs.
+dir:_build
 dir:alembic
+dir:build
 dir:deprecated
 dir:docker
 dir:downgrades
 ext:.icns
 ext:.ico
 ext:.inp
 ext:.isl
 ext:.jar
 ext:.jpeg
 ext:.ptl
 ext:.s
 ext:.so
+ext:.sql
 ext:.sqlite
 ext:.stl
 ext:.sum