juliaturc commited on
Commit
c9295cd
·
1 Parent(s): ca3f128

Option for multi-query retrieval (#51)

Browse files

* Don't filter out .ipynb files.

* Fail explicitly when the repo cannot be cloned.

* Fix indentation in retrieve.py

* Update default OpenAI embedding model to text-embedding-3-small

* Update plot which over-estimated R-Precision for Dense (it was a copy-paste error). The take-away still holds.

* Add multi-query retriever

* Update README with available retrieval strategeis.

* Add LLM flags to retrieve_kaggle.py

README.md CHANGED
@@ -143,11 +143,13 @@ If you are planning on indexing GitHub issues in addition to the codebase, you w
143
 
144
  <details>
145
  <summary><strong>:lock: Working with private repositories</strong></summary>
146
- To index and chat with a private repository, simply set the GITHUB_TOKEN environment variable. To obtain this token: go to github.com > click on your profile icon > Settings > Developer settings > Personal access tokens. You can either make a fine-grained token for the desired repository, or a classic token.
 
147
 
148
  ```
149
  export GITHUB_TOKEN=...
150
  ```
 
151
  </details>
152
 
153
  <details>
@@ -181,10 +183,12 @@ To specify an exclusion file (i.e. index all files, except for the ones specifie
181
  sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
182
  ```
183
  By default, we use the exclusion file [sample-exclude.txt](sage/sample-exclude.txt).
 
184
  </details>
185
 
186
  <details>
187
  <summary><strong>:bug: Index open GitHub issues</strong></summary>
 
188
  You will need a GitHub token first:
189
 
190
  ```
@@ -205,6 +209,26 @@ To index GitHub issues, but not the codebase:
205
  ```
206
  sage-index $GITHUB_REPO --index-issues --no-index-repo
207
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  </details>
209
 
210
  # Why chat with a codebase?
 
143
 
144
  <details>
145
  <summary><strong>:lock: Working with private repositories</strong></summary>
146
+
147
+ To index and chat with a private repository, simply set the `GITHUB_TOKEN` environment variable. To obtain this token, go to github.com > click on your profile icon > Settings > Developer settings > Personal access tokens. You can either make a fine-grained token for the desired repository, or a classic token.
148
 
149
  ```
150
  export GITHUB_TOKEN=...
151
  ```
152
+
153
  </details>
154
 
155
  <details>
 
183
  sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
184
  ```
185
  By default, we use the exclusion file [sample-exclude.txt](sage/sample-exclude.txt).
186
+
187
  </details>
188
 
189
  <details>
190
  <summary><strong>:bug: Index open GitHub issues</strong></summary>
191
+
192
  You will need a GitHub token first:
193
 
194
  ```
 
209
  ```
210
  sage-index $GITHUB_REPO --index-issues --no-index-repo
211
  ```
212
+
213
+ </details>
214
+
215
+ <details>
216
+ <summary><strong>:books: Experiment with retrieval strategies</strong></summary>
217
+
218
+ Retrieving the right files from the vector database is arguably the quality bottleneck of the system. We are actively experimenting with various retrieval strategies and documenting our findings [here](benchmark/retrieval/README.md).
219
+
220
+ Currently, we support the following types of retrieval:
221
+ - **Vanilla RAG** from a vector database (nearest neighbor between dense embeddings). This is the default.
222
+ - **Hybrid RAG** that combines dense retrieval (embeddings-based) with sparse retrieval (BM25). Use `--retrieval-alpha` to weigh the two strategies.
223
+
224
+ - A value of 1 means dense-only retrieval and 0 means BM25-only retrieval.
225
+ - Note this is not available when running locally, only when using Pinecone as a vector store.
226
+ - Contrary to [Anthropic's findings](https://www.anthropic.com/news/contextual-retrieval), we find that BM25 is actually damaging performance *on codebases*, because it gives undeserved advantage to Markdown files.
227
+
228
+ - **Multi-query retrieval** performs multiple query rewrites, makes a separate retrieval call for each, and takes the union of the retrieved documents. You can activate it by passing `--multi-query-retrieval`.
229
+
230
+ - We find that [on our benchmark](benchmark/retrieval/README.md) this only marginally improves retrieval quality (from 0.44 to 0.46 R-precision) while being significantly slower and more expensive due to LLM calls. But your mileage may vary.
231
+
232
  </details>
233
 
234
  # Why chat with a codebase?
benchmarks/retrieval/assets/retrievers.png CHANGED
benchmarks/retrieval/retrieve.py CHANGED
@@ -38,6 +38,7 @@ def main():
38
  parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
39
 
40
  sage.config.add_config_args(parser)
 
41
  sage.config.add_embedding_args(parser)
42
  sage.config.add_vector_store_args(parser)
43
  sage.config.add_reranking_args(parser)
@@ -97,9 +98,9 @@ def main():
97
  with open(output_file, "w") as f:
98
  json.dump(out_data, f, indent=4)
99
 
100
- for key in sorted(results.keys()):
101
- print(f"{key}: {results[key]}")
102
- print(f"Predictions and metrics saved to {output_file}")
103
 
104
 
105
  if __name__ == "__main__":
 
38
  parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
39
 
40
  sage.config.add_config_args(parser)
41
+ sage.config.add_llm_args(parser) # Needed for --multi-query-retriever, which rewrites the query with an LLM.
42
  sage.config.add_embedding_args(parser)
43
  sage.config.add_vector_store_args(parser)
44
  sage.config.add_reranking_args(parser)
 
98
  with open(output_file, "w") as f:
99
  json.dump(out_data, f, indent=4)
100
 
101
+ for key in sorted(results.keys()):
102
+ print(f"{key}: {results[key]}")
103
+ print(f"Predictions and metrics saved to {output_file}")
104
 
105
 
106
  if __name__ == "__main__":
benchmarks/retrieval/retrieve_kaggle.py CHANGED
@@ -22,6 +22,7 @@ def main():
22
  parser.add("--output-file", required=True, help="Path to the output file with predictions.")
23
 
24
  sage.config.add_config_args(parser)
 
25
  sage.config.add_embedding_args(parser)
26
  sage.config.add_vector_store_args(parser)
27
  sage.config.add_reranking_args(parser)
 
22
  parser.add("--output-file", required=True, help="Path to the output file with predictions.")
23
 
24
  sage.config.add_config_args(parser)
25
+ sage.config.add_llm_args(parser) # Necessary for --multi-query-retriever, which calls an LLM.
26
  sage.config.add_embedding_args(parser)
27
  sage.config.add_vector_store_args(parser)
28
  sage.config.add_reranking_args(parser)
sage/chunker.py CHANGED
@@ -291,7 +291,7 @@ class IpynbFileChunker(Chunker):
291
 
292
  for chunk in chunks:
293
  # Update filenames back to .ipynb
294
- chunk.metadata = metadata
295
  return chunks
296
 
297
 
 
291
 
292
  for chunk in chunks:
293
  # Update filenames back to .ipynb
294
+ chunk.metadata["file_path"] = filename
295
  return chunks
296
 
297
 
sage/config.py CHANGED
@@ -145,6 +145,13 @@ def add_vector_store_args(parser: ArgumentParser) -> Callable:
145
  parser.add(
146
  "--retriever-top-k", default=25, type=int, help="The number of top documents to retrieve from the vector store."
147
  )
 
 
 
 
 
 
 
148
  return validate_vector_store_args
149
 
150
 
 
145
  parser.add(
146
  "--retriever-top-k", default=25, type=int, help="The number of top documents to retrieve from the vector store."
147
  )
148
+ parser.add(
149
+ "--multi-query-retriever",
150
+ action=argparse.BooleanOptionalAction,
151
+ default=False,
152
+ help="When set to True, we rewrite the query 5 times, perform retrieval for each rewrite, and take the union "
153
+ "of retrieved documents. See https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/."
154
+ )
155
  return validate_vector_store_args
156
 
157
 
sage/index.py CHANGED
@@ -58,7 +58,14 @@ def main():
58
  inclusion_file=args.include,
59
  exclusion_file=args.exclude,
60
  )
61
- repo_manager.download()
 
 
 
 
 
 
 
62
  logging.info("Embedding the repo...")
63
  chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
64
  repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)
 
58
  inclusion_file=args.include,
59
  exclusion_file=args.exclude,
60
  )
61
+
62
+ success = repo_manager.download()
63
+ if not success:
64
+ raise ValueError(
65
+ f"Unable to clone {args.repo_id}. Please check that it exists and you have access to it. "
66
+ "For private repositories, please set the GITHUB_TOKEN variable in your environment."
67
+ )
68
+
69
  logging.info("Embedding the repo...")
70
  chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
71
  repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)
sage/retriever.py CHANGED
@@ -1,7 +1,9 @@
1
  from langchain.retrievers import ContextualCompressionRetriever
 
2
  from langchain_openai import OpenAIEmbeddings
3
  from langchain_voyageai import VoyageAIEmbeddings
4
 
 
5
  from sage.reranker import build_reranker
6
  from sage.vector_store import build_vector_store_from_args
7
 
@@ -20,6 +22,11 @@ def build_retriever_from_args(args):
20
  top_k=args.retriever_top_k, embeddings=embeddings, namespace=args.index_namespace
21
  )
22
 
 
 
 
 
 
23
  reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
24
  if reranker:
25
  retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
 
1
  from langchain.retrievers import ContextualCompressionRetriever
2
+ from langchain.retrievers.multi_query import MultiQueryRetriever
3
  from langchain_openai import OpenAIEmbeddings
4
  from langchain_voyageai import VoyageAIEmbeddings
5
 
6
+ from sage.llm import build_llm_via_langchain
7
  from sage.reranker import build_reranker
8
  from sage.vector_store import build_vector_store_from_args
9
 
 
22
  top_k=args.retriever_top_k, embeddings=embeddings, namespace=args.index_namespace
23
  )
24
 
25
+ if args.multi_query_retriever:
26
+ retriever = MultiQueryRetriever.from_llm(
27
+ retriever=retriever, llm=build_llm_via_langchain(args.llm_provider, args.llm_model)
28
+ )
29
+
30
  reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
31
  if reranker:
32
  retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
sage/sample-exclude.txt CHANGED
@@ -1,5 +1,7 @@
1
  # This list tends to be overly-aggressive. We're assuming by default devs are most interested in code files, not configs.
 
2
  dir:alembic
 
3
  dir:deprecated
4
  dir:docker
5
  dir:downgrades
@@ -39,7 +41,6 @@ ext:.gz
39
  ext:.icns
40
  ext:.ico
41
  ext:.inp
42
- ext:.ipynb
43
  ext:.isl
44
  ext:.jar
45
  ext:.jpeg
@@ -63,6 +64,7 @@ ext:.pt
63
  ext:.ptl
64
  ext:.s
65
  ext:.so
 
66
  ext:.sqlite
67
  ext:.stl
68
  ext:.sum
 
1
  # This list tends to be overly-aggressive. We're assuming by default devs are most interested in code files, not configs.
2
+ dir:_build
3
  dir:alembic
4
+ dir:build
5
  dir:deprecated
6
  dir:docker
7
  dir:downgrades
 
41
  ext:.icns
42
  ext:.ico
43
  ext:.inp
 
44
  ext:.isl
45
  ext:.jar
46
  ext:.jpeg
 
64
  ext:.ptl
65
  ext:.s
66
  ext:.so
67
+ ext:.sql
68
  ext:.sqlite
69
  ext:.stl
70
  ext:.sum