Spaces:
Running
Running
Option for multi-query retrieval (#51)
Browse files* Don't filter out .ipynb files.
* Fail explicitly when the repo cannot be cloned.
* Fix indentation in retrieve.py
* Update default OpenAI embedding model to text-embedding-3-small
* Update plot which over-estimated R-Precision for Dense (it was a copy-paste error). The take-away still holds.
* Add multi-query retriever
* Update README with available retrieval strategeis.
* Add LLM flags to retrieve_kaggle.py
- README.md +25 -1
- benchmarks/retrieval/assets/retrievers.png +0 -0
- benchmarks/retrieval/retrieve.py +4 -3
- benchmarks/retrieval/retrieve_kaggle.py +1 -0
- sage/chunker.py +1 -1
- sage/config.py +7 -0
- sage/index.py +8 -1
- sage/retriever.py +7 -0
- sage/sample-exclude.txt +3 -1
README.md
CHANGED
|
@@ -143,11 +143,13 @@ If you are planning on indexing GitHub issues in addition to the codebase, you w
|
|
| 143 |
|
| 144 |
<details>
|
| 145 |
<summary><strong>:lock: Working with private repositories</strong></summary>
|
| 146 |
-
|
|
|
|
| 147 |
|
| 148 |
```
|
| 149 |
export GITHUB_TOKEN=...
|
| 150 |
```
|
|
|
|
| 151 |
</details>
|
| 152 |
|
| 153 |
<details>
|
|
@@ -181,10 +183,12 @@ To specify an exclusion file (i.e. index all files, except for the ones specifie
|
|
| 181 |
sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
|
| 182 |
```
|
| 183 |
By default, we use the exclusion file [sample-exclude.txt](sage/sample-exclude.txt).
|
|
|
|
| 184 |
</details>
|
| 185 |
|
| 186 |
<details>
|
| 187 |
<summary><strong>:bug: Index open GitHub issues</strong></summary>
|
|
|
|
| 188 |
You will need a GitHub token first:
|
| 189 |
|
| 190 |
```
|
|
@@ -205,6 +209,26 @@ To index GitHub issues, but not the codebase:
|
|
| 205 |
```
|
| 206 |
sage-index $GITHUB_REPO --index-issues --no-index-repo
|
| 207 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
</details>
|
| 209 |
|
| 210 |
# Why chat with a codebase?
|
|
|
|
| 143 |
|
| 144 |
<details>
|
| 145 |
<summary><strong>:lock: Working with private repositories</strong></summary>
|
| 146 |
+
|
| 147 |
+
To index and chat with a private repository, simply set the `GITHUB_TOKEN` environment variable. To obtain this token, go to github.com > click on your profile icon > Settings > Developer settings > Personal access tokens. You can either make a fine-grained token for the desired repository, or a classic token.
|
| 148 |
|
| 149 |
```
|
| 150 |
export GITHUB_TOKEN=...
|
| 151 |
```
|
| 152 |
+
|
| 153 |
</details>
|
| 154 |
|
| 155 |
<details>
|
|
|
|
| 183 |
sage-index $GITHUB_REPO --exclude=/path/to/exclusion/file
|
| 184 |
```
|
| 185 |
By default, we use the exclusion file [sample-exclude.txt](sage/sample-exclude.txt).
|
| 186 |
+
|
| 187 |
</details>
|
| 188 |
|
| 189 |
<details>
|
| 190 |
<summary><strong>:bug: Index open GitHub issues</strong></summary>
|
| 191 |
+
|
| 192 |
You will need a GitHub token first:
|
| 193 |
|
| 194 |
```
|
|
|
|
| 209 |
```
|
| 210 |
sage-index $GITHUB_REPO --index-issues --no-index-repo
|
| 211 |
```
|
| 212 |
+
|
| 213 |
+
</details>
|
| 214 |
+
|
| 215 |
+
<details>
|
| 216 |
+
<summary><strong>:books: Experiment with retrieval strategies</strong></summary>
|
| 217 |
+
|
| 218 |
+
Retrieving the right files from the vector database is arguably the quality bottleneck of the system. We are actively experimenting with various retrieval strategies and documenting our findings [here](benchmark/retrieval/README.md).
|
| 219 |
+
|
| 220 |
+
Currently, we support the following types of retrieval:
|
| 221 |
+
- **Vanilla RAG** from a vector database (nearest neighbor between dense embeddings). This is the default.
|
| 222 |
+
- **Hybrid RAG** that combines dense retrieval (embeddings-based) with sparse retrieval (BM25). Use `--retrieval-alpha` to weigh the two strategies.
|
| 223 |
+
|
| 224 |
+
- A value of 1 means dense-only retrieval and 0 means BM25-only retrieval.
|
| 225 |
+
- Note this is not available when running locally, only when using Pinecone as a vector store.
|
| 226 |
+
- Contrary to [Anthropic's findings](https://www.anthropic.com/news/contextual-retrieval), we find that BM25 is actually damaging performance *on codebases*, because it gives undeserved advantage to Markdown files.
|
| 227 |
+
|
| 228 |
+
- **Multi-query retrieval** performs multiple query rewrites, makes a separate retrieval call for each, and takes the union of the retrieved documents. You can activate it by passing `--multi-query-retrieval`.
|
| 229 |
+
|
| 230 |
+
- We find that [on our benchmark](benchmark/retrieval/README.md) this only marginally improves retrieval quality (from 0.44 to 0.46 R-precision) while being significantly slower and more expensive due to LLM calls. But your mileage may vary.
|
| 231 |
+
|
| 232 |
</details>
|
| 233 |
|
| 234 |
# Why chat with a codebase?
|
benchmarks/retrieval/assets/retrievers.png
CHANGED
|
|
benchmarks/retrieval/retrieve.py
CHANGED
|
@@ -38,6 +38,7 @@ def main():
|
|
| 38 |
parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
|
| 39 |
|
| 40 |
sage.config.add_config_args(parser)
|
|
|
|
| 41 |
sage.config.add_embedding_args(parser)
|
| 42 |
sage.config.add_vector_store_args(parser)
|
| 43 |
sage.config.add_reranking_args(parser)
|
|
@@ -97,9 +98,9 @@ def main():
|
|
| 97 |
with open(output_file, "w") as f:
|
| 98 |
json.dump(out_data, f, indent=4)
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
|
| 104 |
|
| 105 |
if __name__ == "__main__":
|
|
|
|
| 38 |
parser.add("--max-instances", default=None, type=int, help="Maximum number of instances to process.")
|
| 39 |
|
| 40 |
sage.config.add_config_args(parser)
|
| 41 |
+
sage.config.add_llm_args(parser) # Needed for --multi-query-retriever, which rewrites the query with an LLM.
|
| 42 |
sage.config.add_embedding_args(parser)
|
| 43 |
sage.config.add_vector_store_args(parser)
|
| 44 |
sage.config.add_reranking_args(parser)
|
|
|
|
| 98 |
with open(output_file, "w") as f:
|
| 99 |
json.dump(out_data, f, indent=4)
|
| 100 |
|
| 101 |
+
for key in sorted(results.keys()):
|
| 102 |
+
print(f"{key}: {results[key]}")
|
| 103 |
+
print(f"Predictions and metrics saved to {output_file}")
|
| 104 |
|
| 105 |
|
| 106 |
if __name__ == "__main__":
|
benchmarks/retrieval/retrieve_kaggle.py
CHANGED
|
@@ -22,6 +22,7 @@ def main():
|
|
| 22 |
parser.add("--output-file", required=True, help="Path to the output file with predictions.")
|
| 23 |
|
| 24 |
sage.config.add_config_args(parser)
|
|
|
|
| 25 |
sage.config.add_embedding_args(parser)
|
| 26 |
sage.config.add_vector_store_args(parser)
|
| 27 |
sage.config.add_reranking_args(parser)
|
|
|
|
| 22 |
parser.add("--output-file", required=True, help="Path to the output file with predictions.")
|
| 23 |
|
| 24 |
sage.config.add_config_args(parser)
|
| 25 |
+
sage.config.add_llm_args(parser) # Necessary for --multi-query-retriever, which calls an LLM.
|
| 26 |
sage.config.add_embedding_args(parser)
|
| 27 |
sage.config.add_vector_store_args(parser)
|
| 28 |
sage.config.add_reranking_args(parser)
|
sage/chunker.py
CHANGED
|
@@ -291,7 +291,7 @@ class IpynbFileChunker(Chunker):
|
|
| 291 |
|
| 292 |
for chunk in chunks:
|
| 293 |
# Update filenames back to .ipynb
|
| 294 |
-
chunk.metadata =
|
| 295 |
return chunks
|
| 296 |
|
| 297 |
|
|
|
|
| 291 |
|
| 292 |
for chunk in chunks:
|
| 293 |
# Update filenames back to .ipynb
|
| 294 |
+
chunk.metadata["file_path"] = filename
|
| 295 |
return chunks
|
| 296 |
|
| 297 |
|
sage/config.py
CHANGED
|
@@ -145,6 +145,13 @@ def add_vector_store_args(parser: ArgumentParser) -> Callable:
|
|
| 145 |
parser.add(
|
| 146 |
"--retriever-top-k", default=25, type=int, help="The number of top documents to retrieve from the vector store."
|
| 147 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
return validate_vector_store_args
|
| 149 |
|
| 150 |
|
|
|
|
| 145 |
parser.add(
|
| 146 |
"--retriever-top-k", default=25, type=int, help="The number of top documents to retrieve from the vector store."
|
| 147 |
)
|
| 148 |
+
parser.add(
|
| 149 |
+
"--multi-query-retriever",
|
| 150 |
+
action=argparse.BooleanOptionalAction,
|
| 151 |
+
default=False,
|
| 152 |
+
help="When set to True, we rewrite the query 5 times, perform retrieval for each rewrite, and take the union "
|
| 153 |
+
"of retrieved documents. See https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/."
|
| 154 |
+
)
|
| 155 |
return validate_vector_store_args
|
| 156 |
|
| 157 |
|
sage/index.py
CHANGED
|
@@ -58,7 +58,14 @@ def main():
|
|
| 58 |
inclusion_file=args.include,
|
| 59 |
exclusion_file=args.exclude,
|
| 60 |
)
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
logging.info("Embedding the repo...")
|
| 63 |
chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
|
| 64 |
repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)
|
|
|
|
| 58 |
inclusion_file=args.include,
|
| 59 |
exclusion_file=args.exclude,
|
| 60 |
)
|
| 61 |
+
|
| 62 |
+
success = repo_manager.download()
|
| 63 |
+
if not success:
|
| 64 |
+
raise ValueError(
|
| 65 |
+
f"Unable to clone {args.repo_id}. Please check that it exists and you have access to it. "
|
| 66 |
+
"For private repositories, please set the GITHUB_TOKEN variable in your environment."
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
logging.info("Embedding the repo...")
|
| 70 |
chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
|
| 71 |
repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)
|
sage/retriever.py
CHANGED
|
@@ -1,7 +1,9 @@
|
|
| 1 |
from langchain.retrievers import ContextualCompressionRetriever
|
|
|
|
| 2 |
from langchain_openai import OpenAIEmbeddings
|
| 3 |
from langchain_voyageai import VoyageAIEmbeddings
|
| 4 |
|
|
|
|
| 5 |
from sage.reranker import build_reranker
|
| 6 |
from sage.vector_store import build_vector_store_from_args
|
| 7 |
|
|
@@ -20,6 +22,11 @@ def build_retriever_from_args(args):
|
|
| 20 |
top_k=args.retriever_top_k, embeddings=embeddings, namespace=args.index_namespace
|
| 21 |
)
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
|
| 24 |
if reranker:
|
| 25 |
retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
|
|
|
|
| 1 |
from langchain.retrievers import ContextualCompressionRetriever
|
| 2 |
+
from langchain.retrievers.multi_query import MultiQueryRetriever
|
| 3 |
from langchain_openai import OpenAIEmbeddings
|
| 4 |
from langchain_voyageai import VoyageAIEmbeddings
|
| 5 |
|
| 6 |
+
from sage.llm import build_llm_via_langchain
|
| 7 |
from sage.reranker import build_reranker
|
| 8 |
from sage.vector_store import build_vector_store_from_args
|
| 9 |
|
|
|
|
| 22 |
top_k=args.retriever_top_k, embeddings=embeddings, namespace=args.index_namespace
|
| 23 |
)
|
| 24 |
|
| 25 |
+
if args.multi_query_retriever:
|
| 26 |
+
retriever = MultiQueryRetriever.from_llm(
|
| 27 |
+
retriever=retriever, llm=build_llm_via_langchain(args.llm_provider, args.llm_model)
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
reranker = build_reranker(args.reranker_provider, args.reranker_model, args.reranker_top_k)
|
| 31 |
if reranker:
|
| 32 |
retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
|
sage/sample-exclude.txt
CHANGED
|
@@ -1,5 +1,7 @@
|
|
| 1 |
# This list tends to be overly-aggressive. We're assuming by default devs are most interested in code files, not configs.
|
|
|
|
| 2 |
dir:alembic
|
|
|
|
| 3 |
dir:deprecated
|
| 4 |
dir:docker
|
| 5 |
dir:downgrades
|
|
@@ -39,7 +41,6 @@ ext:.gz
|
|
| 39 |
ext:.icns
|
| 40 |
ext:.ico
|
| 41 |
ext:.inp
|
| 42 |
-
ext:.ipynb
|
| 43 |
ext:.isl
|
| 44 |
ext:.jar
|
| 45 |
ext:.jpeg
|
|
@@ -63,6 +64,7 @@ ext:.pt
|
|
| 63 |
ext:.ptl
|
| 64 |
ext:.s
|
| 65 |
ext:.so
|
|
|
|
| 66 |
ext:.sqlite
|
| 67 |
ext:.stl
|
| 68 |
ext:.sum
|
|
|
|
| 1 |
# This list tends to be overly-aggressive. We're assuming by default devs are most interested in code files, not configs.
|
| 2 |
+
dir:_build
|
| 3 |
dir:alembic
|
| 4 |
+
dir:build
|
| 5 |
dir:deprecated
|
| 6 |
dir:docker
|
| 7 |
dir:downgrades
|
|
|
|
| 41 |
ext:.icns
|
| 42 |
ext:.ico
|
| 43 |
ext:.inp
|
|
|
|
| 44 |
ext:.isl
|
| 45 |
ext:.jar
|
| 46 |
ext:.jpeg
|
|
|
|
| 64 |
ext:.ptl
|
| 65 |
ext:.s
|
| 66 |
ext:.so
|
| 67 |
+
ext:.sql
|
| 68 |
ext:.sqlite
|
| 69 |
ext:.stl
|
| 70 |
ext:.sum
|