Spaces:
Running
Running
Default to Marqo to simplify "Getting started"
Browse files- README.md +95 -60
- repo2vec/chat.py +11 -3
- repo2vec/index.py +58 -25
README.md
CHANGED
|
@@ -51,6 +51,11 @@ To install the library, simply run `pip install repo2vec`!
|
|
| 51 |
export PINECONE_API_KEY=...
|
| 52 |
```
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
|
| 55 |
|
| 56 |
```
|
|
@@ -68,80 +73,110 @@ If you are planning on indexing GitHub issues in addition to the codebase, you w
|
|
| 68 |
## Running it
|
| 69 |
|
| 70 |
<details open>
|
| 71 |
-
<summary><strong>:computer:
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
|
|
|
| 88 |
</details>
|
| 89 |
|
| 90 |
<details>
|
| 91 |
-
<summary><strong>:cloud:
|
| 92 |
-
<p>To index the codebase, run this command. This should take a few minutes, depending on the repo size.</p>
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
--index-name=your-index-name
|
| 99 |
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
-
|
| 103 |
-
|
|
|
|
| 104 |
--vector-store-type=pinecone \
|
| 105 |
-
--index-name
|
| 106 |
--llm-provider=openai \
|
| 107 |
--llm-model=gpt-4
|
| 108 |
-
|
| 109 |
-
To get a public URL for your chat app, set `--share=true`.
|
| 110 |
</details>
|
| 111 |
|
| 112 |
## Additional features
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
# Why chat with a codebase?
|
| 147 |
|
|
|
|
| 51 |
export PINECONE_API_KEY=...
|
| 52 |
```
|
| 53 |
|
| 54 |
+
2. Create a Pinecone index [on their website](https://pinecone.io) and export the name:
|
| 55 |
+
```
|
| 56 |
+
export PINECONE_INDEX_NAME=...
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
|
| 60 |
|
| 61 |
```
|
|
|
|
| 73 |
## Running it
|
| 74 |
|
| 75 |
<details open>
|
| 76 |
+
<summary><strong>:computer: Run locally</strong></summary>
|
| 77 |
+
|
| 78 |
+
1. Select your desired repository:
|
| 79 |
+
```
|
| 80 |
+
export GITHUB_REPO=huggingface/transformers
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
2. Index the repository. This might take a few minutes, depending on its size.
|
| 84 |
+
```
|
| 85 |
+
r2v-index $GITHUB_REPO
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
3. Chat with the repository, once it's indexed:
|
| 89 |
+
```
|
| 90 |
+
r2v-chat $GITHUB_REPO
|
| 91 |
+
```
|
| 92 |
+
To get a public URL for your chat app, set `--share=true`.
|
| 93 |
+
|
| 94 |
</details>
|
| 95 |
|
| 96 |
<details>
|
| 97 |
+
<summary><strong>:cloud: Use external providers</strong></summary>
|
|
|
|
| 98 |
|
| 99 |
+
1. Select your desired repository:
|
| 100 |
+
```
|
| 101 |
+
export GITHUB_REPO=huggingface/transformers
|
| 102 |
+
```
|
|
|
|
| 103 |
|
| 104 |
+
2. Index the repository. This might take a few minutes, depending on its size.
|
| 105 |
+
```
|
| 106 |
+
r2v-index $GITHUB_REPO \
|
| 107 |
+
--embedder-type=openai
|
| 108 |
+
--vector-store=pinecone \
|
| 109 |
+
--index-name=$PINECONE_INDEX_NAME
|
| 110 |
+
```
|
| 111 |
|
| 112 |
+
3. Chat with the repository, once it's indexed:
|
| 113 |
+
```
|
| 114 |
+
r2v-chat $GITHUB_REPO \
|
| 115 |
--vector-store-type=pinecone \
|
| 116 |
+
--index-name=$PINECONE_INDEX_NAME \
|
| 117 |
--llm-provider=openai \
|
| 118 |
--llm-model=gpt-4
|
| 119 |
+
```
|
| 120 |
+
To get a public URL for your chat app, set `--share=true`.
|
| 121 |
</details>
|
| 122 |
|
| 123 |
## Additional features
|
| 124 |
|
| 125 |
+
<details>
|
| 126 |
+
<summary><strong>:hammer_and_wrench: Control which files get indexed</strong></summary>
|
| 127 |
+
|
| 128 |
+
You can specify an inclusion or exclusion file in the following format:
|
| 129 |
+
```
|
| 130 |
+
# This is a comment
|
| 131 |
+
ext:.my-ext-1
|
| 132 |
+
ext:.my-ext-2
|
| 133 |
+
ext:.my-ext-3
|
| 134 |
+
dir:my-dir-1
|
| 135 |
+
dir:my-dir-2
|
| 136 |
+
dir:my-dir-3
|
| 137 |
+
file:my-file-1.md
|
| 138 |
+
file:my-file-2.py
|
| 139 |
+
file:my-file-3.cpp
|
| 140 |
+
```
|
| 141 |
+
where:
|
| 142 |
+
- `ext` specifies a file extension
|
| 143 |
+
- `dir` specifies a directory. This is not a full path. For instance, if you specify `dir:tests` in an exclusion directory, then a file like `/path/to/my/tests/file.py` will be ignored.
|
| 144 |
+
- `file` specifies a file name. This is also not a full path. For instance, if you specify `file:__init__.py`, then a file like `/path/to/my/__init__.py` will be ignored.
|
| 145 |
+
|
| 146 |
+
To specify an inclusion file (i.e. only index the specified files):
|
| 147 |
+
```
|
| 148 |
+
r2v-index $GITHUB_REPO --include=/path/to/inclusion/file
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
To specify an exclusion file (i.e. index all files, except for the ones specified):
|
| 152 |
+
```
|
| 153 |
+
r2v-index $GITHUB_REPO --exclude=/path/to/exclusion/file
|
| 154 |
+
```
|
| 155 |
+
By default, we use the exclusion file [sample-exclude.txt](repo2vec/sample-exclude.txt).
|
| 156 |
+
</details>
|
| 157 |
|
| 158 |
+
<details>
|
| 159 |
+
<summary><strong>:bug: Index open GitHub issues</strong></summary>
|
| 160 |
+
You will need a GitHub token first:
|
| 161 |
+
```
|
| 162 |
+
export GITHUB_TOKEN=...
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
To index GitHub issues without comments:
|
| 166 |
+
```
|
| 167 |
+
r2v-index $GITHUB_REPO --index-issues
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
To index GitHub issues with comments:
|
| 171 |
+
```
|
| 172 |
+
r2v-index $GITHUB_REPO --index-issues --index-issue-comments
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
To index GitHub issues, but not the codebase:
|
| 176 |
+
```
|
| 177 |
+
r2v-index $GITHUB_REPO --index-issues --no-index-repo
|
| 178 |
+
```
|
| 179 |
+
</details>
|
| 180 |
|
| 181 |
# Why chat with a codebase?
|
| 182 |
|
repo2vec/chat.py
CHANGED
|
@@ -70,13 +70,13 @@ def append_sources_to_response(response):
|
|
| 70 |
def main():
|
| 71 |
parser = argparse.ArgumentParser(description="UI to chat with your codebase")
|
| 72 |
parser.add_argument("repo_id", help="The ID of the repository to index")
|
| 73 |
-
parser.add_argument("--llm-provider", default="
|
| 74 |
parser.add_argument(
|
| 75 |
"--llm-model",
|
| 76 |
help="The LLM name. Must be supported by the provider specified via --llm-provider.",
|
| 77 |
)
|
| 78 |
-
parser.add_argument("--vector-store-type", default="
|
| 79 |
-
parser.add_argument("--index-name",
|
| 80 |
parser.add_argument(
|
| 81 |
"--marqo-url",
|
| 82 |
default="http://localhost:8882",
|
|
@@ -89,11 +89,19 @@ def main():
|
|
| 89 |
)
|
| 90 |
args = parser.parse_args()
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
if not args.llm_model:
|
| 93 |
if args.llm_provider == "openai":
|
| 94 |
args.llm_model = "gpt-4"
|
| 95 |
elif args.llm_provider == "anthropic":
|
| 96 |
args.llm_model = "claude-3-opus-20240229"
|
|
|
|
|
|
|
| 97 |
else:
|
| 98 |
raise ValueError("Please specify --llm_model")
|
| 99 |
|
|
|
|
| 70 |
def main():
|
| 71 |
parser = argparse.ArgumentParser(description="UI to chat with your codebase")
|
| 72 |
parser.add_argument("repo_id", help="The ID of the repository to index")
|
| 73 |
+
parser.add_argument("--llm-provider", default="ollama", choices=["openai", "anthropic", "ollama"])
|
| 74 |
parser.add_argument(
|
| 75 |
"--llm-model",
|
| 76 |
help="The LLM name. Must be supported by the provider specified via --llm-provider.",
|
| 77 |
)
|
| 78 |
+
parser.add_argument("--vector-store-type", default="marqo", choices=["pinecone", "marqo"])
|
| 79 |
+
parser.add_argument("--index-name", help="Vector store index name. Required for Pinecone.")
|
| 80 |
parser.add_argument(
|
| 81 |
"--marqo-url",
|
| 82 |
default="http://localhost:8882",
|
|
|
|
| 89 |
)
|
| 90 |
args = parser.parse_args()
|
| 91 |
|
| 92 |
+
if not args.index_name:
|
| 93 |
+
if args.vector_store_type == "marqo":
|
| 94 |
+
args.index_name = args.repo_id.split("/")[1]
|
| 95 |
+
elif args.vector_store_type == "pinecone":
|
| 96 |
+
parser.error("Please specify --index-name for Pinecone.")
|
| 97 |
+
|
| 98 |
if not args.llm_model:
|
| 99 |
if args.llm_provider == "openai":
|
| 100 |
args.llm_model = "gpt-4"
|
| 101 |
elif args.llm_provider == "anthropic":
|
| 102 |
args.llm_model = "claude-3-opus-20240229"
|
| 103 |
+
elif args.llm_provider == "ollama":
|
| 104 |
+
args.llm_model = "llama3.1"
|
| 105 |
else:
|
| 106 |
raise ValueError("Please specify --llm_model")
|
| 107 |
|
repo2vec/index.py
CHANGED
|
@@ -16,9 +16,13 @@ logging.basicConfig(level=logging.INFO)
|
|
| 16 |
logger = logging.getLogger()
|
| 17 |
logger.setLevel(logging.INFO)
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
# Note that OpenAI embedding models have fixed dimensions, however, taking a slice of them is possible.
|
| 24 |
# See "Reducing embedding dimensions" under https://platform.openai.com/docs/guides/embeddings/use-cases and
|
|
@@ -33,7 +37,7 @@ OPENAI_DEFAULT_EMBEDDING_SIZE = {
|
|
| 33 |
def main():
|
| 34 |
parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
|
| 35 |
parser.add_argument("repo_id", help="The ID of the repository to index")
|
| 36 |
-
parser.add_argument("--embedder-type", default="
|
| 37 |
parser.add_argument(
|
| 38 |
"--embedding-model",
|
| 39 |
type=str,
|
|
@@ -47,7 +51,7 @@ def main():
|
|
| 47 |
help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
|
| 48 |
"large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
|
| 49 |
)
|
| 50 |
-
parser.add_argument("--vector-store-type", default="
|
| 51 |
parser.add_argument(
|
| 52 |
"--local-dir",
|
| 53 |
default="repos",
|
|
@@ -62,13 +66,14 @@ def main():
|
|
| 62 |
parser.add_argument(
|
| 63 |
"--chunks-per-batch",
|
| 64 |
type=int,
|
| 65 |
-
default=2000,
|
| 66 |
help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
|
| 67 |
)
|
| 68 |
parser.add_argument(
|
| 69 |
"--index-name",
|
| 70 |
-
|
| 71 |
-
help="Vector store index name. For
|
|
|
|
|
|
|
| 72 |
)
|
| 73 |
parser.add_argument(
|
| 74 |
"--include",
|
|
@@ -119,17 +124,51 @@ def main():
|
|
| 119 |
parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
|
| 120 |
if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
|
| 121 |
parser.error("When using the marqo embedder, the vector store type must also be marqo.")
|
| 122 |
-
if args.
|
| 123 |
-
args.
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
parser.error(
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
if args.include and args.exclude:
|
| 134 |
parser.error("At most one of --include and --exclude can be specified.")
|
| 135 |
if not args.include and not args.exclude:
|
|
@@ -137,12 +176,6 @@ def main():
|
|
| 137 |
if not args.index_repo and not args.index_issues:
|
| 138 |
parser.error("At least one of --index-repo and --index-issues must be true.")
|
| 139 |
|
| 140 |
-
# Set default values based on other arguments
|
| 141 |
-
if args.embedding_model is None:
|
| 142 |
-
args.embedding_model = "text-embedding-ada-002" if args.embedder_type == "openai" else "hf/e5-base-v2"
|
| 143 |
-
if args.embedding_size is None and args.embedder_type == "openai":
|
| 144 |
-
args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
|
| 145 |
-
|
| 146 |
# Fail early on missing environment variables.
|
| 147 |
if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
|
| 148 |
parser.error("Please set the OPENAI_API_KEY environment variable.")
|
|
|
|
| 16 |
logger = logging.getLogger()
|
| 17 |
logger.setLevel(logging.INFO)
|
| 18 |
|
| 19 |
+
MARQO_MAX_CHUNKS_PER_BATCH = 64
|
| 20 |
+
|
| 21 |
+
OPENAI_MAX_TOKENS_PER_CHUNK = 8192 # The ADA embedder from OpenAI has a maximum of 8192 tokens.
|
| 22 |
+
OPENAI_MAX_CHUNKS_PER_BATCH = 2048 # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
|
| 23 |
+
OPENAI_MAX_TOKENS_PER_JOB = (
|
| 24 |
+
3_000_000 # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
|
| 25 |
+
)
|
| 26 |
|
| 27 |
# Note that OpenAI embedding models have fixed dimensions, however, taking a slice of them is possible.
|
| 28 |
# See "Reducing embedding dimensions" under https://platform.openai.com/docs/guides/embeddings/use-cases and
|
|
|
|
| 37 |
def main():
|
| 38 |
parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
|
| 39 |
parser.add_argument("repo_id", help="The ID of the repository to index")
|
| 40 |
+
parser.add_argument("--embedder-type", default="marqo", choices=["openai", "marqo"])
|
| 41 |
parser.add_argument(
|
| 42 |
"--embedding-model",
|
| 43 |
type=str,
|
|
|
|
| 51 |
help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
|
| 52 |
"large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
|
| 53 |
)
|
| 54 |
+
parser.add_argument("--vector-store-type", default="marqo", choices=["pinecone", "marqo"])
|
| 55 |
parser.add_argument(
|
| 56 |
"--local-dir",
|
| 57 |
default="repos",
|
|
|
|
| 66 |
parser.add_argument(
|
| 67 |
"--chunks-per-batch",
|
| 68 |
type=int,
|
|
|
|
| 69 |
help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
|
| 70 |
)
|
| 71 |
parser.add_argument(
|
| 72 |
"--index-name",
|
| 73 |
+
default=None,
|
| 74 |
+
help="Vector store index name. For Marqo, we default it to the repository name. Required for Pinecone, since "
|
| 75 |
+
"it needs to be created manually on their website. In Pinecone terminology, this is *not* the namespace (which "
|
| 76 |
+
"we default to the repo ID).",
|
| 77 |
)
|
| 78 |
parser.add_argument(
|
| 79 |
"--include",
|
|
|
|
| 124 |
parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
|
| 125 |
if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
|
| 126 |
parser.error("When using the marqo embedder, the vector store type must also be marqo.")
|
| 127 |
+
if args.vector_store_type == "marqo":
|
| 128 |
+
if not args.index_name:
|
| 129 |
+
args.index_name = args.repo_id.split("/")[1]
|
| 130 |
+
if "/" in args.index_name:
|
| 131 |
+
parser.error("The index name cannot contain slashes when using Marqo as the vector store.")
|
| 132 |
+
elif args.vector_store_type == "pinecone" and not args.index_name:
|
| 133 |
+
parser.error(
|
| 134 |
+
"When using Pinecone as the vector store, you must specify an index name. You can create one on "
|
| 135 |
+
"the Pinecone website. Make sure to set it the right --embedding-size."
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
# Validate embedder parameters.
|
| 139 |
+
if args.embedder_type == "marqo":
|
| 140 |
+
if args.embedding_model is None:
|
| 141 |
+
args.embedding_model = "hf/e5-base-v2"
|
| 142 |
+
if args.chunks_per_batch is None:
|
| 143 |
+
args.chunks_per_batch = MARQO_MAX_CHUNKS_PER_BATCH
|
| 144 |
+
elif args.chunks_per_batch > MARQO_MAX_CHUNKS_PER_BATCH:
|
| 145 |
+
args.chunks_per_batch = MARQO_MAX_CHUNKS_PER_BATCH
|
| 146 |
+
logging.warning(
|
| 147 |
+
f"Marqo enforces a limit of {MARQO_MAX_CHUNKS_PER_BATCH} chunks per batch. "
|
| 148 |
+
"Overwriting --chunks_per_batch."
|
| 149 |
+
)
|
| 150 |
+
elif args.embedder_type == "openai":
|
| 151 |
+
if args.tokens_per_chunk > OPENAI_MAX_TOKENS_PER_CHUNK:
|
| 152 |
+
args.tokens_per_chunk = OPENAI_MAX_TOKENS_PER_CHUNK
|
| 153 |
+
logging.warning(
|
| 154 |
+
f"OpenAI enforces a limit of {OPENAI_MAX_TOKENS_PER_CHUNK} tokens per chunk. "
|
| 155 |
+
"Overwriting --tokens_per_chunk."
|
| 156 |
+
)
|
| 157 |
+
if args.chunks_per_batch is None:
|
| 158 |
+
args.chunks_per_batch = 2000
|
| 159 |
+
elif args.chunks_per_batch > OPENAI_MAX_CHUNKS_PER_BATCH:
|
| 160 |
+
args.chunks_per_batch = OPENAI_MAX_CHUNKS_PER_BATCH
|
| 161 |
+
logging.warning(
|
| 162 |
+
f"OpenAI enforces a limit of {OPENAI_MAX_CHUNKS_PER_BATCH} chunks per batch. "
|
| 163 |
+
"Overwriting --chunks_per_batch."
|
| 164 |
+
)
|
| 165 |
+
if args.tokens_per_chunk * args.chunks_per_batch >= OPENAI_MAX_TOKENS_PER_JOB:
|
| 166 |
+
parser.error(f"The maximum number of chunks per job is {OPENAI_MAX_TOKENS_PER_JOB}.")
|
| 167 |
+
if args.embedding_model is None:
|
| 168 |
+
args.embedding_model = "text-embedding-ada-002"
|
| 169 |
+
if args.embedding_size is None:
|
| 170 |
+
args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
|
| 171 |
+
|
| 172 |
if args.include and args.exclude:
|
| 173 |
parser.error("At most one of --include and --exclude can be specified.")
|
| 174 |
if not args.include and not args.exclude:
|
|
|
|
| 176 |
if not args.index_repo and not args.index_issues:
|
| 177 |
parser.error("At least one of --index-repo and --index-issues must be true.")
|
| 178 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
# Fail early on missing environment variables.
|
| 180 |
if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
|
| 181 |
parser.error("Please set the OPENAI_API_KEY environment variable.")
|