juliaturc commited on
Commit
90af3bf
·
1 Parent(s): 7db04dd

Default to Marqo to simplify "Getting started"

Browse files
Files changed (3) hide show
  1. README.md +95 -60
  2. repo2vec/chat.py +11 -3
  3. repo2vec/index.py +58 -25
README.md CHANGED
@@ -51,6 +51,11 @@ To install the library, simply run `pip install repo2vec`!
51
  export PINECONE_API_KEY=...
52
  ```
53
 
 
 
 
 
 
54
  2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
55
 
56
  ```
@@ -68,80 +73,110 @@ If you are planning on indexing GitHub issues in addition to the codebase, you w
68
  ## Running it
69
 
70
  <details open>
71
- <summary><strong>:computer: Running locally</strong></summary>
72
- <p>To index the codebase, run this command. This should take a few minutes, depending on the repo size.</p>
73
-
74
- # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
75
- r2v-index Storia-AI/repo2vec \
76
- --embedder-type=marqo \
77
- --vector-store-type=marqo \
78
- --index-name=your-index-name
79
-
80
- <p> To chat with your codebase, run this command:</p>
81
-
82
- # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
83
- r2v-chat Storia-AI/repo2vec \
84
- --vector-store-type=marqo \
85
- --index-name=your-index-name \
86
- --llm-provider=ollama \
87
- --llm-model=llama3.1
 
88
  </details>
89
 
90
  <details>
91
- <summary><strong>:cloud: Using external providers</strong></summary>
92
- <p>To index the codebase, run this command. This should take a few minutes, depending on the repo size.</p>
93
 
94
- # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
95
- r2v-index Storia-AI/repo2vec \
96
- --embedder-type=openai \
97
- --vector-store-type=pinecone \
98
- --index-name=your-index-name
99
 
100
- <p> To chat with your codebase, run this command:</p>
 
 
 
 
 
 
101
 
102
- # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
103
- r2v-chat Storia-AI/repo2vec \
 
104
  --vector-store-type=pinecone \
105
- --index-name=your-index-name \
106
  --llm-provider=openai \
107
  --llm-model=gpt-4
108
-
109
- To get a public URL for your chat app, set `--share=true`.
110
  </details>
111
 
112
  ## Additional features
113
 
114
- - **Control which files get indexed** based on their extension. You can whitelist or blacklist extensions by passing a file with one extension per line (in the format `.ext`):
115
- - To only index a whitelist of files:
116
-
117
- ```
118
- r2v-index ... --include=/path/to/extensions/file
119
- ```
120
-
121
- - To index all code except a blacklist of files:
122
-
123
- ```
124
- r2v-index ... --exclude=/path/to/extensions/file
125
- ```
126
-
127
- - **Index open GitHub issues** (remember to `export GITHUB_TOKEN=...`):
128
- - To index GitHub issues without comments:
129
-
130
- ```
131
- r2v-index ... --index-issues
132
- ```
133
-
134
- - To index GitHub issues with comments:
135
-
136
- ```
137
- r2v-index ... --index-issues --index-issue-comments
138
- ```
139
-
140
- - To index GitHub issues, but not the codebase:
 
 
 
 
 
141
 
142
- ```
143
- r2v-index ... --index-issues --no-index-repo
144
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  # Why chat with a codebase?
147
 
 
51
  export PINECONE_API_KEY=...
52
  ```
53
 
54
+ 2. Create a Pinecone index [on their website](https://pinecone.io) and export the name:
55
+ ```
56
+ export PINECONE_INDEX_NAME=...
57
+ ```
58
+
59
  2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
60
 
61
  ```
 
73
  ## Running it
74
 
75
  <details open>
76
+ <summary><strong>:computer: Run locally</strong></summary>
77
+
78
+ 1. Select your desired repository:
79
+ ```
80
+ export GITHUB_REPO=huggingface/transformers
81
+ ```
82
+
83
+ 2. Index the repository. This might take a few minutes, depending on its size.
84
+ ```
85
+ r2v-index $GITHUB_REPO
86
+ ```
87
+
88
+ 3. Chat with the repository, once it's indexed:
89
+ ```
90
+ r2v-chat $GITHUB_REPO
91
+ ```
92
+ To get a public URL for your chat app, set `--share=true`.
93
+
94
  </details>
95
 
96
  <details>
97
+ <summary><strong>:cloud: Use external providers</strong></summary>
 
98
 
99
+ 1. Select your desired repository:
100
+ ```
101
+ export GITHUB_REPO=huggingface/transformers
102
+ ```
 
103
 
104
+ 2. Index the repository. This might take a few minutes, depending on its size.
105
+ ```
106
+ r2v-index $GITHUB_REPO \
107
+ --embedder-type=openai
108
+ --vector-store=pinecone \
109
+ --index-name=$PINECONE_INDEX_NAME
110
+ ```
111
 
112
+ 3. Chat with the repository, once it's indexed:
113
+ ```
114
+ r2v-chat $GITHUB_REPO \
115
  --vector-store-type=pinecone \
116
+ --index-name=$PINECONE_INDEX_NAME \
117
  --llm-provider=openai \
118
  --llm-model=gpt-4
119
+ ```
120
+ To get a public URL for your chat app, set `--share=true`.
121
  </details>
122
 
123
  ## Additional features
124
 
125
+ <details>
126
+ <summary><strong>:hammer_and_wrench: Control which files get indexed</strong></summary>
127
+
128
+ You can specify an inclusion or exclusion file in the following format:
129
+ ```
130
+ # This is a comment
131
+ ext:.my-ext-1
132
+ ext:.my-ext-2
133
+ ext:.my-ext-3
134
+ dir:my-dir-1
135
+ dir:my-dir-2
136
+ dir:my-dir-3
137
+ file:my-file-1.md
138
+ file:my-file-2.py
139
+ file:my-file-3.cpp
140
+ ```
141
+ where:
142
+ - `ext` specifies a file extension
143
+ - `dir` specifies a directory. This is not a full path. For instance, if you specify `dir:tests` in an exclusion directory, then a file like `/path/to/my/tests/file.py` will be ignored.
144
+ - `file` specifies a file name. This is also not a full path. For instance, if you specify `file:__init__.py`, then a file like `/path/to/my/__init__.py` will be ignored.
145
+
146
+ To specify an inclusion file (i.e. only index the specified files):
147
+ ```
148
+ r2v-index $GITHUB_REPO --include=/path/to/inclusion/file
149
+ ```
150
+
151
+ To specify an exclusion file (i.e. index all files, except for the ones specified):
152
+ ```
153
+ r2v-index $GITHUB_REPO --exclude=/path/to/exclusion/file
154
+ ```
155
+ By default, we use the exclusion file [sample-exclude.txt](repo2vec/sample-exclude.txt).
156
+ </details>
157
 
158
+ <details>
159
+ <summary><strong>:bug: Index open GitHub issues</strong></summary>
160
+ You will need a GitHub token first:
161
+ ```
162
+ export GITHUB_TOKEN=...
163
+ ```
164
+
165
+ To index GitHub issues without comments:
166
+ ```
167
+ r2v-index $GITHUB_REPO --index-issues
168
+ ```
169
+
170
+ To index GitHub issues with comments:
171
+ ```
172
+ r2v-index $GITHUB_REPO --index-issues --index-issue-comments
173
+ ```
174
+
175
+ To index GitHub issues, but not the codebase:
176
+ ```
177
+ r2v-index $GITHUB_REPO --index-issues --no-index-repo
178
+ ```
179
+ </details>
180
 
181
  # Why chat with a codebase?
182
 
repo2vec/chat.py CHANGED
@@ -70,13 +70,13 @@ def append_sources_to_response(response):
70
  def main():
71
  parser = argparse.ArgumentParser(description="UI to chat with your codebase")
72
  parser.add_argument("repo_id", help="The ID of the repository to index")
73
- parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
74
  parser.add_argument(
75
  "--llm-model",
76
  help="The LLM name. Must be supported by the provider specified via --llm-provider.",
77
  )
78
- parser.add_argument("--vector-store-type", default="pinecone", choices=["pinecone", "marqo"])
79
- parser.add_argument("--index-name", required=True, help="Vector store index name")
80
  parser.add_argument(
81
  "--marqo-url",
82
  default="http://localhost:8882",
@@ -89,11 +89,19 @@ def main():
89
  )
90
  args = parser.parse_args()
91
 
 
 
 
 
 
 
92
  if not args.llm_model:
93
  if args.llm_provider == "openai":
94
  args.llm_model = "gpt-4"
95
  elif args.llm_provider == "anthropic":
96
  args.llm_model = "claude-3-opus-20240229"
 
 
97
  else:
98
  raise ValueError("Please specify --llm_model")
99
 
 
70
  def main():
71
  parser = argparse.ArgumentParser(description="UI to chat with your codebase")
72
  parser.add_argument("repo_id", help="The ID of the repository to index")
73
+ parser.add_argument("--llm-provider", default="ollama", choices=["openai", "anthropic", "ollama"])
74
  parser.add_argument(
75
  "--llm-model",
76
  help="The LLM name. Must be supported by the provider specified via --llm-provider.",
77
  )
78
+ parser.add_argument("--vector-store-type", default="marqo", choices=["pinecone", "marqo"])
79
+ parser.add_argument("--index-name", help="Vector store index name. Required for Pinecone.")
80
  parser.add_argument(
81
  "--marqo-url",
82
  default="http://localhost:8882",
 
89
  )
90
  args = parser.parse_args()
91
 
92
+ if not args.index_name:
93
+ if args.vector_store_type == "marqo":
94
+ args.index_name = args.repo_id.split("/")[1]
95
+ elif args.vector_store_type == "pinecone":
96
+ parser.error("Please specify --index-name for Pinecone.")
97
+
98
  if not args.llm_model:
99
  if args.llm_provider == "openai":
100
  args.llm_model = "gpt-4"
101
  elif args.llm_provider == "anthropic":
102
  args.llm_model = "claude-3-opus-20240229"
103
+ elif args.llm_provider == "ollama":
104
+ args.llm_model = "llama3.1"
105
  else:
106
  raise ValueError("Please specify --llm_model")
107
 
repo2vec/index.py CHANGED
@@ -16,9 +16,13 @@ logging.basicConfig(level=logging.INFO)
16
  logger = logging.getLogger()
17
  logger.setLevel(logging.INFO)
18
 
19
- MAX_TOKENS_PER_CHUNK = 8192 # The ADA embedder from OpenAI has a maximum of 8192 tokens.
20
- MAX_CHUNKS_PER_BATCH = 2048 # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
21
- MAX_TOKENS_PER_JOB = 3_000_000 # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
 
 
 
 
22
 
23
  # Note that OpenAI embedding models have fixed dimensions, however, taking a slice of them is possible.
24
  # See "Reducing embedding dimensions" under https://platform.openai.com/docs/guides/embeddings/use-cases and
@@ -33,7 +37,7 @@ OPENAI_DEFAULT_EMBEDDING_SIZE = {
33
  def main():
34
  parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
35
  parser.add_argument("repo_id", help="The ID of the repository to index")
36
- parser.add_argument("--embedder-type", default="openai", choices=["openai", "marqo"])
37
  parser.add_argument(
38
  "--embedding-model",
39
  type=str,
@@ -47,7 +51,7 @@ def main():
47
  help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
48
  "large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
49
  )
50
- parser.add_argument("--vector-store-type", default="pinecone", choices=["pinecone", "marqo"])
51
  parser.add_argument(
52
  "--local-dir",
53
  default="repos",
@@ -62,13 +66,14 @@ def main():
62
  parser.add_argument(
63
  "--chunks-per-batch",
64
  type=int,
65
- default=2000,
66
  help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
67
  )
68
  parser.add_argument(
69
  "--index-name",
70
- required=True,
71
- help="Vector store index name. For Pinecone, make sure to create it with the right embedding size.",
 
 
72
  )
73
  parser.add_argument(
74
  "--include",
@@ -119,17 +124,51 @@ def main():
119
  parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
120
  if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
121
  parser.error("When using the marqo embedder, the vector store type must also be marqo.")
122
- if args.embedder_type == "marqo" and args.chunks_per_batch > 64:
123
- args.chunks_per_batch = 64
124
- logging.warning("Marqo enforces a limit of 64 chunks per batch. Setting --chunks_per_batch to 64.")
125
-
126
- # Validate other arguments.
127
- if args.tokens_per_chunk > MAX_TOKENS_PER_CHUNK:
128
- parser.error(f"The maximum number of tokens per chunk is {MAX_TOKENS_PER_CHUNK}.")
129
- if args.chunks_per_batch > MAX_CHUNKS_PER_BATCH:
130
- parser.error(f"The maximum number of chunks per batch is {MAX_CHUNKS_PER_BATCH}.")
131
- if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
132
- parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  if args.include and args.exclude:
134
  parser.error("At most one of --include and --exclude can be specified.")
135
  if not args.include and not args.exclude:
@@ -137,12 +176,6 @@ def main():
137
  if not args.index_repo and not args.index_issues:
138
  parser.error("At least one of --index-repo and --index-issues must be true.")
139
 
140
- # Set default values based on other arguments
141
- if args.embedding_model is None:
142
- args.embedding_model = "text-embedding-ada-002" if args.embedder_type == "openai" else "hf/e5-base-v2"
143
- if args.embedding_size is None and args.embedder_type == "openai":
144
- args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
145
-
146
  # Fail early on missing environment variables.
147
  if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
148
  parser.error("Please set the OPENAI_API_KEY environment variable.")
 
16
  logger = logging.getLogger()
17
  logger.setLevel(logging.INFO)
18
 
19
+ MARQO_MAX_CHUNKS_PER_BATCH = 64
20
+
21
+ OPENAI_MAX_TOKENS_PER_CHUNK = 8192 # The ADA embedder from OpenAI has a maximum of 8192 tokens.
22
+ OPENAI_MAX_CHUNKS_PER_BATCH = 2048 # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
23
+ OPENAI_MAX_TOKENS_PER_JOB = (
24
+ 3_000_000 # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
25
+ )
26
 
27
  # Note that OpenAI embedding models have fixed dimensions, however, taking a slice of them is possible.
28
  # See "Reducing embedding dimensions" under https://platform.openai.com/docs/guides/embeddings/use-cases and
 
37
  def main():
38
  parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
39
  parser.add_argument("repo_id", help="The ID of the repository to index")
40
+ parser.add_argument("--embedder-type", default="marqo", choices=["openai", "marqo"])
41
  parser.add_argument(
42
  "--embedding-model",
43
  type=str,
 
51
  help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
52
  "large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
53
  )
54
+ parser.add_argument("--vector-store-type", default="marqo", choices=["pinecone", "marqo"])
55
  parser.add_argument(
56
  "--local-dir",
57
  default="repos",
 
66
  parser.add_argument(
67
  "--chunks-per-batch",
68
  type=int,
 
69
  help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
70
  )
71
  parser.add_argument(
72
  "--index-name",
73
+ default=None,
74
+ help="Vector store index name. For Marqo, we default it to the repository name. Required for Pinecone, since "
75
+ "it needs to be created manually on their website. In Pinecone terminology, this is *not* the namespace (which "
76
+ "we default to the repo ID).",
77
  )
78
  parser.add_argument(
79
  "--include",
 
124
  parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
125
  if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
126
  parser.error("When using the marqo embedder, the vector store type must also be marqo.")
127
+ if args.vector_store_type == "marqo":
128
+ if not args.index_name:
129
+ args.index_name = args.repo_id.split("/")[1]
130
+ if "/" in args.index_name:
131
+ parser.error("The index name cannot contain slashes when using Marqo as the vector store.")
132
+ elif args.vector_store_type == "pinecone" and not args.index_name:
133
+ parser.error(
134
+ "When using Pinecone as the vector store, you must specify an index name. You can create one on "
135
+ "the Pinecone website. Make sure to set it the right --embedding-size."
136
+ )
137
+
138
+ # Validate embedder parameters.
139
+ if args.embedder_type == "marqo":
140
+ if args.embedding_model is None:
141
+ args.embedding_model = "hf/e5-base-v2"
142
+ if args.chunks_per_batch is None:
143
+ args.chunks_per_batch = MARQO_MAX_CHUNKS_PER_BATCH
144
+ elif args.chunks_per_batch > MARQO_MAX_CHUNKS_PER_BATCH:
145
+ args.chunks_per_batch = MARQO_MAX_CHUNKS_PER_BATCH
146
+ logging.warning(
147
+ f"Marqo enforces a limit of {MARQO_MAX_CHUNKS_PER_BATCH} chunks per batch. "
148
+ "Overwriting --chunks_per_batch."
149
+ )
150
+ elif args.embedder_type == "openai":
151
+ if args.tokens_per_chunk > OPENAI_MAX_TOKENS_PER_CHUNK:
152
+ args.tokens_per_chunk = OPENAI_MAX_TOKENS_PER_CHUNK
153
+ logging.warning(
154
+ f"OpenAI enforces a limit of {OPENAI_MAX_TOKENS_PER_CHUNK} tokens per chunk. "
155
+ "Overwriting --tokens_per_chunk."
156
+ )
157
+ if args.chunks_per_batch is None:
158
+ args.chunks_per_batch = 2000
159
+ elif args.chunks_per_batch > OPENAI_MAX_CHUNKS_PER_BATCH:
160
+ args.chunks_per_batch = OPENAI_MAX_CHUNKS_PER_BATCH
161
+ logging.warning(
162
+ f"OpenAI enforces a limit of {OPENAI_MAX_CHUNKS_PER_BATCH} chunks per batch. "
163
+ "Overwriting --chunks_per_batch."
164
+ )
165
+ if args.tokens_per_chunk * args.chunks_per_batch >= OPENAI_MAX_TOKENS_PER_JOB:
166
+ parser.error(f"The maximum number of chunks per job is {OPENAI_MAX_TOKENS_PER_JOB}.")
167
+ if args.embedding_model is None:
168
+ args.embedding_model = "text-embedding-ada-002"
169
+ if args.embedding_size is None:
170
+ args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
171
+
172
  if args.include and args.exclude:
173
  parser.error("At most one of --include and --exclude can be specified.")
174
  if not args.include and not args.exclude:
 
176
  if not args.index_repo and not args.index_issues:
177
  parser.error("At least one of --index-repo and --index-issues must be true.")
178
 
 
 
 
 
 
 
179
  # Fail early on missing environment variables.
180
  if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
181
  parser.error("Please set the OPENAI_API_KEY environment variable.")