Spaces:

Asish22
/

code-crawler

Sleeping

App Files Files Community

juliaturc commited on Sep 7, 2024

Commit

90af3bf

1 Parent(s): 7db04dd

Default to Marqo to simplify "Getting started"

Browse files

Files changed (3) hide show

README.md +95 -60
repo2vec/chat.py +11 -3
repo2vec/index.py +58 -25

README.md CHANGED Viewed

@@ -51,6 +51,11 @@ To install the library, simply run `pip install repo2vec`!
     export PINECONE_API_KEY=...
     ```
 2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
     ```
@@ -68,80 +73,110 @@ If you are planning on indexing GitHub issues in addition to the codebase, you w
 ## Running it
 <details open>
-<summary><strong>:computer: Running locally</strong></summary>
-<p>To index the codebase, run this command. This should take a few minutes, depending on the repo size.</p>
-    # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
-    r2v-index Storia-AI/repo2vec \
-        --embedder-type=marqo \
-        --vector-store-type=marqo \
-        --index-name=your-index-name
-<p> To chat with your codebase, run this command:</p>
-    # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
-    r2v-chat Storia-AI/repo2vec \
-        --vector-store-type=marqo \
-        --index-name=your-index-name \
-        --llm-provider=ollama \
-        --llm-model=llama3.1
 </details>
 <details>
-<summary><strong>:cloud: Using external providers</strong></summary>
-<p>To index the codebase, run this command. This should take a few minutes, depending on the repo size.</p>
-    # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
-    r2v-index Storia-AI/repo2vec \
-        --embedder-type=openai \
-        --vector-store-type=pinecone \
-        --index-name=your-index-name
-<p> To chat with your codebase, run this command:</p>
-    # this can be any GitHub repository in the format ORG_NAME/REPO_NAME
-    r2v-chat Storia-AI/repo2vec \
         --vector-store-type=pinecone \
-        --index-name=your-index-name \
         --llm-provider=openai \
         --llm-model=gpt-4
-To get a public URL for your chat app, set `--share=true`.
 </details>
 ## Additional features
-- **Control which files get indexed** based on their extension. You can whitelist or blacklist extensions by passing a file with one extension per line (in the format `.ext`):
-  - To only index a whitelist of files:
-        ```
-        r2v-index ... --include=/path/to/extensions/file
-        ```
-  - To index all code except a blacklist of files:
-        ```
-        r2v-index ... --exclude=/path/to/extensions/file
-        ```
-- **Index open GitHub issues** (remember to `export GITHUB_TOKEN=...`):
-  - To index GitHub issues without comments:
-        ```
-        r2v-index ... --index-issues
-        ```
-  - To index GitHub issues with comments:
-        ```
-        r2v-index ... --index-issues --index-issue-comments
-        ```
-  - To index GitHub issues, but not the codebase:
-        ```
-        r2v-index ... --index-issues --no-index-repo
-        ```
 # Why chat with a codebase?

     export PINECONE_API_KEY=...
     ```
+2. Create a Pinecone index [on their website](https://pinecone.io) and export the name:
+    ```
+    export PINECONE_INDEX_NAME=...
+    ```
 2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
     ```
 ## Running it
 <details open>
+<summary><strong>:computer: Run locally</strong></summary>
+1. Select your desired repository:
+    ```
+    export GITHUB_REPO=huggingface/transformers
+    ```
+2. Index the repository. This might take a few minutes, depending on its size.
+    ```
+    r2v-index $GITHUB_REPO
+    ```
+3. Chat with the repository, once it's indexed:
+    ```
+    r2v-chat $GITHUB_REPO
+    ```
+    To get a public URL for your chat app, set `--share=true`.
 </details>
 <details>
+<summary><strong>:cloud: Use external providers</strong></summary>
+1. Select your desired repository:
+    ```
+    export GITHUB_REPO=huggingface/transformers
+    ```
+2. Index the repository. This might take a few minutes, depending on its size.
+    ```
+    r2v-index $GITHUB_REPO \
+        --embedder-type=openai
+        --vector-store=pinecone \
+        --index-name=$PINECONE_INDEX_NAME
+    ```
+3. Chat with the repository, once it's indexed:
+    ```
+    r2v-chat $GITHUB_REPO \
         --vector-store-type=pinecone \
+        --index-name=$PINECONE_INDEX_NAME \
         --llm-provider=openai \
         --llm-model=gpt-4
+    ```
+    To get a public URL for your chat app, set `--share=true`.
 </details>
 ## Additional features
+<details>
+<summary><strong>:hammer_and_wrench: Control which files get indexed</strong></summary>
+You can specify an inclusion or exclusion file in the following format:
+```
+# This is a comment
+ext:.my-ext-1
+ext:.my-ext-2
+ext:.my-ext-3
+dir:my-dir-1
+dir:my-dir-2
+dir:my-dir-3
+file:my-file-1.md
+file:my-file-2.py
+file:my-file-3.cpp
+```
+where:
+- `ext` specifies a file extension
+- `dir` specifies a directory. This is not a full path. For instance, if you specify `dir:tests` in an exclusion directory, then a file like `/path/to/my/tests/file.py` will be ignored.
+- `file` specifies a file name. This is also not a full path. For instance, if you specify `file:__init__.py`, then a file like `/path/to/my/__init__.py` will be ignored.
+To specify an inclusion file (i.e. only index the specified files):
+```
+r2v-index $GITHUB_REPO --include=/path/to/inclusion/file
+```
+To specify an exclusion file (i.e. index all files, except for the ones specified):
+```
+r2v-index $GITHUB_REPO --exclude=/path/to/exclusion/file
+```
+By default, we use the exclusion file [sample-exclude.txt](repo2vec/sample-exclude.txt).
+</details>
+<details>
+<summary><strong>:bug: Index open GitHub issues</strong></summary>
+You will need a GitHub token first:
+```
+export GITHUB_TOKEN=...
+```
+To index GitHub issues without comments:
+```
+r2v-index $GITHUB_REPO --index-issues
+```
+To index GitHub issues with comments:
+```
+r2v-index $GITHUB_REPO --index-issues --index-issue-comments
+```
+To index GitHub issues, but not the codebase:
+```
+r2v-index $GITHUB_REPO --index-issues --no-index-repo
+```
+</details>
 # Why chat with a codebase?

repo2vec/chat.py CHANGED Viewed

@@ -70,13 +70,13 @@ def append_sources_to_response(response):
 def main():
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
-    parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
     parser.add_argument(
         "--llm-model",
         help="The LLM name. Must be supported by the provider specified via --llm-provider.",
     )
-    parser.add_argument("--vector-store-type", default="pinecone", choices=["pinecone", "marqo"])
-    parser.add_argument("--index-name", required=True, help="Vector store index name")
     parser.add_argument(
         "--marqo-url",
         default="http://localhost:8882",
@@ -89,11 +89,19 @@ def main():
     )
     args = parser.parse_args()
     if not args.llm_model:
         if args.llm_provider == "openai":
             args.llm_model = "gpt-4"
         elif args.llm_provider == "anthropic":
             args.llm_model = "claude-3-opus-20240229"
         else:
             raise ValueError("Please specify --llm_model")

 def main():
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument("--llm-provider", default="ollama", choices=["openai", "anthropic", "ollama"])
     parser.add_argument(
         "--llm-model",
         help="The LLM name. Must be supported by the provider specified via --llm-provider.",
     )
+    parser.add_argument("--vector-store-type", default="marqo", choices=["pinecone", "marqo"])
+    parser.add_argument("--index-name", help="Vector store index name. Required for Pinecone.")
     parser.add_argument(
         "--marqo-url",
         default="http://localhost:8882",
     )
     args = parser.parse_args()
+    if not args.index_name:
+        if args.vector_store_type == "marqo":
+            args.index_name = args.repo_id.split("/")[1]
+        elif args.vector_store_type == "pinecone":
+            parser.error("Please specify --index-name for Pinecone.")
     if not args.llm_model:
         if args.llm_provider == "openai":
             args.llm_model = "gpt-4"
         elif args.llm_provider == "anthropic":
             args.llm_model = "claude-3-opus-20240229"
+        elif args.llm_provider == "ollama":
+            args.llm_model = "llama3.1"
         else:
             raise ValueError("Please specify --llm_model")

repo2vec/index.py CHANGED Viewed

@@ -16,9 +16,13 @@ logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger()
 logger.setLevel(logging.INFO)
-MAX_TOKENS_PER_CHUNK = 8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
-MAX_CHUNKS_PER_BATCH = 2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
-MAX_TOKENS_PER_JOB = 3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
 # Note that OpenAI embedding models have fixed dimensions, however, taking a slice of them is possible.
 # See "Reducing embedding dimensions" under https://platform.openai.com/docs/guides/embeddings/use-cases and
@@ -33,7 +37,7 @@ OPENAI_DEFAULT_EMBEDDING_SIZE = {
 def main():
     parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
     parser.add_argument("repo_id", help="The ID of the repository to index")
-    parser.add_argument("--embedder-type", default="openai", choices=["openai", "marqo"])
     parser.add_argument(
         "--embedding-model",
         type=str,
@@ -47,7 +51,7 @@ def main():
         help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
         "large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
     )
-    parser.add_argument("--vector-store-type", default="pinecone", choices=["pinecone", "marqo"])
     parser.add_argument(
         "--local-dir",
         default="repos",
@@ -62,13 +66,14 @@ def main():
     parser.add_argument(
         "--chunks-per-batch",
         type=int,
-        default=2000,
         help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
     )
     parser.add_argument(
         "--index-name",
-        required=True,
-        help="Vector store index name. For Pinecone, make sure to create it with the right embedding size.",
     )
     parser.add_argument(
         "--include",
@@ -119,17 +124,51 @@ def main():
         parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
     if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
         parser.error("When using the marqo embedder, the vector store type must also be marqo.")
-    if args.embedder_type == "marqo" and args.chunks_per_batch > 64:
-        args.chunks_per_batch = 64
-        logging.warning("Marqo enforces a limit of 64 chunks per batch. Setting --chunks_per_batch to 64.")
-    # Validate other arguments.
-    if args.tokens_per_chunk > MAX_TOKENS_PER_CHUNK:
-        parser.error(f"The maximum number of tokens per chunk is {MAX_TOKENS_PER_CHUNK}.")
-    if args.chunks_per_batch > MAX_CHUNKS_PER_BATCH:
-        parser.error(f"The maximum number of chunks per batch is {MAX_CHUNKS_PER_BATCH}.")
-    if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
-        parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
     if args.include and args.exclude:
         parser.error("At most one of --include and --exclude can be specified.")
     if not args.include and not args.exclude:
@@ -137,12 +176,6 @@ def main():
     if not args.index_repo and not args.index_issues:
         parser.error("At least one of --index-repo and --index-issues must be true.")
-    # Set default values based on other arguments
-    if args.embedding_model is None:
-        args.embedding_model = "text-embedding-ada-002" if args.embedder_type == "openai" else "hf/e5-base-v2"
-    if args.embedding_size is None and args.embedder_type == "openai":
-        args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
     # Fail early on missing environment variables.
     if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
         parser.error("Please set the OPENAI_API_KEY environment variable.")

 logger = logging.getLogger()
 logger.setLevel(logging.INFO)
+MARQO_MAX_CHUNKS_PER_BATCH = 64
+OPENAI_MAX_TOKENS_PER_CHUNK = 8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
+OPENAI_MAX_CHUNKS_PER_BATCH = 2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
+OPENAI_MAX_TOKENS_PER_JOB = (
+    3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
+)
 # Note that OpenAI embedding models have fixed dimensions, however, taking a slice of them is possible.
 # See "Reducing embedding dimensions" under https://platform.openai.com/docs/guides/embeddings/use-cases and
 def main():
     parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
     parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument("--embedder-type", default="marqo", choices=["openai", "marqo"])
     parser.add_argument(
         "--embedding-model",
         type=str,
         help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
         "large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
     )
+    parser.add_argument("--vector-store-type", default="marqo", choices=["pinecone", "marqo"])
     parser.add_argument(
         "--local-dir",
         default="repos",
     parser.add_argument(
         "--chunks-per-batch",
         type=int,
         help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
     )
     parser.add_argument(
         "--index-name",
+        default=None,
+        help="Vector store index name. For Marqo, we default it to the repository name. Required for Pinecone, since "
+        "it needs to be created manually on their website. In Pinecone terminology, this is *not* the namespace (which "
+        "we default to the repo ID).",
     )
     parser.add_argument(
         "--include",
         parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
     if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
         parser.error("When using the marqo embedder, the vector store type must also be marqo.")
+    if args.vector_store_type == "marqo":
+        if not args.index_name:
+            args.index_name = args.repo_id.split("/")[1]
+        if "/" in args.index_name:
+            parser.error("The index name cannot contain slashes when using Marqo as the vector store.")
+    elif args.vector_store_type == "pinecone" and not args.index_name:
+        parser.error(
+            "When using Pinecone as the vector store, you must specify an index name. You can create one on "
+            "the Pinecone website. Make sure to set it the right --embedding-size."
+        )
+    # Validate embedder parameters.
+    if args.embedder_type == "marqo":
+        if args.embedding_model is None:
+            args.embedding_model = "hf/e5-base-v2"
+        if args.chunks_per_batch is None:
+            args.chunks_per_batch = MARQO_MAX_CHUNKS_PER_BATCH
+        elif args.chunks_per_batch > MARQO_MAX_CHUNKS_PER_BATCH:
+            args.chunks_per_batch = MARQO_MAX_CHUNKS_PER_BATCH
+            logging.warning(
+                f"Marqo enforces a limit of {MARQO_MAX_CHUNKS_PER_BATCH} chunks per batch. "
+                "Overwriting --chunks_per_batch."
+            )
+    elif args.embedder_type == "openai":
+        if args.tokens_per_chunk > OPENAI_MAX_TOKENS_PER_CHUNK:
+            args.tokens_per_chunk = OPENAI_MAX_TOKENS_PER_CHUNK
+            logging.warning(
+                f"OpenAI enforces a limit of {OPENAI_MAX_TOKENS_PER_CHUNK} tokens per chunk. "
+                "Overwriting --tokens_per_chunk."
+            )
+        if args.chunks_per_batch is None:
+            args.chunks_per_batch = 2000
+        elif args.chunks_per_batch > OPENAI_MAX_CHUNKS_PER_BATCH:
+            args.chunks_per_batch = OPENAI_MAX_CHUNKS_PER_BATCH
+            logging.warning(
+                f"OpenAI enforces a limit of {OPENAI_MAX_CHUNKS_PER_BATCH} chunks per batch. "
+                "Overwriting --chunks_per_batch."
+            )
+        if args.tokens_per_chunk * args.chunks_per_batch >= OPENAI_MAX_TOKENS_PER_JOB:
+            parser.error(f"The maximum number of chunks per job is {OPENAI_MAX_TOKENS_PER_JOB}.")
+        if args.embedding_model is None:
+            args.embedding_model = "text-embedding-ada-002"
+        if args.embedding_size is None:
+            args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
     if args.include and args.exclude:
         parser.error("At most one of --include and --exclude can be specified.")
     if not args.include and not args.exclude:
     if not args.index_repo and not args.index_issues:
         parser.error("At least one of --index-repo and --index-issues must be true.")
     # Fail early on missing environment variables.
     if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
         parser.error("Please set the OPENAI_API_KEY environment variable.")