Spaces:

Asish22
/

code-crawler

Running

App Files Files Community

juliaturc commited on Aug 26, 2024

Commit

d5c979a

1 Parent(s): 559dd34

Add inclusion and exclusion sets.

Browse files

Files changed (5) hide show

README.md +17 -2
src/chat.py +14 -5
src/index.py +22 -1
src/repo_manager.py +40 -32
src/sample-exclude.txt +62 -0

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 # Overview
-`repo2vec` enables you to chat with your codebase by simply running two python scripts:
 ```
 pip install -r requirements.txt
@@ -11,7 +11,12 @@ export PINECONE_INDEX_NAME=...
 python src/index.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
 python src/chat.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
 ```
-This will bring up a `gradio` app where you can ask questions about your codebase. The assistant responses always include GitHub links to the documents retrieved for each query.
 Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
 ![screenshot](assets/chat_screenshot.png)
@@ -29,6 +34,16 @@ The `src/index.py` script performs the following steps:
 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
     - By default, we use [Pinecone](https://pinecone.io) as a vector store, but you can easily plug in your own.
 ## Chatting via RAG
 The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:

 # Overview
+`repo2vec` enables you to index your codebase and chat with it by simply running two python scripts:
 ```
 pip install -r requirements.txt
 python src/index.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
 python src/chat.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
 ```
+This will index your entire codebase in a vector DB, then bring up a `gradio` app where you can ask questions about it. The assistant responses always include GitHub links to the documents retrieved for each query.
+To make the gradio chat app accessible publicly, you can set `--share=true`:
+```
+python src/chat.py $GITHUB_REPO_NAME --share=true ...
+```
 Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
 ![screenshot](assets/chat_screenshot.png)
 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
     - By default, we use [Pinecone](https://pinecone.io) as a vector store, but you can easily plug in your own.
+Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
+```
+python src/index.py repo-org/repo-name --include=/path/to/file/with/extensions
+```
+Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
+```
+python src/index.py repo-org/repo-name --exclude=src/sample-exclude.txt
+```
+Extensions must be specified one per line, in the form `.ext`.
 ## Chatting via RAG
 The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:

src/chat.py CHANGED Viewed

@@ -86,11 +86,18 @@ if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
     parser.add_argument(
-        "--openai_model", default="gpt-4", help="The OpenAI model to use for response generation"
     )
     parser.add_argument(
         "--pinecone_index_name", required=True, help="Pinecone index name"
     )
     args = parser.parse_args()
     rag_chain = build_rag_chain(args)
@@ -108,7 +115,9 @@ if __name__ == "__main__":
         answer = append_sources_to_response(response)
         return answer
-    gr.ChatInterface(_predict,
-                     title=args.repo_id,
-                     description=f"Code sage for your repo: {args.repo_id}",
-                     examples=["What does this repo do?", "Give me some sample code."]).launch()

     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
     parser.add_argument(
+        "--openai_model",
+        default="gpt-4",
+        help="The OpenAI model to use for response generation",
     )
     parser.add_argument(
         "--pinecone_index_name", required=True, help="Pinecone index name"
     )
+    parser.add_argument(
+        "--share",
+        default=False,
+        help="Whether to make the gradio app publicly accessible.",
+    )
     args = parser.parse_args()
     rag_chain = build_rag_chain(args)
         answer = append_sources_to_response(response)
         return answer
+    gr.ChatInterface(
+        _predict,
+        title=args.repo_id,
+        description=f"Code sage for your repo: {args.repo_id}",
+        examples=["What does this repo do?", "Give me some sample code."],
+    ).launch(share=args.share)

src/index.py CHANGED Viewed

@@ -21,6 +21,11 @@ MAX_CHUNKS_PER_BATCH = (
 MAX_TOKENS_PER_JOB = 3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
 def main():
     parser = argparse.ArgumentParser(description="Batch-embeds a repository")
     parser.add_argument("repo_id", help="The ID of the repository to index")
@@ -41,6 +46,12 @@ def main():
     parser.add_argument(
         "--pinecone_index_name", required=True, help="Pinecone index name"
     )
     args = parser.parse_args()
@@ -55,9 +66,19 @@ def main():
         )
     if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
         parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
     logging.info("Cloning the repository...")
-    repo_manager = RepoManager(args.repo_id, local_dir=args.local_dir)
     repo_manager.clone()
     logging.info("Issuing embedding jobs...")

 MAX_TOKENS_PER_JOB = 3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
+def _read_extensions(path):
+    with open(path, "r") as f:
+        return {line.strip().lower() for line in f}
 def main():
     parser = argparse.ArgumentParser(description="Batch-embeds a repository")
     parser.add_argument("repo_id", help="The ID of the repository to index")
     parser.add_argument(
         "--pinecone_index_name", required=True, help="Pinecone index name"
     )
+    parser.add_argument(
+        "--include", help="Path to a file containing a list of extensions to include. One extension per line."
+    )
+    parser.add_argument(
+        "--exclude", help="Path to a file containing a list of extensions to exclude. One extension per line."
+    )
     args = parser.parse_args()
         )
     if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
         parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
+    if args.include and args.exclude:
+        parser.error("At most one of --include and --exclude can be specified.")
+    included_extensions = _read_extensions(args.include) if args.include else None
+    excluded_extensions = _read_extensions(args.exclude) if args.exclude else None
     logging.info("Cloning the repository...")
+    repo_manager = RepoManager(
+        args.repo_id,
+        local_dir=args.local_dir,
+        included_extensions=included_extensions,
+        excluded_extensions=excluded_extensions,
+    )
     repo_manager.clone()
     logging.info("Issuing embedding jobs...")

src/repo_manager.py CHANGED Viewed

@@ -11,7 +11,13 @@ from git import GitCommandError, Repo
 class RepoManager:
     """Class to manage a local clone of a GitHub repository."""
-    def __init__(self, repo_id: str, local_dir: str = None):
         """
         Args:
             repo_id: The identifier of the repository in owner/repo format, e.g. "Storia-AI/repo2vec".
@@ -23,11 +29,15 @@ class RepoManager:
             os.makedirs(self.local_dir)
         self.local_path = os.path.join(self.local_dir, repo_id)
         self.access_token = os.getenv("GITHUB_TOKEN")
     @cached_property
     def is_public(self) -> bool:
         """Checks whether a GitHub repository is publicly visible."""
-        response = requests.get(f"https://api.github.com/repos/{self.repo_id}", timeout=10)
         # Note that the response will be 404 for both private and non-existent repos.
         return response.status_code == 200
@@ -40,13 +50,17 @@ class RepoManager:
         if self.access_token:
             headers["Authorization"] = f"token {self.access_token}"
-        response = requests.get(f"https://api.github.com/repos/{self.repo_id}", headers=headers)
         if response.status_code == 200:
             branch = response.json().get("default_branch", "main")
         else:
             # This happens sometimes when we exceed the Github rate limit. The best bet in this case is to assume the
             # most common naming for the default branch ("main").
-            logging.warn(f"Unable to fetch default branch for {self.repo_id}: {response.text}")
             branch = "main"
         return branch
@@ -73,12 +87,20 @@ class RepoManager:
             return False
         return True
-    def walk(
-        self,
-        included_extensions: set = None,
-        excluded_extensions: set = None,
-        log_dir: str = None,
-    ):
         """Walks the local repository path and yields a tuple of (filepath, content) for each file.
         The filepath is relative to the root of the repository (e.g. "org/repo/your/file/path.py").
@@ -87,24 +109,6 @@ class RepoManager:
             excluded_extensions: Optional set of extensions to exclude.
             log_dir: Optional directory where to log the included and excluded files.
         """
-        # Convert included and excluded extensions to lowercase.
-        if included_extensions:
-            included_extensions = {ext.lower() for ext in included_extensions}
-        if excluded_extensions:
-            excluded_extensions = {ext.lower() for ext in excluded_extensions}
-        def include(file_path: str) -> bool:
-            _, extension = os.path.splitext(file_path)
-            extension = extension.lower()
-            if included_extensions and extension not in included_extensions:
-                return False
-            if excluded_extensions and extension in excluded_extensions:
-                return False
-            # Exclude hidden files and directories.
-            if any(part.startswith(".") for part in file_path.split(os.path.sep)):
-                return False
-            return True
         # We will keep apending to these files during the iteration, so we need to clear them first.
         if log_dir:
             repo_name = self.repo_id.replace("/", "_")
@@ -117,7 +121,7 @@ class RepoManager:
         for root, _, files in os.walk(self.local_path):
             file_paths = [os.path.join(root, file) for file in files]
-            included_file_paths = [f for f in file_paths if include(f)]
             if log_dir:
                 with open(included_log_file, "a") as f:
@@ -136,11 +140,15 @@ class RepoManager:
                     try:
                         contents = f.read()
                     except UnicodeDecodeError:
-                        logging.warning("Unable to decode file %s. Skipping.", file_path)
                         continue
                     yield file_path[len(self.local_dir) + 1 :], contents
     def github_link_for_file(self, file_path: str) -> str:
         """Converts a repository file path to a GitHub link."""
-        file_path = file_path[len(self.repo_id):]
-        return f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"

 class RepoManager:
     """Class to manage a local clone of a GitHub repository."""
+    def __init__(
+        self,
+        repo_id: str,
+        local_dir: str = None,
+        included_extensions: set = None,
+        excluded_extensions: set = None,
+    ):
         """
         Args:
             repo_id: The identifier of the repository in owner/repo format, e.g. "Storia-AI/repo2vec".
             os.makedirs(self.local_dir)
         self.local_path = os.path.join(self.local_dir, repo_id)
         self.access_token = os.getenv("GITHUB_TOKEN")
+        self.included_extensions = included_extensions
+        self.excluded_extensions = excluded_extensions
     @cached_property
     def is_public(self) -> bool:
         """Checks whether a GitHub repository is publicly visible."""
+        response = requests.get(
+            f"https://api.github.com/repos/{self.repo_id}", timeout=10
+        )
         # Note that the response will be 404 for both private and non-existent repos.
         return response.status_code == 200
         if self.access_token:
             headers["Authorization"] = f"token {self.access_token}"
+        response = requests.get(
+            f"https://api.github.com/repos/{self.repo_id}", headers=headers
+        )
         if response.status_code == 200:
             branch = response.json().get("default_branch", "main")
         else:
             # This happens sometimes when we exceed the Github rate limit. The best bet in this case is to assume the
             # most common naming for the default branch ("main").
+            logging.warn(
+                f"Unable to fetch default branch for {self.repo_id}: {response.text}"
+            )
             branch = "main"
         return branch
             return False
         return True
+    def _should_include(self, file_path: str) -> bool:
+        """Checks whether the file should be indexed, based on the included and excluded extensions."""
+        _, extension = os.path.splitext(file_path)
+        extension = extension.lower()
+        if self.included_extensions and extension not in self.included_extensions:
+            return False
+        if self.excluded_extensions and extension in self.excluded_extensions:
+            return False
+        # Exclude hidden files and directories.
+        if any(part.startswith(".") for part in file_path.split(os.path.sep)):
+            return False
+        return True
+    def walk(self, log_dir: str = None):
         """Walks the local repository path and yields a tuple of (filepath, content) for each file.
         The filepath is relative to the root of the repository (e.g. "org/repo/your/file/path.py").
             excluded_extensions: Optional set of extensions to exclude.
             log_dir: Optional directory where to log the included and excluded files.
         """
         # We will keep apending to these files during the iteration, so we need to clear them first.
         if log_dir:
             repo_name = self.repo_id.replace("/", "_")
         for root, _, files in os.walk(self.local_path):
             file_paths = [os.path.join(root, file) for file in files]
+            included_file_paths = [f for f in file_paths if self._should_include(f)]
             if log_dir:
                 with open(included_log_file, "a") as f:
                     try:
                         contents = f.read()
                     except UnicodeDecodeError:
+                        logging.warning(
+                            "Unable to decode file %s. Skipping.", file_path
+                        )
                         continue
                     yield file_path[len(self.local_dir) + 1 :], contents
     def github_link_for_file(self, file_path: str) -> str:
         """Converts a repository file path to a GitHub link."""
+        file_path = file_path[len(self.repo_id) :]
+        return (
+            f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"
+        )

src/sample-exclude.txt ADDED Viewed

	@@ -0,0 +1,62 @@

+.avi
+.bazel
+.bin
+.binpb
+.bmp
+.crt
+.css
+.dat
+.db
+.duckdb
+.eot
+.exe
+.gif
+.gguf
+.glb
+.gz
+.ico
+.icns
+.inp
+.ipynb
+.isl
+.jar
+.jpeg
+.jpg
+.json
+.key
+.lock
+.mo
+.model
+.mov
+.mp3
+.mp4
+.otf
+.out
+.Packages
+.pb
+.pdf
+.pem
+.pickle
+.png
+.pt
+.ptl
+.s
+.sqlite
+.stl
+.sum
+.svg
+.tar
+.th
+.tgz
+.toml
+.ts-fixture
+.ttf
+.wav
+.webp
+.wmv
+.woff
+.woff2
+.xml
+.yaml
+.yml
+.zip