Spaces:

Asish22
/

code-crawler

Running

App Files Files Community

juliaturc commited on Aug 30, 2024

Commit

2e96b6e

1 Parent(s): 5f9eeb4

Update README.

Browse files

Files changed (1) hide show

README.md +53 -27

README.md CHANGED Viewed

@@ -7,40 +7,66 @@
 **Ok, but why chat with a codebase?**
 Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
-the code itself.
-`repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
-Features:
 - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
 - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
 - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
-Here are the two scripts you need to run:
-```
-pip install -r requirements.txt
-export GITHUB_REPO_NAME=...
-export OPENAI_API_KEY=...
-export PINECONE_API_KEY=...
-export PINECONE_INDEX_NAME=...
-python src/index.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
-python src/chat.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
-```
-This will index your entire codebase in a vector DB, then bring up a `gradio` app where you can ask questions about it.
-The assistant responses always include GitHub links to the documents retrieved for each query.
-If you want to publicly host your chat experience, set `--share=true`:
 ```
-python src/chat.py $GITHUB_REPO_NAME --share=true ...
 ```
-That's it.
-Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
-![screenshot](assets/chat_screenshot.png)
 # Peeking under the hood
@@ -50,10 +76,11 @@ The `src/index.py` script performs the following steps:
     - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
 2. **Chunks files**. See [Chunker](src/chunker.py).
     - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
-3. **Batch-embeds chunks**. See [Embedder](src/embedder.py).
-    - By default, we use OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
-    - By default, we use [Pinecone](https://pinecone.io) as a vector store, but you can easily plug in your own.
 Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
 ```
@@ -77,10 +104,9 @@ The sources are conveniently surfaced in the chat and linked directly to GitHub.
 # Want your repository hosted?
-We're working to make all code on the internet searchable and understandable for devs. If you would like help hosting
-your repository, we're onboarding a handful of repos onto our infrastructure **for free**.
-You'll get a dedicated url for your repo like `https://sage.storia.ai/[REPO_NAME]`. Just send us a message at [founders@storia.ai](mailto:founders@storia.ai)!
 ![](assets/sage.gif)

 **Ok, but why chat with a codebase?**
 Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
+the code itself.
+`repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
+Features:
 - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
 - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
 - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
+# How to run it
+## Indexing the codebase
+We currently support two options for indexing the codebase:
+1. **Locally**, using the open-source [Marqo vector store](https://github.com/marqo-ai/marqo). Marqo is both an embedder (you can choose your favorite embedding model from Hugging Face) and a vector store.
+    You can bring up a Marqo instance using Docker:
+    ```
+    docker rm -f marqo
+    docker pull marqoai/marqo:latest
+    docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
+    ```
+    Then, to index your codebase, run:
+    ```
+    pip install -r requirements.txt
+    python src/index.py
+        github-repo-name \  # e.g. Storia-AI/repo2vec
+        --embedder_type=marqo \
+        --vector_store_type=marqo \
+        --index_name=your-index-name
+    ```
+2. **Using external providers**: OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store. To index your codebase, run:
+    ```
+    pip install -r requirements.txt
+    export OPENAI_API_KEY=...
+    export PINECONE_API_KEY=...
+    python src/index.py
+        github-repo-name \  # e.g. Storia-AI/repo2vec
+        --embedder_type=openai \
+        --vector_store_type=pinecone \
+        --index_name=your-index-name
+    ```
+    We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
+## Chatting with the codebase
+To bring a `gradio` app where you can chat with your codebase, simply point it to your vector store:
 ```
+python src/chat.py \
+    github-repo-name \  # e.g. Storia-AI/repo2vec
+    --vector_store_type=marqo \  # or pinecone
+    --index_name=your-index-name
 ```
+To get a public URL for your chat app, set `--share=true`.
+Currently, the chat will use OpenAI's GPT-4, but we are working on adding support for other providers and local LLMs. Stay tuned!
 # Peeking under the hood
     - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
 2. **Chunks files**. See [Chunker](src/chunker.py).
     - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
+3. **Batch-embeds chunks**. See [Embedder](src/embedder.py). We currently support:
+    - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model;
+    - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
+    - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
 Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
 ```
 # Want your repository hosted?
+We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, [Code Sage](https://sage.storia.ai). We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.
+If you're the maintainer of an OSS repo and would like a dedicated page on Code Sage (e.g. `sage.storia.ai/your-repo`), then send us a message at [founders@storia.ai](mailto:founders@storia.ai). We'll do it for free!
 ![](assets/sage.gif)