Spaces:
Running
Running
Update README.
Browse files
README.md
CHANGED
|
@@ -7,40 +7,66 @@
|
|
| 7 |
**Ok, but why chat with a codebase?**
|
| 8 |
|
| 9 |
Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
|
| 10 |
-
the code itself.
|
| 11 |
|
| 12 |
-
`repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
|
| 13 |
|
| 14 |
-
Features:
|
| 15 |
- **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
|
| 16 |
- **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
|
| 17 |
- **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
export OPENAI_API_KEY=...
|
| 25 |
-
export PINECONE_API_KEY=...
|
| 26 |
-
export PINECONE_INDEX_NAME=...
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
If you want to publicly host your chat experience, set `--share=true`:
|
| 36 |
```
|
| 37 |
-
python src/chat.py
|
|
|
|
|
|
|
|
|
|
| 38 |
```
|
|
|
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
|
| 43 |
-

|
| 44 |
|
| 45 |
# Peeking under the hood
|
| 46 |
|
|
@@ -50,10 +76,11 @@ The `src/index.py` script performs the following steps:
|
|
| 50 |
- Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
|
| 51 |
2. **Chunks files**. See [Chunker](src/chunker.py).
|
| 52 |
- For code files, we implement a special `CodeChunker` that takes the parse tree into account.
|
| 53 |
-
3. **Batch-embeds chunks**. See [Embedder](src/embedder.py).
|
| 54 |
-
-
|
|
|
|
| 55 |
4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
|
| 56 |
-
-
|
| 57 |
|
| 58 |
Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
|
| 59 |
```
|
|
@@ -77,10 +104,9 @@ The sources are conveniently surfaced in the chat and linked directly to GitHub.
|
|
| 77 |
|
| 78 |
# Want your repository hosted?
|
| 79 |
|
| 80 |
-
We're working to make all code on the internet searchable and understandable for devs.
|
| 81 |
-
your repository, we're onboarding a handful of repos onto our infrastructure **for free**.
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |

|
| 86 |
|
|
|
|
| 7 |
**Ok, but why chat with a codebase?**
|
| 8 |
|
| 9 |
Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
|
| 10 |
+
the code itself.
|
| 11 |
|
| 12 |
+
`repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
|
| 13 |
|
| 14 |
+
Features:
|
| 15 |
- **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
|
| 16 |
- **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
|
| 17 |
- **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
|
| 18 |
|
| 19 |
+
# How to run it
|
| 20 |
+
## Indexing the codebase
|
| 21 |
+
We currently support two options for indexing the codebase:
|
| 22 |
|
| 23 |
+
1. **Locally**, using the open-source [Marqo vector store](https://github.com/marqo-ai/marqo). Marqo is both an embedder (you can choose your favorite embedding model from Hugging Face) and a vector store.
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
You can bring up a Marqo instance using Docker:
|
| 26 |
+
```
|
| 27 |
+
docker rm -f marqo
|
| 28 |
+
docker pull marqoai/marqo:latest
|
| 29 |
+
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
Then, to index your codebase, run:
|
| 33 |
+
```
|
| 34 |
+
pip install -r requirements.txt
|
| 35 |
+
|
| 36 |
+
python src/index.py
|
| 37 |
+
github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 38 |
+
--embedder_type=marqo \
|
| 39 |
+
--vector_store_type=marqo \
|
| 40 |
+
--index_name=your-index-name
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
2. **Using external providers**: OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store. To index your codebase, run:
|
| 44 |
+
```
|
| 45 |
+
pip install -r requirements.txt
|
| 46 |
|
| 47 |
+
export OPENAI_API_KEY=...
|
| 48 |
+
export PINECONE_API_KEY=...
|
| 49 |
+
|
| 50 |
+
python src/index.py
|
| 51 |
+
github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 52 |
+
--embedder_type=openai \
|
| 53 |
+
--vector_store_type=pinecone \
|
| 54 |
+
--index_name=your-index-name
|
| 55 |
+
```
|
| 56 |
+
We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
|
| 57 |
+
|
| 58 |
+
## Chatting with the codebase
|
| 59 |
+
To bring a `gradio` app where you can chat with your codebase, simply point it to your vector store:
|
| 60 |
|
|
|
|
| 61 |
```
|
| 62 |
+
python src/chat.py \
|
| 63 |
+
github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 64 |
+
--vector_store_type=marqo \ # or pinecone
|
| 65 |
+
--index_name=your-index-name
|
| 66 |
```
|
| 67 |
+
To get a public URL for your chat app, set `--share=true`.
|
| 68 |
|
| 69 |
+
Currently, the chat will use OpenAI's GPT-4, but we are working on adding support for other providers and local LLMs. Stay tuned!
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
# Peeking under the hood
|
| 72 |
|
|
|
|
| 76 |
- Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
|
| 77 |
2. **Chunks files**. See [Chunker](src/chunker.py).
|
| 78 |
- For code files, we implement a special `CodeChunker` that takes the parse tree into account.
|
| 79 |
+
3. **Batch-embeds chunks**. See [Embedder](src/embedder.py). We currently support:
|
| 80 |
+
- [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model;
|
| 81 |
+
- OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
|
| 82 |
4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
|
| 83 |
+
- We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
|
| 84 |
|
| 85 |
Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
|
| 86 |
```
|
|
|
|
| 104 |
|
| 105 |
# Want your repository hosted?
|
| 106 |
|
| 107 |
+
We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, [Code Sage](https://sage.storia.ai). We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.
|
|
|
|
| 108 |
|
| 109 |
+
If you're the maintainer of an OSS repo and would like a dedicated page on Code Sage (e.g. `sage.storia.ai/your-repo`), then send us a message at [founders@storia.ai](mailto:founders@storia.ai). We'll do it for free!
|
| 110 |
|
| 111 |

|
| 112 |
|