Spaces:
Running
Running
Improve the docs based on stargazers' feedback (#25)
Browse files
README.md
CHANGED
|
@@ -1,126 +1,142 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
|
| 30 |
-
1.
|
| 31 |
|
| 32 |
-
You can bring up a Marqo instance using Docker:
|
| 33 |
```
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
|
| 37 |
```
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
| 40 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
index github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 42 |
--embedder-type=marqo \
|
| 43 |
--vector-store-type=marqo \
|
| 44 |
--index-name=your-index-name
|
| 45 |
-
```
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
index github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 53 |
--embedder-type=openai \
|
| 54 |
--vector-store-type=pinecone \
|
| 55 |
--index-name=your-index-name
|
| 56 |
-
```
|
| 57 |
-
We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
|
| 58 |
|
| 59 |
-
|
| 60 |
-
You can additionally index GitHub issues by setting the `--index-issues` flag. Conversely, you can turn off indexing the code (and solely index issues) by passing `--no-index-repo`.
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
To chat
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
--
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
export ANTHROPIC_API_KEY=...
|
| 80 |
|
| 81 |
-
|
| 82 |
-
--llm-provider=anthropic \
|
| 83 |
-
--llm-model=claude-3-opus-20240229 \
|
| 84 |
-
--vector-store-type=marqo \ # or pinecone
|
| 85 |
-
--index-name=your-index-name
|
| 86 |
-
```
|
| 87 |
-
To get a public URL for your chat app, set `--share=true`.
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
2. **Chunks files**. See [Chunker](repo2vec/chunker.py).
|
| 96 |
-
- For code files, we implement a special `CodeChunker` that takes the parse tree into account.
|
| 97 |
-
3. **Batch-embeds chunks**. See [Embedder](repo2vec/embedder.py). We currently support:
|
| 98 |
-
- [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model, and
|
| 99 |
-
- OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
|
| 100 |
-
4. **Stores embeddings in a vector store**. See [VectorStore](repo2vec/vector_store.py).
|
| 101 |
-
- We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
|
| 102 |
-
|
| 103 |
-
Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
|
| 104 |
-
```
|
| 105 |
-
index repo-org/repo-name --include=/path/to/file/with/extensions
|
| 106 |
-
```
|
| 107 |
-
Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
|
| 108 |
-
```
|
| 109 |
-
index repo-org/repo-name --exclude=repo2vec/sample-exclude.txt
|
| 110 |
-
```
|
| 111 |
-
Extensions must be specified one per line, in the form `.ext`.
|
| 112 |
-
|
| 113 |
-
## Chatting via RAG
|
| 114 |
-
The `repo2vec/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
|
| 115 |
-
|
| 116 |
-
1. Rewrites the query to be self-contained based on previous queries
|
| 117 |
-
2. Embeds the rewritten query using OpenAI embeddings
|
| 118 |
-
3. Retrieves relevant documents from the vector store
|
| 119 |
-
4. Calls a chat LLM to respond to the user query based on the retrieved documents.
|
| 120 |
-
|
| 121 |
-
The sources are conveniently surfaced in the chat and linked directly to GitHub.
|
| 122 |
|
| 123 |
# Changelog
|
|
|
|
| 124 |
- 2024-09-03: `repo2vec` is now available on pypi.
|
| 125 |
- 2024-09-03: Support for indexing GitHub issues.
|
| 126 |
- 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
|
|
@@ -134,6 +150,7 @@ If you're the maintainer of an OSS repo and would like a dedicated page on Code
|
|
| 134 |

|
| 135 |
|
| 136 |
# Extensions & Contributions
|
|
|
|
| 137 |
We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.
|
| 138 |
|
| 139 |
-
Feel free to send feature requests to [founders@storia.ai](mailto:founders@storia.ai) or make a pull request!
|
|
|
|
| 1 |
+
<div align="center">
|
| 2 |
+
<h1 align="center">repo2vec</h1>
|
| 3 |
+
<p align="center">An open-source pair programmer for chatting with any codebase.</p>
|
| 4 |
+
<figure>
|
| 5 |
+
<img src="assets/chat_screenshot2.png" alt="screenshot" style="max-height: 500px; border: 1px solid black;">
|
| 6 |
+
<figcaption align="center" style="font-size: smaller;">Our chat window, showing a conversation with the Transformers library. 🚀</figcaption>
|
| 7 |
+
</figure>
|
| 8 |
+
</div>
|
| 9 |
|
| 10 |
+
# Getting started
|
| 11 |
|
| 12 |
+
## Installation
|
| 13 |
|
| 14 |
+
To install the library, simply run `pip install repo2vec`!
|
| 15 |
|
| 16 |
+
## Prerequisites
|
| 17 |
+
`repo2vec` performs two steps:
|
| 18 |
|
| 19 |
+
1. Indexes your codebase (requiring an embdder and a vector store)
|
| 20 |
+
2. Enables chatting via LLM + RAG (requiring access to an LLM)
|
| 21 |
|
| 22 |
+
<details open>
|
| 23 |
+
<summary><strong>:computer: Running locally</strong></summary>
|
| 24 |
+
|
| 25 |
+
1. To index the codebase locally, we use the open-source project <a href="https://github.com/marqo-ai/marqo">Marqo</a>, which is both an embedder and a vector store. To bring up a Marqo instance:
|
| 26 |
+
```
|
| 27 |
+
docker rm -f marqo
|
| 28 |
+
docker pull marqoai/marqo:latest
|
| 29 |
+
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
|
| 30 |
+
```
|
| 31 |
|
| 32 |
+
2. To chat with an LLM locally, we use <a href="https://github.com/ollama/ollama">Ollama</a>:
|
| 33 |
|
| 34 |
+
- Head over to [ollama.com](https://ollama.com) to download the appropriate binary for your machine.
|
| 35 |
+
- Pull the desired model, e.g. `ollama pull llama3.1`.
|
| 36 |
+
</details>
|
| 37 |
|
| 38 |
+
<details>
|
| 39 |
+
<summary><strong>:cloud: Using external providers</strong></summary>
|
| 40 |
|
| 41 |
+
1. We support <a href="https://openai.com/">OpenAI</a> for embeddings (they have a super fast batch embedding API) and <a href="https://www.pinecone.io/">Pinecone</a> for the vector store. So you will need two API keys:
|
| 42 |
|
|
|
|
| 43 |
```
|
| 44 |
+
export OPENAI_API_KEY=...
|
| 45 |
+
export PINECONE_API_KEY=...
|
|
|
|
| 46 |
```
|
| 47 |
|
| 48 |
+
2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
export ANTHROPIC_API_KEY=...
|
| 52 |
```
|
| 53 |
+
</details>
|
| 54 |
+
|
| 55 |
+
<br>
|
| 56 |
+
<summary><strong>Optional</strong></summary>
|
| 57 |
+
If you are planning on indexing GitHub issues in addition to the codebase, you will need a GitHub token:
|
| 58 |
+
|
| 59 |
+
export GITHUB_TOKEN=...
|
| 60 |
+
|
| 61 |
+
## Running it
|
| 62 |
+
<details open>
|
| 63 |
+
<summary><strong>:computer: Running locally</strong></summary>
|
| 64 |
+
<p>To index the codebase:</p>
|
| 65 |
+
|
| 66 |
index github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 67 |
--embedder-type=marqo \
|
| 68 |
--vector-store-type=marqo \
|
| 69 |
--index-name=your-index-name
|
|
|
|
| 70 |
|
| 71 |
+
<p> To chat with your codebase:</p>
|
| 72 |
+
|
| 73 |
+
chat github-repo-name \
|
| 74 |
+
--vector-store-type=marqo \
|
| 75 |
+
--index-name=your-index-name \
|
| 76 |
+
--llm-provider=ollama \
|
| 77 |
+
--llm-model=llama3.1
|
| 78 |
+
</details>
|
| 79 |
+
|
| 80 |
+
<details open>
|
| 81 |
+
<summary><strong>:cloud: Using external providers</strong></summary>
|
| 82 |
+
<p>To index the codebase:</p>
|
| 83 |
|
| 84 |
index github-repo-name \ # e.g. Storia-AI/repo2vec
|
| 85 |
--embedder-type=openai \
|
| 86 |
--vector-store-type=pinecone \
|
| 87 |
--index-name=your-index-name
|
|
|
|
|
|
|
| 88 |
|
| 89 |
+
<p> To chat with your codebase:</p>
|
|
|
|
| 90 |
|
| 91 |
+
chat github-repo-name \
|
| 92 |
+
--vector-store-type=pinecone \
|
| 93 |
+
--index-name=your-index-name \
|
| 94 |
+
--llm-provider=openai \
|
| 95 |
+
--llm-model=gpt-4
|
| 96 |
|
| 97 |
+
To get a public URL for your chat app, set `--share=true`.
|
| 98 |
+
</details>
|
| 99 |
+
|
| 100 |
+
## Additional features
|
| 101 |
+
- **Control which files get indexed** based on their extension. You can whitelist or blacklist extensions by passing a file with one extension per line (in the format `.ext`):
|
| 102 |
+
- To only index a whitelist of files:
|
| 103 |
+
```
|
| 104 |
+
index ... --include=/path/to/extensions/file
|
| 105 |
+
```
|
| 106 |
+
- To index all code except a blacklist of files:
|
| 107 |
+
```
|
| 108 |
+
index ... --exclude=/path/to/extensions/file
|
| 109 |
+
```
|
| 110 |
+
- **Index open GitHub issues** (remember to `export GITHUB_TOKEN=...`):
|
| 111 |
+
- To index GitHub issues without comments:
|
| 112 |
+
```
|
| 113 |
+
index ... --index-issues
|
| 114 |
+
```
|
| 115 |
+
- To index GitHub issues with comments:
|
| 116 |
+
```
|
| 117 |
+
index ... --index-issues --index-issue-comments
|
| 118 |
+
```
|
| 119 |
+
- To index GitHub issues, but not the codebase:
|
| 120 |
+
```
|
| 121 |
+
index ... --index-issues --no-index-repo
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
# Why chat with a codebase?
|
| 125 |
|
| 126 |
+
Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
|
| 127 |
+
the code itself.
|
|
|
|
| 128 |
|
| 129 |
+
`repo2vec` is like an open-source GitHub Copilot with the most up-to-date information about your repo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
+
Features:
|
| 132 |
+
|
| 133 |
+
- **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
|
| 134 |
+
- **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
|
| 135 |
+
- **Runs locally or on the cloud.**
|
| 136 |
+
- **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
# Changelog
|
| 139 |
+
|
| 140 |
- 2024-09-03: `repo2vec` is now available on pypi.
|
| 141 |
- 2024-09-03: Support for indexing GitHub issues.
|
| 142 |
- 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
|
|
|
|
| 150 |

|
| 151 |
|
| 152 |
# Extensions & Contributions
|
| 153 |
+
|
| 154 |
We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.
|
| 155 |
|
| 156 |
+
Feel free to send feature requests to [founders@storia.ai](mailto:founders@storia.ai) or make a pull request!
|