juliaturc commited on
Commit
2e96b6e
·
1 Parent(s): 5f9eeb4

Update README.

Browse files
Files changed (1) hide show
  1. README.md +53 -27
README.md CHANGED
@@ -7,40 +7,66 @@
7
  **Ok, but why chat with a codebase?**
8
 
9
  Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
10
- the code itself.
11
 
12
- `repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
13
 
14
- Features:
15
  - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
16
  - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
17
  - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
18
 
19
- Here are the two scripts you need to run:
20
- ```
21
- pip install -r requirements.txt
22
 
23
- export GITHUB_REPO_NAME=...
24
- export OPENAI_API_KEY=...
25
- export PINECONE_API_KEY=...
26
- export PINECONE_INDEX_NAME=...
27
 
28
- python src/index.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
29
- python src/chat.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
30
- ```
31
- This will index your entire codebase in a vector DB, then bring up a `gradio` app where you can ask questions about it.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- The assistant responses always include GitHub links to the documents retrieved for each query.
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- If you want to publicly host your chat experience, set `--share=true`:
36
  ```
37
- python src/chat.py $GITHUB_REPO_NAME --share=true ...
 
 
 
38
  ```
 
39
 
40
- That's it.
41
-
42
- Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
43
- ![screenshot](assets/chat_screenshot.png)
44
 
45
  # Peeking under the hood
46
 
@@ -50,10 +76,11 @@ The `src/index.py` script performs the following steps:
50
  - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
51
  2. **Chunks files**. See [Chunker](src/chunker.py).
52
  - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
53
- 3. **Batch-embeds chunks**. See [Embedder](src/embedder.py).
54
- - By default, we use OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
 
55
  4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
56
- - By default, we use [Pinecone](https://pinecone.io) as a vector store, but you can easily plug in your own.
57
 
58
  Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
59
  ```
@@ -77,10 +104,9 @@ The sources are conveniently surfaced in the chat and linked directly to GitHub.
77
 
78
  # Want your repository hosted?
79
 
80
- We're working to make all code on the internet searchable and understandable for devs. If you would like help hosting
81
- your repository, we're onboarding a handful of repos onto our infrastructure **for free**.
82
 
83
- You'll get a dedicated url for your repo like `https://sage.storia.ai/[REPO_NAME]`. Just send us a message at [founders@storia.ai](mailto:founders@storia.ai)!
84
 
85
  ![](assets/sage.gif)
86
 
 
7
  **Ok, but why chat with a codebase?**
8
 
9
  Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
10
+ the code itself.
11
 
12
+ `repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
13
 
14
+ Features:
15
  - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
16
  - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
17
  - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
18
 
19
+ # How to run it
20
+ ## Indexing the codebase
21
+ We currently support two options for indexing the codebase:
22
 
23
+ 1. **Locally**, using the open-source [Marqo vector store](https://github.com/marqo-ai/marqo). Marqo is both an embedder (you can choose your favorite embedding model from Hugging Face) and a vector store.
 
 
 
24
 
25
+ You can bring up a Marqo instance using Docker:
26
+ ```
27
+ docker rm -f marqo
28
+ docker pull marqoai/marqo:latest
29
+ docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
30
+ ```
31
+
32
+ Then, to index your codebase, run:
33
+ ```
34
+ pip install -r requirements.txt
35
+
36
+ python src/index.py
37
+ github-repo-name \ # e.g. Storia-AI/repo2vec
38
+ --embedder_type=marqo \
39
+ --vector_store_type=marqo \
40
+ --index_name=your-index-name
41
+ ```
42
+
43
+ 2. **Using external providers**: OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store. To index your codebase, run:
44
+ ```
45
+ pip install -r requirements.txt
46
 
47
+ export OPENAI_API_KEY=...
48
+ export PINECONE_API_KEY=...
49
+
50
+ python src/index.py
51
+ github-repo-name \ # e.g. Storia-AI/repo2vec
52
+ --embedder_type=openai \
53
+ --vector_store_type=pinecone \
54
+ --index_name=your-index-name
55
+ ```
56
+ We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
57
+
58
+ ## Chatting with the codebase
59
+ To bring a `gradio` app where you can chat with your codebase, simply point it to your vector store:
60
 
 
61
  ```
62
+ python src/chat.py \
63
+ github-repo-name \ # e.g. Storia-AI/repo2vec
64
+ --vector_store_type=marqo \ # or pinecone
65
+ --index_name=your-index-name
66
  ```
67
+ To get a public URL for your chat app, set `--share=true`.
68
 
69
+ Currently, the chat will use OpenAI's GPT-4, but we are working on adding support for other providers and local LLMs. Stay tuned!
 
 
 
70
 
71
  # Peeking under the hood
72
 
 
76
  - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
77
  2. **Chunks files**. See [Chunker](src/chunker.py).
78
  - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
79
+ 3. **Batch-embeds chunks**. See [Embedder](src/embedder.py). We currently support:
80
+ - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model;
81
+ - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
82
  4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
83
+ - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
84
 
85
  Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
86
  ```
 
104
 
105
  # Want your repository hosted?
106
 
107
+ We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, [Code Sage](https://sage.storia.ai). We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.
 
108
 
109
+ If you're the maintainer of an OSS repo and would like a dedicated page on Code Sage (e.g. `sage.storia.ai/your-repo`), then send us a message at [founders@storia.ai](mailto:founders@storia.ai). We'll do it for free!
110
 
111
  ![](assets/sage.gif)
112