juliaturc commited on
Commit
59fd872
Β·
1 Parent(s): 6f4d334

Make repo2vec a proper Python library (#22)

Browse files
.gitignore CHANGED
@@ -1,4 +1,6 @@
1
  .env
2
  __pycache__
3
  *.cpython.*
4
- repos/
 
 
 
1
  .env
2
  __pycache__
3
  *.cpython.*
4
+ build/
5
+ repos/
6
+ repo2vec.egg-info/
MANIFEST.in ADDED
@@ -0,0 +1 @@
 
 
1
+ include repo2vec/sample-exclude.txt
README.md CHANGED
@@ -20,6 +20,10 @@ Features:
20
  - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
21
 
22
  # How to run it
 
 
 
 
23
  ## Indexing the codebase
24
  We currently support two options for indexing the codebase:
25
 
@@ -34,10 +38,7 @@ We currently support two options for indexing the codebase:
34
 
35
  Then, to index your codebase, run:
36
  ```
37
- pip install -r requirements.txt
38
-
39
- python src/index.py
40
- github-repo-name \ # e.g. Storia-AI/repo2vec
41
  --embedder-type=marqo \
42
  --vector-store-type=marqo \
43
  --index-name=your-index-name
@@ -45,13 +46,10 @@ We currently support two options for indexing the codebase:
45
 
46
  2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
47
  ```
48
- pip install -r requirements.txt
49
-
50
  export OPENAI_API_KEY=...
51
  export PINECONE_API_KEY=...
52
 
53
- python src/index.py
54
- github-repo-name \ # e.g. Storia-AI/repo2vec
55
  --embedder-type=openai \
56
  --vector-store-type=pinecone \
57
  --index-name=your-index-name
@@ -59,7 +57,7 @@ We currently support two options for indexing the codebase:
59
  We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
60
 
61
  ## Indexing GitHub Issues
62
- By default, we also index the open GitHub issues associated with a codebase. You can control what gets index with the `--index-repo` and `--index-issues` flags (and their converse `--no-index-repo` and `--no-index-issues`).
63
 
64
  ## Chatting with the codebase
65
  We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
@@ -69,8 +67,7 @@ To chat with a local LLM:
69
  2. Pull the desired model, e.g. `ollama pull llama3.1`.
70
  3. Start the `gradio` app:
71
  ```
72
- python src/chat.py \
73
- github-repo-name \ # e.g. Storia-AI/repo2vec
74
  --llm-provider=ollama
75
  --llm-model=llama3.1
76
  --vector-store-type=marqo \ # or pinecone
@@ -81,8 +78,7 @@ To chat with a cloud-based LLM, for instance Anthropic's Claude:
81
  ```
82
  export ANTHROPIC_API_KEY=...
83
 
84
- python src/chat.py \
85
- github-repo-name \ # e.g. Storia-AI/repo2vec
86
  --llm-provider=anthropic \
87
  --llm-model=claude-3-opus-20240229 \
88
  --vector-store-type=marqo \ # or pinecone
@@ -93,29 +89,29 @@ To get a public URL for your chat app, set `--share=true`.
93
  # Peeking under the hood
94
 
95
  ## Indexing the repo
96
- The `src/index.py` script performs the following steps:
97
- 1. **Clones a GitHub repository**. See [RepoManager](src/repo_manager.py).
98
  - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
99
- 2. **Chunks files**. See [Chunker](src/chunker.py).
100
  - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
101
- 3. **Batch-embeds chunks**. See [Embedder](src/embedder.py). We currently support:
102
  - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model, and
103
  - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
104
- 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
105
  - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
106
 
107
  Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
108
  ```
109
- python src/index.py repo-org/repo-name --include=/path/to/file/with/extensions
110
  ```
111
  Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
112
  ```
113
- python src/index.py repo-org/repo-name --exclude=src/sample-exclude.txt
114
  ```
115
  Extensions must be specified one per line, in the form `.ext`.
116
 
117
  ## Chatting via RAG
118
- The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
119
 
120
  1. Rewrites the query to be self-contained based on previous queries
121
  2. Embeds the rewritten query using OpenAI embeddings
@@ -125,6 +121,7 @@ The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat
125
  The sources are conveniently surfaced in the chat and linked directly to GitHub.
126
 
127
  # Changelog
 
128
  - 2024-09-03: Support for indexing GitHub issues.
129
  - 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
130
 
 
20
  - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
21
 
22
  # How to run it
23
+
24
+ ## Installation
25
+ To install the library, simply run `pip install repo2vec`.
26
+
27
  ## Indexing the codebase
28
  We currently support two options for indexing the codebase:
29
 
 
38
 
39
  Then, to index your codebase, run:
40
  ```
41
+ index github-repo-name \ # e.g. Storia-AI/repo2vec
 
 
 
42
  --embedder-type=marqo \
43
  --vector-store-type=marqo \
44
  --index-name=your-index-name
 
46
 
47
  2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
48
  ```
 
 
49
  export OPENAI_API_KEY=...
50
  export PINECONE_API_KEY=...
51
 
52
+ index github-repo-name \ # e.g. Storia-AI/repo2vec
 
53
  --embedder-type=openai \
54
  --vector-store-type=pinecone \
55
  --index-name=your-index-name
 
57
  We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
58
 
59
  ## Indexing GitHub Issues
60
+ You can additionally index GitHub issues by setting the `--index-issues` flag. Conversely, you can turn off indexing the code (and solely index issues) by passing `--no-index-repo`.
61
 
62
  ## Chatting with the codebase
63
  We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
 
67
  2. Pull the desired model, e.g. `ollama pull llama3.1`.
68
  3. Start the `gradio` app:
69
  ```
70
+ chat github-repo-name \ # e.g. Storia-AI/repo2vec
 
71
  --llm-provider=ollama
72
  --llm-model=llama3.1
73
  --vector-store-type=marqo \ # or pinecone
 
78
  ```
79
  export ANTHROPIC_API_KEY=...
80
 
81
+ chat github-repo-name \ # e.g. Storia-AI/repo2vec
 
82
  --llm-provider=anthropic \
83
  --llm-model=claude-3-opus-20240229 \
84
  --vector-store-type=marqo \ # or pinecone
 
89
  # Peeking under the hood
90
 
91
  ## Indexing the repo
92
+ The `repo2vec/index.py` script performs the following steps:
93
+ 1. **Clones a GitHub repository**. See [RepoManager](repo2vec/repo_manager.py).
94
  - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
95
+ 2. **Chunks files**. See [Chunker](repo2vec/chunker.py).
96
  - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
97
+ 3. **Batch-embeds chunks**. See [Embedder](repo2vec/embedder.py). We currently support:
98
  - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model, and
99
  - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
100
+ 4. **Stores embeddings in a vector store**. See [VectorStore](repo2vec/vector_store.py).
101
  - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
102
 
103
  Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
104
  ```
105
+ index repo-org/repo-name --include=/path/to/file/with/extensions
106
  ```
107
  Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
108
  ```
109
+ index repo-org/repo-name --exclude=repo2vec/sample-exclude.txt
110
  ```
111
  Extensions must be specified one per line, in the form `.ext`.
112
 
113
  ## Chatting via RAG
114
+ The `repo2vec/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
115
 
116
  1. Rewrites the query to be self-contained based on previous queries
117
  2. Embeds the rewritten query using OpenAI embeddings
 
121
  The sources are conveniently surfaced in the chat and linked directly to GitHub.
122
 
123
  # Changelog
124
+ - 2024-09-03: `repo2vec` is now available on pypi.
125
  - 2024-09-03: Support for indexing GitHub issues.
126
  - 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
127
 
pyproject.toml CHANGED
@@ -1,2 +1,6 @@
 
 
 
 
1
  [tool.black]
2
  line-length = 120
 
1
+ [build-system]
2
+ requires = ["setuptools>=42", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
  [tool.black]
6
  line-length = 120
{src β†’ repo2vec}/.sample-env RENAMED
File without changes
{src β†’ repo2vec}/__init__.py RENAMED
File without changes
{src β†’ repo2vec}/chat.py RENAMED
@@ -12,8 +12,8 @@ from langchain.chains.combine_documents import create_stuff_documents_chain
12
  from langchain.schema import AIMessage, HumanMessage
13
  from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
14
 
15
- import vector_store
16
- from llm import build_llm_via_langchain
17
 
18
  load_dotenv()
19
 
@@ -67,7 +67,7 @@ def append_sources_to_response(response):
67
  return response["answer"] + "\n\nSources:\n" + "\n".join(urls)
68
 
69
 
70
- if __name__ == "__main__":
71
  parser = argparse.ArgumentParser(description="UI to chat with your codebase")
72
  parser.add_argument("repo_id", help="The ID of the repository to index")
73
  parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
@@ -116,3 +116,6 @@ if __name__ == "__main__":
116
  description=f"Code sage for your repo: {args.repo_id}",
117
  examples=["What does this repo do?", "Give me some sample code."],
118
  ).launch(share=args.share)
 
 
 
 
12
  from langchain.schema import AIMessage, HumanMessage
13
  from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
14
 
15
+ import repo2vec.vector_store as vector_store
16
+ from repo2vec.llm import build_llm_via_langchain
17
 
18
  load_dotenv()
19
 
 
67
  return response["answer"] + "\n\nSources:\n" + "\n".join(urls)
68
 
69
 
70
+ def main():
71
  parser = argparse.ArgumentParser(description="UI to chat with your codebase")
72
  parser.add_argument("repo_id", help="The ID of the repository to index")
73
  parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
 
116
  description=f"Code sage for your repo: {args.repo_id}",
117
  examples=["What does this repo do?", "Give me some sample code."],
118
  ).launch(share=args.share)
119
+
120
+ if __name__ == "__main__":
121
+ main()
{src β†’ repo2vec}/chunker.py RENAMED
File without changes
{src β†’ repo2vec}/data_manager.py RENAMED
File without changes
{src β†’ repo2vec}/embedder.py RENAMED
@@ -10,8 +10,8 @@ from typing import Dict, Generator, List, Optional, Tuple
10
  import marqo
11
  from openai import OpenAI
12
 
13
- from chunker import Chunk, Chunker
14
- from data_manager import DataManager
15
 
16
  Vector = Tuple[Dict, List[float]] # (metadata, embedding)
17
 
 
10
  import marqo
11
  from openai import OpenAI
12
 
13
+ from repo2vec.chunker import Chunk, Chunker
14
+ from repo2vec.data_manager import DataManager
15
 
16
  Vector = Tuple[Dict, List[float]] # (metadata, embedding)
17
 
{src β†’ repo2vec}/github.py RENAMED
@@ -8,8 +8,8 @@ import logging
8
  import requests
9
  import tiktoken
10
 
11
- from chunker import Chunk, Chunker
12
- from data_manager import DataManager
13
 
14
  tokenizer = tiktoken.get_encoding("cl100k_base")
15
 
 
8
  import requests
9
  import tiktoken
10
 
11
+ from repo2vec.chunker import Chunk, Chunker
12
+ from repo2vec.data_manager import DataManager
13
 
14
  tokenizer = tiktoken.get_encoding("cl100k_base")
15
 
{src β†’ repo2vec}/index.py RENAMED
@@ -2,15 +2,19 @@
2
 
3
  import argparse
4
  import logging
 
 
5
  import time
6
 
7
- from chunker import UniversalFileChunker
8
- from data_manager import GitHubRepoManager
9
- from embedder import build_batch_embedder_from_flags
10
- from github import GitHubIssuesChunker, GitHubIssuesManager
11
- from vector_store import build_from_args
12
 
13
  logging.basicConfig(level=logging.INFO)
 
 
14
 
15
  MAX_TOKENS_PER_CHUNK = 8192 # The ADA embedder from OpenAI has a maximum of 8192 tokens.
16
  MAX_CHUNKS_PER_BATCH = 2048 # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
@@ -77,7 +81,7 @@ def main():
77
  )
78
  parser.add_argument(
79
  "--exclude",
80
- default="src/sample-exclude.txt",
81
  help="Path to a file containing a list of extensions to exclude. One extension per line.",
82
  )
83
  parser.add_argument(
@@ -102,8 +106,9 @@ def main():
102
  parser.add_argument(
103
  "--index-issues",
104
  action=argparse.BooleanOptionalAction,
105
- default=True,
106
- help="Whether to index GitHub issues. At least one of --index-repo and --index-issues must be True.",
 
107
  )
108
  args = parser.parse_args()
109
 
@@ -134,6 +139,14 @@ def main():
134
  if args.embedding_size is None and args.embedder_type == "openai":
135
  args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
136
 
 
 
 
 
 
 
 
 
137
  ######################
138
  # Step 1: Embeddings #
139
  ######################
@@ -159,7 +172,6 @@ def main():
159
 
160
  # Index the GitHub issues.
161
  issues_embedder = None
162
- assert args.index_issues is True
163
  if args.index_issues:
164
  logging.info("Issuing embedding jobs for GitHub issues...")
165
  issues_manager = GitHubIssuesManager(args.repo_id)
 
2
 
3
  import argparse
4
  import logging
5
+ import os
6
+ import pkg_resources
7
  import time
8
 
9
+ from repo2vec.chunker import UniversalFileChunker
10
+ from repo2vec.data_manager import GitHubRepoManager
11
+ from repo2vec.embedder import build_batch_embedder_from_flags
12
+ from repo2vec.github import GitHubIssuesChunker, GitHubIssuesManager
13
+ from repo2vec.vector_store import build_from_args
14
 
15
  logging.basicConfig(level=logging.INFO)
16
+ logger = logging.getLogger()
17
+ logger.setLevel(logging.INFO)
18
 
19
  MAX_TOKENS_PER_CHUNK = 8192 # The ADA embedder from OpenAI has a maximum of 8192 tokens.
20
  MAX_CHUNKS_PER_BATCH = 2048 # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
 
81
  )
82
  parser.add_argument(
83
  "--exclude",
84
+ default=pkg_resources.resource_filename(__name__, "sample-exclude.txt"),
85
  help="Path to a file containing a list of extensions to exclude. One extension per line.",
86
  )
87
  parser.add_argument(
 
106
  parser.add_argument(
107
  "--index-issues",
108
  action=argparse.BooleanOptionalAction,
109
+ default=False,
110
+ help="Whether to index GitHub issues. At least one of --index-repo and --index-issues must be True. When "
111
+ "--index-issues is set, you must also set a GITHUB_TOKEN environment variable.",
112
  )
113
  args = parser.parse_args()
114
 
 
139
  if args.embedding_size is None and args.embedder_type == "openai":
140
  args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
141
 
142
+ # Fail early on missing environment variables.
143
+ if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
144
+ parser.error("Please set the OPENAI_API_KEY environment variable.")
145
+ if args.vector_store_type == "pinecone" and not os.getenv("PINECONE_API_KEY"):
146
+ parser.error("Please set the PINECONE_API_KEY environment variable.")
147
+ if args.index_issues and not os.getenv("GITHUB_TOKEN"):
148
+ parser.error("Please set the GITHUB_TOKEN environment variable.")
149
+
150
  ######################
151
  # Step 1: Embeddings #
152
  ######################
 
172
 
173
  # Index the GitHub issues.
174
  issues_embedder = None
 
175
  if args.index_issues:
176
  logging.info("Issuing embedding jobs for GitHub issues...")
177
  issues_manager = GitHubIssuesManager(args.repo_id)
{src β†’ repo2vec}/llm.py RENAMED
File without changes
{src β†’ repo2vec}/sample-exclude.txt RENAMED
@@ -5,6 +5,7 @@
5
  .bmp
6
  .crt
7
  .css
 
8
  .dat
9
  .db
10
  .duckdb
 
5
  .bmp
6
  .crt
7
  .css
8
+ .csv
9
  .dat
10
  .db
11
  .duckdb
{src β†’ repo2vec}/vector_store.py RENAMED
File without changes
setup.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup, find_packages
2
+
3
+ def readfile(filename):
4
+ with open(filename, 'r+') as f:
5
+ return f.read()
6
+
7
+ setup(
8
+ name="repo2vec",
9
+ version="0.1.2",
10
+ packages=find_packages(),
11
+ include_package_data=True,
12
+ package_data={
13
+ "repo2vec": ["sample-exclude.txt"],
14
+ },
15
+ install_requires=open("requirements.txt").readlines() + ["setuptools"],
16
+ entry_points={
17
+ "console_scripts": [
18
+ "index=repo2vec.index:main",
19
+ "chat=repo2vec.chat:main",
20
+ ],
21
+ },
22
+ author="Julia Turc & Mihail Eric / Storia AI",
23
+ author_email="founders@storia.ai",
24
+ description="A library to index a code repository and chat with it via LLMs.",
25
+ long_description=open("README.md").read(),
26
+ long_description_content_type="text/markdown",
27
+ url="https://github.com/Storia-AI/repo2vec",
28
+ classifiers=[
29
+ "Programming Language :: Python :: 3",
30
+ "License :: OSI Approved :: MIT License",
31
+ "Operating System :: OS Independent",
32
+ ],
33
+ python_requires='>=3.9',
34
+ )