Spaces:

Asish22
/

code-crawler

Sleeping

App Files Files Community

juliaturc commited on Sep 4, 2024

Commit

59fd872

1 Parent(s): 6f4d334

Make repo2vec a proper Python library (#22)

Browse files

Files changed (16) hide show

.gitignore +3 -1
MANIFEST.in +1 -0
README.md +18 -21
pyproject.toml +4 -0
{src → repo2vec}/.sample-env +0 -0
{src → repo2vec}/__init__.py +0 -0
{src → repo2vec}/chat.py +6 -3
{src → repo2vec}/chunker.py +0 -0
{src → repo2vec}/data_manager.py +0 -0
{src → repo2vec}/embedder.py +2 -2
{src → repo2vec}/github.py +2 -2
{src → repo2vec}/index.py +21 -9
{src → repo2vec}/llm.py +0 -0
{src → repo2vec}/sample-exclude.txt +1 -0
{src → repo2vec}/vector_store.py +0 -0
setup.py +34 -0

.gitignore CHANGED Viewed

@@ -1,4 +1,6 @@
 .env
 __pycache__
 *.cpython.*
-repos/

 .env
 __pycache__
 *.cpython.*
+build/
+repos/
+repo2vec.egg-info/

MANIFEST.in ADDED Viewed

	@@ -0,0 +1 @@


1	+ include repo2vec/sample-exclude.txt

README.md CHANGED Viewed

@@ -20,6 +20,10 @@ Features:
 - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
 # How to run it
 ## Indexing the codebase
 We currently support two options for indexing the codebase:
@@ -34,10 +38,7 @@ We currently support two options for indexing the codebase:
     Then, to index your codebase, run:
     ```
-    pip install -r requirements.txt
-    python src/index.py
-        github-repo-name \  # e.g. Storia-AI/repo2vec
         --embedder-type=marqo \
         --vector-store-type=marqo \
         --index-name=your-index-name
@@ -45,13 +46,10 @@ We currently support two options for indexing the codebase:
 2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
     ```
-    pip install -r requirements.txt
     export OPENAI_API_KEY=...
     export PINECONE_API_KEY=...
-    python src/index.py
-        github-repo-name \  # e.g. Storia-AI/repo2vec
         --embedder-type=openai \
         --vector-store-type=pinecone \
         --index-name=your-index-name
@@ -59,7 +57,7 @@ We currently support two options for indexing the codebase:
     We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
 ## Indexing GitHub Issues
-By default, we also index the open GitHub issues associated with a codebase. You can control what gets index with the `--index-repo` and `--index-issues` flags (and their converse `--no-index-repo` and `--no-index-issues`).
 ## Chatting with the codebase
 We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
@@ -69,8 +67,7 @@ To chat with a local LLM:
 2. Pull the desired model, e.g. `ollama pull llama3.1`.
 3. Start the `gradio` app:
     ```
-    python src/chat.py \
-        github-repo-name \  # e.g. Storia-AI/repo2vec
         --llm-provider=ollama
         --llm-model=llama3.1
         --vector-store-type=marqo \  # or pinecone
@@ -81,8 +78,7 @@ To chat with a cloud-based LLM, for instance Anthropic's Claude:
 ```
 export ANTHROPIC_API_KEY=...
-python src/chat.py \
-    github-repo-name \  # e.g. Storia-AI/repo2vec
     --llm-provider=anthropic \
     --llm-model=claude-3-opus-20240229 \
     --vector-store-type=marqo \  # or pinecone
@@ -93,29 +89,29 @@ To get a public URL for your chat app, set `--share=true`.
 # Peeking under the hood
 ## Indexing the repo
-The `src/index.py` script performs the following steps:
-1. **Clones a GitHub repository**. See [RepoManager](src/repo_manager.py).
     - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
-2. **Chunks files**. See [Chunker](src/chunker.py).
     - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
-3. **Batch-embeds chunks**. See [Embedder](src/embedder.py). We currently support:
     - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model, and
     - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
-4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
     - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
 Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
 ```
-python src/index.py repo-org/repo-name --include=/path/to/file/with/extensions
 ```
 Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
 ```
-python src/index.py repo-org/repo-name --exclude=src/sample-exclude.txt
 ```
 Extensions must be specified one per line, in the form `.ext`.
 ## Chatting via RAG
-The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
 1. Rewrites the query to be self-contained based on previous queries
 2. Embeds the rewritten query using OpenAI embeddings
@@ -125,6 +121,7 @@ The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat
 The sources are conveniently surfaced in the chat and linked directly to GitHub.
 # Changelog
 - 2024-09-03: Support for indexing GitHub issues.
 - 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).

 - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
 # How to run it
+## Installation
+To install the library, simply run `pip install repo2vec`.
 ## Indexing the codebase
 We currently support two options for indexing the codebase:
     Then, to index your codebase, run:
     ```
+    index github-repo-name \  # e.g. Storia-AI/repo2vec
         --embedder-type=marqo \
         --vector-store-type=marqo \
         --index-name=your-index-name
 2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
     ```
     export OPENAI_API_KEY=...
     export PINECONE_API_KEY=...
+    index github-repo-name \  # e.g. Storia-AI/repo2vec
         --embedder-type=openai \
         --vector-store-type=pinecone \
         --index-name=your-index-name
     We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
 ## Indexing GitHub Issues
+You can additionally index GitHub issues by setting the `--index-issues` flag. Conversely, you can turn off indexing the code (and solely index issues) by passing `--no-index-repo`.
 ## Chatting with the codebase
 We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
 2. Pull the desired model, e.g. `ollama pull llama3.1`.
 3. Start the `gradio` app:
     ```
+    chat github-repo-name \  # e.g. Storia-AI/repo2vec
         --llm-provider=ollama
         --llm-model=llama3.1
         --vector-store-type=marqo \  # or pinecone
 ```
 export ANTHROPIC_API_KEY=...
+chat github-repo-name \  # e.g. Storia-AI/repo2vec
     --llm-provider=anthropic \
     --llm-model=claude-3-opus-20240229 \
     --vector-store-type=marqo \  # or pinecone
 # Peeking under the hood
 ## Indexing the repo
+The `repo2vec/index.py` script performs the following steps:
+1. **Clones a GitHub repository**. See [RepoManager](repo2vec/repo_manager.py).
     - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
+2. **Chunks files**. See [Chunker](repo2vec/chunker.py).
     - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
+3. **Batch-embeds chunks**. See [Embedder](repo2vec/embedder.py). We currently support:
     - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model, and
     - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
+4. **Stores embeddings in a vector store**. See [VectorStore](repo2vec/vector_store.py).
     - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
 Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
 ```
+index repo-org/repo-name --include=/path/to/file/with/extensions
 ```
 Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
 ```
+index repo-org/repo-name --exclude=repo2vec/sample-exclude.txt
 ```
 Extensions must be specified one per line, in the form `.ext`.
 ## Chatting via RAG
+The `repo2vec/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
 1. Rewrites the query to be self-contained based on previous queries
 2. Embeds the rewritten query using OpenAI embeddings
 The sources are conveniently surfaced in the chat and linked directly to GitHub.
 # Changelog
+- 2024-09-03: `repo2vec` is now available on pypi.
 - 2024-09-03: Support for indexing GitHub issues.
 - 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).

pyproject.toml CHANGED Viewed

@@ -1,2 +1,6 @@
 [tool.black]
 line-length = 120

+[build-system]
+requires = ["setuptools>=42", "wheel"]
+build-backend = "setuptools.build_meta"
 [tool.black]
 line-length = 120

{src → repo2vec}/.sample-env RENAMED Viewed

File without changes

{src → repo2vec}/__init__.py RENAMED Viewed

File without changes

{src → repo2vec}/chat.py RENAMED Viewed

@@ -12,8 +12,8 @@ from langchain.chains.combine_documents import create_stuff_documents_chain
 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
-import vector_store
-from llm import build_llm_via_langchain
 load_dotenv()
@@ -67,7 +67,7 @@ def append_sources_to_response(response):
     return response["answer"] + "\n\nSources:\n" + "\n".join(urls)
-if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
     parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
@@ -116,3 +116,6 @@ if __name__ == "__main__":
         description=f"Code sage for your repo: {args.repo_id}",
         examples=["What does this repo do?", "Give me some sample code."],
     ).launch(share=args.share)

 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+import repo2vec.vector_store as vector_store
+from repo2vec.llm import build_llm_via_langchain
 load_dotenv()
     return response["answer"] + "\n\nSources:\n" + "\n".join(urls)
+def main():
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
     parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
         description=f"Code sage for your repo: {args.repo_id}",
         examples=["What does this repo do?", "Give me some sample code."],
     ).launch(share=args.share)
+if __name__ == "__main__":
+    main()

{src → repo2vec}/chunker.py RENAMED Viewed

File without changes

{src → repo2vec}/data_manager.py RENAMED Viewed

File without changes

{src → repo2vec}/embedder.py RENAMED Viewed

@@ -10,8 +10,8 @@ from typing import Dict, Generator, List, Optional, Tuple
 import marqo
 from openai import OpenAI
-from chunker import Chunk, Chunker
-from data_manager import DataManager
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)

 import marqo
 from openai import OpenAI
+from repo2vec.chunker import Chunk, Chunker
+from repo2vec.data_manager import DataManager
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)

{src → repo2vec}/github.py RENAMED Viewed

@@ -8,8 +8,8 @@ import logging
 import requests
 import tiktoken
-from chunker import Chunk, Chunker
-from data_manager import DataManager
 tokenizer = tiktoken.get_encoding("cl100k_base")

 import requests
 import tiktoken
+from repo2vec.chunker import Chunk, Chunker
+from repo2vec.data_manager import DataManager
 tokenizer = tiktoken.get_encoding("cl100k_base")

{src → repo2vec}/index.py RENAMED Viewed

@@ -2,15 +2,19 @@
 import argparse
 import logging
 import time
-from chunker import UniversalFileChunker
-from data_manager import GitHubRepoManager
-from embedder import build_batch_embedder_from_flags
-from github import GitHubIssuesChunker, GitHubIssuesManager
-from vector_store import build_from_args
 logging.basicConfig(level=logging.INFO)
 MAX_TOKENS_PER_CHUNK = 8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
 MAX_CHUNKS_PER_BATCH = 2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
@@ -77,7 +81,7 @@ def main():
     )
     parser.add_argument(
         "--exclude",
-        default="src/sample-exclude.txt",
         help="Path to a file containing a list of extensions to exclude. One extension per line.",
     )
     parser.add_argument(
@@ -102,8 +106,9 @@ def main():
     parser.add_argument(
         "--index-issues",
         action=argparse.BooleanOptionalAction,
-        default=True,
-        help="Whether to index GitHub issues. At least one of --index-repo and --index-issues must be True.",
     )
     args = parser.parse_args()
@@ -134,6 +139,14 @@ def main():
     if args.embedding_size is None and args.embedder_type == "openai":
         args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
     ######################
     # Step 1: Embeddings #
     ######################
@@ -159,7 +172,6 @@ def main():
     # Index the GitHub issues.
     issues_embedder = None
-    assert args.index_issues is True
     if args.index_issues:
         logging.info("Issuing embedding jobs for GitHub issues...")
         issues_manager = GitHubIssuesManager(args.repo_id)

 import argparse
 import logging
+import os
+import pkg_resources
 import time
+from repo2vec.chunker import UniversalFileChunker
+from repo2vec.data_manager import GitHubRepoManager
+from repo2vec.embedder import build_batch_embedder_from_flags
+from repo2vec.github import GitHubIssuesChunker, GitHubIssuesManager
+from repo2vec.vector_store import build_from_args
 logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
 MAX_TOKENS_PER_CHUNK = 8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
 MAX_CHUNKS_PER_BATCH = 2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
     )
     parser.add_argument(
         "--exclude",
+        default=pkg_resources.resource_filename(__name__, "sample-exclude.txt"),
         help="Path to a file containing a list of extensions to exclude. One extension per line.",
     )
     parser.add_argument(
     parser.add_argument(
         "--index-issues",
         action=argparse.BooleanOptionalAction,
+        default=False,
+        help="Whether to index GitHub issues. At least one of --index-repo and --index-issues must be True. When "
+        "--index-issues is set, you must also set a GITHUB_TOKEN environment variable.",
     )
     args = parser.parse_args()
     if args.embedding_size is None and args.embedder_type == "openai":
         args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
+    # Fail early on missing environment variables.
+    if args.embedder_type == "openai" and not os.getenv("OPENAI_API_KEY"):
+        parser.error("Please set the OPENAI_API_KEY environment variable.")
+    if args.vector_store_type == "pinecone" and not os.getenv("PINECONE_API_KEY"):
+        parser.error("Please set the PINECONE_API_KEY environment variable.")
+    if args.index_issues and not os.getenv("GITHUB_TOKEN"):
+        parser.error("Please set the GITHUB_TOKEN environment variable.")
     ######################
     # Step 1: Embeddings #
     ######################
     # Index the GitHub issues.
     issues_embedder = None
     if args.index_issues:
         logging.info("Issuing embedding jobs for GitHub issues...")
         issues_manager = GitHubIssuesManager(args.repo_id)

{src → repo2vec}/llm.py RENAMED Viewed

File without changes

{src → repo2vec}/sample-exclude.txt RENAMED Viewed

@@ -5,6 +5,7 @@
 .bmp
 .crt
 .css
 .dat
 .db
 .duckdb

 .bmp
 .crt
 .css
+.csv
 .dat
 .db
 .duckdb

{src → repo2vec}/vector_store.py RENAMED Viewed

File without changes

setup.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from setuptools import setup, find_packages
+def readfile(filename):
+    with open(filename, 'r+') as f:
+        return f.read()
+setup(
+    name="repo2vec",
+    version="0.1.2",
+    packages=find_packages(),
+    include_package_data=True,
+    package_data={
+        "repo2vec": ["sample-exclude.txt"],
+    },
+    install_requires=open("requirements.txt").readlines() + ["setuptools"],
+    entry_points={
+        "console_scripts": [
+            "index=repo2vec.index:main",
+            "chat=repo2vec.chat:main",
+        ],
+    },
+    author="Julia Turc & Mihail Eric / Storia AI",
+    author_email="founders@storia.ai",
+    description="A library to index a code repository and chat with it via LLMs.",
+    long_description=open("README.md").read(),
+    long_description_content_type="text/markdown",
+    url="https://github.com/Storia-AI/repo2vec",
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.9',
+)