Spaces:

Asish22
/

code-crawler

Sleeping

+# Overview
+`repo2vec` enables you to chat with your codebase by simply running two python scripts:
+```
+pip install -r requirements.txt
+export GITHUB_REPO_NAME=...
+export OPENAI_API_KEY=...
+export PINECONE_API_KEY=...
+export PINECONE_INDEX_NAME=...
+python src/index.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
+python src/chat.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
+```
+This will bring up a `gradio` app where you can ask questions about your codebase. The assistant responses always include GitHub links to the documents retrieved for each query.
+Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
+![screenshot](assets/chat_screenshot.png)
+# Under the hood
+## Indexing the repo
+The `src/index.py` script performs the following steps:
+1. **Clones a GitHub repository**. See [RepoManager](src/repo_manager.py).
+    - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
+2. **Chunks files**. See [Chunker](src/chunker.py).
+    - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
+3. **Batch-embeds chunks**. See [Embedder](src/embedder.py).
+    - By default, we use OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
+4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
+    - By default, we use [Pinecone](https://pinecone.io) as a vector store, but you can easily plug in your own.
+## Chatting via RAG
+The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
+1. Rewrites the query to be self-contained based on previous queries
+2. Embeds the rewritten query using OpenAI embeddings
+3. Retrieves relevant documents from the vector store
+4. Calls an OpenAI LLM to respond to the user query based on the retrieved documents.
+The sources are conveniently surfaced in the chat and linked directly to GitHub.
+# Extensions & Contributions
+We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.
+Feel free to send feature requests to [founders@storia.ai](mailto:founders@storia.ai) or make a pull request!

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+GitPython==3.1.43
+Pygments==2.18.0
+gradio==4.42.0
+langchain==0.2.14
+langchain-community==0.2.12
+langchain-openai==0.1.22
+openai==1.42.0
+pinecone==5.0.1
+python-dotenv==1.0.1
+requests==2.32.3
+semchunk==2.2.0
+tiktoken==0.7.0
+tree-sitter==0.22.3
+tree-sitter-language-pack==0.2.0

src/.sample-env ADDED Viewed

	@@ -0,0 +1,3 @@

+OPENAI_API_KEY=
+PINECONE_API_KEY=
+GITHUB_TOKEN=

src/__init__.py ADDED Viewed

File without changes

src/chat.py ADDED Viewed

	@@ -0,0 +1,114 @@

+"""A gradio app that enables users to chat with their codebase.
+You must run main.py first in order to index the codebase into a vector store.
+"""
+import argparse
+from dotenv import load_dotenv
+import gradio as gr
+from langchain.chains import create_history_aware_retriever, create_retrieval_chain
+from langchain.chains.combine_documents import create_stuff_documents_chain
+from langchain.schema import AIMessage, HumanMessage
+from langchain_community.vectorstores import Pinecone
+from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_openai import ChatOpenAI, OpenAIEmbeddings
+from repo_manager import RepoManager
+load_dotenv()
+def build_rag_chain(args):
+    """Builds a RAG chain via LangChain."""
+    llm = ChatOpenAI(model=args.openai_model)
+    vectorstore = Pinecone.from_existing_index(
+        index_name=args.pinecone_index_name,
+        embedding=OpenAIEmbeddings(),
+        namespace=args.repo_id,
+    )
+    retriever = vectorstore.as_retriever()
+    # Prompt to contextualize the latest query based on the chat history.
+    contextualize_q_system_prompt = (
+        "Given a chat history and the latest user question which might reference context in the chat history, "
+        "formualte a standalone question which can be understood without the chat history. Do NOT answer the question, "
+        "just reformulate it if needed and otherwise return it as is."
+    )
+    contextualize_q_prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", contextualize_q_system_prompt),
+            MessagesPlaceholder("chat_history"),
+            ("human", "{input}"),
+        ]
+    )
+    history_aware_retriever = create_history_aware_retriever(
+        llm, retriever, contextualize_q_prompt
+    )
+    qa_system_prompt = (
+        f"You are my coding buddy, helping me quickly understand a GitHub repository called {args.repo_id}."
+        "Assume I am an advanced developer and answer my questions in the most succinct way possible."
+        "\n\n"
+        "Here are some snippets from the codebase."
+        "\n\n"
+        "{context}"
+    )
+    qa_prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", qa_system_prompt),
+            MessagesPlaceholder("chat_history"),
+            ("human", "{input}"),
+        ]
+    )
+    question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
+    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)
+    return rag_chain
+def append_sources_to_response(response):
+    """Given an OpenAI completion response, appends to it GitHub links of the context sources."""
+    filenames = [document.metadata["filename"] for document in response["context"]]
+    # Deduplicate filenames while preserving their order.
+    filenames = list(dict.fromkeys(filenames))
+    repo_manager = RepoManager(args.repo_id)
+    github_links = [
+        repo_manager.github_link_for_file(filename) for filename in filenames
+    ]
+    return response["answer"] + "\n\nSources:\n" + "\n".join(github_links)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="UI to chat with your codebase")
+    parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument(
+        "--openai_model", default="gpt-4", help="The OpenAI model to use for response generation"
+    )
+    parser.add_argument(
+        "--pinecone_index_name", required=True, help="Pinecone index name"
+    )
+    args = parser.parse_args()
+    rag_chain = build_rag_chain(args)
+    def _predict(message, history):
+        """Performs one RAG operation."""
+        history_langchain_format = []
+        for human, ai in history:
+            history_langchain_format.append(HumanMessage(content=human))
+            history_langchain_format.append(AIMessage(content=ai))
+        history_langchain_format.append(HumanMessage(content=message))
+        response = rag_chain.invoke(
+            {"input": message, "chat_history": history_langchain_format}
+        )
+        answer = append_sources_to_response(response)
+        return answer
+    gr.ChatInterface(_predict,
+                     title=args.repo_id,
+                     description=f"Code sage for your repo: {args.repo_id}",
+                     examples=["What does this repo do?", "Give me some sample code."]).launch()

src/chunker.py ADDED Viewed

	@@ -0,0 +1,237 @@

+"""Chunker abstraction and implementations."""
+import logging
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from functools import lru_cache
+from typing import List, Optional
+import pygments
+import tiktoken
+from semchunk import chunk as chunk_via_semchunk
+from tree_sitter import Node
+from tree_sitter_language_pack import get_parser
+logger = logging.getLogger(__name__)
+@dataclass
+class Chunk:
+    """A chunk of code or text extracted from a file in the repository."""
+    filename: str
+    start_byte: int
+    end_byte: int
+    _content: Optional[str] = None
+    @property
+    def content(self) -> Optional[str]:
+        """The text content to be embedded. Might contain information beyond just the text snippet from the file."""
+        return self._content
+    def populate_content(self, file_content: str):
+        """Populates the content of the chunk with the file path and file content."""
+        self._content = (
+            self.filename + "\n\n" + file_content[self.start_byte : self.end_byte]
+        )
+    def num_tokens(self, tokenizer):
+        """Counts the number of tokens in the chunk."""
+        if not self.content:
+            raise ValueError("Content not populated.")
+        return Chunk._cached_num_tokens(self.content, tokenizer)
+    @staticmethod
+    @lru_cache(maxsize=1024)
+    def _cached_num_tokens(content: str, tokenizer):
+        """Static method to cache token counts."""
+        return len(tokenizer.encode(content, disallowed_special=()))
+    def __eq__(self, other):
+        if isinstance(other, Chunk):
+            return (
+                self.filename == other.filename
+                and self.start_byte == other.start_byte
+                and self.end_byte == other.end_byte
+            )
+        return False
+    def __hash__(self):
+        return hash((self.filename, self.start_byte, self.end_byte))
+class Chunker(ABC):
+    """Abstract class for chunking a file into smaller pieces."""
+    @abstractmethod
+    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
+        """Chunks a file into smaller pieces."""
+class CodeChunker(Chunker):
+    """Splits a code file into chunks of at most `max_tokens` tokens each."""
+    def __init__(self, max_tokens: int):
+        self.max_tokens = max_tokens
+        self.tokenizer = tiktoken.get_encoding("cl100k_base")
+        self.text_chunker = TextChunker(max_tokens)
+    @staticmethod
+    def _get_language_from_filename(filename: str):
+        """Returns a canonical name for the language of the file, based on its extension.
+        Returns None if the language is unknown to the pygments lexer.
+        """
+        try:
+            lexer = pygments.lexers.get_lexer_for_filename(filename)
+            return lexer.name.lower()
+        except pygments.util.ClassNotFound:
+            return None
+    def _chunk_node(self, node: Node, filename: str, file_content: str) -> List[Chunk]:
+        """Splits a node in the parse tree into a flat list of chunks."""
+        node_chunk = Chunk(filename, node.start_byte, node.end_byte)
+        node_chunk.populate_content(file_content)
+        if node_chunk.num_tokens(self.tokenizer) <= self.max_tokens:
+            return [node_chunk]
+        if not node.children:
+            # This is a leaf node, but it's too long. We'll have to split it with a text tokenizer.
+            return self.text_chunker.chunk(
+                filename, file_content[node.start_byte : node.end_byte]
+            )
+        chunks = []
+        for child in node.children:
+            chunks.extend(self._chunk_node(child, filename, file_content))
+        for chunk in chunks:
+            # This should always be true. Otherwise there must be a bug in the code.
+            assert chunk.content and chunk.num_tokens(self.tokenizer) <= self.max_tokens
+        # Merge neighboring chunks if their combined size doesn't exceed max_tokens. The goal is to avoid pathologically
+        # small chunks that end up being undeservedly preferred by the retriever.
+        merged_chunks = []
+        for chunk in chunks:
+            if not merged_chunks:
+                merged_chunks.append(chunk)
+            elif (
+                merged_chunks[-1].num_tokens(self.tokenizer)
+                + chunk.num_tokens(self.tokenizer)
+                < self.max_tokens - 50
+            ):
+                # There's a good chance that merging these two chunks will be under the token limit. We're not 100% sure
+                # at this point, because tokenization is not necessarily additive.
+                merged = Chunk(
+                    merged_chunks[-1].filename,
+                    merged_chunks[-1].start_byte,
+                    chunk.end_byte,
+                )
+                merged.populate_content(file_content)
+                if merged.num_tokens(self.tokenizer) <= self.max_tokens:
+                    merged_chunks[-1] = merged
+                else:
+                    merged_chunks.append(chunk)
+        chunks = merged_chunks
+        for chunk in merged_chunks:
+            # This should always be true. Otherwise there's a bug worth investigating.
+            assert chunk.content and chunk.num_tokens(self.tokenizer) <= self.max_tokens
+        return merged_chunks
+    @staticmethod
+    def is_code_file(filename: str) -> bool:
+        """Checks whether pygment & tree_sitter can parse the file as code."""
+        language = CodeChunker._get_language_from_filename(filename)
+        return language and language not in ["text only", "None"]
+    @staticmethod
+    def parse_tree(filename: str, content: str) -> List[str]:
+        """Parses the code in a file and returns the parse tree."""
+        language = CodeChunker._get_language_from_filename(filename)
+        if not language or language in ["text only", "None"]:
+            logging.debug("%s doesn't seem to be a code file.", filename)
+            return None
+        try:
+            parser = get_parser(language)
+        except LookupError:
+            logging.debug("%s doesn't seem to be a code file.", filename)
+            return None
+        tree = parser.parse(bytes(content, "utf8"))
+        if not tree.root_node.children or tree.root_node.children[0].type == "ERROR":
+            logging.warning("Failed to parse code in %s.", filename)
+            return None
+        return tree
+    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
+        """Chunks a code file into smaller pieces."""
+        tree = self.parse_tree(file_path, file_content)
+        if tree is None:
+            return []
+        chunks = self._chunk_node(tree.root_node, file_path, file_content)
+        for chunk in chunks:
+            # Make sure that the chunk has content and doesn't exceed the max_tokens limit. Otherwise there must be
+            # a bug in the code.
+            assert chunk.content
+            size = chunk.num_tokens(self.tokenizer)
+            assert (
+                size <= self.max_tokens
+            ), f"Chunk size {size} exceeds max_tokens {self.max_tokens}."
+        return chunks
+class TextChunker(Chunker):
+    """Wrapper around semchunk: https://github.com/umarbutler/semchunk."""
+    def __init__(self, max_tokens: int):
+        self.max_tokens = max_tokens
+        tokenizer = tiktoken.get_encoding("cl100k_base")
+        self.count_tokens = lambda text: len(
+            tokenizer.encode(text, disallowed_special=())
+        )
+    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
+        """Chunks a text file into smaller pieces."""
+        # We need to allocate some tokens for the filename, which is part of the chunk content.
+        extra_tokens = self.count_tokens(file_path + "\n\n")
+        text_chunks = chunk_via_semchunk(
+            file_content, self.max_tokens - extra_tokens, self.count_tokens
+        )
+        chunks = []
+        start = 0
+        for text_chunk in text_chunks:
+            # This assertion should always be true. Otherwise there's a bug worth finding.
+            assert self.count_tokens(text_chunk) <= self.max_tokens - extra_tokens
+            # Find the start/end positions of the chunks.
+            start = file_content.index(text_chunk, start)
+            if start == -1:
+                logging.warning("Couldn't find semchunk in content: %s", text_chunk)
+            else:
+                end = start + len(text_chunk)
+                chunks.append(Chunk(file_path, start, end, text_chunk))
+            start = end
+        return chunks
+class UniversalChunker(Chunker):
+    """Chunks a file into smaller pieces, regardless of whether it's code or text."""
+    def __init__(self, max_tokens: int):
+        self.code_chunker = CodeChunker(max_tokens)
+        self.text_chunker = TextChunker(max_tokens)
+    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
+        if CodeChunker.is_code_file(file_path):
+            return self.code_chunker.chunk(file_path, file_content)
+        return self.text_chunker.chunk(file_path, file_content)

src/embedder.py ADDED Viewed

	@@ -0,0 +1,199 @@

+"""Batch embedder abstraction and implementations."""
+import json
+import logging
+import os
+from abc import ABC, abstractmethod
+from collections import Counter
+from typing import Dict, Generator, List, Tuple
+from openai import OpenAI
+from chunker import Chunk, Chunker
+from repo_manager import RepoManager
+Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
+class BatchEmbedder(ABC):
+    """Abstract class for batch embedding of a repository."""
+    @abstractmethod
+    def embed_repo(self, chunks_per_batch: int):
+        """Issues batch embedding jobs for the entire repository."""
+    @abstractmethod
+    def embeddings_are_ready(self) -> bool:
+        """Checks whether the batch embedding jobs are done."""
+    @abstractmethod
+    def download_embeddings(self) -> Generator[Vector, None, None]:
+        """Yields (chunk_metadata, embedding) pairs for each chunk in the repository."""
+class OpenAIBatchEmbedder(BatchEmbedder):
+    """Batch embedder that calls OpenAI. See https://platform.openai.com/docs/guides/batch/overview."""
+    def __init__(
+        self, repo_manager: RepoManager, chunker: Chunker, local_dir: str
+    ):
+        self.repo_manager = repo_manager
+        self.chunker = chunker
+        self.local_dir = local_dir
+        # IDs issued by OpenAI for each batch job mapped to metadata about the chunks.
+        self.openai_batch_ids = {}
+        self.client = OpenAI()
+    def embed_repo(self, chunks_per_batch: int):
+        """Issues batch embedding jobs for the entire repository."""
+        if self.openai_batch_ids:
+            raise ValueError("Embeddings are in progress.")
+        batch = []
+        chunk_count = 0
+        repo_name = self.repo_manager.repo_id.split("/")[-1]
+        for filepath, content in self.repo_manager.walk():
+            chunks = self.chunker.chunk(filepath, content)
+            chunk_count += len(chunks)
+            batch.extend(chunks)
+            if len(batch) > chunks_per_batch:
+                for i in range(0, len(batch), chunks_per_batch):
+                    batch = batch[i : i + chunks_per_batch]
+                    openai_batch_id = self._issue_job_for_chunks(
+                        batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}"
+                    )
+                    self.openai_batch_ids[openai_batch_id] = self._metadata_for_chunks(
+                        batch
+                    )
+                batch = []
+        # Finally, commit the last batch.
+        if batch:
+            openai_batch_id = self._issue_job_for_chunks(
+                batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}"
+            )
+            self.openai_batch_ids[openai_batch_id] = self._metadata_for_chunks(batch)
+        logging.info(
+            "Issued %d jobs for %d chunks.", len(self.openai_batch_ids), chunk_count
+        )
+        # Save the job IDs to a file, just in case this script is terminated by mistake.
+        metadata_file = os.path.join(self.local_dir, "openai_batch_ids.json")
+        with open(metadata_file, "w") as f:
+            json.dump(self.openai_batch_ids, f)
+        logging.info("Job metadata saved at %s", metadata_file)
+    def embeddings_are_ready(self) -> bool:
+        """Checks whether the embeddings jobs are done (either completed or failed)."""
+        if not self.openai_batch_ids:
+            raise ValueError("No embeddings in progress.")
+        job_ids = self.openai_batch_ids.keys()
+        statuses = [self.client.batches.retrieve(job_id.strip()) for job_id in job_ids]
+        are_ready = all(status.status in ["completed", "failed"] for status in statuses)
+        status_counts = Counter(status.status for status in statuses)
+        logging.info("Job statuses: %s", status_counts)
+        return are_ready
+    def download_embeddings(self) -> Generator[Vector, None, None]:
+        """Yield a (chunk_metadata, embedding) pair for each chunk in the repository."""
+        job_ids = self.openai_batch_ids.keys()
+        statuses = [self.client.batches.retrieve(job_id.strip()) for job_id in job_ids]
+        for idx, status in enumerate(statuses):
+            if status.status == "failed":
+                logging.error("Job failed: %s", status)
+                continue
+            if not status.output_file_id:
+                error = self.client.files.content(status.error_file_id)
+                logging.error("Job %s failed with error: %s", status.id, error.text)
+                continue
+            batch_metadata = self.openai_batch_ids[status.id]
+            file_response = self.client.files.content(status.output_file_id)
+            data = json.loads(file_response.text)["response"]["body"]["data"]
+            logging.info("Job %s generated %d embeddings.", status.id, len(data))
+            for datum in data:
+                idx = int(datum["index"])
+                metadata = batch_metadata[idx]
+                embedding = datum["embedding"]
+                yield (metadata, embedding)
+    def _issue_job_for_chunks(self, chunks: List[Chunk], batch_id: str) -> str:
+        """Issues a batch embedding job for the given chunks. Returns the job ID."""
+        logging.info("*" * 100)
+        logging.info("Issuing job for batch %s with %d chunks.", batch_id, len(chunks))
+        # Create a .jsonl file with the batch.
+        request = OpenAIBatchEmbedder._chunks_to_request(chunks, batch_id)
+        input_file = os.path.join(self.local_dir, f"batch_{batch_id}.jsonl")
+        OpenAIBatchEmbedder._export_to_jsonl([request], input_file)
+        # Uplaod the file and issue the embedding job.
+        batch_input_file = self.client.files.create(file=open(input_file, "rb"), purpose="batch")
+        batch_status = self._create_batch_job(batch_input_file.id)
+        logging.info("Created job with ID %s", batch_status.id)
+        return batch_status.id
+    def _create_batch_job(self, input_file_id: str):
+        """Creates a batch embedding job for OpenAI."""
+        try:
+            return self.client.batches.create(
+                input_file_id=input_file_id,
+                endpoint="/v1/embeddings",
+                completion_window="24h",  # This is the only allowed value for now.
+                timeout=3 * 60,  # 3 minutes
+                metadata={},
+            )
+        except Exception as e:
+            print(
+                f"Failed to create batch job with input_file_id={input_file_id}. Error: {e}"
+            )
+            return None
+    @staticmethod
+    def _export_to_jsonl(list_of_dicts: List[Dict], output_file: str):
+        """Exports a list of dictionaries to a .jsonl file."""
+        directory = os.path.dirname(output_file)
+        if not os.path.exists(directory):
+            os.makedirs(directory)
+        with open(output_file, "w") as f:
+            for item in list_of_dicts:
+                json.dump(item, f)
+                f.write("\n")
+    @staticmethod
+    def _chunks_to_request(chunks: List[Chunk], batch_id: str):
+        """Convert a list of chunks to a batch request."""
+        return {
+            "custom_id": batch_id,
+            "method": "POST",
+            "url": "/v1/embeddings",
+            "body": {
+                "model": "text-embedding-ada-002",
+                "input": [chunk.content for chunk in chunks],
+            },
+        }
+    @staticmethod
+    def _metadata_for_chunks(chunks):
+        metadata = []
+        for chunk in chunks:
+            filename_ascii = chunk.filename.encode("ascii", "ignore").decode("ascii")
+            metadata.append(
+                {
+                    # Some vector stores require the IDs to be ASCII.
+                    "id": f"{filename_ascii}_{chunk.start_byte}_{chunk.end_byte}",
+                    "filename": chunk.filename,
+                    "start_byte": chunk.start_byte,
+                    "end_byte": chunk.end_byte,
+                    # Note to developer: When choosing a large chunk size, you might exceed the vector store's metadata
+                    # size limit. In that case, you can simply store the start/end bytes above, and fetch the content
+                    # directly from the repository when needed.
+                    "text": chunk.content,
+                }
+            )
+        return metadata

src/index.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""Runs a batch job to compute embeddings for an entire repo and stores them into a vector store."""
+import argparse
+import logging
+import time
+from chunker import UniversalChunker
+from embedder import OpenAIBatchEmbedder
+from repo_manager import RepoManager
+from vector_store import PineconeVectorStore
+logging.basicConfig(level=logging.INFO)
+OPENAI_EMBEDDING_SIZE = 1536
+MAX_TOKENS_PER_CHUNK = (
+    8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
+)
+MAX_CHUNKS_PER_BATCH = (
+    2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
+)
+MAX_TOKENS_PER_JOB = 3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
+def main():
+    parser = argparse.ArgumentParser(description="Batch-embeds a repository")
+    parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument(
+        "--local_dir",
+        default="repos",
+        help="The local directory to store the repository",
+    )
+    parser.add_argument(
+        "--tokens_per_chunk",
+        type=int,
+        default=800,
+        help="https://arxiv.org/pdf/2406.14497 recommends a value between 200-800.",
+    )
+    parser.add_argument(
+        "--chunks_per_batch", type=int, default=2000, help="Maximum chunks per batch"
+    )
+    parser.add_argument(
+        "--pinecone_index_name", required=True, help="Pinecone index name"
+    )
+    args = parser.parse_args()
+    # Validate the arguments.
+    if args.tokens_per_chunk > MAX_TOKENS_PER_CHUNK:
+        parser.error(
+            f"The maximum number of tokens per chunk is {MAX_TOKENS_PER_CHUNK}."
+        )
+    if args.chunks_per_batch > MAX_CHUNKS_PER_BATCH:
+        parser.error(
+            f"The maximum number of chunks per batch is {MAX_CHUNKS_PER_BATCH}."
+        )
+    if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
+        parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
+    logging.info("Cloning the repository...")
+    repo_manager = RepoManager(args.repo_id, local_dir=args.local_dir)
+    repo_manager.clone()
+    logging.info("Issuing embedding jobs...")
+    chunker = UniversalChunker(max_tokens=args.tokens_per_chunk)
+    embedder = OpenAIBatchEmbedder(repo_manager, chunker, args.local_dir)
+    embedder.embed_repo(args.chunks_per_batch)
+    logging.info("Waiting for embeddings to be ready...")
+    while not embedder.embeddings_are_ready():
+        logging.info("Sleeping for 30 seconds...")
+        time.sleep(30)
+    logging.info("Moving embeddings to the vector store...")
+    # Note to developer: Replace this with your preferred vector store.
+    vector_store = PineconeVectorStore(
+        index_name=args.pinecone_index_name,
+        dimension=OPENAI_EMBEDDING_SIZE,
+        namespace=repo_manager.repo_id,
+    )
+    vector_store.ensure_exists()
+    vector_store.upsert(embedder.download_embeddings())
+    logging.info("Done!")
+if __name__ == "__main__":
+    main()

src/repo_manager.py ADDED Viewed

	@@ -0,0 +1,146 @@

+"""Utility classes to maniuplate GitHub repositories."""
+import logging
+import os
+from functools import cached_property
+import requests
+from git import GitCommandError, Repo
+class RepoManager:
+    """Class to manage a local clone of a GitHub repository."""
+    def __init__(self, repo_id: str, local_dir: str = None):
+        """
+        Args:
+            repo_id: The identifier of the repository in owner/repo format, e.g. "Storia-AI/repo2vec".
+            local_dir: The local directory where the repository will be cloned.
+        """
+        self.repo_id = repo_id
+        self.local_dir = local_dir or "/tmp/"
+        if not os.path.exists(self.local_dir):
+            os.makedirs(self.local_dir)
+        self.local_path = os.path.join(self.local_dir, repo_id)
+        self.access_token = os.getenv("GITHUB_TOKEN")
+    @cached_property
+    def is_public(self) -> bool:
+        """Checks whether a GitHub repository is publicly visible."""
+        response = requests.get(f"https://api.github.com/repos/{self.repo_id}", timeout=10)
+        # Note that the response will be 404 for both private and non-existent repos.
+        return response.status_code == 200
+    @cached_property
+    def default_branch(self) -> str:
+        """Fetches the default branch of the repository from GitHub."""
+        headers = {
+            "Accept": "application/vnd.github.v3+json",
+        }
+        if self.access_token:
+            headers["Authorization"] = f"token {self.access_token}"
+        response = requests.get(f"https://api.github.com/repos/{self.repo_id}", headers=headers)
+        if response.status_code == 200:
+            branch = response.json().get("default_branch", "main")
+        else:
+            # This happens sometimes when we exceed the Github rate limit. The best bet in this case is to assume the
+            # most common naming for the default branch ("main").
+            logging.warn(f"Unable to fetch default branch for {self.repo_id}: {response.text}")
+            branch = "main"
+        return branch
+    def clone(self) -> bool:
+        """Clones the repository to the local directory, if it's not already cloned."""
+        if os.path.exists(self.local_path):
+            # The repository is already cloned.
+            return True
+        if not self.is_public and not self.access_token:
+            raise ValueError(f"Repo {self.repo_id} is private or doesn't exist.")
+        if self.access_token:
+            clone_url = f"https://{self.access_token}@github.com/{self.repo_id}.git"
+        else:
+            clone_url = f"https://github.com/{self.repo_id}.git"
+        try:
+            Repo.clone_from(clone_url, self.local_path, depth=1, single_branch=True)
+        except GitCommandError as e:
+            logging.error(
+                "Unable to clone %s from %s. Error: %s", self.repo_id, clone_url, e
+            )
+            return False
+        return True
+    def walk(
+        self,
+        included_extensions: set = None,
+        excluded_extensions: set = None,
+        log_dir: str = None,
+    ):
+        """Walks the local repository path and yields a tuple of (filepath, content) for each file.
+        The filepath is relative to the root of the repository (e.g. "org/repo/your/file/path.py").
+        Args:
+            included_extensions: Optional set of extensions to include.
+            excluded_extensions: Optional set of extensions to exclude.
+            log_dir: Optional directory where to log the included and excluded files.
+        """
+        # Convert included and excluded extensions to lowercase.
+        if included_extensions:
+            included_extensions = {ext.lower() for ext in included_extensions}
+        if excluded_extensions:
+            excluded_extensions = {ext.lower() for ext in excluded_extensions}
+        def include(file_path: str) -> bool:
+            _, extension = os.path.splitext(file_path)
+            extension = extension.lower()
+            if included_extensions and extension not in included_extensions:
+                return False
+            if excluded_extensions and extension in excluded_extensions:
+                return False
+            # Exclude hidden files and directories.
+            if any(part.startswith(".") for part in file_path.split(os.path.sep)):
+                return False
+            return True
+        # We will keep apending to these files during the iteration, so we need to clear them first.
+        if log_dir:
+            repo_name = self.repo_id.replace("/", "_")
+            included_log_file = os.path.join(log_dir, f"included_{repo_name}.txt")
+            excluded_log_file = os.path.join(log_dir, f"excluded_{repo_name}.txt")
+            if os.path.exists(included_log_file):
+                os.remove(included_log_file)
+            if os.path.exists(excluded_log_file):
+                os.remove(excluded_log_file)
+        for root, _, files in os.walk(self.local_path):
+            file_paths = [os.path.join(root, file) for file in files]
+            included_file_paths = [f for f in file_paths if include(f)]
+            if log_dir:
+                with open(included_log_file, "a") as f:
+                    for path in included_file_paths:
+                        f.write(path + "\n")
+                excluded_file_paths = set(file_paths).difference(
+                    set(included_file_paths)
+                )
+                with open(excluded_log_file, "a") as f:
+                    for path in excluded_file_paths:
+                        f.write(path + "\n")
+            for file_path in included_file_paths:
+                with open(file_path, "r") as f:
+                    try:
+                        contents = f.read()
+                    except UnicodeDecodeError:
+                        logging.warning("Unable to decode file %s. Skipping.", file_path)
+                        continue
+                    yield file_path[len(self.local_dir) + 1 :], contents
+    def github_link_for_file(self, file_path: str) -> str:
+        """Converts a repository file path to a GitHub link."""
+        file_path = file_path[len(self.repo_id):]
+        return f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"

src/vector_store.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""Vector store abstraction and implementations."""
+from abc import ABC, abstractmethod
+from typing import Dict, Generator, List, Tuple
+from pinecone import Pinecone
+Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
+class VectorStore(ABC):
+    """Abstract class for a vector store."""
+    @abstractmethod
+    def ensure_exists(self):
+        """Ensures that the vector store exists. Creates it if it doesn't."""
+    @abstractmethod
+    def upsert_batch(self, vectors: List[Vector]):
+        """Upserts a batch of vectors."""
+    def upsert(self, vectors: Generator[Vector, None, None]):
+        """Upserts in batches of 100, since vector stores have a limit on upsert size."""
+        batch = []
+        for metadata, embedding in vectors:
+            batch.append((metadata, embedding))
+            if len(batch) == 100:
+                self.upsert_batch(batch)
+                batch = []
+        if batch:
+            self.upsert_batch(batch)
+class PineconeVectorStore(VectorStore):
+    """Vector store implementation using Pinecone."""
+    def __init__(self, index_name: str, dimension: int, namespace: str):
+        self.index_name = index_name
+        self.dimension = dimension
+        self.client = Pinecone()
+        self.index = self.client.Index(self.index_name)
+        self.namespace = namespace
+    def ensure_exists(self):
+        if self.index_name not in self.client.list_indexes().names():
+            self.client.create_index(
+                name=self.index_name, dimension=self.dimension, metric="cosine"
+            )
+    def upsert_batch(self, vectors: List[Vector]):
+        pinecone_vectors = [
+            (metadata.get("id", str(i)), embedding, metadata)
+            for i, (metadata, embedding) in enumerate(vectors)
+        ]
+        self.index.upsert(vectors=pinecone_vectors, namespace=self.namespace)