Spaces:

whitelotus0
/

code-weaver

Sleeping

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# PyPI configuration file
+.pypirc
+repos

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+# Use an official Python runtime as a parent image
+FROM python:3.11-slim
+# Set the working directory in the container
+WORKDIR /app
+# Copy the requirements file into the container
+COPY requirements.txt .
+# Install any dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application code into the container
+COPY . .
+# Expose the port that the app runs on (default 8501)
+EXPOSE 8501
+# Command to run the application
+CMD ["streamlit", "run", "app.py"]

README.md CHANGED Viewed

@@ -8,4 +8,74 @@ pinned: false
 short_description: Interact with your github repo
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: Interact with your github repo
 ---
+# Code Weaver
+This application is inspired by the [Chat-with-Github-Repo](https://github.com/peterw/Chat-with-Github-Repo) project by peterw, which demonstrates how to create a chatbot using Streamlit, OpenAI, and Activeloop's Deep Lake. Code Weaver builds upon that foundation, offering a more user-friendly experience and optimized performance.
+Code Weaver is designed to let you interact with a GitHub repository through a conversational interface. It indexes the contents of a repository, allowing you to ask questions and receive answers based on the repository's documentation and code.
+![Code Weaver Screenshot](screenshot.png)
+## Key Improvements Over Original Implementation
+Code Weaver offers several key improvements that enhance the user experience and the app's overall performance:
+*   **Purely Frontend UI:**
+    *   The original application used a command-line interface (CLI) for configuration. Code Weaver provides a completely UI-based experience. All configurations (API keys, repo URLs, dataset names, etc.) are entered directly into the Streamlit app, making it much more accessible to users.
+*   **Optimized Document Parsing:**
+    *   The document parsing and splitting process have been optimized for speed and handling of large files by using `RecursiveCharacterTextSplitter`.
+*   **Improved Loading times:**
+     * The app is now faster because the embeddings are not recalculated each time the app is started, this is achieved by using `InMemoryCache`.
+*   **Direct OpenAI Embeddings Model:**
+    *   The app now directly specifies the usage of the `"text-embedding-ada-002"` OpenAI embedding model, resulting in faster embedding calculations.
+*   **In Memory Cache:**
+    *   `InMemoryCache` is used to cache the results from the language model improving loading speeds.
+*   **Streamlined Setup:**
+    *   No more need for `.env` files, as all necessary variables are inserted directly in the app UI.
+*    **Easier Deployment**: The app is built to be deployed in Hugging Face Spaces, making the deployment process easier.
+## How to Use Code Weaver
+1.  **Clone the repository:**
+     ```bash
+     https://github.com/WHITELOTUS0/chat-with-a-repo.git
+     ```
+2.  **Navigate to the Project Directory:**
+    ```bash
+    cd your-repo-name
+    ```
+3.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+4.  **Run the app:**
+    ```bash
+    streamlit run app.py
+    ```
+5.  **Enter configurations**: The Streamlit app will open in your web browser. In the sidebar, provide:
+    *   Your OpenAI API Key
+    *   Your Activeloop Token
+    *   Your Activeloop Username
+    *   The GitHub repository URL you want to explore
+    *   The file extensions to include (optional)
+    *   A desired name for the DeepLake dataset
+6.  **Process the repository**: Click on the "Process Repository" Button, and wait for it to finish.
+7.  **Start Chatting:** After processing is done, you can start asking questions about the repository in the provided input text area.
+## Contributing
+Contributions to this project are always welcome. If you find any bugs or would like to suggest new features, feel free to create a pull request or open an issue in this repository.
+## Credits
+*   Inspired by the [Chat-with-Github-Repo](https://github.com/peterw/Chat-with-Github-Repo) project by peterw.
+*   Built using Python, Streamlit, OpenAI, Activeloop, and Langchain.
+## License
+[MIT License](LICENSE)
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,14 @@

+# app.py
+import os
+import sys
+import streamlit as st
+# Add the project's root directory to the Python path
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+# Import and run the chat app
+from src.utils.chat import run_chat_app
+if __name__ == "__main__":
+    run_chat_app()

dev-requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ black
2	+ flake8

pyproject.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[tool.black]
+line-length = 88
+target-version = ['py37', 'py38', 'py39']

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+deeplake
+langchain
+openai
+pathspec
+python-dotenv
+streamlit
+streamlit_chat
+langchain_community

screenshot.png ADDED Viewed

src/__init__.py ADDED Viewed

File without changes

src/main.py ADDED Viewed

	@@ -0,0 +1,110 @@

+import argparse
+import os
+import sys
+from dotenv import load_dotenv
+from streamlit.web import cli as stcli
+from utils.process import process
+# Load environment variables from a .env file (containing OPENAI_API_KEY)
+load_dotenv()
+def extract_repo_name(repo_url):
+    """Extract the repository name from the given repository URL."""
+    repo_name = repo_url.split("/")[-1].replace(".git", "")
+    return repo_name
+def process_repo(args):
+    """
+    Process the git repository by cloning it, filtering files, and
+    creating an Activeloop dataset with the contents.
+    """
+    repo_name = extract_repo_name(args.repo_url)
+    activeloop_username = os.environ.get("ACTIVELOOP_USERNAME")
+    if not args.activeloop_dataset_name:
+        args.activeloop_dataset_path = f"hub://{activeloop_username}/{repo_name}"
+    else:
+        args.activeloop_dataset_path = (
+            f"hub://{activeloop_username}/{args.activeloop_dataset_name}"
+        )
+    process(
+        args.repo_url,
+        args.include_file_extensions,
+        args.activeloop_dataset_path,
+        args.repo_destination,
+    )
+def chat(args):
+    """
+    Start the Streamlit chat application using the specified Activeloop dataset.
+    """
+    activeloop_username = os.environ.get("ACTIVELOOP_USERNAME")
+    args.activeloop_dataset_path = (
+        f"hub://{activeloop_username}/{args.activeloop_dataset_name}"
+    )
+    sys.argv = [
+        "streamlit",
+        "run",
+        "src/utils/chat.py",
+        "--",
+        f"--activeloop_dataset_path={args.activeloop_dataset_path}",
+    ]
+    sys.exit(stcli.main())
+def main():
+    """Define and parse CLI arguments, then execute the appropriate subcommand."""
+    parser = argparse.ArgumentParser(description="Chat with a git repository")
+    subparsers = parser.add_subparsers(dest="command")
+    # Process subcommand
+    process_parser = subparsers.add_parser("process", help="Process a git repository")
+    process_parser.add_argument(
+        "--repo-url", required=True, help="The git repository URL"
+    )
+    process_parser.add_argument(
+        "--include-file-extensions",
+        nargs="+",
+        default=None,
+        help=(
+            "Exclude all files not matching these extensions. Example:"
+            " --include-file-extensions .py .js .ts .html .css .md .txt"
+        ),
+    )
+    process_parser.add_argument(
+        "--activeloop-dataset-name",
+        help=(
+            "The name for the Activeloop dataset. Defaults to the git repository name."
+        ),
+    )
+    process_parser.add_argument(
+        "--repo-destination",
+        default="repos",
+        help="The destination to clone the repository. Defaults to 'repos'.",
+    )
+    # Chat subcommand
+    chat_parser = subparsers.add_parser("chat", help="Start the chat application")
+    chat_parser.add_argument(
+        "--activeloop-dataset-name",
+        required=True,
+        help="The name of one of your existing Activeloop datasets.",
+    )
+    args = parser.parse_args()
+    if args.command == "process":
+        process_repo(args)
+    elif args.command == "chat":
+        chat(args)
+if __name__ == "__main__":
+    main()

src/utils/__init__.py ADDED Viewed

File without changes

src/utils/chat.py ADDED Viewed

	@@ -0,0 +1,153 @@

+# src/utils/chat.py
+import os
+import tempfile
+import streamlit as st
+from langchain_community.vectorstores import DeepLake
+from langchain_community.embeddings import OpenAIEmbeddings
+from langchain_community.chat_models import ChatOpenAI
+from langchain.chains import RetrievalQA
+import openai
+from streamlit_chat import message
+from src.utils.process import process
+from src.utils.load_and_split import load_docs, split_docs
+import shutil
+from langchain.cache import InMemoryCache
+from langchain.globals import set_llm_cache
+set_llm_cache(InMemoryCache())
+def run_chat_app():
+    """Run the chat application using the Streamlit framework."""
+    st.title("Code Weaver")  # App title
+    # Initialize session state variables if they don't exist
+    if "generated" not in st.session_state:
+        st.session_state["generated"] = ["I am ready to help you!"]
+    if "past" not in st.session_state:
+        st.session_state["past"] = ["Hello"]
+    # Initialize data and status in the session
+    if "data" not in st.session_state:
+            st.session_state["data"] = {
+                "repo_url": None,
+                "include_file_extensions": None,
+                "activeloop_dataset_path": None,
+                "repo_destination": None,
+                "status": "Please Provide Data"
+            }
+    # Sidebar for API keys and data
+    with st.sidebar:
+        st.header("Configuration")
+        # Open AI key
+        openai_api_key = st.text_input("OpenAI API Key", type="password")
+        if openai_api_key:
+            os.environ["OPENAI_API_KEY"] = openai_api_key
+        #activeloop key
+        activeloop_token = st.text_input("Activeloop Token", type="password")
+        if activeloop_token:
+            os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
+        # activeloop username
+        activeloop_username = st.text_input("Activeloop Username")
+        if activeloop_username:
+            os.environ["ACTIVELOOP_USERNAME"] = activeloop_username
+        st.session_state["data"]["repo_url"] = st.text_input("GitHub Repository URL")
+        file_extensions_input = st.text_input("File Extensions (comma-separated, e.g., .py,.js)").strip()
+        st.session_state["data"]["include_file_extensions"] = [ext.strip() for ext in file_extensions_input.split(",")] if file_extensions_input else None
+        dataset_name = st.text_input("Dataset Name")
+        if dataset_name:
+              st.session_state["data"]["activeloop_dataset_path"] = f"hub://{os.environ.get('ACTIVELOOP_USERNAME')}/{dataset_name}"
+        else:
+           st.session_state["data"]["activeloop_dataset_path"] = None
+        st.session_state["data"]["repo_destination"] = "repos"
+        if st.button("Process Repository"):
+            if st.session_state["data"]["repo_url"] and st.session_state["data"]["activeloop_dataset_path"] and os.environ.get("OPENAI_API_KEY") and os.environ.get("ACTIVELOOP_TOKEN") and os.environ.get("ACTIVELOOP_USERNAME") :
+                st.session_state["data"]["status"] = "Processing Data"
+                with st.spinner("Processing the repository, please wait"):
+                    process_repo()
+                st.session_state["data"]["status"] = "Ready to Chat!"
+            else :
+              st.session_state["data"]["status"] = "Missing Data"
+    # Chat input and display area
+    st.write(st.session_state["data"]["status"])
+    if  st.session_state["data"]["status"] == "Ready to Chat!":
+        user_input = get_text()
+        if user_input:
+            output = search_db(user_input)
+            st.session_state.past.append(user_input)
+            st.session_state.generated.append(output)
+        if st.session_state["generated"]:
+            for i in range(len(st.session_state["generated"])):
+                message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")
+                message(st.session_state["generated"][i], key=str(i))
+    # Footer
+    st.markdown(
+    """
+    <br><hr style="border:2px solid gray">
+    <p style="text-align:center; font-size: 12px;">
+        Made with ❤️ by <a href="https://www.linkedin.com/in/glorry-sibomana/">Glorry Sibomana</a>
+    </p>
+    """,
+    unsafe_allow_html=True,
+)
+def get_text():
+    """Create a Streamlit input field and return the user's input."""
+    input_text = st.text_input("Enter your query:", key="input", label_visibility="hidden")
+    return input_text
+def search_db(query):
+    """Search for a response to the query in the DeepLake database."""
+    # Set up embeddings and database
+    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
+    db = DeepLake(
+       dataset_path=st.session_state["data"]["activeloop_dataset_path"],
+       read_only=True,
+       embedding_function=embeddings,
+    )
+    # Set up retriever with custom search parameters
+    retriever = db.as_retriever()
+    retriever.search_kwargs["distance_metric"] = "cos"
+    retriever.search_kwargs["fetch_k"] = 100
+    retriever.search_kwargs["k"] = 10
+    # Initialize chat model
+    model = ChatOpenAI(model="gpt-3.5-turbo")
+    # Set up RetrievalQA chain
+    qa = RetrievalQA.from_llm(model, retriever=retriever)
+    return qa.run(query)
+def process_repo():
+  """Process the repository and save embeddings into Deep Lake dataset."""
+  with tempfile.TemporaryDirectory() as temp_dir:
+    repo_destination = os.path.join(temp_dir, "repo_clone")
+    repo_url = st.session_state["data"]["repo_url"]
+    include_file_extensions = st.session_state["data"]["include_file_extensions"]
+    activeloop_dataset_path = st.session_state["data"]["activeloop_dataset_path"]
+    process(
+        repo_url,
+        include_file_extensions,
+        activeloop_dataset_path,
+        repo_destination,
+        )
+if __name__ == "__main__":
+    run_chat_app()

src/utils/load_and_split.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# src/utils/load_and_split.py
+import os
+import pathspec
+from langchain_community.document_loaders import TextLoader
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+def load_docs(root_dir, file_extensions=None):
+    """
+    Load documents from the specified root directory.
+    Ignore dotfiles, dot directories, and files that match .gitignore rules.
+    Optionally filter by file extensions.
+    """
+    docs = []
+    # Load .gitignore rules
+    gitignore_path = os.path.join(root_dir, ".gitignore")
+    if os.path.isfile(gitignore_path):
+        with open(gitignore_path, "r") as gitignore_file:
+            gitignore = gitignore_file.read()
+        spec = pathspec.PathSpec.from_lines(
+            pathspec.patterns.GitWildMatchPattern, gitignore.splitlines()
+        )
+    else:
+        spec = None
+    for dirpath, dirnames, filenames in os.walk(root_dir):
+        # Remove dot directories from the list of directory names
+        dirnames[:] = [d for d in dirnames if not d.startswith(".")]
+        for file in filenames:
+            file_path = os.path.join(dirpath, file)
+            # Skip dotfiles
+            if file.startswith("."):
+                continue
+            # Skip files that match .gitignore rules
+            if spec and spec.match_file(file_path):
+                continue
+            if file_extensions and os.path.splitext(file)[1] not in file_extensions:
+                continue
+            try:
+                loader = TextLoader(file_path, encoding="utf-8")
+                docs.extend(loader.load_and_split())
+            except Exception:
+                pass
+    return docs
+def split_docs(docs):
+    """Split the input documents into smaller chunks."""
+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
+    return text_splitter.split_documents(docs)

src/utils/process.py ADDED Viewed

	@@ -0,0 +1,52 @@

+# src/utils/process.py
+import deeplake
+import openai
+import os
+import subprocess
+from langchain_community.document_loaders import TextLoader
+from langchain_community.embeddings import OpenAIEmbeddings
+from langchain_community.vectorstores import DeepLake
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from src.utils.load_and_split import load_docs, split_docs # Updated import
+def clone_repository(repo_url, local_path):
+    """Clone the specified git repository to the given local path."""
+    subprocess.run(["git", "clone", repo_url, local_path], check=True, capture_output=True)
+def create_deeplake_dataset(activeloop_dataset_path, activeloop_token):
+    """Create an empty DeepLake dataset with the specified path and token."""
+    ds = deeplake.empty(
+        activeloop_dataset_path,
+        token=activeloop_token,
+        overwrite=True,
+    )
+    ds.create_tensor("ids")
+    ds.create_tensor("metadata")
+    ds.create_tensor("embedding")
+    ds.create_tensor("text")
+def process(
+    repo_url, include_file_extensions, activeloop_dataset_path, repo_destination
+):
+    """
+    Process a git repository by cloning it, filtering files, splitting documents,
+    creating embeddings, and storing everything in a DeepLake dataset.
+    """
+    activeloop_token = os.getenv("ACTIVELOOP_TOKEN")
+    create_deeplake_dataset(activeloop_dataset_path, activeloop_token)
+    clone_repository(repo_url, repo_destination)
+    docs = load_docs(repo_destination, include_file_extensions)
+    texts = split_docs(docs)
+    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
+    db = DeepLake(dataset_path=activeloop_dataset_path, embedding_function=embeddings)
+    db.add_documents(texts)